Datasets:

anyreach-ai
/

dualturn-otospeech-turn-taking

Parquet error: Scan size limit exceeded: attempted to read 381298381 bytes, limit is 300000000 bytes Make sure that 1. the Parquet files contain a page index to enable random access without loading entire row groups2. otherwise use smaller row-group sizes when serializing the Parquet files

Error code:   TooBigContentError

Need help to make the dataset viewer work? Make sure to review how to configure the dataset viewer, and open a discussion for direct support.

OtoSpeech Turn-Taking

Official DualTurn release of the otospeech corpus, with per-frame turn-taking labels and Mimi speech codec features. Each row is one full conversation. Frame rate 12.5 Hz (80 ms per frame).

Paper: DualTurn: Learning Turn-Taking from Dual-Channel Generative Speech Pretraining
Training code: github.com/anyreachai/dualturn
Model checkpoint: anyreach-ai/dualturn-qwen2.5-mimi-0.5B

Splits

Split	Sessions
train	896
val	111
test	113

Total audio: 287.0 hours. splits.json in the repo root maps every session ID to its split — these are the exact train/val/test splits used for all experiments in the paper.

from huggingface_hub import hf_hub_download
import json
path = hf_hub_download("anyreach-ai/dualturn-otospeech-turn-taking", "splits.json", repo_type="dataset")
splits = json.load(open(path))
print(splits["split_counts"])

Features

All multi-dim arrays are stored as flat lists (row-major); reshape with num_frames.

Column	Shape	dtype	Description
`session_id`	—	str	Unique session identifier
`dataset`	—	str	Source corpus name
`duration_s`	—	float32	Conversation duration (seconds)
`num_frames`	—	int32	T — total frames at 12.5 Hz
`codes_ch0` / `codes_ch1`	[T*8]	int16	Mimi RVQ codes, reshape to (T, 8)
`mimi_feat_ch0` / `mimi_feat_ch1`	[T*512]	float16	Mimi continuous embeddings, reshape to (T, 512)
`vad_ch0` / `vad_ch1`	[T]	int8	Cleaned binary VAD per channel
`eot_ch0` / `eot_ch1`	[T]	int8	End-of-Turn label (sparse)
`hold_ch0` / `hold_ch1`	[T]	int8	Within-turn hold/pause (sparse)
`bot_ch0` / `bot_ch1`	[T]	int8	Beginning-of-Turn (sparse)
`bc_ch0` / `bc_ch1`	[T]	int8	Backchannel (sparse)
`fvad_ch0` / `fvad_ch1`	[T*4]	float32	Future-VAD soft targets at 240/480/960/2000 ms

Event labels (eot, hold, bot, bc) are sparse binary: 0 everywhere except at event frames.

Loading

import numpy as np
from datasets import load_dataset

ds = load_dataset("anyreach-ai/dualturn-otospeech-turn-taking")
s = ds["val"][0]
T = s["num_frames"]

codes_ch0 = np.array(s["codes_ch0"], dtype=np.int16).reshape(T, 8)
mimi_ch0  = np.array(s["mimi_feat_ch0"], dtype=np.float16).reshape(T, 512)
fvad_ch0  = np.array(s["fvad_ch0"], dtype=np.float32).reshape(T, 4)
vad_ch0   = np.array(s["vad_ch0"], dtype=np.int8)
eot_ch0   = np.array(s["eot_ch0"], dtype=np.int8)

PyTorch windowed loader

import numpy as np
import torch
from torch.utils.data import DataLoader
from datasets import load_dataset

LABEL_KEYS = ["eot", "hold", "bot", "bc"]

def collate_windows(sessions, window_frames=125, hop_frames=25):
    """Slice each session into fixed-length windows and collate into a batch."""
    windows = []
    for s in sessions:
        T = s["num_frames"]
        codes_ch0 = np.array(s["codes_ch0"], dtype=np.int16).reshape(T, 8)
        codes_ch1 = np.array(s["codes_ch1"], dtype=np.int16).reshape(T, 8)
        vad_ch0   = np.array(s["vad_ch0"], dtype=np.int8)
        vad_ch1   = np.array(s["vad_ch1"], dtype=np.int8)
        labels    = {f"{k}_{ch}": np.array(s[f"{k}_{ch}"], dtype=np.int8)
                      for k in LABEL_KEYS for ch in ("ch0", "ch1")}
        for start in range(0, T - window_frames + 1, hop_frames):
            end = start + window_frames
            w = {
                "codes_ch0": torch.from_numpy(codes_ch0[start:end]).long(),
                "codes_ch1": torch.from_numpy(codes_ch1[start:end]).long(),
                "vad_ch0":   torch.from_numpy(vad_ch0[start:end]).float(),
                "vad_ch1":   torch.from_numpy(vad_ch1[start:end]).float(),
            }
            for key, arr in labels.items():
                w[key] = torch.from_numpy(arr[start:end]).float()
            windows.append(w)
    return {k: torch.stack([w[k] for w in windows]) for k in windows[0]}

ds     = load_dataset("anyreach-ai/dualturn-otospeech-turn-taking")
loader = DataLoader(ds["train"], batch_size=8, shuffle=True,
                    collate_fn=lambda b: collate_windows(b, window_frames=125, hop_frames=25))

batch = next(iter(loader))
print(batch["codes_ch0"].shape)   # [N_windows, 125, 8]
print(batch["eot_ch0"].shape)     # [N_windows, 125]

Label definitions

Label	Meaning
EOT	End-of-Turn: speech offset where the other speaker takes the floor within 4 s
HOLD	Within-turn pause: speech offset where the same speaker resumes (no handover)
BOT	Beginning-of-Turn: speech onset (>=1 s) following the other speaker
BC	Backchannel: isolated utterance <=1 s with >=1 s silence before and after
VAD	Voice Activity Detection — binary speech presence per frame
FVAD	Future VAD — mean voice activity over 4 future horizons (240/480/960/2000 ms)

Authors

Shangeth Rajaa — Senior ML Research Scientist, Anyreach AI

Citation

This dataset was used for all training and evaluation in the DualTurn paper. splits.json contains the exact train/val/test splits used in the paper.

Paper: DualTurn: Learning Turn-Taking from Dual-Channel Generative Speech Pretraining

@misc{rajaa2026dualturnlearningturntakingdualchannel,
      title={DualTurn: Learning Turn-Taking from Dual-Channel Generative Speech Pretraining},
      author={Shangeth Rajaa},
      year={2026},
      eprint={2603.08216},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2603.08216},
}

If you use this dataset, please also cite the source corpus otoearth/otoSpeech-full-duplex-280h:

@misc{otoSpeech-full-duplex-280h,
  title        = {otoSpeech-full-duplex-280h: Full-Duplex Conversational Speech Dataset},
  author       = {otoearth},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/datasets/otoearth/otoSpeech-full-duplex-280h}},
  note         = {License: CC BY 4.0}
}

Downloads last month: 396

Models trained or fine-tuned on anyreach-ai/dualturn-otospeech-turn-taking

anyreach-ai/dualturn-qwen2.5-mimi-0.5B

Audio Classification • Updated Apr 6 • 249 • 1

Collection including anyreach-ai/dualturn-otospeech-turn-taking

DualTurn

Collection

DualTurn: Learning Turn-Taking from Dual-Channel Generative Speech Pretraining • 3 items • Updated 8 days ago

Paper for anyreach-ai/dualturn-otospeech-turn-taking

DualTurn: Learning Turn-Taking from Dual-Channel Generative Speech Pretraining

Paper • 2603.08216 • Published Mar 9 • 2