Dataset Viewer
Auto-converted to Parquet Duplicate
The dataset viewer is not available for this split.
Parquet error: Scan size limit exceeded: attempted to read 381298381 bytes, limit is 300000000 bytes Make sure that 1. the Parquet files contain a page index to enable random access without loading entire row groups2. otherwise use smaller row-group sizes when serializing the Parquet files
Error code:   TooBigContentError

Need help to make the dataset viewer work? Make sure to review how to configure the dataset viewer, and open a discussion for direct support.

OtoSpeech Turn-Taking

Official DualTurn release of the otospeech corpus, with per-frame turn-taking labels and Mimi speech codec features. Each row is one full conversation. Frame rate 12.5 Hz (80 ms per frame).

Splits

Split Sessions
train 896
val 111
test 113

Total audio: 287.0 hours. splits.json in the repo root maps every session ID to its split — these are the exact train/val/test splits used for all experiments in the paper.

from huggingface_hub import hf_hub_download
import json
path = hf_hub_download("anyreach-ai/dualturn-otospeech-turn-taking", "splits.json", repo_type="dataset")
splits = json.load(open(path))
print(splits["split_counts"])

Features

All multi-dim arrays are stored as flat lists (row-major); reshape with num_frames.

Column Shape dtype Description
session_id — str Unique session identifier
dataset — str Source corpus name
duration_s — float32 Conversation duration (seconds)
num_frames — int32 T — total frames at 12.5 Hz
codes_ch0 / codes_ch1 [T*8] int16 Mimi RVQ codes, reshape to (T, 8)
mimi_feat_ch0 / mimi_feat_ch1 [T*512] float16 Mimi continuous embeddings, reshape to (T, 512)
vad_ch0 / vad_ch1 [T] int8 Cleaned binary VAD per channel
eot_ch0 / eot_ch1 [T] int8 End-of-Turn label (sparse)
hold_ch0 / hold_ch1 [T] int8 Within-turn hold/pause (sparse)
bot_ch0 / bot_ch1 [T] int8 Beginning-of-Turn (sparse)
bc_ch0 / bc_ch1 [T] int8 Backchannel (sparse)
fvad_ch0 / fvad_ch1 [T*4] float32 Future-VAD soft targets at 240/480/960/2000 ms

Event labels (eot, hold, bot, bc) are sparse binary: 0 everywhere except at event frames.

Loading

import numpy as np
from datasets import load_dataset

ds = load_dataset("anyreach-ai/dualturn-otospeech-turn-taking")
s = ds["val"][0]
T = s["num_frames"]

codes_ch0 = np.array(s["codes_ch0"], dtype=np.int16).reshape(T, 8)
mimi_ch0  = np.array(s["mimi_feat_ch0"], dtype=np.float16).reshape(T, 512)
fvad_ch0  = np.array(s["fvad_ch0"], dtype=np.float32).reshape(T, 4)
vad_ch0   = np.array(s["vad_ch0"], dtype=np.int8)
eot_ch0   = np.array(s["eot_ch0"], dtype=np.int8)

PyTorch windowed loader

import numpy as np
import torch
from torch.utils.data import DataLoader
from datasets import load_dataset

LABEL_KEYS = ["eot", "hold", "bot", "bc"]

def collate_windows(sessions, window_frames=125, hop_frames=25):
    """Slice each session into fixed-length windows and collate into a batch."""
    windows = []
    for s in sessions:
        T = s["num_frames"]
        codes_ch0 = np.array(s["codes_ch0"], dtype=np.int16).reshape(T, 8)
        codes_ch1 = np.array(s["codes_ch1"], dtype=np.int16).reshape(T, 8)
        vad_ch0   = np.array(s["vad_ch0"], dtype=np.int8)
        vad_ch1   = np.array(s["vad_ch1"], dtype=np.int8)
        labels    = {f"{k}_{ch}": np.array(s[f"{k}_{ch}"], dtype=np.int8)
                      for k in LABEL_KEYS for ch in ("ch0", "ch1")}
        for start in range(0, T - window_frames + 1, hop_frames):
            end = start + window_frames
            w = {
                "codes_ch0": torch.from_numpy(codes_ch0[start:end]).long(),
                "codes_ch1": torch.from_numpy(codes_ch1[start:end]).long(),
                "vad_ch0":   torch.from_numpy(vad_ch0[start:end]).float(),
                "vad_ch1":   torch.from_numpy(vad_ch1[start:end]).float(),
            }
            for key, arr in labels.items():
                w[key] = torch.from_numpy(arr[start:end]).float()
            windows.append(w)
    return {k: torch.stack([w[k] for w in windows]) for k in windows[0]}

ds     = load_dataset("anyreach-ai/dualturn-otospeech-turn-taking")
loader = DataLoader(ds["train"], batch_size=8, shuffle=True,
                    collate_fn=lambda b: collate_windows(b, window_frames=125, hop_frames=25))

batch = next(iter(loader))
print(batch["codes_ch0"].shape)   # [N_windows, 125, 8]
print(batch["eot_ch0"].shape)     # [N_windows, 125]

Label definitions

Label Meaning
EOT End-of-Turn: speech offset where the other speaker takes the floor within 4 s
HOLD Within-turn pause: speech offset where the same speaker resumes (no handover)
BOT Beginning-of-Turn: speech onset (>=1 s) following the other speaker
BC Backchannel: isolated utterance <=1 s with >=1 s silence before and after
VAD Voice Activity Detection — binary speech presence per frame
FVAD Future VAD — mean voice activity over 4 future horizons (240/480/960/2000 ms)

Authors

Citation

This dataset was used for all training and evaluation in the DualTurn paper. splits.json contains the exact train/val/test splits used in the paper.

Paper: DualTurn: Learning Turn-Taking from Dual-Channel Generative Speech Pretraining

@misc{rajaa2026dualturnlearningturntakingdualchannel,
      title={DualTurn: Learning Turn-Taking from Dual-Channel Generative Speech Pretraining},
      author={Shangeth Rajaa},
      year={2026},
      eprint={2603.08216},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2603.08216},
}

If you use this dataset, please also cite the source corpus otoearth/otoSpeech-full-duplex-280h:

@misc{otoSpeech-full-duplex-280h,
  title        = {otoSpeech-full-duplex-280h: Full-Duplex Conversational Speech Dataset},
  author       = {otoearth},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/datasets/otoearth/otoSpeech-full-duplex-280h}},
  note         = {License: CC BY 4.0}
}
Downloads last month
396

Models trained or fine-tuned on anyreach-ai/dualturn-otospeech-turn-taking

Collection including anyreach-ai/dualturn-otospeech-turn-taking

Paper for anyreach-ai/dualturn-otospeech-turn-taking