Dataset Viewer
Auto-converted to Parquet Duplicate
text
stringlengths
2
6.58M
doc-id
stringlengths
18
64
category
stringclasses
8 values
data-source
stringlengths
3
35
script
stringclasses
2 values
age-estimate
stringclasses
110 values
license
stringclasses
14 values
misc
stringclasses
2 values
num-tokens
int64
1
1.34M
language
stringclasses
2 values
"A:\tOkay.\nA:\tSo, What kind of experience do you, do you have, then with child care?\nB:\tI guess,(...TRUNCATED)
8022e87a5dc2fac75de0a3f56faec75318d929235047e081d8df84be16eaa50f
child-available-speech
Switchboard
Latn
n/a
cc-by-nc-sa-3.0
{}
1,342,024
eng
"Well it's just that, you know, a pound, or a hundred pounds today, is not the same as a hundred pou(...TRUNCATED)
d67176e066a5b0b49b31180ae8c9b7cc7af57ea263b895bcd7a0f8e59c521253
child-available-speech
BNC
Latn
n/a
BNC User License
{}
3,801
eng
"You want me to start again?\nYeah.\nRight erm, could you tell me about how you left school please?\(...TRUNCATED)
f08b0626a8a70d4ca55ca9d839dffb25957e92573f9d3490ded91382a3dd3b0a
child-available-speech
BNC
Latn
n/a
BNC User License
{}
14,372
eng
"Is this on yet?\nYeah.\nOh.\nOkay well, good morning.\nErm I have a er an important administrative (...TRUNCATED)
61919dfdd0afefd51c936db62c66215d1c53974d29301516b15951e634702605
child-available-speech
BNC
Latn
n/a
BNC User License
{}
6,611
eng
"Come in, good morning.\nHello, well what's your mum been doing to you this morning ?\n.\nWell she's(...TRUNCATED)
9b9479dfb49cdda37da32b684b92dd88dc56f62fef21a1291b6f998f6673f4f3
child-available-speech
BNC
Latn
n/a
BNC User License
{}
363
eng
"Order.\nEr, just a couple of announcements colleagues, if er, those delegates who actually smoke, i(...TRUNCATED)
ea96f7fd36f3dfea684022ba9e6f1d2ac02cc0721c156d7d1fd2f99fc3e107f7
child-available-speech
BNC
Latn
n/a
BNC User License
{}
12,902
eng
"Okay Ron there are, thanks for coming over for a start, there are you've got all the er the brochur(...TRUNCATED)
1a8c3a6a3d017aaa7c2756aa53663e50db2ba80036f3a69aaf355f8a64550bd2
child-available-speech
BNC
Latn
n/a
BNC User License
{}
11,925
eng
"I actually wanted to that I didn't really want to go as far as for example deciding that the chair (...TRUNCATED)
6da6be7247ec0efb4de60eb96a602703df1d7e0f65027df90f64ad85a0aac37b
child-available-speech
BNC
Latn
n/a
BNC User License
{}
400
eng
"L E A T L E A T our local Environment Action Teams, initiative proved to be so outstanding.\nL E A (...TRUNCATED)
0fd0263d223577f9ace342ef4e88052e80768991060a0baf88284fd5f3da1019
child-available-speech
BNC
Latn
n/a
BNC User License
{}
5,793
eng
"fifty to three hundred and fifty pounds and Lot one four three three hundred and fifty to four hund(...TRUNCATED)
b0d13ac149c2eedb71c30f178ac787e3d28458010b8db524d413ead071c8d57f
child-available-speech
BNC
Latn
n/a
BNC User License
{}
11,734
eng
End of preview. Expand in Data Studio

BabyLM English–Chinese 50/50 Stratified

A bilingual training corpus for the Multilingual BabyLM project, combining 50% of the English and Chinese BabyBabelLM corpora.

Sampling is stratified by category: each source category is sampled independently to preserve the original category proportions within each language.

Token counts

Language Tokens Share
English (eng) 49,481,353 41.8%
Chinese (zho) 68,925,921 58.2%
Total 118,407,274 100%

Category breakdown

Category English tokens English % Chinese tokens Chinese %
child-available-speech 4,551,767 9.2% 3,702,039 5.4%
child-books 13,406,594 27.1% 7,992,009 11.6%
child-directed-speech 13,857,314 28.0% 4,821,121 7.0%
child-wiki 7,304,856 14.8% 12,967 0.0%
educational 0 0.0% 6,732,733 9.8%
padding-opensubtitles 9,856,922 19.9% 0 0.0%
padding-wikipedia 503,900 1.0% 0 0.0%
subtitles 0 0.0% 45,665,052 66.3%

Source datasets

Downloads last month
48