text stringlengths 2 6.58M | doc-id stringlengths 18 64 | category stringclasses 8
values | data-source stringlengths 3 35 | script stringclasses 2
values | age-estimate stringclasses 110
values | license stringclasses 14
values | misc stringclasses 2
values | num-tokens int64 1 1.34M | language stringclasses 2
values |
|---|---|---|---|---|---|---|---|---|---|
"A:\tOkay.\nA:\tSo, What kind of experience do you, do you have, then with child care?\nB:\tI guess,(...TRUNCATED) | 8022e87a5dc2fac75de0a3f56faec75318d929235047e081d8df84be16eaa50f | child-available-speech | Switchboard | Latn | n/a | cc-by-nc-sa-3.0 | {} | 1,342,024 | eng |
"Well it's just that, you know, a pound, or a hundred pounds today, is not the same as a hundred pou(...TRUNCATED) | d67176e066a5b0b49b31180ae8c9b7cc7af57ea263b895bcd7a0f8e59c521253 | child-available-speech | BNC | Latn | n/a | BNC User License | {} | 3,801 | eng |
"You want me to start again?\nYeah.\nRight erm, could you tell me about how you left school please?\(...TRUNCATED) | f08b0626a8a70d4ca55ca9d839dffb25957e92573f9d3490ded91382a3dd3b0a | child-available-speech | BNC | Latn | n/a | BNC User License | {} | 14,372 | eng |
"Is this on yet?\nYeah.\nOh.\nOkay well, good morning.\nErm I have a er an important administrative (...TRUNCATED) | 61919dfdd0afefd51c936db62c66215d1c53974d29301516b15951e634702605 | child-available-speech | BNC | Latn | n/a | BNC User License | {} | 6,611 | eng |
"Come in, good morning.\nHello, well what's your mum been doing to you this morning ?\n.\nWell she's(...TRUNCATED) | 9b9479dfb49cdda37da32b684b92dd88dc56f62fef21a1291b6f998f6673f4f3 | child-available-speech | BNC | Latn | n/a | BNC User License | {} | 363 | eng |
"Order.\nEr, just a couple of announcements colleagues, if er, those delegates who actually smoke, i(...TRUNCATED) | ea96f7fd36f3dfea684022ba9e6f1d2ac02cc0721c156d7d1fd2f99fc3e107f7 | child-available-speech | BNC | Latn | n/a | BNC User License | {} | 12,902 | eng |
"Okay Ron there are, thanks for coming over for a start, there are you've got all the er the brochur(...TRUNCATED) | 1a8c3a6a3d017aaa7c2756aa53663e50db2ba80036f3a69aaf355f8a64550bd2 | child-available-speech | BNC | Latn | n/a | BNC User License | {} | 11,925 | eng |
"I actually wanted to that I didn't really want to go as far as for example deciding that the chair (...TRUNCATED) | 6da6be7247ec0efb4de60eb96a602703df1d7e0f65027df90f64ad85a0aac37b | child-available-speech | BNC | Latn | n/a | BNC User License | {} | 400 | eng |
"L E A T L E A T our local Environment Action Teams, initiative proved to be so outstanding.\nL E A (...TRUNCATED) | 0fd0263d223577f9ace342ef4e88052e80768991060a0baf88284fd5f3da1019 | child-available-speech | BNC | Latn | n/a | BNC User License | {} | 5,793 | eng |
"fifty to three hundred and fifty pounds and Lot one four three three hundred and fifty to four hund(...TRUNCATED) | b0d13ac149c2eedb71c30f178ac787e3d28458010b8db524d413ead071c8d57f | child-available-speech | BNC | Latn | n/a | BNC User License | {} | 11,734 | eng |
End of preview. Expand in Data Studio
BabyLM English–Chinese 50/50 Stratified
A bilingual training corpus for the Multilingual BabyLM project, combining 50% of the English and Chinese BabyBabelLM corpora.
Sampling is stratified by category: each source category is sampled independently to preserve the original category proportions within each language.
Token counts
| Language | Tokens | Share |
|---|---|---|
English (eng) |
49,481,353 | 41.8% |
Chinese (zho) |
68,925,921 | 58.2% |
| Total | 118,407,274 | 100% |
Category breakdown
| Category | English tokens | English % | Chinese tokens | Chinese % |
|---|---|---|---|---|
child-available-speech |
4,551,767 | 9.2% | 3,702,039 | 5.4% |
child-books |
13,406,594 | 27.1% | 7,992,009 | 11.6% |
child-directed-speech |
13,857,314 | 28.0% | 4,821,121 | 7.0% |
child-wiki |
7,304,856 | 14.8% | 12,967 | 0.0% |
educational |
0 | 0.0% | 6,732,733 | 9.8% |
padding-opensubtitles |
9,856,922 | 19.9% | 0 | 0.0% |
padding-wikipedia |
503,900 | 1.0% | 0 | 0.0% |
subtitles |
0 | 0.0% | 45,665,052 | 66.3% |
Source datasets
- Downloads last month
- 48