Datasets:
African Languages Lab Multi-Open
multi-open is the open-source multilingual subset released by the
African Languages Lab. It contains English-target
parallel text for 31 African languages.
The African Languages Lab: A Collaborative Approach to Advancing Low-Resource African
NLP
Issaka et al., ACL 2026.
The paper presents ALL Lab's broader collaborative program: systematic and quality-controlled data infrastructure, empirical evaluation for low-resource African NLP, and local research capacity building. It reports a broader multimodal collection spanning 40 languages and evaluates translation across 31 languages. This repository is not the complete paper corpus. It is a restricted release for the paper's 31 target languages.
Data format
Each configuration is a separate English-target parallel dataset. Text and token-count column
names follow the language pair (for example, english, swahili, english_token_count, and
swahili_token_count).
Usage
from datasets import load_dataset
dataset = load_dataset("African-Languages-Lab/multi-open", "english-swahili")
Citation
@inproceedings{issaka-etal-2026-african,
title = "The {A}frican Languages Lab: A Collaborative Approach to Advancing Low-Resource {A}frican {NLP}",
author = "Issaka, Sheriff and Wang, Keyi and Ajibola, Yinka and Samuel-Ipaye, Oluwatumininu and Zhang, Zhaoyi and Jimenez, Nicte Aguillon and Agyei, Evans Kofi and Lin, Abraham and Ramachandran, Rohan and Mumin, Sadick Abdul and Nchifor, Faith and Issah, Mohammed Shuraim and Gonzalez, Erick Rosas and Liu, Lieqi and Kpei, Sylvester and Osei, Jemimah Kusi and Ajeneza, Carlene and Boateng, Persis and Yeboah, Prisca Adwoa Dufie and Gabriel, Saadia",
booktitle = "Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2026",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2026.acl-long.1965/",
pages = "42460--42477"
}
- Downloads last month
- 8