Dataset importers¶

Dataset importers can be used in datasets sections of the training config.

Example:

datasets:
  train:
    - tc_Tatoeba-Challenge-v2023-09-26
  devtest:
    - flores_dev
  test:
    - flores_devtest

Data source	Prefix	Name examples	Type	Comments
MTData	mtdata	newstest2017_ruen	corpus	Supports many datasets. Run `mtdata list -l ru-en` to see datasets for a specific language pair.
OPUS	opus	ParaCrawl__v7.1	corpus	Many open source datasets. Go to the website, choose a language pair, check links under Moses column to see what names and version is used in a link. The version should be separated by a double _.
SacreBLEU	sacrebleu	wmt20	corpus	Official evaluation datasets available in SacreBLEU tool. Recommended to use in `datasets:test` config section. Look up supported datasets and language pairs in `sacrebleu.dataset` python module.
Flores	flores	dev, devtest	corpus	Evaluation dataset from Facebook that supports 100 languages.
Custom parallel	url	`https://storage.googleapis.com/releng-translations-dev/data/en-ru/pytest-dataset.[LANG].zst`	corpus	A custom zst compressed parallel dataset, for instance uploaded to GCS. The language pairs should be split into two files. the `[LANG]` will be replaced with the `to` and `from` language codes.
Paracrawl	paracrawl-mono	paracrawl8	mono	Datasets that are crawled from the web. Only mono datasets are used in this importer. Parallel corpus is available using opus importer.
News crawl	news-crawl	news.2019	mono	Some news monolingual datasets from WMT21
Common crawl	commoncrawl	wmt16	mono	Huge web crawl datasets. The links are posted on WMT21
Custom mono	url	`https://storage.googleapis.com/releng-translations-dev/data/en-ru/pytest-dataset.ru.zst`	mono	A custom zst compressed monolingual dataset, for instance uploaded to GCS.

Find datasets¶

You can also use find-corpus tool to find all datasets for an importer and get them formatted to use in config.

Set up a local poetry environment.

task find-corpus -- en ru

Make sure to check licenses of the datasets before using them.

Adding a new importer¶

Just add a shell script to corpus or mono which is named as <prefix>.sh and accepts the same parameters as the other scripts from the same folder.

Issues¶

Currently, it is not possible to download specific datasets per language pair; the tool downloads the same dataset for all language pairs. If a dataset doesn’t exist for a given language pair, dummy files are created. Do you want to collaborate? Feel free to work on this issue.
There is currently no support for downloading monolingual datasets. The use of monolingual data is not fully implemented; only bilingual data is supported at this time. Do you want to collaborate? Feel free to work on this issue.