Dataset importers

Dataset importers can be used in datasets sections of the training config.

Example:

datasets:
  train:
    - tc_Tatoeba-Challenge-v2023-09-26
  devtest:
    - flores_dev
  test:
    - flores_devtest

Data source

Prefix

Name examples

Type

Comments

MTData

mtdata

newstest2017_ruen

corpus

Supports many datasets. Run mtdata list -l ru-en to see datasets for a specific language pair.

OPUS

opus

ParaCrawl__v7.1

corpus

Many open source datasets. Go to the website, choose a language pair, check links under Moses column to see what names and version is used in a link. The version should be separated by a double _.

SacreBLEU

sacrebleu

wmt20

corpus

Official evaluation datasets available in SacreBLEU tool. Recommended to use in datasets:test config section. Look up supported datasets and language pairs in sacrebleu.dataset python module.

Flores

flores

dev, devtest

corpus

Evaluation dataset from Facebook that supports 100 languages.

Custom parallel

url

https://storage.googleapis.com/releng-translations-dev/data/en-ru/pytest-dataset.[LANG].zst

corpus

A custom zst compressed parallel dataset, for instance uploaded to GCS. The language pairs should be split into two files. the [LANG] will be replaced with the to and from language codes.

Paracrawl

paracrawl-mono

paracrawl8

mono

Datasets that are crawled from the web. Only mono datasets are used in this importer. Parallel corpus is available using opus importer.

News crawl

news-crawl

news.2019

mono

Some news monolingual datasets from WMT21

Common crawl

commoncrawl

wmt16

mono

Huge web crawl datasets. The links are posted on WMT21

Custom mono

url

https://storage.googleapis.com/releng-translations-dev/data/en-ru/pytest-dataset.ru.zst

mono

A custom zst compressed monolingual dataset, for instance uploaded to GCS.

Find datasets

You can also use find-corpus tool to find all datasets for an importer and get them formatted to use in config.

Set up a local poetry environment.

task find-corpus -- en ru

Make sure to check licenses of the datasets before using them.

Adding a new importer

Just add a shell script to corpus or mono which is named as <prefix>.sh and accepts the same parameters as the other scripts from the same folder.

Issues

  • Currently, it is not possible to download specific datasets per language pair; the tool downloads the same dataset for all language pairs. If a dataset doesn’t exist for a given language pair, dummy files are created. Do you want to collaborate? Feel free to work on this issue.

  • There is currently no support for downloading monolingual datasets. The use of monolingual data is not fully implemented; only bilingual data is supported at this time. Do you want to collaborate? Feel free to work on this issue.