Dataset importers¶
Dataset importers can be used in datasets
sections of the training config.
Example:
datasets:
train:
- tc_Tatoeba-Challenge-v2023-09-26
devtest:
- flores_dev
test:
- flores_devtest
Data source |
Prefix |
Name examples |
Type |
Comments |
---|---|---|---|---|
mtdata |
newstest2017_ruen |
corpus |
Supports many datasets. Run |
|
OPUS |
opus |
ParaCrawl__v7.1 |
corpus |
Many open source datasets. Go to the website, choose a language pair, check links under Moses column to see what names and version is used in a link. The version should be separated by a double _. |
sacrebleu |
wmt20 |
corpus |
Official evaluation datasets available in SacreBLEU tool. Recommended to use in |
|
flores |
dev, devtest |
corpus |
Evaluation dataset from Facebook that supports 100 languages. |
|
Custom parallel |
url |
|
corpus |
A custom zst compressed parallel dataset, for instance uploaded to GCS. The language pairs should be split into two files. the |
paracrawl-mono |
paracrawl8 |
mono |
Datasets that are crawled from the web. Only mono datasets are used in this importer. Parallel corpus is available using opus importer. |
|
news-crawl |
news.2019 |
mono |
Some news monolingual datasets from WMT21 |
|
commoncrawl |
wmt16 |
mono |
Huge web crawl datasets. The links are posted on WMT21 |
|
Custom mono |
url |
|
mono |
A custom zst compressed monolingual dataset, for instance uploaded to GCS. |
Find datasets¶
You can also use find-corpus tool to find all datasets for an importer and get them formatted to use in config.
Set up a local poetry environment.
task find-corpus -- en ru
Make sure to check licenses of the datasets before using them.
Adding a new importer¶
Just add a shell script to corpus or mono which is named as <prefix>.sh
and accepts the same parameters as the other scripts from the same folder.
Issues¶
Currently, it is not possible to download specific datasets per language pair; the tool downloads the same dataset for all language pairs. If a dataset doesn’t exist for a given language pair, dummy files are created. Do you want to collaborate? Feel free to work on this issue.
There is currently no support for downloading monolingual datasets. The use of monolingual data is not fully implemented; only bilingual data is supported at this time. Do you want to collaborate? Feel free to work on this issue.