Configuration Files¶
The configuration files for OpusDistillery are written in YAML format and are divided into two main sections:
experiment
: Contains the general setup and parameters for the experiment, excluding dataset information.datasets
: Specifies the datasets used for training, development, and evaluation. Details about datasets can be found in Dataset Importers.
Experiment Setup¶
In the experiment
section, the following key parameters must be defined:
dirname
: The directory where all experiment outputs will be stored.name
: The name of the experiment. All generated data and models will be saved underdirname
/name
.langpairs
: A list of language pairs for the student model, using ISO two-letter language codes.
Example configuration:
experiment:
dirname: test
name: fiu-eng
langpairs:
- et-en
- fi-en
- hu-en
Data processing¶
OpusFilter¶
OpusDistillery supports OpusFilter, a tool for filtering and combining parallel corpora. Instead of the default cleaning, you can choose to filter data using OpusFilter with either a default configuration or a custom configuration that you provide.
In the configuration file, if you want to use a default configuration, see this example. Otherwise, you can specify the path to a custom OpusFilter configuration file such as this one.
opusfilter:
config: default # # Or specify the path to an OpusFilter configuration file
Teacher models¶
You can select a teacher model from OPUS-MT or Hugging Face.
OPUS-MT Teachers¶
To specify an OPUS-MT teacher, use:
opusmt-teacher
It can be one of the following:
A URL to an OPUS-MT model:
opusmt-teacher: "https://object.pouta.csc.fi/Tatoeba-MT-models/fiu-eng/opus4m-2020-08-12.zip"
A path to a local OPUS-MT model:
opusmt-teacher: "/path/to/opus-mt/model"
A list of OPUS-MT models:
opusmt-teacher:
- "https://object.pouta.csc.fi/Tatoeba-MT-models/gem-gem/opus-2020-10-04.zip"
- "https://object.pouta.csc.fi/Tatoeba-MT-models/eng-swe/opus+bt-2021-04-14.zip"
For multilingual students, specify different teachers for each language pair:
opusmt-teacher:
en-uk: "https://object.pouta.csc.fi/Tatoeba-MT-models/eng-ukr/opus+bt-2021-04-14.zip"
en-ru: "https://object.pouta.csc.fi/Tatoeba-MT-models/eng-rus/opus+bt-2021-04-14.zip"
en-be: "https://object.pouta.csc.fi/Tatoeba-MT-models/eng-bel/opus+bt-2021-03-07.zip"
Use the
best
option to automatically select the best teacher for each language pair, based on FLORES200+ scores from the OPUS-MT dashboard.
opusmt-teacher: "best"
Hugging Face Teachers¶
You can also use a Hugging Face model as a teacher.
modelname
: The model identifier from the Hugging Face hub.modelclass
: The class of the model being loaded.
huggingface:
modelname: "Helsinki-NLP/opus-mt-mul-en"
modelclass: "transformers.AutoModelForSeq2SeqLM"
You can also configure the decoding options:
huggingface:
modelname: "HPLT/translate-et-en-v1.0-hplt_opus"
modelclass: "transformers.AutoModelForSeq2SeqLM"
config:
top_k: 50
top_p: 0.90
temperature: 0.1
max_new_tokens: 128
For models that use language tags, additional parameters are required:
lang_info
: Set to True if language tags are needed.lang_tags
: A mapping of language codes to the tags used by the model.
huggingface:
modelname: "facebook/nllb-200-distilled-600M"
modelclass: "transformers.AutoModelForSeq2SeqLM"
lang_info: True
lang_tags:
en: eng_Latn
et: est_Latn
Finally, for models requiring a prompt, you can define it like this:
huggingface:
modelname: "google-t5/t5-small"
modelclass: "transformers.AutoModelForSeq2SeqLM"
lang_tags:
en: English
de: German
prompt: "Translate {src_lang} to {tgt_lang}: {source}"
In this case, the lang_tags mapping will be used in the prompt.
Note: When using a Hugging Face model as a teacher, there is no scoring or cross-entropy filtering.
CTranslate2¶
The pipeline also supports CTranslate2 inference for HuggingFace models, which provides a considerable speedup. For that, simply add new boolean key:
huggingface:
modelname: "facebook/nllb-200-distilled-1.3B"
lang_info: True
batch_size: 4096
lang_tags:
en: eng_Latn
ja: jpn_Jpan
ct2: True
We have done some benchmarking on 4 Nvidia Ampere A100 GPUs that shows CTranslate2 provides a 26x faster inference:
| Model | Type | Batch size | Return sequences | Sent/s |
|-------------------------------------|--------------|------------|------------------|---------|
| facebook/nllb-200-distilled-1.3B | ctranslate2 | 8192 | 8 | 406,316 |
| facebook/nllb-200-distilled-1.3B | huggingface | 8 | 8 | 15,37 |
Backward models¶
Currently, only OPUS-MT models are available as backward models for scoring translations.
To specify a backward model, use:
opusmt-backward
: The URL or path to an OPUS-MT model. Like the teacher models, this can also be a dictionary for multilingual students orbest
.
opusmt-backward:
uk-en: "https://object.pouta.csc.fi/Tatoeba-MT-models/ukr-eng/opus+bt-2021-04-30.zip"
ru-en: "https://object.pouta.csc.fi/Tatoeba-MT-models/rus-eng/opus+bt-2021-04-30.zip"
be-en: "https://object.pouta.csc.fi/Tatoeba-MT-models/bel-eng/opus+bt-2021-04-30.zip"
If left empty, the cross-entropy filtering step will be skipped.
Multilinguality¶
Specify whether the teacher, backward, and student models are many-to-one to properly handle language tags. By default, this is set to False
.
one2many-teacher
:True
orFalse
(default). Ifopusmt-teacher
is set tobest
, this should also bebest
.one2many-backward
:True
orFalse
(default). Ifopusmt-backward
is set tobest
, this should also bebest
.one2many-student
:True
orFalse
(default).
# Specify if the teacher and the student are one2many
one2many-teacher: True
one2many-student: True
Training¶
Marian arguments¶
You can override default pipeline settings with Marian-specific settings.
You can use the following options: training-teacher
, decoding-teacher
,training-backward
, decoding-backward
,training-student
, training-student-finetuned
.
marian-args:
# These configs override pipeline/train/configs
training-student:
dec-depth: 3
enc-depth: 3
dim-emb: 512
tied-embeddings-all: true
transformer-decoder-autoreg: rnn
transformer-dim-ffn: 2048
transformer-ffn-activation: relu
transformer-ffn-depth: 2
transformer-guided-alignment-layer: last
transformer-heads: 8
transformer-postprocess: dan
transformer-preprocess: ""
transformer-tied-layers: []
transformer-train-position-embeddings: false
type: transformer
Opustrainer¶
OpusDistillery supports OpusTrainer for curriculum training and data augmentation.
You can specify a path to the OpusTrainer configuration, such as in this example. This assumes you know the final paths of the data, as defined in this file.
Currently, this is implemented only for student training.
opustrainer:
model: student # Ideally, could be teacher or backward
path: "configs/opustrainer/config.fiu-eng.opustrainer.stages.yml" # This assumes you already know the paths to the data
Exporting¶
The final student model is exported in the Bergamot format, which uses shortlists for training. Shortlists are trained using alignments, so there’s an option to train a student without guided alignment using the tiny architecture. To disable this, specify export in the configuration file:
export: "no"
Other¶
parallel-max-sentences
: Maximum parallel sentences to download from each dataset.split-length
: The number of sentences into which you want to split your training data for forward translation.best-model
: Metric used to select the best model.spm-sample-size
: Sample size for training the student’s SPM vocabulary.spm-vocab-size
: Vocabulary size for training the student’s SPM vocabulary.student-prefix
: To train multiple students with the same data, add a prefix to the student name, which will allow multiple students to be trained under the same directory structure with the same data. More details on the directory structure can be found here.