Pipeline steps¶

Below is an overview of the pipeline steps:

Alt text

The pipeline consists of five main steps:

Data Preprocessing: Downloads data from publicly available repositories and handles basic data cleaning.
Synthetic Dataset Generation: Downloads the relevant teacher and backward models, forward translates all source sentences with our teacher model(s) into the target languages, computes cross-entropy scores with a backward model, and then filters the synthetic dataset.
Student Training: Trains a small transformer model on the filtered synthetic dataset with guided alignment.
Exporting: Creates the final student. It includes a fine-tuning step, a quantization step and, finally, the export step which saves the model so it is ready for deployment.
Evaluation: Evaluates the trained model.

The steps are based on train-student recipe.

They can be represented as a Directly Acyclic Graph (DAG).

Step	Description	Bottleneck	Comments
Installation	Installing dependencies and compiling	CPU	Takes ~1 hour
Data downloading	Downloads datasets, samples sentences	Network, Disk	Time depends on dataset size, sampling of huge mono datasets (100M+ sentences) is the most intensive operation.
Data cleaning	Basic preprocessing, dataset specific, language specific, rule based and other attempts to clean noisy data in parallel and mono datasets	CPU	Good parallelization across CPU cores. To make cleaning of a new language more efficient add it to clean_parallel.py.
Merge and dedupe	Merges clean dataset and applies deduplicaiton	CPU, Disk
Training vocabulary	Trains SentencePiece vocabulary/tokenizer model on parallel corpus.	CPU
Teacher download	Downloads teacher model	CPU
Backward model download	Downloads backward model	CPU
Translation by teacher	Translates a corpus using the teacher models	GPU	The slowest part of the pipeline. Can take days. It is possible to speed it up by using multiple nodes in cluster mode.
Cross-entropy filtering	Scores translated corpus with backward s2s model and removes a part of the corpus with the lowest scores to reduce noise	GPU, CPU, Disk	At this point we work with huge datasets. Very disk intensive.
Training alignments and shortlist	Trains alignments using fast_align and extracts lexical shortlist using extract_lex tool	CPU, Disk	Some tools require uncompressed datasets on disk and they are huge at this point. Good CPU parallelization.
Training student	Trains a small transformer student model on filtered data and using alignments. Shuffling in RAM might fail if dataset is huge and there’s not enough RAM on the machine, so it’s recommended to remove it and use `shuffle: batches` marian settings (see issue).	GPU
Fine-tuning student	Finetunes the student model by emulating 8bit GEMM during training	GPU	Converges very quickly and then degrades. It’s quick but you might want to reduce early stopping threshold.
Quantizaiton	Applies 8 bit quantization to the fined-tuned student model and runs evaluation on CPU	CPU	CPU threads must be set to 1 for this step.
Evaluation	Calculates metrics for all models (BLEU, chrf) using SacreBLEU	GPU	Uses `datasets.test` configuration section.
Export	Exports trained model and shortlist to (bergamot-translator)(https://github.com/mozilla/bergamot-translator) format

Configurable steps¶

Summary of OpusDistillery main steps. For each step, we report the compute resource used (CPU or GPU), whether the step is optional, and whether it is configurable or hard-coded.

Main Step	Step	Resource	Optional	Configurable
Data Processing
	Data Download	CPU	✗	✓
	Data Cleaning	CPU	✗	✓
Synthetic Dataset Generation
	Teacher Model Download	CPU	✗	✓
	Forward Translation	GPU	✗	✗
	Backward Model Download	CPU	✓	✓
	Cross-Entropy Scoring	GPU	✓	✗
	Cross-Entropy Filtering	CPU	✓	✓
Student Training
	Alignment Training	CPU	✓	✗
	Vocabulary Training	CPU	✗	✓
	Student Training	GPU	✗	✓
Exporting
	Fine-tuning	GPU	✓	✓
	Quantization	CPU	✓	✗
	Export	-	✓	✗
Evaluation	Evaluation	GPU	✓	✗