# Quickstart MAMMOTH is specifically designed for distributed training of modular systems in multi-GPUs SLURM environments. In the example below, we will show you how to configure Mammoth to train a machine translation model with language-specific encoders and decoders. ### Step 0: Install mammoth ```bash pip install mammoth-nlp ``` Check out the [installation guide](install) to install in specific clusters. ### Step 1: Prepare the data Before running the training, we will download data for chosen pairs of languages and create a sentencepiece tokenizer for the model. **Refer to the data preparation [tutorial](prepare_data) for more details.** In the following steps, we assume that you already have an encoded dataset containing `*.sp` file for `europarl` dataset, and languages `cs` and `bg`. Thus, your data directory `europarl_data/encoded` should contain 8 files in a format `{train/valid}.{cs/bg}-en.{cs/bg}.sp`. If you use other datasets, please update the paths in the configurations below. ### Step 2: Configurations Mamooth uses configurations to build a new transformer model and configure your training settings, such as which modules are trained with the data from which languages. Below are a few examples of training configurations that will work for you out-of-box in a one-node, two-GPU environment.
Task-specific encoders and decoders In this example, we create a model with encoders and decoders **unshared** for the specified languages. This is defined by `enc_sharing_group` and `enc_sharing_group`. Note that the configs expect you have access to 2 GPUs. ```yaml # TRAINING CONFIG world_size: 2 gpu_ranks: [0, 1] batch_type: tokens batch_size: 4096 # INPUT/OUTPUT VOCABULARY CONFIG src_vocab: bg: vocab/opusTC.mul.vocab.onmt cs: vocab/opusTC.mul.vocab.onmt en: vocab/opusTC.mul.vocab.onmt tgt_vocab: cs: vocab/opusTC.mul.vocab.onmt en: vocab/opusTC.mul.vocab.onmt # MODEL CONFIG model_dim: 512 tasks: train_bg-en: src_tgt: bg-en enc_sharing_group: [bg] dec_sharing_group: [en] node_gpu: "0:0" path_src: europarl_data/encoded/train.bg-en.bg.sp path_tgt: europarl_data/encoded/train.bg-en.en.sp train_cs-en: src_tgt: cs-en enc_sharing_group: [cs] dec_sharing_group: [en] node_gpu: "0:1" path_src: europarl_data/encoded/train.cs-en.cs.sp path_tgt: europarl_data/encoded/train.cs-en.en.sp train_en-cs: src_tgt: en-cs enc_sharing_group: [en] dec_sharing_group: [cs] node_gpu: "0:1" path_src: europarl_data/encoded/train.cs-en.en.sp path_tgt: europarl_data/encoded/train.cs-en.cs.sp enc_layers: [6] dec_layers: [6] ```
Arbitrarily shared layers in encoders and task-specific decoders The training and vocab config is the same as in the previous example. ```yaml # TRAINING CONFIG world_size: 2 gpu_ranks: [0, 1] batch_type: tokens batch_size: 4096 # INPUT/OUTPUT VOCABULARY CONFIG src_vocab: bg: vocab/opusTC.mul.vocab.onmt cs: vocab/opusTC.mul.vocab.onmt en: vocab/opusTC.mul.vocab.onmt tgt_vocab: cs: vocab/opusTC.mul.vocab.onmt en: vocab/opusTC.mul.vocab.onmt # MODEL CONFIG model_dim: 512 tasks: train_bg-en: src_tgt: bg-en enc_sharing_group: [bg, all] dec_sharing_group: [en] node_gpu: "0:0" path_src: europarl_data/encoded/train.bg-en.bg.sp path_tgt: europarl_data/encoded/train.bg-en.en.sp train_cs-en: src_tgt: cs-en enc_sharing_group: [cs, all] dec_sharing_group: [en] node_gpu: "0:1" path_src: europarl_data/encoded/train.cs-en.cs.sp path_tgt: europarl_data/encoded/train.cs-en.en.sp train_en-cs: src_tgt: en-cs enc_sharing_group: [en, all] dec_sharing_group: [cs] node_gpu: "0:1" path_src: europarl_data/encoded/train.cs-en.en.sp path_tgt: europarl_data/encoded/train.cs-en.cs.sp enc_layers: [4, 4] dec_layers: [4] ```
Non-modular multilingual system In this example, we share the input/output vocabulary over all languages. Hence, we define a vocabulary for an `all` language, that we use in the definition of the model. ```yaml # TRAINING CONFIG world_size: 2 gpu_ranks: [0, 1] batch_type: tokens batch_size: 4096 # INPUT/OUTPUT VOCABULARY CONFIG src_vocab: all: vocab/opusTC.mul.vocab.onmt tgt_vocab: all: vocab/opusTC.mul.vocab.onmt # MODEL CONFIG model_dim: 512 tasks: train_bg-en: src_tgt: all-all enc_sharing_group: [shared_enc] dec_sharing_group: [shared_dec] node_gpu: "0:0" path_src: europarl_data/encoded/train.bg-en.bg.sp path_tgt: europarl_data/encoded/train.bg-en.en.sp train_cs-en: src_tgt: all-all enc_sharing_group: [shared_enc] dec_sharing_group: [shared_dec] node_gpu: "0:1" path_src: europarl_data/encoded/train.cs-en.cs.sp path_tgt: europarl_data/encoded/train.cs-en.en.sp train_en-cs: src_tgt: all-all enc_sharing_group: [shared_enc] dec_sharing_group: [shared_dec] node_gpu: "0:1" path_src: europarl_data/encoded/train.cs-en.en.sp path_tgt: europarl_data/encoded/train.cs-en.cs.sp enc_layers: [6] dec_layers: [6] ```
**To proceed, copy-paste one of these configurations into a new file named `my_config.yaml`.** For further information, check out the documentation of all parameters in **[train.py](options/train)**. For more complex scenarios, we recommend our [automatic configuration generation tool](config_config) for generating your configurations. ## Step 3: Start training You can start your training on a single machine, by simply running a python script `train.py`, possibly with a definition of your desired GPUs. Note that the example config above assumes two GPUs available on one machine. ```shell CUDA_VISIBLE_DEVICES=0,1 python3 train.py -config my_config.yaml -save_model output_dir -tensorboard -tensorboard_log_dir log_dir ``` Note that when running `train.py`, you can use all the parameters from [train.py](options/train) as cmd arguments. In the case of duplicate arguments, the cmd parameters override the ones found in your config.yaml. ### Step 4: Translate Now that you have successfully trained your multilingual machine translation model using Mammoth, it's time to put it to use for translation. ```bash python3 -u $MAMMOTH/translate.py \ --config "my_config.yml" \ --model "$model_checkpoint" \ --task_id "train_$src_lang-$tgt_lang" \ --src "$path_to_src_language/$lang_pair.$src_lang.sp" \ --output "$out_path/$src_lang-$tgt_lang.hyp.sp" \ --gpu 0 --shard_size 0 \ --batch_size 512 ``` Follow these configs to translate text with your trained model. - Provide necessary details using the following options: - Configuration File: `--config "my_config.yml"` - Model Checkpoint: `--model "$model_checkpoint"` - Translation Task: `--task_id "train_$src_lang-$tgt_lang"` - Point to the source language file for translation: `--src "$path_to_src_language/$lang_pair.$src_lang.sp"` - Define the path for saving the translated output: `--output "$out_path/$src_lang-$tgt_lang.hyp.sp"` - Adjust GPU and batch size settings based on your requirements: `--gpu 0 --shard_size 0 --batch_size 512` - We provide the model checkpoint trained using the encoder shared scheme described in [this tutorial](examples/sharing_schemes.md). ```bash wget https://mammoth-share.a3s.fi/encoder-shared-models.tar.gz ``` Congratulations! You've successfully translated text using your Mammoth model. Adjust the parameters as needed for your specific translation tasks. ### Further reading A complete example of training on the Europarl dataset is available at [MAMMOTH101](examples/train_mammoth_101.md), and a complete example for configuring different sharing schemes is available at [MAMMOTH sharing schemes](examples/sharing_schemes.md).