Quickstart

MAMMOTH is specifically designed for distributed training of modular systems in multi-GPUs SLURM environments.

In the example below, we will show you how to configure Mammoth to train a machine translation model with language-specific encoders and decoders.

Step 0: Install mammoth

pip install mammoth-nlp

Check out the installation guide to install in specific clusters.

Step 1: Prepare the data

Before running the training, we will download data for chosen pairs of languages and create a sentencepiece tokenizer for the model.

Refer to the data preparation tutorial for more details.

In the following steps, we assume that you already have an encoded dataset containing *.sp file for europarl dataset, and languages cs and bg. Thus, your data directory europarl_data/encoded should contain 8 files in a format {train/valid}.{cs/bg}-en.{cs/bg}.sp. If you use other datasets, please update the paths in the configurations below.

Step 2: Configurations

Mamooth uses configurations to build a new transformer model and configure your training settings, such as which modules are trained with the data from which languages.

Below are a few examples of training configurations that will work for you out-of-box in a one-node, two-GPU environment.

Task-specific encoders and decoders

In this example, we create a model with encoders and decoders unshared for the specified languages. This is defined by enc_sharing_group and enc_sharing_group. Note that the configs expect you have access to 2 GPUs.

# TRAINING CONFIG
world_size: 2
gpu_ranks: [0, 1]

batch_type: tokens
batch_size: 4096

# INPUT/OUTPUT VOCABULARY CONFIG

src_vocab:
  bg: vocab/opusTC.mul.vocab.onmt
  cs: vocab/opusTC.mul.vocab.onmt
  en: vocab/opusTC.mul.vocab.onmt
tgt_vocab:
  cs: vocab/opusTC.mul.vocab.onmt
  en: vocab/opusTC.mul.vocab.onmt

# MODEL CONFIG

model_dim: 512

tasks:
  train_bg-en:
    src_tgt: bg-en
    enc_sharing_group: [bg]
    dec_sharing_group: [en]
    node_gpu: "0:0"
    path_src: europarl_data/encoded/train.bg-en.bg.sp
    path_tgt: europarl_data/encoded/train.bg-en.en.sp
  train_cs-en:
    src_tgt: cs-en
    enc_sharing_group: [cs]
    dec_sharing_group: [en]
    node_gpu: "0:1"
    path_src: europarl_data/encoded/train.cs-en.cs.sp
    path_tgt: europarl_data/encoded/train.cs-en.en.sp
  train_en-cs:
    src_tgt: en-cs
    enc_sharing_group: [en]
    dec_sharing_group: [cs]
    node_gpu: "0:1"
    path_src: europarl_data/encoded/train.cs-en.en.sp
    path_tgt: europarl_data/encoded/train.cs-en.cs.sp

enc_layers: [6]
dec_layers: [6]
Arbitrarily shared layers in encoders and task-specific decoders

The training and vocab config is the same as in the previous example.

# TRAINING CONFIG
world_size: 2
gpu_ranks: [0, 1]

batch_type: tokens
batch_size: 4096

# INPUT/OUTPUT VOCABULARY CONFIG

src_vocab:
  bg: vocab/opusTC.mul.vocab.onmt
  cs: vocab/opusTC.mul.vocab.onmt
  en: vocab/opusTC.mul.vocab.onmt
tgt_vocab:
  cs: vocab/opusTC.mul.vocab.onmt
  en: vocab/opusTC.mul.vocab.onmt

# MODEL CONFIG

model_dim: 512

tasks:
  train_bg-en:
    src_tgt: bg-en
    enc_sharing_group: [bg, all]
    dec_sharing_group: [en]
    node_gpu: "0:0"
    path_src: europarl_data/encoded/train.bg-en.bg.sp
    path_tgt: europarl_data/encoded/train.bg-en.en.sp
  train_cs-en:
    src_tgt: cs-en
    enc_sharing_group: [cs, all]
    dec_sharing_group: [en]
    node_gpu: "0:1"
    path_src: europarl_data/encoded/train.cs-en.cs.sp
    path_tgt: europarl_data/encoded/train.cs-en.en.sp
  train_en-cs:
    src_tgt: en-cs
    enc_sharing_group: [en, all]
    dec_sharing_group: [cs]
    node_gpu: "0:1"
    path_src: europarl_data/encoded/train.cs-en.en.sp
    path_tgt: europarl_data/encoded/train.cs-en.cs.sp

enc_layers: [4, 4]
dec_layers: [4]
Non-modular multilingual system

In this example, we share the input/output vocabulary over all languages. Hence, we define a vocabulary for an all language, that we use in the definition of the model.

# TRAINING CONFIG
world_size: 2
gpu_ranks: [0, 1]

batch_type: tokens
batch_size: 4096

# INPUT/OUTPUT VOCABULARY CONFIG

src_vocab:
  all: vocab/opusTC.mul.vocab.onmt
tgt_vocab:
  all: vocab/opusTC.mul.vocab.onmt

# MODEL CONFIG

model_dim: 512

tasks:
  train_bg-en:
    src_tgt: all-all
    enc_sharing_group: [shared_enc]
    dec_sharing_group: [shared_dec]
    node_gpu: "0:0"
    path_src: europarl_data/encoded/train.bg-en.bg.sp
    path_tgt: europarl_data/encoded/train.bg-en.en.sp
  train_cs-en:
    src_tgt: all-all
    enc_sharing_group: [shared_enc]
    dec_sharing_group: [shared_dec]
    node_gpu: "0:1"
    path_src: europarl_data/encoded/train.cs-en.cs.sp
    path_tgt: europarl_data/encoded/train.cs-en.en.sp
  train_en-cs:
    src_tgt: all-all
    enc_sharing_group: [shared_enc]
    dec_sharing_group: [shared_dec]
    node_gpu: "0:1"
    path_src: europarl_data/encoded/train.cs-en.en.sp
    path_tgt: europarl_data/encoded/train.cs-en.cs.sp

enc_layers: [6]
dec_layers: [6]

To proceed, copy-paste one of these configurations into a new file named my_config.yaml.

For further information, check out the documentation of all parameters in train.py.

For more complex scenarios, we recommend our automatic configuration generation tool for generating your configurations.

Step 3: Start training

You can start your training on a single machine, by simply running a python script train.py, possibly with a definition of your desired GPUs. Note that the example config above assumes two GPUs available on one machine.

CUDA_VISIBLE_DEVICES=0,1 python3 train.py -config my_config.yaml -save_model output_dir -tensorboard -tensorboard_log_dir log_dir

Note that when running train.py, you can use all the parameters from train.py as cmd arguments. In the case of duplicate arguments, the cmd parameters override the ones found in your config.yaml.

Step 4: Translate

Now that you have successfully trained your multilingual machine translation model using Mammoth, it’s time to put it to use for translation.

python3 -u $MAMMOTH/translate.py \
  --config "my_config.yml" \
  --model "$model_checkpoint" \
  --task_id  "train_$src_lang-$tgt_lang" \
  --src "$path_to_src_language/$lang_pair.$src_lang.sp" \
  --output "$out_path/$src_lang-$tgt_lang.hyp.sp" \
  --gpu 0 --shard_size 0 \
  --batch_size 512

Follow these configs to translate text with your trained model.

  • Provide necessary details using the following options:

    • Configuration File: --config "my_config.yml"

    • Model Checkpoint: --model "$model_checkpoint"

    • Translation Task: --task_id "train_$src_lang-$tgt_lang"

  • Point to the source language file for translation: --src "$path_to_src_language/$lang_pair.$src_lang.sp"

  • Define the path for saving the translated output: --output "$out_path/$src_lang-$tgt_lang.hyp.sp"

  • Adjust GPU and batch size settings based on your requirements: --gpu 0 --shard_size 0 --batch_size 512

  • We provide the model checkpoint trained using the encoder shared scheme described in this tutorial.

    wget https://mammoth-share.a3s.fi/encoder-shared-models.tar.gz
    

Congratulations! You’ve successfully translated text using your Mammoth model. Adjust the parameters as needed for your specific translation tasks.

Further reading

A complete example of training on the Europarl dataset is available at MAMMOTH101, and a complete example for configuring different sharing schemes is available at MAMMOTH sharing schemes.