MAMMOTH Sharing Schemes¶

MAMMOTH is designed as a flexible modular system, allowing users to configure, train, and test various sharing schemes. This tutorial will guide you through the process of setting up and experimenting with different sharing schemes, including:

fully shared
fully unshared
encoder shared
decoder shared

The configuration for each scheme is managed through YAML files, ensuring a seamless and customizable experience.

Dataset¶

For this tutorial, we will be utilizing the UNPC dataset, which consists of manually translated UN documents spanning the last 25 years (1990 to 2014) for the six official UN languages: Arabic, Chinese, English, French, Russian, and Spanish.

Before diving into the sharing schemes, we need to preprocess the data. You can download the processed data using the following command:

wget https://mammoth-share.a3s.fi/unpc.tar

Additionally, we require the corresponding vocabularies for the dataset. Download the vocabularies with the following command:

wget https://mammoth-share.a3s.fi/vocab.tar.gz

Now, let’s explore an overview of the sharing schemes to better understand their functionalities.

Training Modular Systems¶

1. Setup:¶

To initiate the training process for MAMMOTH’s modular systems, start by setting up the necessary environment variables:

export MAMMOTH=/path/to/mammoth
export CONFIG=/path/to/configs/config.yaml

2. Training Command:¶

Execute the following command to commence training:

srun /path/to/wrapper.sh $MAMMOTH/train.py \
    -config $CONFIG \
    -master_ip $SLURMD_NODENAME \
    -master_port 9969

For the wrapper script, use an example like the one below:

python -u "$@" --node_rank $SLURM_NODEID

This tutorial utilizes SLURM for job scheduling and parallel computing. You can tailor the provided commands for your specific needs, adapting them to alternative job scheduling systems or standalone setups. Ensure that the config.yaml file specifies the desired sharing scheme.

The training can be run on a single GPU in which case the wrapper wouldn’t be necessary. In this case, you can train with the following command.

python -u $MAMMOTH/train.py -config $CONFIG

3. Inference Command:¶

After training, use the following command to test the model:

python3 -u $MAMMOTH/translate.py \
    --config $CONFIG \
    --model "$checkpoint" \
    --task_id train_$sl-$tl \
    --src $processed_data/$lp/$lp.$sl.sp \
    --output $out_path/$sl-$tl.${base}hyp.sp \
    --gpu 0 --shard_size 0 \
    --batch_size 512

Remember to replace $checkpoint, $sl (source language), $tl (target language), $lp (language pair), $processed_data, and $out_path with appropriate values.

We provide the model checkpoint trained using the aforementioned encoder shared scheme.

wget https://mammoth-share.a3s.fi/encoder-shared-models.tar.gz

Notes:¶

Make sure to adapt the paths and variables to your specific directory structure.
Adjust the --gpu flag in the testing command based on your GPU availability.
Ensure that the configuration file (config.yaml) contains the correct sharing scheme based on your experiment.

This tutorial serves as a general guide, and it is recommended to refer to the specific configuration file for additional details and customization options. Feel free to explore and adapt the commands to suit your specific training and testing requirements, regardless of the job scheduling system you choose to employ.

MAMMOTH Sharing Schemes¶

Dataset¶

Sharing Schemes Overview¶

1. Fully Unshared:¶

2. Shared Encoder, Separate Decoder:¶

3. Separate Encoder, Shared Decoder:¶

4. Fully Shared:¶

Training Modular Systems¶

1. Setup:¶

2. Training Command:¶

3. Inference Command:¶

Notes:¶