MAMMOTH Sharing Schemes¶
MAMMOTH is designed as a flexible modular system, allowing users to configure, train, and test various sharing schemes. This tutorial will guide you through the process of setting up and experimenting with different sharing schemes, including:
fully shared
fully unshared
encoder shared
decoder shared
The configuration for each scheme is managed through YAML files, ensuring a seamless and customizable experience.
Dataset¶
For this tutorial, we will be utilizing the UNPC dataset, which consists of manually translated UN documents spanning the last 25 years (1990 to 2014) for the six official UN languages: Arabic, Chinese, English, French, Russian, and Spanish.
Before diving into the sharing schemes, we need to preprocess the data. You can download the processed data using the following command:
wget https://mammoth-share.a3s.fi/unpc.tar
Additionally, we require the corresponding vocabularies for the dataset. Download the vocabularies with the following command:
wget https://mammoth-share.a3s.fi/vocab.tar.gz
Now, let’s explore an overview of the sharing schemes to better understand their functionalities.
Sharing Schemes Overview¶
Let’s delve into an overview of the MAMMOTH Sharing Schemes, each offering unique configurations for a flexible modular system.
Training Modular Systems¶
1. Setup:¶
To initiate the training process for MAMMOTH’s modular systems, start by setting up the necessary environment variables:
export MAMMOTH=/path/to/mammoth
export CONFIG=/path/to/configs/config.yaml
2. Training Command:¶
Execute the following command to commence training:
srun /path/to/wrapper.sh $MAMMOTH/train.py \
-config $CONFIG \
-master_ip $SLURMD_NODENAME \
-master_port 9969
For the wrapper script, use an example like the one below:
python -u "$@" --node_rank $SLURM_NODEID
This tutorial utilizes SLURM for job scheduling and parallel computing.
You can tailor the provided commands for your specific needs, adapting them to alternative job scheduling systems or standalone setups.
Ensure that the config.yaml
file specifies the desired sharing scheme.
The training can be run on a single GPU in which case the wrapper wouldn’t be necessary. In this case, you can train with the following command.
python -u $MAMMOTH/train.py -config $CONFIG
3. Inference Command:¶
After training, use the following command to test the model:
python3 -u $MAMMOTH/translate.py \
--config $CONFIG \
--model "$checkpoint" \
--task_id train_$sl-$tl \
--src $processed_data/$lp/$lp.$sl.sp \
--output $out_path/$sl-$tl.${base}hyp.sp \
--gpu 0 --shard_size 0 \
--batch_size 512
Remember to replace $checkpoint
, $sl
(source language), $tl
(target language), $lp
(language pair), $processed_data
, and $out_path
with appropriate values.
We provide the model checkpoint trained using the aforementioned encoder shared scheme.
wget https://mammoth-share.a3s.fi/encoder-shared-models.tar.gz
Notes:¶
Make sure to adapt the paths and variables to your specific directory structure.
Adjust the
--gpu
flag in the testing command based on your GPU availability.Ensure that the configuration file (
config.yaml
) contains the correct sharing scheme based on your experiment.
This tutorial serves as a general guide, and it is recommended to refer to the specific configuration file for additional details and customization options. Feel free to explore and adapt the commands to suit your specific training and testing requirements, regardless of the job scheduling system you choose to employ.