Prepare Data

Before running these scripts, make sure that you have installed Mamooth, which includes the dependencies required below.

Quickstart: Europarl

In the Quickstart tutorial, we assume that you will download and preprocess the Europarl data by following the steps below.

Step 1: Download the data

Europarl parallel corpus is a multilingual resource extracted from European Parliament proceedings and contains texts in 21 European languages. Download the Release v7 - a further expanded and improved version of the Europarl corpus on 15 May 2012 - from the original website or download the processed data by us:

wget https://mammoth101.a3s.fi/europarl.tar.gz
mkdir europarl_data
tar –xvzf europarl.tar.gz -C europarl_data

Note that the extracted dataset will require around 30GB of memory. Alternatively, you can only download the data for the three example languages (666M).

wget https://mammoth101.a3s.fi/europarl-3langs.tar.gz
mkdir europarl_data
tar –xvzf europarl-3langs.tar.gz -C europarl_data

We use a SentencePiece tokenizer trained on OPUS Tatoeba Challenge data with 64k vocabulary size. Download the SentencePiece model and the vocabulary:

# Download the SentencePiece model
wget https://mammoth101.a3s.fi/opusTC.mul.64k.spm
# Download the vocabulary
wget https://mammoth101.a3s.fi/opusTC.mul.vocab.onmt

mkdir vocab
mv opusTC.mul.64k.spm vocab/.
mv opusTC.mul.vocab.onmt vocab/.

If you would like to create and use a custom sentencepiece tokenizer, take a look at the OPUS tutorial below.

Step 2: Tokenization

Then, read parallel text data, processes it, and generates output files for training and validation sets. Here’s a high-level summary of the main processing steps. For each language in ‘langs,’

  • read parallel data files.

  • clean the data by removing empty lines.

  • shuffle the data randomly.

  • tokenizes the text using SentencePiece and writes the tokenized data to separate output files for training and validation sets.

You’re free to skip this step if you directly download the processed data.

import random
import pathlib

import tqdm
import sentencepiece as sp

langs = ["bg", "cs"]

sp_path = 'vocab/opusTC.mul.64k.spm'
spm = sp.SentencePieceProcessor(model_file=sp_path)

input_dir = 'europarl_data/europarl'
output_dir = 'europarl_data/encoded'

for lang in tqdm.tqdm(langs):
    en_side_in = f'{input_dir}/{lang}-en/europarl-v7.{lang}-en.en'
    xx_side_in = f'{input_dir}/{lang}-en/europarl-v7.{lang}-en.{lang}'
    with open(xx_side_in) as xx_stream, open(en_side_in) as en_stream:
        data = zip(map(str.strip, xx_stream), map(str.strip, en_stream))
        data = [(xx, en) for xx, en in tqdm.tqdm(data, leave=False, desc=f'read {lang}') if xx and en] # drop empty lines
        random.shuffle(data)
    pathlib.Path(output_dir).mkdir(exist_ok=True) 
    en_side_out = f'{output_dir}/valid.{lang}-en.en.sp'
    xx_side_out = f'{output_dir}/valid.{lang}-en.{lang}.sp'
    with open(xx_side_out, 'w') as xx_stream, open(en_side_out, 'w') as en_stream:
        for xx, en in tqdm.tqdm(data[:1000], leave=False, desc=f'valid {lang}'):
            print(*spm.encode(xx, out_type=str), file=xx_stream)
            print(*spm.encode(en, out_type=str), file=en_stream)
    en_side_out = f'{output_dir}/train.{lang}-en.en.sp'
    xx_side_out = f'{output_dir}/train.{lang}-en.{lang}.sp'
    with open(xx_side_out, 'w') as xx_stream, open(en_side_out, 'w') as en_stream:
        for xx, en in tqdm.tqdm(data[1000:], leave=False, desc=f'train {lang}'):
            print(*spm.encode(xx, out_type=str), file=xx_stream)
            print(*spm.encode(en, out_type=str), file=en_stream)

The script will produce encoded datasets in europarl_data/encoded that you can further use for the training.

UNPC

UNPC consists of manually translated UN documents from the last 25 years (1990 to 2014) for the six official UN languages, Arabic, Chinese, English, French, Russian, and Spanish. We preprocess the data. You can download the processed data by:

wget https://mammoth-share.a3s.fi/unpc.tar

Or you can use the scripts provided by the tarball to process the data yourself.

For references, please cite this reference: Ziemski, M., Junczys-Dowmunt, M., and Pouliquen, B., (2016), The United Nations Parallel Corpus, Language Resources and Evaluation (LREC’16), Portorož, Slovenia, May 2016.

OPUS 100

In this guideline, we will also create our custom sentencepiece tokenizer.

To do that, you will also need to compile a sentencepiece installation in your environment (not just pip install). Follow the instructions on sentencepiece github.

After that, download the opus 100 dataset from OPUS 100

Step 1: Set relevant paths, variables and download

SP_PATH=your/sentencepiece/path/build/src
DATA_PATH=your/path/to/save/dataset
# Download the default datasets into the $DATA_PATH; mkdir if it doesn't exist
mkdir -p $DATA_PATH

CUR_DIR=$(pwd)

# set vocabulary size and language pairs
vocab_sizes=(32000 16000 8000 4000 2000 1000)
input_sentence_size=10000000

cd $DATA_PATH
echo "Downloading and extracting Opus100"
wget -q --trust-server-names https://object.pouta.csc.fi/OPUS-100/v1.0/opus-100-corpus-v1.0.tar.gz
tar -xzvf opus-100-corpus-v1.0.tar.gz
cd $CUR_DIR

language_pairs=( $( ls $DATA_PATH/opus-100-corpus/v1.0/supervised/ ) )

Step 2: Train SentencePiece models and get vocabs

Starting from here, original files are supposed to be in $DATA_PATH

echo "$0: Training SentencePiece models"
rm -f $DATA_PATH/train.txt
rm -f $DATA_PATH/train.en.txt
for lp in "${language_pairs[@]}"
do
IFS=- read sl tl <<< $lp
if [[ $sl = "en" ]]
then
other_lang=$tl
else
other_lang=$sl
fi
# train the SentencePiece model over the language other than english
sort -u $DATA_PATH/opus-100-corpus/v1.0/supervised/$lp/opus.$lp-train.$other_lang | shuf > $DATA_PATH/train.txt
for vocab_size in "${vocab_sizes[@]}"
do
echo "Training SentencePiece model for $other_lang with vocab size $vocab_size"
cd $SP_PATH
./spm_train --input=$DATA_PATH/train.txt \
            --model_prefix=$DATA_PATH/opus.$other_lang \
            --vocab_size=$vocab_size --character_coverage=0.98 \
            --input_sentence_size=$input_sentence_size --shuffle_input_sentence=true # to use a subset of sentences sampled from the entire training set
cd $CUR_DIR
if [ -f "$DATA_PATH"/opus."$other_lang".vocab ]
then
    # get vocab in onmt format
    cut -f 1 "$DATA_PATH"/opus."$other_lang".vocab > "$DATA_PATH"/opus."$other_lang".vocab.onmt
    break
fi
done
rm $DATA_PATH/train.txt
# append the english data to a file
cat $DATA_PATH/opus-100-corpus/v1.0/supervised/$lp/opus.$lp-train.en >> $DATA_PATH/train.en.txt

 # train the SentencePiece model for english
 echo "Training SentencePiece model for en"
 sort -u $DATA_PATH/train.en.txt | shuf -n $input_sentence_size > $DATA_PATH/train.txt
 rm $DATA_PATH/train.en.txt
 cd $SP_PATH
 ./spm_train --input=$DATA_PATH/train.txt --model_prefix=$DATA_PATH/opus.en \
             --vocab_size=${vocab_sizes[0]} --character_coverage=0.98 \
             --input_sentence_size=$input_sentence_size --shuffle_input_sentence=true # to use a subset of sentences sampled from the entire training set
# Other options to consider:
# --max_sentence_length=  # to set max length when filtering sentences
# --train_extremely_large_corpus=true
 cd $CUR_DIR
 rm $DATA_PATH/train.txt
fi

Step 3: Parse train, valid and test sets for supervised translation directions

mkdir -p $DATA_PATH/supervised
for lp in "${language_pairs[@]}"
do
mkdir -p $DATA_PATH/supervised/$lp
IFS=- read sl tl <<< $lp

echo "$lp: parsing train data"
dir=$DATA_PATH/opus-100-corpus/v1.0/supervised
cd $SP_PATH
./spm_encode --model=$DATA_PATH/opus.$sl.model \
                < $dir/$lp/opus.$lp-train.$sl \
                > $DATA_PATH/supervised/$lp/opus.$lp-train.$sl.sp
./spm_encode --model=$DATA_PATH/opus.$tl.model \
                < $dir/$lp/opus.$lp-train.$tl \
                > $DATA_PATH/supervised/$lp/opus.$lp-train.$tl.sp
cd $CUR_DIR

if [ -f $dir/$lp/opus.$lp-dev.$sl ]
then
    echo "$lp: parsing dev data"
    cd $SP_PATH
    ./spm_encode --model=$DATA_PATH/opus.$sl.model \
                < $dir/$lp/opus.$lp-dev.$sl \
                > $DATA_PATH/supervised/$lp/opus.$lp-dev.$sl.sp
    ./spm_encode --model=$DATA_PATH/opus.$tl.model \
                < $dir/$lp/opus.$lp-dev.$tl \
                > $DATA_PATH/supervised/$lp/opus.$lp-dev.$tl.sp
    cd $CUR_DIR
else
    echo "$lp: dev data not found"
fi

if [ -f $dir/$lp/opus.$lp-test.$sl ]
then
    echo "$lp: parsing test data"
    cd $SP_PATH
    ./spm_encode --model=$DATA_PATH/opus.$sl.model \
                < $dir/$lp/opus.$lp-test.$sl \
                > $DATA_PATH/supervised/$lp/opus.$lp-test.$sl.sp
    ./spm_encode --model=$DATA_PATH/opus.$tl.model \
                < $dir/$lp/opus.$lp-test.$tl \
                > $DATA_PATH/supervised/$lp/opus.$lp-test.$tl.sp
    cd $CUR_DIR
else
    echo "$lp: test data not found"
fi
done

Step 4: Parse the test sets for zero-shot translation directions

mkdir -p $DATA_PATH/zero-shot
for dir in $DATA_PATH/opus-100-corpus/v1.0/zero-shot/*
do
lp=$(basename $dir)  # get name of dir from full path
mkdir -p $DATA_PATH/zero-shot/$lp
echo "$lp: parsing zero-shot test data"
IFS=- read sl tl <<< $lp
cd $SP_PATH
./spm_encode --model=$DATA_PATH/opus.$sl.model \
            < $dir/opus.$lp-test.$sl \
            > $DATA_PATH/zero-shot/$lp/opus.$lp-test.$sl.sp
./spm_encode --model=$DATA_PATH/opus.$tl.model \
            < $dir/opus.$lp-test.$tl \
            > $DATA_PATH/zero-shot/$lp/opus.$lp-test.$tl.sp
cd $CUR_DIR
done