Config-config Tool

A meta-configuration tool, or config generator for MAMMOTH.

The MAMMOTH configuration options have become unwieldy as complexity has increased. Especially the introduction of LayerStacks (the mechanism for dividing encoders and decoders into several subcomponents with different parameter sharing) and Adapters has made specifying parameters on the command line into a total nightmare, and even writing yaml configs by hand is cumbersome. Other functionality that is cumbersome to specify by hand includes:

  • Node and GPU assignments for massively multilingual models,

  • Weights and curricula (starting step for specific tasks),

  • Parameter sharing groups based on language similarity.

To ease the creation of configs, the config-config tool reads in a human-writable configuration template, and computes the specific values expected by OpenNMT, writing them out as a less-readable yaml file.

Command

python3 mammoth/tools/config_config.py config_all --in_config input.yaml --out_config output.yaml

Inputs

The primary input is a yaml file, which contains two types of parameters

  • Passed through parameters: values copied from the input into the output, i.e. an OpenNMT yaml config file.

  • Meta-parameters, defining what config-config should do. These are contained under the config_config key.

Input yaml

See the example in mammoth/examples/config_config.yaml

The meta-parameters under the config_config key:

src_path and tgt_path

Path templates for source and target corpora, respectively. The path templates can contain the following variables that will be substituted by config_config:

  • Directional corpus mode

    • {src_lang}: The source language of the task

    • {tgt_lang}: The target language of the task

    • {lang_pair}: {src_lang}-{tgt_lang} for convenience

  • Symmetric corpus mode

    • {lang_a}: The alphabetically first language

    • {lang_b}: The alphabetically second language

    • {side_a}: ‘src’ if the language pair is used in the “forward” direction, otherwise ‘trg’. Tatoeba uses ‘trg’, not ‘tgt’. Deal with it.

    • {side_b}: ‘trg’ if the language pair is used in the “forward” direction, otherwise ‘src’.

    • {sorted_pair}: the source and target languages in alphabetical order, separated by a hyphen.

So for example, let’s say your corpus contains the files eng-ben/train.src.gz (English side) and eng-ben/train.trg.gz (Bengali side). You want to use the data symmetrically for both ben-to-eng and eng-to-ben directions. For the first, {lang_pair} and {sorted_pair} are the same. For the second, {lang_pair} is “eng-ben”, but {sorted_pair} is “ben-eng”. In order to use the files in the correct order, you should use the template {sorted_pair}/train.{side_a}.gz for the source template, and {sorted_pair}/train.{side_b}.gz for the target template.

ae_path

Path templates for monolingual data for autoencoder tasks. The same data will be used as both the source and target for the task: noise is introduced using transforms. The path templates can contain the following variables that will be substituted by config_config: {src_lang}, {tgt_lang}, and {sorted_pair}. If unset, autoencoder pairs will use src_path and tgt_path instead.

autoencoder

If set to True, autoencoder tasks will be added.

distance_matrix

Path to the distance matrix comma-separated value (csv) file.

n_groups

The number of language groups to create when clustering.

use_weight

If set to True, use corpus weights based on temperature-adjusted corpus size.

Note that the actual weight is proportional to the weights of the the other tasks assigned to the same GPU. E.g. if only one task is assigned to a GPU, it will receive 100% weight regardless of what the computed weight is.

use_introduce_at_training_step

If set to True, use a curriculum introducing corpora based on temperature-adjusted corpus size.

Note that if both use_weight and use_introduce_at_training_step are specified, the weight is distributed to the two according to the square root, so that when both of them are applied (multiplicatively), the desired weight is achieved.

Note that high-resource language pairs (would train for over 75% of the training time) all start at 0. This avoids starting training with only one GPU doing work, while the other GPUs are idle waiting for their LPs to start.

use_src_lang_token

Only has an effect when using the prefix transform. Normally, the prefix transform only includes a target language selector token: <to_yyy> where yyy is the code of the target language. If this flag is set, then also the source language is specified, e.g. <from_xxx> <to_yyy>.

translation_config_dir

The directory in which to generate translation configs. One config per language pair will be generated. Only supervised pairs are generated, unless zero_shot is True.

zero_shot

Generate translation configs for zero-shot directions.

transforms and ae_transforms

A list of transforms, for translation tasks and autoencoder tasks, respectively. Use this to apply subword segmentation, e.g. using sentencepiece, and denoising noise for autoencoder. Both of these may change the sequence length, necessitating a filtertoolong transform.

enc_sharing_groups and dec_sharing_groups

A list of parameter sharing patterns, one for each LayerStack in the (enc|dec)oder. Each list element takes one of 7 values:

  • FULL: fully shared parameters. Will be named using the constant “full”.

  • SRC_GROUP: groupwise shared parameters. Will be named according to the cluster id of the source language.

  • TGT_GROUP: groupwise shared parameters. Will be named according to the cluster id of the target language.

  • GROUP: groupwise shared parameters. Same as SRC_GROUP for encoder and TGT_GROUP for decoder.

  • SRC_LANGUAGE: language specific parameters. Will be named according to the source language code.

  • TGT_LANGUAGE: language specific parameters. Will be named according to the target language code.

  • LANGUAGE: language specific parameters. Same as SRC_LANGUAGE for encoder and TGT_LANGUAGE for decoder.

Note that it is possible to have target-language-dependent components in the encoder, by using TGT_LANGUAGE or TGT_GROUP in the enc_sharing_groups.

n_nodes and n_gpus_per_node

The number of nodes and GPUs, for assignment of tasks to devices. Note that you also need to separately specify this information to slurm.

Other top-level keys than config_config

Parameter sharing in adapters

The key adapters.encoder.{adapter_name}.ids takes one of 3 values:

  • FULL: fully shared parameters. Will be named using the constant “full”.

  • GROUP: groupwise shared parameters. Will be named according to the cluster id.

  • LANGUAGE: language specific parameters. Will be named according to the language code.

(Adapters do not currently support the SRC_ and TGT_ prefixes)

Distance matrix

See the example distance matrix in mammoth/examples/config_config.distance.csv.

The distance matrix is given as a csv file, with a column lang and one column per language. There should be one row per language, with the language code in the first lang column, followed by a float giving the distance to the language specified by the column. The rows should appear in the same order as the columns. This means that the matrix must be square and symmetrical. The upper and lower triangle are redundant, but both must be given.

Note that the values are distances: the distance of a language to itself (the diagonal) should be 0.

Alternative: specify language groups manually

To specify groups manually instead of using clustering, you must do two things:

  1. Leave distance_matrix unset.

  2. Specify a mapping of languages to groups: config_config.groups.{lang}: {group}.

The actual corpora

The actual corpus files are used in two ways:

  • The presence of the files for a language pair determine if it is included or not.

  • Line counts from the files are used for weighting.

Because of this, you need to run config_config so that it can access the corpora using the specified src_path and tgt_path.

Usage

python mammoth/tools/config_config.py config_all --in_config path/to/input.yaml --out_config path/to/output.yaml

Stages

The tool runs in multiple stages. The meta-stage config_all runs all of the stages in order.

It is possible to run the steps individually, by feeding in the output of the previous step as the input of the next step. This allows more control:

  • Skipping unnecessary steps.

  • Overriding what a particular step does by specifying its output manually.

complete_language_pairs

Determines which language pairs have data. The languages to consider as candidates are determined from the vocabulary keys.

corpora_schedule

Determine weighting and curriculum for the tasks.

cluster_languages

Determine language groups by clustering. This step can be easily skipped by leaving the distance_matrix unset. If the step is skipped, you should define the config_config.groups dict in the input yaml.

sharing_groups

Apply the parameter sharing groups to tasks.

set_transforms

Apply the transforms to tasks.

allocate_devices

Allocate tasks to nodes and gpus. A local search procedure is used, taking into account parameter sharing groups and tasks delayed by curriculum weighting.

adapter_config

Determine the adapter configuration.

translation_configs

Generate the translation yaml configs.

remove_temporary_keys

Remove any meta-parameters that are not accepted by OpenNMT. This should always be the last step.

config_all

Meta-stage to run all of the stages in order.

Command line overrides

Some parameters can also be given on the command line. If a value is given both in the input yaml and on the command line, the command line takes precedence.