Config-config Tool¶

A meta-configuration tool, or config generator for MAMMOTH.

The MAMMOTH configuration options have become unwieldy as complexity has increased. Especially the introduction of LayerStacks (the mechanism for dividing encoders and decoders into several subcomponents with different parameter sharing) and Adapters has made specifying parameters on the command line into a total nightmare, and even writing yaml configs by hand is cumbersome. Other functionality that is cumbersome to specify by hand includes:

Node and GPU assignments for massively multilingual models,
Weights and curricula (starting step for specific tasks),
Parameter sharing groups based on language similarity.

To ease the creation of configs, the config-config tool reads in a human-writable configuration template, and computes the specific values expected by OpenNMT, writing them out as a less-readable yaml file.

Command¶

python3 mammoth/tools/config_config.py config_all --in_config input.yaml --out_config output.yaml

Inputs¶

The primary input is a yaml file, which contains two types of parameters

Passed through parameters: values copied from the input into the output, i.e. an OpenNMT yaml config file.
Meta-parameters, defining what config-config should do. These are contained under the config_config key.

Input yaml¶

See the example in mammoth/examples/config_config.yaml

The meta-parameters under the config_config key:

`src_path` and `tgt_path`¶

Path templates for source and target corpora, respectively. The path templates can contain the following variables that will be substituted by config_config:

Directional corpus mode
- {src_lang}: The source language of the task
- {tgt_lang}: The target language of the task
- {lang_pair}: {src_lang}-{tgt_lang} for convenience
Symmetric corpus mode
- {lang_a}: The alphabetically first language
- {lang_b}: The alphabetically second language
- {side_a}: ‘src’ if the language pair is used in the “forward” direction, otherwise ‘trg’. Tatoeba uses ‘trg’, not ‘tgt’. Deal with it.
- {side_b}: ‘trg’ if the language pair is used in the “forward” direction, otherwise ‘src’.
- {sorted_pair}: the source and target languages in alphabetical order, separated by a hyphen.

So for example, let’s say your corpus contains the files eng-ben/train.src.gz (English side) and eng-ben/train.trg.gz (Bengali side). You want to use the data symmetrically for both ben-to-eng and eng-to-ben directions. For the first, {lang_pair} and {sorted_pair} are the same. For the second, {lang_pair} is “eng-ben”, but {sorted_pair} is “ben-eng”. In order to use the files in the correct order, you should use the template {sorted_pair}/train.{side_a}.gz for the source template, and {sorted_pair}/train.{side_b}.gz for the target template.

`ae_path`¶

Path templates for monolingual data for autoencoder tasks. The same data will be used as both the source and target for the task: noise is introduced using transforms. The path templates can contain the following variables that will be substituted by config_config: {src_lang}, {tgt_lang}, and {sorted_pair}. If unset, autoencoder pairs will use src_path and tgt_path instead.

`autoencoder`¶

If set to True, autoencoder tasks will be added.

`distance_matrix`¶

Path to the distance matrix comma-separated value (csv) file.

`n_groups`¶

The number of language groups to create when clustering.

`use_weight`¶

If set to True, use corpus weights based on temperature-adjusted corpus size.

Note that the actual weight is proportional to the weights of the the other tasks assigned to the same GPU. E.g. if only one task is assigned to a GPU, it will receive 100% weight regardless of what the computed weight is.

`use_introduce_at_training_step`¶

If set to True, use a curriculum introducing corpora based on temperature-adjusted corpus size.

Note that if both use_weight and use_introduce_at_training_step are specified, the weight is distributed to the two according to the square root, so that when both of them are applied (multiplicatively), the desired weight is achieved.

Note that high-resource language pairs (would train for over 75% of the training time) all start at 0. This avoids starting training with only one GPU doing work, while the other GPUs are idle waiting for their LPs to start.

`use_src_lang_token`¶

Only has an effect when using the prefix transform. Normally, the prefix transform only includes a target language selector token: <to_yyy> where yyy is the code of the target language. If this flag is set, then also the source language is specified, e.g. <from_xxx> <to_yyy>.

`translation_config_dir`¶

The directory in which to generate translation configs. One config per language pair will be generated. Only supervised pairs are generated, unless zero_shot is True.

`zero_shot`¶

Generate translation configs for zero-shot directions.

`transforms` and `ae_transforms`¶

A list of transforms, for translation tasks and autoencoder tasks, respectively. Use this to apply subword segmentation, e.g. using sentencepiece, and denoising noise for autoencoder. Both of these may change the sequence length, necessitating a filtertoolong transform.

`n_nodes` and `n_gpus_per_node`¶

The number of nodes and GPUs, for assignment of tasks to devices. Note that you also need to separately specify this information to slurm.

Other top-level keys than `config_config`¶

Distance matrix¶

See the example distance matrix in mammoth/examples/config_config.distance.csv.

The distance matrix is given as a csv file, with a column lang and one column per language. There should be one row per language, with the language code in the first lang column, followed by a float giving the distance to the language specified by the column. The rows should appear in the same order as the columns. This means that the matrix must be square and symmetrical. The upper and lower triangle are redundant, but both must be given.

Note that the values are distances: the distance of a language to itself (the diagonal) should be 0.

Alternative: specify language groups manually¶

To specify groups manually instead of using clustering, you must do two things:

Leave distance_matrix unset.
Specify a mapping of languages to groups: config_config.groups.{lang}: {group}.

The actual corpora¶

The actual corpus files are used in two ways:

The presence of the files for a language pair determine if it is included or not.
Line counts from the files are used for weighting.

Because of this, you need to run config_config so that it can access the corpora using the specified src_path and tgt_path.

Usage¶

python mammoth/tools/config_config.py config_all --in_config path/to/input.yaml --out_config path/to/output.yaml

Stages¶

The tool runs in multiple stages. The meta-stage config_all runs all of the stages in order.

It is possible to run the steps individually, by feeding in the output of the previous step as the input of the next step. This allows more control:

Skipping unnecessary steps.
Overriding what a particular step does by specifying its output manually.

`complete_language_pairs`¶

Determines which language pairs have data. The languages to consider as candidates are determined from the vocabulary keys.

`corpora_schedule`¶

Determine weighting and curriculum for the tasks.

`cluster_languages`¶

Determine language groups by clustering. This step can be easily skipped by leaving the distance_matrix unset. If the step is skipped, you should define the config_config.groups dict in the input yaml.

`set_transforms`¶

Apply the transforms to tasks.

`allocate_devices`¶

Allocate tasks to nodes and gpus. A local search procedure is used, taking into account parameter sharing groups and tasks delayed by curriculum weighting.

`adapter_config`¶

Determine the adapter configuration.

`translation_configs`¶

Generate the translation yaml configs.

`remove_temporary_keys`¶

Remove any meta-parameters that are not accepted by OpenNMT. This should always be the last step.

`config_all`¶

Meta-stage to run all of the stages in order.

Command line overrides¶

Some parameters can also be given on the command line. If a value is given both in the input yaml and on the command line, the command line takes precedence.

Config-config Tool¶

Command¶

Inputs¶

Input yaml¶

src_path and tgt_path¶

ae_path¶

autoencoder¶

distance_matrix¶

n_groups¶

use_weight¶

use_introduce_at_training_step¶

use_src_lang_token¶

translation_config_dir¶

zero_shot¶

transforms and ae_transforms¶

enc_sharing_groups and dec_sharing_groups¶

n_nodes and n_gpus_per_node¶

Other top-level keys than config_config¶

Parameter sharing in adapters¶