Config-config Tool¶
A meta-configuration tool, or config generator for MAMMOTH.
The MAMMOTH configuration options have become unwieldy as complexity has increased. Especially the introduction of LayerStacks (the mechanism for dividing encoders and decoders into several subcomponents with different parameter sharing) and Adapters has made specifying parameters on the command line into a total nightmare, and even writing yaml configs by hand is cumbersome. Other functionality that is cumbersome to specify by hand includes:
Node and GPU assignments for massively multilingual models,
Weights and curricula (starting step for specific tasks),
Parameter sharing groups based on language similarity.
To ease the creation of configs, the config-config tool reads in a human-writable configuration template, and computes the specific values expected by OpenNMT, writing them out as a less-readable yaml file.
Command¶
python3 mammoth/tools/config_config.py config_all --in_config input.yaml --out_config output.yaml
Inputs¶
The primary input is a yaml file, which contains two types of parameters
Passed through parameters: values copied from the input into the output, i.e. an OpenNMT yaml config file.
Meta-parameters, defining what config-config should do. These are contained under the
config_config
key.
Input yaml¶
See the example in mammoth/examples/config_config.yaml
The meta-parameters under the config_config
key:
src_path
and tgt_path
¶
Path templates for source and target corpora, respectively.
The path templates can contain the following variables that will be substituted by config_config
:
Directional corpus mode
{src_lang}
: The source language of the task{tgt_lang}
: The target language of the task{lang_pair}
:{src_lang}-{tgt_lang}
for convenience
Symmetric corpus mode
{lang_a}
: The alphabetically first language{lang_b}
: The alphabetically second language{side_a}
: ‘src’ if the language pair is used in the “forward” direction, otherwise ‘trg’. Tatoeba uses ‘trg’, not ‘tgt’. Deal with it.{side_b}
: ‘trg’ if the language pair is used in the “forward” direction, otherwise ‘src’.{sorted_pair}
: the source and target languages in alphabetical order, separated by a hyphen.
So for example, let’s say your corpus contains the files eng-ben/train.src.gz
(English side) and eng-ben/train.trg.gz
(Bengali side).
You want to use the data symmetrically for both ben-to-eng and eng-to-ben directions.
For the first, {lang_pair}
and {sorted_pair}
are the same.
For the second, {lang_pair}
is “eng-ben”, but {sorted_pair}
is “ben-eng”.
In order to use the files in the correct order, you should use the template {sorted_pair}/train.{side_a}.gz
for the source template, and {sorted_pair}/train.{side_b}.gz
for the target template.
ae_path
¶
Path templates for monolingual data for autoencoder tasks.
The same data will be used as both the source and target for the task: noise is introduced using transforms.
The path templates can contain the following variables that will be substituted by config_config
:
{src_lang}
, {tgt_lang}
, and {sorted_pair}
.
If unset, autoencoder pairs will use src_path
and tgt_path
instead.
autoencoder
¶
If set to True
, autoencoder tasks will be added.
distance_matrix
¶
Path to the distance matrix comma-separated value (csv) file.
n_groups
¶
The number of language groups to create when clustering.
use_weight
¶
If set to True
, use corpus weights based on temperature-adjusted corpus size.
Note that the actual weight is proportional to the weights of the the other tasks assigned to the same GPU. E.g. if only one task is assigned to a GPU, it will receive 100% weight regardless of what the computed weight is.
use_introduce_at_training_step
¶
If set to True
, use a curriculum introducing corpora based on temperature-adjusted corpus size.
Note that if both use_weight
and use_introduce_at_training_step
are specified, the weight is distributed to the two according to the square root, so that when both of them are applied (multiplicatively), the desired weight is achieved.
Note that high-resource language pairs (would train for over 75% of the training time) all start at 0. This avoids starting training with only one GPU doing work, while the other GPUs are idle waiting for their LPs to start.
use_src_lang_token
¶
Only has an effect when using the prefix
transform.
Normally, the prefix transform only includes a target language selector token: <to_yyy>
where yyy
is the code of the target language.
If this flag is set, then also the source language is specified, e.g. <from_xxx> <to_yyy>
.
translation_config_dir
¶
The directory in which to generate translation configs.
One config per language pair will be generated.
Only supervised pairs are generated, unless zero_shot
is True.
zero_shot
¶
Generate translation configs for zero-shot directions.
transforms
and ae_transforms
¶
A list of transforms, for translation tasks and autoencoder tasks, respectively.
Use this to apply subword segmentation, e.g. using sentencepiece
, and denoising
noise for autoencoder.
Both of these may change the sequence length, necessitating a filtertoolong
transform.
enc_sharing_groups
and dec_sharing_groups
¶
A list of parameter sharing patterns, one for each LayerStack in the (enc|dec)oder. Each list element takes one of 7 values:
FULL
: fully shared parameters. Will be named using the constant “full”.SRC_GROUP
: groupwise shared parameters. Will be named according to the cluster id of the source language.TGT_GROUP
: groupwise shared parameters. Will be named according to the cluster id of the target language.GROUP
: groupwise shared parameters. Same asSRC_GROUP
for encoder andTGT_GROUP
for decoder.SRC_LANGUAGE
: language specific parameters. Will be named according to the source language code.TGT_LANGUAGE
: language specific parameters. Will be named according to the target language code.LANGUAGE
: language specific parameters. Same asSRC_LANGUAGE
for encoder andTGT_LANGUAGE
for decoder.
Note that it is possible to have target-language-dependent components in the encoder, by using TGT_LANGUAGE
or TGT_GROUP
in the enc_sharing_groups
.
n_nodes
and n_gpus_per_node
¶
The number of nodes and GPUs, for assignment of tasks to devices. Note that you also need to separately specify this information to slurm.
Other top-level keys than config_config
¶
Parameter sharing in adapters¶
The key adapters.encoder.{adapter_name}.ids
takes one of 3 values:
FULL
: fully shared parameters. Will be named using the constant “full”.GROUP
: groupwise shared parameters. Will be named according to the cluster id.LANGUAGE
: language specific parameters. Will be named according to the language code.
(Adapters do not currently support the SRC_ and TGT_ prefixes)
Distance matrix¶
See the example distance matrix in mammoth/examples/config_config.distance.csv
.
The distance matrix is given as a csv file, with a column lang
and one column per language.
There should be one row per language, with the language code in the first lang
column, followed by a float giving the distance to the language specified by the column.
The rows should appear in the same order as the columns.
This means that the matrix must be square and symmetrical.
The upper and lower triangle are redundant, but both must be given.
Note that the values are distances: the distance of a language to itself (the diagonal) should be 0.
Alternative: specify language groups manually¶
To specify groups manually instead of using clustering, you must do two things:
Leave
distance_matrix
unset.Specify a mapping of languages to groups:
config_config.groups.{lang}: {group}
.
The actual corpora¶
The actual corpus files are used in two ways:
The presence of the files for a language pair determine if it is included or not.
Line counts from the files are used for weighting.
Because of this, you need to run config_config
so that it can access the corpora using the specified src_path
and tgt_path
.
Usage¶
python mammoth/tools/config_config.py config_all --in_config path/to/input.yaml --out_config path/to/output.yaml
Stages¶
The tool runs in multiple stages. The meta-stage config_all
runs all of the stages in order.
It is possible to run the steps individually, by feeding in the output of the previous step as the input of the next step. This allows more control:
Skipping unnecessary steps.
Overriding what a particular step does by specifying its output manually.
complete_language_pairs
¶
Determines which language pairs have data. The languages to consider as candidates are determined from the vocabulary keys.
corpora_schedule
¶
Determine weighting and curriculum for the tasks.
cluster_languages
¶
Determine language groups by clustering.
This step can be easily skipped by leaving the distance_matrix
unset.
If the step is skipped, you should define the config_config.groups
dict in the input yaml.
sharing_groups
¶
Apply the parameter sharing groups to tasks.
set_transforms
¶
Apply the transforms to tasks.
allocate_devices
¶
Allocate tasks to nodes and gpus. A local search procedure is used, taking into account parameter sharing groups and tasks delayed by curriculum weighting.
adapter_config
¶
Determine the adapter configuration.
translation_configs
¶
Generate the translation yaml configs.
remove_temporary_keys
¶
Remove any meta-parameters that are not accepted by OpenNMT. This should always be the last step.
config_all
¶
Meta-stage to run all of the stages in order.
Command line overrides¶
Some parameters can also be given on the command line. If a value is given both in the input yaml and on the command line, the command line takes precedence.