Train

train.py

usage: train.py [-h] [-config CONFIG] [-save_config SAVE_CONFIG] -tasks TASKS [-skip_empty_level {silent,warning,error}]
                [-mammoth_transforms {prefix,denoising,filtertoolong,filterwordratio,filterrepetitions,filterterminalpunct,filternonzeronumerals,filterfeats,inferfeats,switchout,tokendrop,tokenmask,sentencepiece,bpe,onmt_tokenize} [{prefix,denoising,filtertoolong,filterwordratio,filterrepetitions,filterterminalpunct,filternonzeronumerals,filterfeats,inferfeats,switchout,tokendrop,tokenmask,sentencepiece,bpe,onmt_tokenize} ...]]
                [-save_data SAVE_DATA] [-overwrite] [-n_sample N_SAMPLE] [-dump_transforms] -src_vocab SRC_VOCAB [-tgt_vocab TGT_VOCAB] [-share_vocab]
                [-vocab_paths VOCAB_PATHS] [-src_feats_vocab SRC_FEATS_VOCAB] [-src_vocab_size SRC_VOCAB_SIZE] [-tgt_vocab_size TGT_VOCAB_SIZE]
                [-vocab_size_multiple VOCAB_SIZE_MULTIPLE] [-src_words_min_frequency SRC_WORDS_MIN_FREQUENCY]
                [-tgt_words_min_frequency TGT_WORDS_MIN_FREQUENCY] [--src_seq_length_trunc SRC_SEQ_LENGTH_TRUNC]
                [--tgt_seq_length_trunc TGT_SEQ_LENGTH_TRUNC] [-both_embeddings BOTH_EMBEDDINGS] [-src_embeddings SRC_EMBEDDINGS]
                [-tgt_embeddings TGT_EMBEDDINGS] [-embeddings_type {GloVe,word2vec}] [--permute_sent_ratio PERMUTE_SENT_RATIO]
                [--rotate_ratio ROTATE_RATIO] [--insert_ratio INSERT_RATIO] [--random_ratio RANDOM_RATIO] [--mask_ratio MASK_RATIO]
                [--mask_length {subword,word,span-poisson}] [--poisson_lambda POISSON_LAMBDA] [--replace_length {-1,0,1}]
                [--denoising_objective {bart,mass}] [--src_seq_length SRC_SEQ_LENGTH] [--tgt_seq_length TGT_SEQ_LENGTH]
                [--word_ratio_threshold WORD_RATIO_THRESHOLD] [--rep_threshold REP_THRESHOLD] [--rep_min_len REP_MIN_LEN] [--rep_max_len REP_MAX_LEN]
                [--punct_threshold PUNCT_THRESHOLD] [--nonzero_threshold NONZERO_THRESHOLD] [--reversible_tokenization {joiner,spacer}]
                [--prior_tokenization] [-switchout_temperature SWITCHOUT_TEMPERATURE] [-tokendrop_temperature TOKENDROP_TEMPERATURE]
                [-tokenmask_temperature TOKENMASK_TEMPERATURE] [-src_subword_model SRC_SUBWORD_MODEL] [-tgt_subword_model TGT_SUBWORD_MODEL]
                [-src_subword_nbest SRC_SUBWORD_NBEST] [-tgt_subword_nbest TGT_SUBWORD_NBEST] [-src_subword_alpha SRC_SUBWORD_ALPHA]
                [-tgt_subword_alpha TGT_SUBWORD_ALPHA] [-src_subword_vocab SRC_SUBWORD_VOCAB] [-tgt_subword_vocab TGT_SUBWORD_VOCAB]
                [-src_vocab_threshold SRC_VOCAB_THRESHOLD] [-tgt_vocab_threshold TGT_VOCAB_THRESHOLD] [-src_subword_type {none,sentencepiece,bpe}]
                [-tgt_subword_type {none,sentencepiece,bpe}] [-src_onmttok_kwargs SRC_ONMTTOK_KWARGS] [-tgt_onmttok_kwargs TGT_ONMTTOK_KWARGS]
                [--share_decoder_embeddings] [--share_embeddings] [--enable_embeddingless] [--position_encoding] [-update_vocab]
                [--feat_merge {concat,sum,mlp}] [--feat_vec_size FEAT_VEC_SIZE] [--feat_vec_exponent FEAT_VEC_EXPONENT] [-model_task {seq2seq,lm}]
                [--model_type {text}] [--model_dtype {fp32,fp16}] [--encoder_type {mean,transformer}] [--decoder_type {transformer}] [--layers LAYERS]
                [--enc_layers ENC_LAYERS [ENC_LAYERS ...]] [--dec_layers DEC_LAYERS [DEC_LAYERS ...]] [--model_dim MODEL_DIM]
                [--pos_ffn_activation_fn {relu,gelu}] [-normformer] [--bridge] [--bridge_extra_node BRIDGE_EXTRA_NODE] [--bidir_edges BIDIR_EDGES]
                [--state_dim STATE_DIM] [--n_edge_types N_EDGE_TYPES] [--n_node N_NODE] [--n_steps N_STEPS] [--src_ggnn_size SRC_GGNN_SIZE]
                [--global_attention {dot,general,mlp,none}] [--global_attention_function {softmax}] [--self_attn_type SELF_ATTN_TYPE]
                [--max_relative_positions MAX_RELATIVE_POSITIONS] [--heads HEADS] [--transformer_ff TRANSFORMER_FF] [--aan_useffn]
                [--lambda_align LAMBDA_ALIGN] [--alignment_layer ALIGNMENT_LAYER] [--alignment_heads ALIGNMENT_HEADS] [--full_context_alignment]
                [--copy_attn] [--copy_attn_type {dot,general,mlp,none}] [--generator_function {softmax}] [--copy_attn_force] [--reuse_copy_attn]
                [--copy_loss_by_seqlength] [--coverage_attn] [--lambda_coverage LAMBDA_COVERAGE] [--loss_scale LOSS_SCALE] [--apex_opt_level {O0,O1,O2,O3}]
                [--hidden_ab_size HIDDEN_AB_SIZE] [--ab_fixed_length AB_FIXED_LENGTH] [--ab_layers [{lin,simple,transformer,perceiver,feedforward} ...]]
                [--ab_layer_norm {none,rmsnorm,layernorm}] [-adapters ADAPTERS] [--data_type DATA_TYPE] [--save_model SAVE_MODEL] [--save_all_gpus]
                [--save_checkpoint_steps SAVE_CHECKPOINT_STEPS] [--keep_checkpoint KEEP_CHECKPOINT] [--gpuid [GPUID ...]] [--gpu_ranks [GPU_RANKS ...]]
                [--n_nodes N_NODES] --node_rank NODE_RANK [--world_size WORLD_SIZE] [--gpu_backend GPU_BACKEND] [--gpu_verbose_level GPU_VERBOSE_LEVEL]
                [--master_ip MASTER_IP] [--master_port MASTER_PORT] [--queue_size QUEUE_SIZE] [--seed SEED] [--param_init PARAM_INIT] [--param_init_glorot]
                [--train_from TRAIN_FROM] [--reset_optim {none,all,states,keep_states}] [--pre_word_vecs_enc PRE_WORD_VECS_ENC]
                [--pre_word_vecs_dec PRE_WORD_VECS_DEC] [--freeze_word_vecs_enc] [--freeze_word_vecs_dec] [--batch_size BATCH_SIZE]
                [--batch_size_multiple BATCH_SIZE_MULTIPLE] [--batch_type {sents,tokens}] [--normalization {sents,tokens}]
                [--accum_count ACCUM_COUNT [ACCUM_COUNT ...]] [--accum_steps ACCUM_STEPS [ACCUM_STEPS ...]]
                [--task_distribution_strategy {weighted_sampling,roundrobin}] [--valid_steps VALID_STEPS] [--valid_batch_size VALID_BATCH_SIZE]
                [--max_generator_batches MAX_GENERATOR_BATCHES] [--train_steps TRAIN_STEPS] [--single_pass] [--epochs EPOCHS]
                [--early_stopping EARLY_STOPPING] [--early_stopping_criteria [EARLY_STOPPING_CRITERIA ...]]
                [--optim {sgd,adagrad,adadelta,adam,adamw,adafactor,fusedadam}] [--adagrad_accumulator_init ADAGRAD_ACCUMULATOR_INIT]
                [--max_grad_norm MAX_GRAD_NORM] [--weight_decay WEIGHT_DECAY] [--dropout DROPOUT [DROPOUT ...]]
                [--attention_dropout ATTENTION_DROPOUT [ATTENTION_DROPOUT ...]] [--dropout_steps DROPOUT_STEPS [DROPOUT_STEPS ...]]
                [--truncated_decoder TRUNCATED_DECODER] [--adam_beta1 ADAM_BETA1] [--adam_beta2 ADAM_BETA2] [--label_smoothing LABEL_SMOOTHING]
                [--average_decay AVERAGE_DECAY] [--average_every AVERAGE_EVERY] [--learning_rate LEARNING_RATE] [--learning_rate_decay LEARNING_RATE_DECAY]
                [--start_decay_steps START_DECAY_STEPS] [--decay_steps DECAY_STEPS] [--decay_method {noam,noamwd,rsqrt,linear_warmup,none}]
                [--warmup_steps WARMUP_STEPS] [--log_file LOG_FILE] [--structured_log_file STRUCTURED_LOG_FILE]
                [--log_file_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET,50,40,30,20,10,0}] [--verbose] [--report_every REPORT_EVERY]
                [--exp_host EXP_HOST] [--exp EXP] [--tensorboard] [--tensorboard_log_dir TENSORBOARD_LOG_DIR] [--report_stats_from_parameters]
                [-pool_size POOL_SIZE] [-n_buckets N_BUCKETS]

Configuration

-config, --config

Path of the main YAML config file.

-save_config, --save_config

Path where to save the config.

Data/Tasks

-tasks, --tasks

List of datasets and their specifications. See examples/*.yaml for further details.

-skip_empty_level, --skip_empty_level

Possible choices: silent, warning, error

Security level when encounter empty examples.silent: silently ignore/skip empty example;warning: warning when ignore/skip empty example;error: raise error & stop execution when encouter empty.

Default: “warning”

-mammoth_transforms, --mammoth_transforms

Possible choices: prefix, denoising, filtertoolong, filterwordratio, filterrepetitions, filterterminalpunct, filternonzeronumerals, filterfeats, inferfeats, switchout, tokendrop, tokenmask, sentencepiece, bpe, onmt_tokenize

Default transform pipeline to apply to data. Can be specified in each corpus of data to override.

Default: []

-save_data, --save_data

Output base path for objects that will be saved (vocab, transforms, embeddings, …).

-overwrite, --overwrite

Overwrite existing objects if any.

Default: False

-n_sample, --n_sample

Stop after save this number of transformed samples/corpus. Can be [-1, 0, N>0]. Set to -1 to go full corpus, 0 to skip.

Default: 0

-dump_transforms, --dump_transforms

Dump transforms *.transforms.pt to disk. -save_data should be set as saving prefix.

Default: False

Vocab

-src_vocab, --src_vocab

Path to src (or shared) vocabulary file. Format: one <word> or <word> <count> per line.

-tgt_vocab, --tgt_vocab

Path to tgt vocabulary file. Format: one <word> or <word> <count> per line.

-share_vocab, --share_vocab

Share source and target vocabulary.

Default: False

-vocab_paths, --vocab_paths

file name with ENCorDEC TAB language name TAB path of the vocab.

-src_feats_vocab, --src_feats_vocab

List of paths to src features vocabulary files. Files format: one <word> or <word> <count> per line.

-src_vocab_size, --src_vocab_size

Maximum size of the source vocabulary.

Default: 50000

-tgt_vocab_size, --tgt_vocab_size

Maximum size of the target vocabulary

Default: 50000

-vocab_size_multiple, --vocab_size_multiple

Make the vocabulary size a multiple of this value.

Default: 1

-src_words_min_frequency, --src_words_min_frequency

Discard source words with lower frequency.

Default: 0

-tgt_words_min_frequency, --tgt_words_min_frequency

Discard target words with lower frequency.

Default: 0

Pruning

--src_seq_length_trunc, -src_seq_length_trunc

Truncate source sequence length.

--tgt_seq_length_trunc, -tgt_seq_length_trunc

Truncate target sequence length.

Embeddings

-both_embeddings, --both_embeddings

Path to the embeddings file to use for both source and target tokens.

-src_embeddings, --src_embeddings

Path to the embeddings file to use for source tokens.

-tgt_embeddings, --tgt_embeddings

Path to the embeddings file to use for target tokens.

-embeddings_type, --embeddings_type

Possible choices: GloVe, word2vec

Type of embeddings file.

Transform/Denoising AE

--permute_sent_ratio, -permute_sent_ratio

Permute this proportion of sentences (boundaries defined by [‘.’, ‘?’, ‘!’]) in all inputs.

Default: 0.0

--rotate_ratio, -rotate_ratio

Rotate this proportion of inputs.

Default: 0.0

--insert_ratio, -insert_ratio

Insert this percentage of additional random tokens.

Default: 0.0

--random_ratio, -random_ratio

Instead of using <mask>, use random token this often. Incompatible with MASS

Default: 0.0

--mask_ratio, -mask_ratio

Fraction of words/subwords that will be masked.

Default: 0.0

--mask_length, -mask_length

Possible choices: subword, word, span-poisson

Length of masking window to apply.

Default: “subword”

--poisson_lambda, -poisson_lambda

Lambda for Poisson distribution to sample span length if -mask_length set to span-poisson.

Default: 3.0

--replace_length, -replace_length

Possible choices: -1, 0, 1

When masking N tokens, replace with 0, 1, or N tokens. (use -1 for N)

Default: -1

--denoising_objective

Possible choices: bart, mass

choose between BART-style or MASS-style denoising objectives

Default: “bart”

Transform/Filter

--src_seq_length, -src_seq_length

Maximum source sequence length.

Default: 200

--tgt_seq_length, -tgt_seq_length

Maximum target sequence length.

Default: 200

Transform/Filter

--word_ratio_threshold, -word_ratio_threshold

Threshold for discarding sentences based on word ratio.

Default: 3

Transform/Filter

--rep_threshold, -rep_threshold

Number of times the substring is repeated.

Default: 2

--rep_min_len, -rep_min_len

Minimum length of the repeated pattern.

Default: 3

--rep_max_len, -rep_max_len

Maximum length of the repeated pattern.

Default: 100

Transform/Filter

--punct_threshold, -punct_threshold

Minimum penalty score for discarding sentences based on their terminal punctuation signs

Default: -2

Transform/Filter

--nonzero_threshold, -nonzero_threshold

Threshold for discarding sentences based on numerals between the segments with zeros removed

Default: 0.5

Transform/InferFeats

--reversible_tokenization, -reversible_tokenization

Possible choices: joiner, spacer

Type of reversible tokenization applied on the tokenizer.

Default: “joiner”

--prior_tokenization, -prior_tokenization

Whether the input has already been tokenized.

Default: False

Transform/SwitchOut

-switchout_temperature, --switchout_temperature

Sampling temperature for SwitchOut. \(\tau^{-1}\) in [WPDN18]. Smaller value makes data more diverse.

Default: 1.0

Transform/Token_Drop

-tokendrop_temperature, --tokendrop_temperature

Sampling temperature for token deletion.

Default: 1.0

Transform/Token_Mask

-tokenmask_temperature, --tokenmask_temperature

Sampling temperature for token masking.

Default: 1.0

Transform/Subword/Common

Attention

Common options shared by all subword transforms. Including options for indicate subword model path, Subword Regularization/BPE-Dropout, and Vocabulary Restriction.

-src_subword_model, --src_subword_model

Path of subword model for src (or shared).

-tgt_subword_model, --tgt_subword_model

Path of subword model for tgt.

-src_subword_nbest, --src_subword_nbest

Number of candidates in subword regularization. Valid for unigram sampling, invalid for BPE-dropout. (source side)

Default: 1

-tgt_subword_nbest, --tgt_subword_nbest

Number of candidates in subword regularization. Valid for unigram sampling, invalid for BPE-dropout. (target side)

Default: 1

-src_subword_alpha, --src_subword_alpha

Smoothing parameter for sentencepiece unigram sampling, and dropout probability for BPE-dropout. (source side)

Default: 0

-tgt_subword_alpha, --tgt_subword_alpha

Smoothing parameter for sentencepiece unigram sampling, and dropout probability for BPE-dropout. (target side)

Default: 0

-src_subword_vocab, --src_subword_vocab

Path to the vocabulary file for src subword. Format: <word> <count> per line.

Default: “”

-tgt_subword_vocab, --tgt_subword_vocab

Path to the vocabulary file for tgt subword. Format: <word> <count> per line.

Default: “”

-src_vocab_threshold, --src_vocab_threshold

Only produce src subword in src_subword_vocab with frequency >= src_vocab_threshold.

Default: 0

-tgt_vocab_threshold, --tgt_vocab_threshold

Only produce tgt subword in tgt_subword_vocab with frequency >= tgt_vocab_threshold.

Default: 0

Transform/Subword/ONMTTOK

-src_subword_type, --src_subword_type

Possible choices: none, sentencepiece, bpe

Type of subword model for src (or shared) in pyonmttok.

Default: “none”

-tgt_subword_type, --tgt_subword_type

Possible choices: none, sentencepiece, bpe

Type of subword model for tgt in pyonmttok.

Default: “none”

-src_onmttok_kwargs, --src_onmttok_kwargs

Other pyonmttok options for src in dict string, except subword related options listed earlier.

Default: “{‘mode’: ‘none’}”

-tgt_onmttok_kwargs, --tgt_onmttok_kwargs

Other pyonmttok options for tgt in dict string, except subword related options listed earlier.

Default: “{‘mode’: ‘none’}”

Model-Embeddings

--share_decoder_embeddings, -share_decoder_embeddings

Use a shared weight matrix for the input and output word embeddings in the decoder.

Default: False

--share_embeddings, -share_embeddings

Share the word embeddings between encoder and decoder. Need to use shared dictionary for this option.

Default: False

--enable_embeddingless, -enable_embeddingless

Enable the use of byte-based embeddingless models(Shaham et. al, 2021) https://aclanthology.org/2021.naacl-main.17/

Default: False

--position_encoding, -position_encoding

Use a sin to mark relative words positions. Necessary for non-RNN style models.

Default: False

-update_vocab, --update_vocab

Update source and target existing vocabularies

Default: False

Model-Embedding Features

--feat_merge, -feat_merge

Possible choices: concat, sum, mlp

Merge action for incorporating features embeddings. Options [concat|sum|mlp].

Default: “concat”

--feat_vec_size, -feat_vec_size

If specified, feature embedding sizes will be set to this. Otherwise, feat_vec_exponent will be used.

Default: -1

--feat_vec_exponent, -feat_vec_exponent

If -feat_merge_size is not set, feature embedding sizes will be set to N^feat_vec_exponent where N is the number of values the feature takes.

Default: 0.7

Model- Task

-model_task, --model_task

Possible choices: seq2seq, lm

Type of task for the model either seq2seq or lm

Default: “seq2seq”

Model- Encoder-Decoder

--model_type, -model_type

Possible choices: text

Type of source model to use. Allows the system to incorporate non-text inputs. Options are [text].

Default: “text”

--model_dtype, -model_dtype

Possible choices: fp32, fp16

Data type of the model.

Default: “fp32”

--encoder_type, -encoder_type

Possible choices: mean, transformer

Type of encoder layer to use. Non-RNN layers are experimental. Options are [mean|transformer].

Default: “transformer”

--decoder_type, -decoder_type

Possible choices: transformer

Type of decoder layer to use. Non-RNN layers are experimental. Options are [transformer].

Default: “transformer”

--layers, -layers

Deprecated

Default: -1

--enc_layers, -enc_layers

Number of layers in each encoder

--dec_layers, -dec_layers

Number of layers in each decoder

--model_dim, -model_dim

Size of rnn hidden states.

Default: -1

--pos_ffn_activation_fn, -pos_ffn_activation_fn

Possible choices: relu, gelu

The activation function to use in PositionwiseFeedForward layer. Choices are dict_keys([‘relu’, ‘gelu’]). Default to relu.

Default: “relu”

-normformer, --normformer

NormFormer-style normalization

Default: False

--bridge, -bridge

Have an additional layer between the last encoder state and the first decoder state

Default: False

--bridge_extra_node, -bridge_extra_node

Graph encoder bridges only extra node to decoder as input

Default: True

--bidir_edges, -bidir_edges

Graph encoder autogenerates bidirectional edges

Default: True

--state_dim, -state_dim

Number of state dimensions in the graph encoder

Default: 512

--n_edge_types, -n_edge_types

Number of edge types in the graph encoder

Default: 2

--n_node, -n_node

Number of nodes in the graph encoder

Default: 2

--n_steps, -n_steps

Number of steps to advance graph encoder

Default: 2

--src_ggnn_size, -src_ggnn_size

Vocab size plus feature space for embedding input

Default: 0

Model- Attention

--global_attention, -global_attention

Possible choices: dot, general, mlp, none

The attention type to use: dotprod or general (Luong) or MLP (Bahdanau)

Default: “general”

--global_attention_function, -global_attention_function

Possible choices: softmax

Default: “softmax”

--self_attn_type, -self_attn_type

Self attention type in Transformer decoder layer – currently “scaled-dot” or “average”

Default: “scaled-dot”

--max_relative_positions, -max_relative_positions

Maximum distance between inputs in relative positions representations. For more detailed information, see: https://arxiv.org/pdf/1803.02155.pdf

Default: 0

--heads, -heads

Number of heads for transformer self-attention

Default: 8

--transformer_ff, -transformer_ff

Size of hidden transformer feed-forward

Default: 2048

--aan_useffn, -aan_useffn

Turn on the FFN layer in the AAN decoder

Default: False

Model - Alignement

--lambda_align, -lambda_align

Lambda value for alignement loss of Garg et al (2019)For more detailed information, see: https://arxiv.org/abs/1909.02074

Default: 0.0

--alignment_layer, -alignment_layer

Layer number which has to be supervised.

Default: -3

--alignment_heads, -alignment_heads
  1. of cross attention heads per layer to supervised with

Default: 0

--full_context_alignment, -full_context_alignment

Whether alignment is conditioned on full target context.

Default: False

Generator

--copy_attn, -copy_attn

Train copy attention layer.

Default: False

--copy_attn_type, -copy_attn_type

Possible choices: dot, general, mlp, none

The copy attention type to use. Leave as None to use the same as -global_attention.

--generator_function, -generator_function

Possible choices: softmax

Which function to use for generating probabilities over the target vocabulary (choices: softmax)

Default: “softmax”

--copy_attn_force, -copy_attn_force

When available, train to copy.

Default: False

--reuse_copy_attn, -reuse_copy_attn

Reuse standard attention for copy

Default: False

--copy_loss_by_seqlength, -copy_loss_by_seqlength

Divide copy loss by length of sequence

Default: False

--coverage_attn, -coverage_attn

Train a coverage attention layer.

Default: False

--lambda_coverage, -lambda_coverage

Lambda value for coverage loss of See et al (2017)

Default: 0.0

--loss_scale, -loss_scale

For FP16 training, the static loss scale to use. If not set, the loss scale is dynamically computed.

Default: 0

--apex_opt_level, -apex_opt_level

Possible choices: O0, O1, O2, O3

For FP16 training, the opt_level to use. See https://nvidia.github.io/apex/amp.html#opts-levels.

Default: “O1”

Attention bridge

--hidden_ab_size, -hidden_ab_size

Size of attention bridge hidden states

Default: 2048

--ab_fixed_length, -ab_fixed_length

Number of attention heads in attention bridge (fixed length of output)

Default: 50

--ab_layers, -ab_layers

Possible choices: lin, simple, transformer, perceiver, feedforward

Composition of the attention bridge

Default: []

--ab_layer_norm, -ab_layer_norm

Possible choices: none, rmsnorm, layernorm

Use layer normalization after lin, simple and feedforward bridge layers

Default: “layernorm”

Adapters

-adapters, --adapters

Adapter specifications

General

--data_type, -data_type

Type of the source input. Options are [text].

Default: “text”

--save_model, -save_model

Model filename (the model will be saved as <save_model>_N.pt where N is the number of steps

Default: “model”

--save_all_gpus, -save_all_gpus

Whether to store a model from every gpu (in addition to the modules)

Default: False

--save_checkpoint_steps, -save_checkpoint_steps

Save a checkpoint every X steps

Default: 5000

--keep_checkpoint, -keep_checkpoint

Keep X checkpoints (negative: keep all)

Default: -1

--gpuid, -gpuid

Deprecated see world_size and gpu_ranks.

Default: []

--gpu_ranks, -gpu_ranks

list of ranks of each process.

Default: []

--n_nodes, -n_nodes

total number of training nodes.

Default: 1

--node_rank, -node_rank

index of current node (0-based). When using non-distributed training (CPU, single-GPU), set to 0

--world_size, -world_size

total number of distributed processes.

Default: 1

--gpu_backend, -gpu_backend

Type of torch distributed backend

Default: “nccl”

--gpu_verbose_level, -gpu_verbose_level

Gives more info on each process per GPU.

Default: 0

--master_ip, -master_ip

IP of master for torch.distributed training.

Default: “localhost”

--master_port, -master_port

Port of master for torch.distributed training.

Default: 10000

--queue_size, -queue_size

Size of queue for each process in producer/consumer

Default: 40

Reproducibility

--seed, -seed

Set random seed used for better reproducibility between experiments.

Default: -1

Initialization

--param_init, -param_init

Parameters are initialized over uniform distribution with support (-param_init, param_init). Use 0 to not use initialization

Default: 0.1

--param_init_glorot, -param_init_glorot

Init parameters with xavier_uniform. Required for transformer.

Default: False

--train_from, -train_from

If training from a checkpoint then this is the path to the pretrained model’s state_dict.

Default: “”

--reset_optim, -reset_optim

Possible choices: none, all, states, keep_states

Optimization resetter when train_from.

Default: “none”

--pre_word_vecs_enc, -pre_word_vecs_enc

If a valid path is specified, then this will load pretrained word embeddings on the encoder side. See README for specific formatting instructions.

--pre_word_vecs_dec, -pre_word_vecs_dec

If a valid path is specified, then this will load pretrained word embeddings on the decoder side. See README for specific formatting instructions.

--freeze_word_vecs_enc, -freeze_word_vecs_enc

Freeze word embeddings on the encoder side.

Default: False

--freeze_word_vecs_dec, -freeze_word_vecs_dec

Freeze word embeddings on the decoder side.

Default: False

Optimization- Type

--batch_size, -batch_size

Maximum batch size for training

Default: 64

--batch_size_multiple, -batch_size_multiple

Batch size multiple for token batches.

--batch_type, -batch_type

Possible choices: sents, tokens

Batch grouping for batch_size. Standard is sents. Tokens will do dynamic batching

Default: “sents”

--normalization, -normalization

Possible choices: sents, tokens

Normalization method of the gradient.

Default: “sents”

--accum_count, -accum_count

Accumulate gradient this many times. Approximately equivalent to updating batch_size * accum_count batches at once. Recommended for Transformer.

Default: [1]

--accum_steps, -accum_steps

Steps at which accum_count values change

Default: [0]

--task_distribution_strategy, -task_distribution_strategy

Possible choices: weighted_sampling, roundrobin

Strategy for the order in which tasks (e.g. language pairs) are scheduled for training

Default: “weighted_sampling”

--valid_steps, -valid_steps

Perfom validation every X steps

Default: 10000

--valid_batch_size, -valid_batch_size

Maximum batch size for validation

Default: 32

--max_generator_batches, -max_generator_batches

Maximum batches of words in a sequence to run the generator on in parallel. Higher is faster, but uses more memory. Set to 0 to disable.

Default: 32

--train_steps, -train_steps

Number of training steps

Default: 100000

--single_pass, -single_pass

Make a single pass over the training dataset.

Default: False

--epochs, -epochs

Deprecated epochs see train_steps

Default: 0

--early_stopping, -early_stopping

Number of validation steps without improving.

Default: 0

--early_stopping_criteria, -early_stopping_criteria

Criteria to use for early stopping.

--optim, -optim

Possible choices: sgd, adagrad, adadelta, adam, adamw, adafactor, fusedadam

Optimization method.

Default: “sgd”

--adagrad_accumulator_init, -adagrad_accumulator_init

Initializes the accumulator values in adagrad. Mirrors the initial_accumulator_value option in the tensorflow adagrad (use 0.1 for their default).

Default: 0

--max_grad_norm, -max_grad_norm

If the norm of the gradient vector exceeds this, renormalize it to have the norm equal to max_grad_norm

Default: 5

--weight_decay, -weight_decay

L2 penalty (weight decay) regularizer

Default: 0

--dropout, -dropout

Dropout probability; applied in LSTM stacks.

Default: [0.3]

--attention_dropout, -attention_dropout

Attention Dropout probability.

Default: [0.1]

--dropout_steps, -dropout_steps

Steps at which dropout changes.

Default: [0]

--truncated_decoder, -truncated_decoder

Truncated bptt.

Default: 0

--adam_beta1, -adam_beta1

The beta1 parameter used by Adam. Almost without exception a value of 0.9 is used in the literature, seemingly giving good results, so we would discourage changing this value from the default without due consideration.

Default: 0.9

--adam_beta2, -adam_beta2

The beta2 parameter used by Adam. Typically a value of 0.999 is recommended, as this is the value suggested by the original paper describing Adam, and is also the value adopted in other frameworks such as Tensorflow and Keras, i.e. see: https://www.tensorflow.org/api_docs/python/tf/train/AdamOptimizer or https://keras.io/optimizers/ . Whereas recently the paper “Attention is All You Need” suggested a value of 0.98 for beta2, this parameter may not work well for normal models / default baselines.

Default: 0.999

--label_smoothing, -label_smoothing

Label smoothing value epsilon. Probabilities of all non-true labels will be smoothed by epsilon / (vocab_size - 1). Set to zero to turn off label smoothing. For more detailed information, see: https://arxiv.org/abs/1512.00567

Default: 0.0

--average_decay, -average_decay

Moving average decay. Set to other than 0 (e.g. 1e-4) to activate. Similar to Marian NMT implementation: http://www.aclweb.org/anthology/P18-4020 For more detail on Exponential Moving Average: https://en.wikipedia.org/wiki/Moving_average

Default: 0

--average_every, -average_every

Step for moving average. Default is every update, if -average_decay is set.

Default: 1

Optimization- Rate

--learning_rate, -learning_rate

Starting learning rate. Recommended settings: sgd = 1, adagrad = 0.1, adadelta = 1, adam = 0.001

Default: 1.0

--learning_rate_decay, -learning_rate_decay

If update_learning_rate, decay learning rate by this much if steps have gone past start_decay_steps

Default: 0.5

--start_decay_steps, -start_decay_steps

Start decaying every decay_steps after start_decay_steps

Default: 50000

--decay_steps, -decay_steps

Decay every decay_steps

Default: 10000

--decay_method, -decay_method

Possible choices: noam, noamwd, rsqrt, linear_warmup, none

Use a custom decay rate.

Default: “none”

--warmup_steps, -warmup_steps

Number of warmup steps for custom decay.

Default: 4000

Logging

--log_file, -log_file

Output logs to a file under this path.

Default: “”

--structured_log_file, -structured_log_file

Output machine-readable structured logs to a file under this path.

Default: “”

--log_file_level, -log_file_level

Possible choices: CRITICAL, ERROR, WARNING, INFO, DEBUG, NOTSET, 50, 40, 30, 20, 10, 0

Default: “0”

--verbose, -verbose

Print data loading and statistics for all process (default only log the first process shard)

Default: False

--report_every, -report_every

Print stats at this interval.

Default: 50

--exp_host, -exp_host

Send logs to this crayon server.

Default: “”

--exp, -exp

Name of the experiment for logging.

Default: “”

--tensorboard, -tensorboard

Use tensorboard for visualization during training. Must have the library tensorboard >= 1.14.

Default: False

--tensorboard_log_dir, -tensorboard_log_dir

Log directory for Tensorboard. This is also the name of the run.

Default: “runs/mammoth”

--report_stats_from_parameters, -report_stats_from_parameters=

Report parameter-level statistics in tensorboard. This has a huge impact on performance: only use for debugging.

Default: False

Dynamic data

-pool_size, --pool_size

Number of examples to dynamically pool before batching.

Default: 2048

-n_buckets, --n_buckets

Maximum number of bins for batching.

Default: 1024