Train¶
train.py
usage: train.py [-h] [-config CONFIG] [-save_config SAVE_CONFIG] -tasks TASKS [-skip_empty_level {silent,warning,error}]
[-mammoth_transforms {prefix,denoising,filtertoolong,filterwordratio,filterrepetitions,filterterminalpunct,filternonzeronumerals,filterfeats,inferfeats,switchout,tokendrop,tokenmask,sentencepiece,bpe,onmt_tokenize} [{prefix,denoising,filtertoolong,filterwordratio,filterrepetitions,filterterminalpunct,filternonzeronumerals,filterfeats,inferfeats,switchout,tokendrop,tokenmask,sentencepiece,bpe,onmt_tokenize} ...]]
[-save_data SAVE_DATA] [-overwrite] [-n_sample N_SAMPLE] [-dump_transforms] -src_vocab SRC_VOCAB [-tgt_vocab TGT_VOCAB] [-share_vocab]
[-vocab_paths VOCAB_PATHS] [-src_feats_vocab SRC_FEATS_VOCAB] [-src_vocab_size SRC_VOCAB_SIZE] [-tgt_vocab_size TGT_VOCAB_SIZE]
[-vocab_size_multiple VOCAB_SIZE_MULTIPLE] [-src_words_min_frequency SRC_WORDS_MIN_FREQUENCY]
[-tgt_words_min_frequency TGT_WORDS_MIN_FREQUENCY] [--src_seq_length_trunc SRC_SEQ_LENGTH_TRUNC]
[--tgt_seq_length_trunc TGT_SEQ_LENGTH_TRUNC] [-both_embeddings BOTH_EMBEDDINGS] [-src_embeddings SRC_EMBEDDINGS]
[-tgt_embeddings TGT_EMBEDDINGS] [-embeddings_type {GloVe,word2vec}] [--permute_sent_ratio PERMUTE_SENT_RATIO]
[--rotate_ratio ROTATE_RATIO] [--insert_ratio INSERT_RATIO] [--random_ratio RANDOM_RATIO] [--mask_ratio MASK_RATIO]
[--mask_length {subword,word,span-poisson}] [--poisson_lambda POISSON_LAMBDA] [--replace_length {-1,0,1}]
[--denoising_objective {bart,mass}] [--src_seq_length SRC_SEQ_LENGTH] [--tgt_seq_length TGT_SEQ_LENGTH]
[--word_ratio_threshold WORD_RATIO_THRESHOLD] [--rep_threshold REP_THRESHOLD] [--rep_min_len REP_MIN_LEN] [--rep_max_len REP_MAX_LEN]
[--punct_threshold PUNCT_THRESHOLD] [--nonzero_threshold NONZERO_THRESHOLD] [--reversible_tokenization {joiner,spacer}]
[--prior_tokenization] [-switchout_temperature SWITCHOUT_TEMPERATURE] [-tokendrop_temperature TOKENDROP_TEMPERATURE]
[-tokenmask_temperature TOKENMASK_TEMPERATURE] [-src_subword_model SRC_SUBWORD_MODEL] [-tgt_subword_model TGT_SUBWORD_MODEL]
[-src_subword_nbest SRC_SUBWORD_NBEST] [-tgt_subword_nbest TGT_SUBWORD_NBEST] [-src_subword_alpha SRC_SUBWORD_ALPHA]
[-tgt_subword_alpha TGT_SUBWORD_ALPHA] [-src_subword_vocab SRC_SUBWORD_VOCAB] [-tgt_subword_vocab TGT_SUBWORD_VOCAB]
[-src_vocab_threshold SRC_VOCAB_THRESHOLD] [-tgt_vocab_threshold TGT_VOCAB_THRESHOLD] [-src_subword_type {none,sentencepiece,bpe}]
[-tgt_subword_type {none,sentencepiece,bpe}] [-src_onmttok_kwargs SRC_ONMTTOK_KWARGS] [-tgt_onmttok_kwargs TGT_ONMTTOK_KWARGS]
[--share_decoder_embeddings] [--share_embeddings] [--enable_embeddingless] [--position_encoding] [-update_vocab]
[--feat_merge {concat,sum,mlp}] [--feat_vec_size FEAT_VEC_SIZE] [--feat_vec_exponent FEAT_VEC_EXPONENT] [-model_task {seq2seq,lm}]
[--model_type {text}] [--model_dtype {fp32,fp16}] [--encoder_type {mean,transformer}] [--decoder_type {transformer}] [--layers LAYERS]
[--enc_layers ENC_LAYERS [ENC_LAYERS ...]] [--dec_layers DEC_LAYERS [DEC_LAYERS ...]] [--model_dim MODEL_DIM]
[--pos_ffn_activation_fn {relu,gelu}] [-normformer] [--bridge] [--bridge_extra_node BRIDGE_EXTRA_NODE] [--bidir_edges BIDIR_EDGES]
[--state_dim STATE_DIM] [--n_edge_types N_EDGE_TYPES] [--n_node N_NODE] [--n_steps N_STEPS] [--src_ggnn_size SRC_GGNN_SIZE]
[--global_attention {dot,general,mlp,none}] [--global_attention_function {softmax}] [--self_attn_type SELF_ATTN_TYPE]
[--max_relative_positions MAX_RELATIVE_POSITIONS] [--heads HEADS] [--transformer_ff TRANSFORMER_FF] [--aan_useffn]
[--lambda_align LAMBDA_ALIGN] [--alignment_layer ALIGNMENT_LAYER] [--alignment_heads ALIGNMENT_HEADS] [--full_context_alignment]
[--copy_attn] [--copy_attn_type {dot,general,mlp,none}] [--generator_function {softmax}] [--copy_attn_force] [--reuse_copy_attn]
[--copy_loss_by_seqlength] [--coverage_attn] [--lambda_coverage LAMBDA_COVERAGE] [--loss_scale LOSS_SCALE] [--apex_opt_level {O0,O1,O2,O3}]
[--hidden_ab_size HIDDEN_AB_SIZE] [--ab_fixed_length AB_FIXED_LENGTH] [--ab_layers [{lin,simple,transformer,perceiver,feedforward} ...]]
[--ab_layer_norm {none,rmsnorm,layernorm}] [-adapters ADAPTERS] [--data_type DATA_TYPE] [--save_model SAVE_MODEL] [--save_all_gpus]
[--save_checkpoint_steps SAVE_CHECKPOINT_STEPS] [--keep_checkpoint KEEP_CHECKPOINT] [--gpuid [GPUID ...]] [--gpu_ranks [GPU_RANKS ...]]
[--n_nodes N_NODES] --node_rank NODE_RANK [--world_size WORLD_SIZE] [--gpu_backend GPU_BACKEND] [--gpu_verbose_level GPU_VERBOSE_LEVEL]
[--master_ip MASTER_IP] [--master_port MASTER_PORT] [--queue_size QUEUE_SIZE] [--seed SEED] [--param_init PARAM_INIT] [--param_init_glorot]
[--train_from TRAIN_FROM] [--reset_optim {none,all,states,keep_states}] [--pre_word_vecs_enc PRE_WORD_VECS_ENC]
[--pre_word_vecs_dec PRE_WORD_VECS_DEC] [--freeze_word_vecs_enc] [--freeze_word_vecs_dec] [--batch_size BATCH_SIZE]
[--batch_size_multiple BATCH_SIZE_MULTIPLE] [--batch_type {sents,tokens}] [--normalization {sents,tokens}]
[--accum_count ACCUM_COUNT [ACCUM_COUNT ...]] [--accum_steps ACCUM_STEPS [ACCUM_STEPS ...]]
[--task_distribution_strategy {weighted_sampling,roundrobin}] [--valid_steps VALID_STEPS] [--valid_batch_size VALID_BATCH_SIZE]
[--max_generator_batches MAX_GENERATOR_BATCHES] [--train_steps TRAIN_STEPS] [--single_pass] [--epochs EPOCHS]
[--early_stopping EARLY_STOPPING] [--early_stopping_criteria [EARLY_STOPPING_CRITERIA ...]]
[--optim {sgd,adagrad,adadelta,adam,adamw,adafactor,fusedadam}] [--adagrad_accumulator_init ADAGRAD_ACCUMULATOR_INIT]
[--max_grad_norm MAX_GRAD_NORM] [--weight_decay WEIGHT_DECAY] [--dropout DROPOUT [DROPOUT ...]]
[--attention_dropout ATTENTION_DROPOUT [ATTENTION_DROPOUT ...]] [--dropout_steps DROPOUT_STEPS [DROPOUT_STEPS ...]]
[--truncated_decoder TRUNCATED_DECODER] [--adam_beta1 ADAM_BETA1] [--adam_beta2 ADAM_BETA2] [--label_smoothing LABEL_SMOOTHING]
[--average_decay AVERAGE_DECAY] [--average_every AVERAGE_EVERY] [--learning_rate LEARNING_RATE] [--learning_rate_decay LEARNING_RATE_DECAY]
[--start_decay_steps START_DECAY_STEPS] [--decay_steps DECAY_STEPS] [--decay_method {noam,noamwd,rsqrt,linear_warmup,none}]
[--warmup_steps WARMUP_STEPS] [--log_file LOG_FILE] [--structured_log_file STRUCTURED_LOG_FILE]
[--log_file_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET,50,40,30,20,10,0}] [--verbose] [--report_every REPORT_EVERY]
[--exp_host EXP_HOST] [--exp EXP] [--tensorboard] [--tensorboard_log_dir TENSORBOARD_LOG_DIR] [--report_stats_from_parameters]
[-pool_size POOL_SIZE] [-n_buckets N_BUCKETS]
Configuration¶
- -config, --config
Path of the main YAML config file.
- -save_config, --save_config
Path where to save the config.
Data/Tasks¶
- -tasks, --tasks
List of datasets and their specifications. See examples/*.yaml for further details.
- -skip_empty_level, --skip_empty_level
Possible choices: silent, warning, error
Security level when encounter empty examples.silent: silently ignore/skip empty example;warning: warning when ignore/skip empty example;error: raise error & stop execution when encouter empty.
Default: “warning”
- -mammoth_transforms, --mammoth_transforms
Possible choices: prefix, denoising, filtertoolong, filterwordratio, filterrepetitions, filterterminalpunct, filternonzeronumerals, filterfeats, inferfeats, switchout, tokendrop, tokenmask, sentencepiece, bpe, onmt_tokenize
Default transform pipeline to apply to data. Can be specified in each corpus of data to override.
Default: []
- -save_data, --save_data
Output base path for objects that will be saved (vocab, transforms, embeddings, …).
- -overwrite, --overwrite
Overwrite existing objects if any.
Default: False
- -n_sample, --n_sample
Stop after save this number of transformed samples/corpus. Can be [-1, 0, N>0]. Set to -1 to go full corpus, 0 to skip.
Default: 0
- -dump_transforms, --dump_transforms
Dump transforms *.transforms.pt to disk. -save_data should be set as saving prefix.
Default: False
Vocab¶
- -src_vocab, --src_vocab
Path to src (or shared) vocabulary file. Format: one <word> or <word> <count> per line.
- -tgt_vocab, --tgt_vocab
Path to tgt vocabulary file. Format: one <word> or <word> <count> per line.
- -share_vocab, --share_vocab
Share source and target vocabulary.
Default: False
- -vocab_paths, --vocab_paths
file name with ENCorDEC TAB language name TAB path of the vocab.
- -src_feats_vocab, --src_feats_vocab
List of paths to src features vocabulary files. Files format: one <word> or <word> <count> per line.
- -src_vocab_size, --src_vocab_size
Maximum size of the source vocabulary.
Default: 50000
- -tgt_vocab_size, --tgt_vocab_size
Maximum size of the target vocabulary
Default: 50000
- -vocab_size_multiple, --vocab_size_multiple
Make the vocabulary size a multiple of this value.
Default: 1
- -src_words_min_frequency, --src_words_min_frequency
Discard source words with lower frequency.
Default: 0
- -tgt_words_min_frequency, --tgt_words_min_frequency
Discard target words with lower frequency.
Default: 0
Pruning¶
- --src_seq_length_trunc, -src_seq_length_trunc
Truncate source sequence length.
- --tgt_seq_length_trunc, -tgt_seq_length_trunc
Truncate target sequence length.
Embeddings¶
- -both_embeddings, --both_embeddings
Path to the embeddings file to use for both source and target tokens.
- -src_embeddings, --src_embeddings
Path to the embeddings file to use for source tokens.
- -tgt_embeddings, --tgt_embeddings
Path to the embeddings file to use for target tokens.
- -embeddings_type, --embeddings_type
Possible choices: GloVe, word2vec
Type of embeddings file.
Transform/Denoising AE¶
- --permute_sent_ratio, -permute_sent_ratio
Permute this proportion of sentences (boundaries defined by [‘.’, ‘?’, ‘!’]) in all inputs.
Default: 0.0
- --rotate_ratio, -rotate_ratio
Rotate this proportion of inputs.
Default: 0.0
- --insert_ratio, -insert_ratio
Insert this percentage of additional random tokens.
Default: 0.0
- --random_ratio, -random_ratio
Instead of using <mask>, use random token this often. Incompatible with MASS
Default: 0.0
- --mask_ratio, -mask_ratio
Fraction of words/subwords that will be masked.
Default: 0.0
- --mask_length, -mask_length
Possible choices: subword, word, span-poisson
Length of masking window to apply.
Default: “subword”
- --poisson_lambda, -poisson_lambda
Lambda for Poisson distribution to sample span length if -mask_length set to span-poisson.
Default: 3.0
- --replace_length, -replace_length
Possible choices: -1, 0, 1
When masking N tokens, replace with 0, 1, or N tokens. (use -1 for N)
Default: -1
- --denoising_objective
Possible choices: bart, mass
choose between BART-style or MASS-style denoising objectives
Default: “bart”
Transform/Filter¶
- --src_seq_length, -src_seq_length
Maximum source sequence length.
Default: 200
- --tgt_seq_length, -tgt_seq_length
Maximum target sequence length.
Default: 200
Transform/Filter¶
- --word_ratio_threshold, -word_ratio_threshold
Threshold for discarding sentences based on word ratio.
Default: 3
Transform/Filter¶
- --rep_threshold, -rep_threshold
Number of times the substring is repeated.
Default: 2
- --rep_min_len, -rep_min_len
Minimum length of the repeated pattern.
Default: 3
- --rep_max_len, -rep_max_len
Maximum length of the repeated pattern.
Default: 100
Transform/Filter¶
- --punct_threshold, -punct_threshold
Minimum penalty score for discarding sentences based on their terminal punctuation signs
Default: -2
Transform/Filter¶
- --nonzero_threshold, -nonzero_threshold
Threshold for discarding sentences based on numerals between the segments with zeros removed
Default: 0.5
Transform/InferFeats¶
- --reversible_tokenization, -reversible_tokenization
Possible choices: joiner, spacer
Type of reversible tokenization applied on the tokenizer.
Default: “joiner”
- --prior_tokenization, -prior_tokenization
Whether the input has already been tokenized.
Default: False
Transform/SwitchOut¶
- -switchout_temperature, --switchout_temperature
Sampling temperature for SwitchOut. \(\tau^{-1}\) in [WPDN18]. Smaller value makes data more diverse.
Default: 1.0
Transform/Token_Drop¶
- -tokendrop_temperature, --tokendrop_temperature
Sampling temperature for token deletion.
Default: 1.0
Transform/Token_Mask¶
- -tokenmask_temperature, --tokenmask_temperature
Sampling temperature for token masking.
Default: 1.0
Transform/Subword/Common¶
Attention
Common options shared by all subword transforms. Including options for indicate subword model path, Subword Regularization/BPE-Dropout, and Vocabulary Restriction.
- -src_subword_model, --src_subword_model
Path of subword model for src (or shared).
- -tgt_subword_model, --tgt_subword_model
Path of subword model for tgt.
- -src_subword_nbest, --src_subword_nbest
Number of candidates in subword regularization. Valid for unigram sampling, invalid for BPE-dropout. (source side)
Default: 1
- -tgt_subword_nbest, --tgt_subword_nbest
Number of candidates in subword regularization. Valid for unigram sampling, invalid for BPE-dropout. (target side)
Default: 1
- -src_subword_alpha, --src_subword_alpha
Smoothing parameter for sentencepiece unigram sampling, and dropout probability for BPE-dropout. (source side)
Default: 0
- -tgt_subword_alpha, --tgt_subword_alpha
Smoothing parameter for sentencepiece unigram sampling, and dropout probability for BPE-dropout. (target side)
Default: 0
- -src_subword_vocab, --src_subword_vocab
Path to the vocabulary file for src subword. Format: <word> <count> per line.
Default: “”
- -tgt_subword_vocab, --tgt_subword_vocab
Path to the vocabulary file for tgt subword. Format: <word> <count> per line.
Default: “”
- -src_vocab_threshold, --src_vocab_threshold
Only produce src subword in src_subword_vocab with frequency >= src_vocab_threshold.
Default: 0
- -tgt_vocab_threshold, --tgt_vocab_threshold
Only produce tgt subword in tgt_subword_vocab with frequency >= tgt_vocab_threshold.
Default: 0
Transform/Subword/ONMTTOK¶
- -src_subword_type, --src_subword_type
Possible choices: none, sentencepiece, bpe
Type of subword model for src (or shared) in pyonmttok.
Default: “none”
- -tgt_subword_type, --tgt_subword_type
Possible choices: none, sentencepiece, bpe
Type of subword model for tgt in pyonmttok.
Default: “none”
- -src_onmttok_kwargs, --src_onmttok_kwargs
Other pyonmttok options for src in dict string, except subword related options listed earlier.
Default: “{‘mode’: ‘none’}”
- -tgt_onmttok_kwargs, --tgt_onmttok_kwargs
Other pyonmttok options for tgt in dict string, except subword related options listed earlier.
Default: “{‘mode’: ‘none’}”
Model-Embeddings¶
- --share_decoder_embeddings, -share_decoder_embeddings
Use a shared weight matrix for the input and output word embeddings in the decoder.
Default: False
- --share_embeddings, -share_embeddings
Share the word embeddings between encoder and decoder. Need to use shared dictionary for this option.
Default: False
- --enable_embeddingless, -enable_embeddingless
Enable the use of byte-based embeddingless models(Shaham et. al, 2021) https://aclanthology.org/2021.naacl-main.17/
Default: False
- --position_encoding, -position_encoding
Use a sin to mark relative words positions. Necessary for non-RNN style models.
Default: False
- -update_vocab, --update_vocab
Update source and target existing vocabularies
Default: False
Model-Embedding Features¶
- --feat_merge, -feat_merge
Possible choices: concat, sum, mlp
Merge action for incorporating features embeddings. Options [concat|sum|mlp].
Default: “concat”
- --feat_vec_size, -feat_vec_size
If specified, feature embedding sizes will be set to this. Otherwise, feat_vec_exponent will be used.
Default: -1
- --feat_vec_exponent, -feat_vec_exponent
If -feat_merge_size is not set, feature embedding sizes will be set to N^feat_vec_exponent where N is the number of values the feature takes.
Default: 0.7
Model- Task¶
- -model_task, --model_task
Possible choices: seq2seq, lm
Type of task for the model either seq2seq or lm
Default: “seq2seq”
Model- Encoder-Decoder¶
- --model_type, -model_type
Possible choices: text
Type of source model to use. Allows the system to incorporate non-text inputs. Options are [text].
Default: “text”
- --model_dtype, -model_dtype
Possible choices: fp32, fp16
Data type of the model.
Default: “fp32”
- --encoder_type, -encoder_type
Possible choices: mean, transformer
Type of encoder layer to use. Non-RNN layers are experimental. Options are [mean|transformer].
Default: “transformer”
- --decoder_type, -decoder_type
Possible choices: transformer
Type of decoder layer to use. Non-RNN layers are experimental. Options are [transformer].
Default: “transformer”
- --layers, -layers
Deprecated
Default: -1
- --enc_layers, -enc_layers
Number of layers in each encoder
- --dec_layers, -dec_layers
Number of layers in each decoder
- --model_dim, -model_dim
Size of rnn hidden states.
Default: -1
- --pos_ffn_activation_fn, -pos_ffn_activation_fn
Possible choices: relu, gelu
The activation function to use in PositionwiseFeedForward layer. Choices are dict_keys([‘relu’, ‘gelu’]). Default to relu.
Default: “relu”
- -normformer, --normformer
NormFormer-style normalization
Default: False
- --bridge, -bridge
Have an additional layer between the last encoder state and the first decoder state
Default: False
- --bridge_extra_node, -bridge_extra_node
Graph encoder bridges only extra node to decoder as input
Default: True
- --bidir_edges, -bidir_edges
Graph encoder autogenerates bidirectional edges
Default: True
- --state_dim, -state_dim
Number of state dimensions in the graph encoder
Default: 512
- --n_edge_types, -n_edge_types
Number of edge types in the graph encoder
Default: 2
- --n_node, -n_node
Number of nodes in the graph encoder
Default: 2
- --n_steps, -n_steps
Number of steps to advance graph encoder
Default: 2
- --src_ggnn_size, -src_ggnn_size
Vocab size plus feature space for embedding input
Default: 0
Model- Attention¶
- --global_attention, -global_attention
Possible choices: dot, general, mlp, none
The attention type to use: dotprod or general (Luong) or MLP (Bahdanau)
Default: “general”
- --global_attention_function, -global_attention_function
Possible choices: softmax
Default: “softmax”
- --self_attn_type, -self_attn_type
Self attention type in Transformer decoder layer – currently “scaled-dot” or “average”
Default: “scaled-dot”
- --max_relative_positions, -max_relative_positions
Maximum distance between inputs in relative positions representations. For more detailed information, see: https://arxiv.org/pdf/1803.02155.pdf
Default: 0
- --heads, -heads
Number of heads for transformer self-attention
Default: 8
- --transformer_ff, -transformer_ff
Size of hidden transformer feed-forward
Default: 2048
- --aan_useffn, -aan_useffn
Turn on the FFN layer in the AAN decoder
Default: False
Model - Alignement¶
- --lambda_align, -lambda_align
Lambda value for alignement loss of Garg et al (2019)For more detailed information, see: https://arxiv.org/abs/1909.02074
Default: 0.0
- --alignment_layer, -alignment_layer
Layer number which has to be supervised.
Default: -3
- --alignment_heads, -alignment_heads
of cross attention heads per layer to supervised with
Default: 0
- --full_context_alignment, -full_context_alignment
Whether alignment is conditioned on full target context.
Default: False
Generator¶
- --copy_attn, -copy_attn
Train copy attention layer.
Default: False
- --copy_attn_type, -copy_attn_type
Possible choices: dot, general, mlp, none
The copy attention type to use. Leave as None to use the same as -global_attention.
- --generator_function, -generator_function
Possible choices: softmax
Which function to use for generating probabilities over the target vocabulary (choices: softmax)
Default: “softmax”
- --copy_attn_force, -copy_attn_force
When available, train to copy.
Default: False
- --reuse_copy_attn, -reuse_copy_attn
Reuse standard attention for copy
Default: False
- --copy_loss_by_seqlength, -copy_loss_by_seqlength
Divide copy loss by length of sequence
Default: False
- --coverage_attn, -coverage_attn
Train a coverage attention layer.
Default: False
- --lambda_coverage, -lambda_coverage
Lambda value for coverage loss of See et al (2017)
Default: 0.0
- --loss_scale, -loss_scale
For FP16 training, the static loss scale to use. If not set, the loss scale is dynamically computed.
Default: 0
- --apex_opt_level, -apex_opt_level
Possible choices: O0, O1, O2, O3
For FP16 training, the opt_level to use. See https://nvidia.github.io/apex/amp.html#opts-levels.
Default: “O1”
Attention bridge¶
- --hidden_ab_size, -hidden_ab_size
Size of attention bridge hidden states
Default: 2048
- --ab_fixed_length, -ab_fixed_length
Number of attention heads in attention bridge (fixed length of output)
Default: 50
- --ab_layers, -ab_layers
Possible choices: lin, simple, transformer, perceiver, feedforward
Composition of the attention bridge
Default: []
- --ab_layer_norm, -ab_layer_norm
Possible choices: none, rmsnorm, layernorm
Use layer normalization after lin, simple and feedforward bridge layers
Default: “layernorm”
Adapters¶
- -adapters, --adapters
Adapter specifications
General¶
- --data_type, -data_type
Type of the source input. Options are [text].
Default: “text”
- --save_model, -save_model
Model filename (the model will be saved as <save_model>_N.pt where N is the number of steps
Default: “model”
- --save_all_gpus, -save_all_gpus
Whether to store a model from every gpu (in addition to the modules)
Default: False
- --save_checkpoint_steps, -save_checkpoint_steps
Save a checkpoint every X steps
Default: 5000
- --keep_checkpoint, -keep_checkpoint
Keep X checkpoints (negative: keep all)
Default: -1
- --gpuid, -gpuid
Deprecated see world_size and gpu_ranks.
Default: []
- --gpu_ranks, -gpu_ranks
list of ranks of each process.
Default: []
- --n_nodes, -n_nodes
total number of training nodes.
Default: 1
- --node_rank, -node_rank
index of current node (0-based). When using non-distributed training (CPU, single-GPU), set to 0
- --world_size, -world_size
total number of distributed processes.
Default: 1
- --gpu_backend, -gpu_backend
Type of torch distributed backend
Default: “nccl”
- --gpu_verbose_level, -gpu_verbose_level
Gives more info on each process per GPU.
Default: 0
- --master_ip, -master_ip
IP of master for torch.distributed training.
Default: “localhost”
- --master_port, -master_port
Port of master for torch.distributed training.
Default: 10000
- --queue_size, -queue_size
Size of queue for each process in producer/consumer
Default: 40
Reproducibility¶
- --seed, -seed
Set random seed used for better reproducibility between experiments.
Default: -1
Initialization¶
- --param_init, -param_init
Parameters are initialized over uniform distribution with support (-param_init, param_init). Use 0 to not use initialization
Default: 0.1
- --param_init_glorot, -param_init_glorot
Init parameters with xavier_uniform. Required for transformer.
Default: False
- --train_from, -train_from
If training from a checkpoint then this is the path to the pretrained model’s state_dict.
Default: “”
- --reset_optim, -reset_optim
Possible choices: none, all, states, keep_states
Optimization resetter when train_from.
Default: “none”
- --pre_word_vecs_enc, -pre_word_vecs_enc
If a valid path is specified, then this will load pretrained word embeddings on the encoder side. See README for specific formatting instructions.
- --pre_word_vecs_dec, -pre_word_vecs_dec
If a valid path is specified, then this will load pretrained word embeddings on the decoder side. See README for specific formatting instructions.
- --freeze_word_vecs_enc, -freeze_word_vecs_enc
Freeze word embeddings on the encoder side.
Default: False
- --freeze_word_vecs_dec, -freeze_word_vecs_dec
Freeze word embeddings on the decoder side.
Default: False
Optimization- Type¶
- --batch_size, -batch_size
Maximum batch size for training
Default: 64
- --batch_size_multiple, -batch_size_multiple
Batch size multiple for token batches.
- --batch_type, -batch_type
Possible choices: sents, tokens
Batch grouping for batch_size. Standard is sents. Tokens will do dynamic batching
Default: “sents”
- --normalization, -normalization
Possible choices: sents, tokens
Normalization method of the gradient.
Default: “sents”
- --accum_count, -accum_count
Accumulate gradient this many times. Approximately equivalent to updating batch_size * accum_count batches at once. Recommended for Transformer.
Default: [1]
- --accum_steps, -accum_steps
Steps at which accum_count values change
Default: [0]
- --task_distribution_strategy, -task_distribution_strategy
Possible choices: weighted_sampling, roundrobin
Strategy for the order in which tasks (e.g. language pairs) are scheduled for training
Default: “weighted_sampling”
- --valid_steps, -valid_steps
Perfom validation every X steps
Default: 10000
- --valid_batch_size, -valid_batch_size
Maximum batch size for validation
Default: 32
- --max_generator_batches, -max_generator_batches
Maximum batches of words in a sequence to run the generator on in parallel. Higher is faster, but uses more memory. Set to 0 to disable.
Default: 32
- --train_steps, -train_steps
Number of training steps
Default: 100000
- --single_pass, -single_pass
Make a single pass over the training dataset.
Default: False
- --epochs, -epochs
Deprecated epochs see train_steps
Default: 0
- --early_stopping, -early_stopping
Number of validation steps without improving.
Default: 0
- --early_stopping_criteria, -early_stopping_criteria
Criteria to use for early stopping.
- --optim, -optim
Possible choices: sgd, adagrad, adadelta, adam, adamw, adafactor, fusedadam
Optimization method.
Default: “sgd”
- --adagrad_accumulator_init, -adagrad_accumulator_init
Initializes the accumulator values in adagrad. Mirrors the initial_accumulator_value option in the tensorflow adagrad (use 0.1 for their default).
Default: 0
- --max_grad_norm, -max_grad_norm
If the norm of the gradient vector exceeds this, renormalize it to have the norm equal to max_grad_norm
Default: 5
- --weight_decay, -weight_decay
L2 penalty (weight decay) regularizer
Default: 0
- --dropout, -dropout
Dropout probability; applied in LSTM stacks.
Default: [0.3]
- --attention_dropout, -attention_dropout
Attention Dropout probability.
Default: [0.1]
- --dropout_steps, -dropout_steps
Steps at which dropout changes.
Default: [0]
- --truncated_decoder, -truncated_decoder
Truncated bptt.
Default: 0
- --adam_beta1, -adam_beta1
The beta1 parameter used by Adam. Almost without exception a value of 0.9 is used in the literature, seemingly giving good results, so we would discourage changing this value from the default without due consideration.
Default: 0.9
- --adam_beta2, -adam_beta2
The beta2 parameter used by Adam. Typically a value of 0.999 is recommended, as this is the value suggested by the original paper describing Adam, and is also the value adopted in other frameworks such as Tensorflow and Keras, i.e. see: https://www.tensorflow.org/api_docs/python/tf/train/AdamOptimizer or https://keras.io/optimizers/ . Whereas recently the paper “Attention is All You Need” suggested a value of 0.98 for beta2, this parameter may not work well for normal models / default baselines.
Default: 0.999
- --label_smoothing, -label_smoothing
Label smoothing value epsilon. Probabilities of all non-true labels will be smoothed by epsilon / (vocab_size - 1). Set to zero to turn off label smoothing. For more detailed information, see: https://arxiv.org/abs/1512.00567
Default: 0.0
- --average_decay, -average_decay
Moving average decay. Set to other than 0 (e.g. 1e-4) to activate. Similar to Marian NMT implementation: http://www.aclweb.org/anthology/P18-4020 For more detail on Exponential Moving Average: https://en.wikipedia.org/wiki/Moving_average
Default: 0
- --average_every, -average_every
Step for moving average. Default is every update, if -average_decay is set.
Default: 1
Optimization- Rate¶
- --learning_rate, -learning_rate
Starting learning rate. Recommended settings: sgd = 1, adagrad = 0.1, adadelta = 1, adam = 0.001
Default: 1.0
- --learning_rate_decay, -learning_rate_decay
If update_learning_rate, decay learning rate by this much if steps have gone past start_decay_steps
Default: 0.5
- --start_decay_steps, -start_decay_steps
Start decaying every decay_steps after start_decay_steps
Default: 50000
- --decay_steps, -decay_steps
Decay every decay_steps
Default: 10000
- --decay_method, -decay_method
Possible choices: noam, noamwd, rsqrt, linear_warmup, none
Use a custom decay rate.
Default: “none”
- --warmup_steps, -warmup_steps
Number of warmup steps for custom decay.
Default: 4000
Logging¶
- --log_file, -log_file
Output logs to a file under this path.
Default: “”
- --structured_log_file, -structured_log_file
Output machine-readable structured logs to a file under this path.
Default: “”
- --log_file_level, -log_file_level
Possible choices: CRITICAL, ERROR, WARNING, INFO, DEBUG, NOTSET, 50, 40, 30, 20, 10, 0
Default: “0”
- --verbose, -verbose
Print data loading and statistics for all process (default only log the first process shard)
Default: False
- --report_every, -report_every
Print stats at this interval.
Default: 50
- --exp_host, -exp_host
Send logs to this crayon server.
Default: “”
- --exp, -exp
Name of the experiment for logging.
Default: “”
- --tensorboard, -tensorboard
Use tensorboard for visualization during training. Must have the library tensorboard >= 1.14.
Default: False
- --tensorboard_log_dir, -tensorboard_log_dir
Log directory for Tensorboard. This is also the name of the run.
Default: “runs/mammoth”
- --report_stats_from_parameters, -report_stats_from_parameters=
Report parameter-level statistics in tensorboard. This has a huge impact on performance: only use for debugging.
Default: False
Dynamic data¶
- -pool_size, --pool_size
Number of examples to dynamically pool before batching.
Default: 2048
- -n_buckets, --n_buckets
Maximum number of bins for batching.
Default: 1024