Translate¶
translate.py
usage: translate.py [-h] [-config CONFIG] [-save_config SAVE_CONFIG] -tasks TASKS [-skip_empty_level {silent,warning,error}]
[-mammoth_transforms {prefix,denoising,filtertoolong,filterwordratio,filterrepetitions,filterterminalpunct,filternonzeronumerals,filterfeats,inferfeats,switchout,tokendrop,tokenmask,sentencepiece,bpe,onmt_tokenize} [{prefix,denoising,filtertoolong,filterwordratio,filterrepetitions,filterterminalpunct,filternonzeronumerals,filterfeats,inferfeats,switchout,tokendrop,tokenmask,sentencepiece,bpe,onmt_tokenize} ...]]
[-save_data SAVE_DATA] [-overwrite] [-n_sample N_SAMPLE] [-dump_transforms] -src_vocab SRC_VOCAB [-tgt_vocab TGT_VOCAB] [-share_vocab]
[-vocab_paths VOCAB_PATHS] [-src_feats_vocab SRC_FEATS_VOCAB] [-src_vocab_size SRC_VOCAB_SIZE] [-tgt_vocab_size TGT_VOCAB_SIZE]
[-vocab_size_multiple VOCAB_SIZE_MULTIPLE] [-src_words_min_frequency SRC_WORDS_MIN_FREQUENCY]
[-tgt_words_min_frequency TGT_WORDS_MIN_FREQUENCY] [--src_seq_length_trunc SRC_SEQ_LENGTH_TRUNC]
[--tgt_seq_length_trunc TGT_SEQ_LENGTH_TRUNC] [-both_embeddings BOTH_EMBEDDINGS] [-src_embeddings SRC_EMBEDDINGS]
[-tgt_embeddings TGT_EMBEDDINGS] [-embeddings_type {GloVe,word2vec}] [--permute_sent_ratio PERMUTE_SENT_RATIO]
[--rotate_ratio ROTATE_RATIO] [--insert_ratio INSERT_RATIO] [--random_ratio RANDOM_RATIO] [--mask_ratio MASK_RATIO]
[--mask_length {subword,word,span-poisson}] [--poisson_lambda POISSON_LAMBDA] [--replace_length {-1,0,1}]
[--denoising_objective {bart,mass}] [--src_seq_length SRC_SEQ_LENGTH] [--tgt_seq_length TGT_SEQ_LENGTH]
[--word_ratio_threshold WORD_RATIO_THRESHOLD] [--rep_threshold REP_THRESHOLD] [--rep_min_len REP_MIN_LEN] [--rep_max_len REP_MAX_LEN]
[--punct_threshold PUNCT_THRESHOLD] [--nonzero_threshold NONZERO_THRESHOLD] [--reversible_tokenization {joiner,spacer}]
[--prior_tokenization] [-switchout_temperature SWITCHOUT_TEMPERATURE] [-tokendrop_temperature TOKENDROP_TEMPERATURE]
[-tokenmask_temperature TOKENMASK_TEMPERATURE] [-src_subword_model SRC_SUBWORD_MODEL] [-tgt_subword_model TGT_SUBWORD_MODEL]
[-src_subword_nbest SRC_SUBWORD_NBEST] [-tgt_subword_nbest TGT_SUBWORD_NBEST] [-src_subword_alpha SRC_SUBWORD_ALPHA]
[-tgt_subword_alpha TGT_SUBWORD_ALPHA] [-src_subword_vocab SRC_SUBWORD_VOCAB] [-tgt_subword_vocab TGT_SUBWORD_VOCAB]
[-src_vocab_threshold SRC_VOCAB_THRESHOLD] [-tgt_vocab_threshold TGT_VOCAB_THRESHOLD] [-src_subword_type {none,sentencepiece,bpe}]
[-tgt_subword_type {none,sentencepiece,bpe}] [-src_onmttok_kwargs SRC_ONMTTOK_KWARGS] [-tgt_onmttok_kwargs TGT_ONMTTOK_KWARGS] --model
MODEL [MODEL ...] [--fp32] [--int8] [--avg_raw_probs] --task_id TASK_ID [--data_type DATA_TYPE] --src SRC [-src_feats SRC_FEATS]
[--tgt TGT] [--shard_size SHARD_SIZE] [--output OUTPUT] [--report_align] [--report_time] [--beam_size BEAM_SIZE] [--ratio RATIO]
[--random_sampling_topk RANDOM_SAMPLING_TOPK] [--random_sampling_topp RANDOM_SAMPLING_TOPP]
[--random_sampling_temp RANDOM_SAMPLING_TEMP] [--seed SEED] [--length_penalty {none,wu,avg}] [--alpha ALPHA]
[--coverage_penalty {none,wu,summary}] [--beta BETA] [--stepwise_penalty] [--min_length MIN_LENGTH] [--max_length MAX_LENGTH]
[--max_sent_length] [--block_ngram_repeat BLOCK_NGRAM_REPEAT] [--ignore_when_blocking IGNORE_WHEN_BLOCKING [IGNORE_WHEN_BLOCKING ...]]
[--replace_unk] [--ban_unk_token] [--phrase_table PHRASE_TABLE] [--log_file LOG_FILE] [--structured_log_file STRUCTURED_LOG_FILE]
[--log_file_level {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET,50,40,30,20,10,0}] [--verbose] [--attn_debug] [--align_debug]
[--dump_beam DUMP_BEAM] [--n_best N_BEST] [--batch_size BATCH_SIZE] [--batch_type {sents,tokens}] [--gpu GPU]
[--output_model OUTPUT_MODEL]
Configuration¶
- -config, --config
Path of the main YAML config file.
- -save_config, --save_config
Path where to save the config.
Data/Tasks¶
- -tasks, --tasks
List of datasets and their specifications. See examples/*.yaml for further details.
- -skip_empty_level, --skip_empty_level
Possible choices: silent, warning, error
Security level when encounter empty examples.silent: silently ignore/skip empty example;warning: warning when ignore/skip empty example;error: raise error & stop execution when encouter empty.
Default: “warning”
- -mammoth_transforms, --mammoth_transforms
Possible choices: prefix, denoising, filtertoolong, filterwordratio, filterrepetitions, filterterminalpunct, filternonzeronumerals, filterfeats, inferfeats, switchout, tokendrop, tokenmask, sentencepiece, bpe, onmt_tokenize
Default transform pipeline to apply to data. Can be specified in each corpus of data to override.
Default: []
- -save_data, --save_data
Output base path for objects that will be saved (vocab, transforms, embeddings, …).
- -overwrite, --overwrite
Overwrite existing objects if any.
Default: False
- -n_sample, --n_sample
Stop after save this number of transformed samples/corpus. Can be [-1, 0, N>0]. Set to -1 to go full corpus, 0 to skip.
Default: 0
- -dump_transforms, --dump_transforms
Dump transforms *.transforms.pt to disk. -save_data should be set as saving prefix.
Default: False
Vocab¶
- -src_vocab, --src_vocab
Path to src (or shared) vocabulary file. Format: one <word> or <word> <count> per line.
- -tgt_vocab, --tgt_vocab
Path to tgt vocabulary file. Format: one <word> or <word> <count> per line.
- -share_vocab, --share_vocab
Share source and target vocabulary.
Default: False
- -vocab_paths, --vocab_paths
file name with ENCorDEC TAB language name TAB path of the vocab.
- -src_feats_vocab, --src_feats_vocab
List of paths to src features vocabulary files. Files format: one <word> or <word> <count> per line.
- -src_vocab_size, --src_vocab_size
Maximum size of the source vocabulary.
Default: 50000
- -tgt_vocab_size, --tgt_vocab_size
Maximum size of the target vocabulary
Default: 50000
- -vocab_size_multiple, --vocab_size_multiple
Make the vocabulary size a multiple of this value.
Default: 1
- -src_words_min_frequency, --src_words_min_frequency
Discard source words with lower frequency.
Default: 0
- -tgt_words_min_frequency, --tgt_words_min_frequency
Discard target words with lower frequency.
Default: 0
Pruning¶
- --src_seq_length_trunc, -src_seq_length_trunc
Truncate source sequence length.
- --tgt_seq_length_trunc, -tgt_seq_length_trunc
Truncate target sequence length.
Embeddings¶
- -both_embeddings, --both_embeddings
Path to the embeddings file to use for both source and target tokens.
- -src_embeddings, --src_embeddings
Path to the embeddings file to use for source tokens.
- -tgt_embeddings, --tgt_embeddings
Path to the embeddings file to use for target tokens.
- -embeddings_type, --embeddings_type
Possible choices: GloVe, word2vec
Type of embeddings file.
Transform/Denoising AE¶
- --permute_sent_ratio, -permute_sent_ratio
Permute this proportion of sentences (boundaries defined by [‘.’, ‘?’, ‘!’]) in all inputs.
Default: 0.0
- --rotate_ratio, -rotate_ratio
Rotate this proportion of inputs.
Default: 0.0
- --insert_ratio, -insert_ratio
Insert this percentage of additional random tokens.
Default: 0.0
- --random_ratio, -random_ratio
Instead of using <mask>, use random token this often. Incompatible with MASS
Default: 0.0
- --mask_ratio, -mask_ratio
Fraction of words/subwords that will be masked.
Default: 0.0
- --mask_length, -mask_length
Possible choices: subword, word, span-poisson
Length of masking window to apply.
Default: “subword”
- --poisson_lambda, -poisson_lambda
Lambda for Poisson distribution to sample span length if -mask_length set to span-poisson.
Default: 3.0
- --replace_length, -replace_length
Possible choices: -1, 0, 1
When masking N tokens, replace with 0, 1, or N tokens. (use -1 for N)
Default: -1
- --denoising_objective
Possible choices: bart, mass
choose between BART-style or MASS-style denoising objectives
Default: “bart”
Transform/Filter¶
- --src_seq_length, -src_seq_length
Maximum source sequence length.
Default: 200
- --tgt_seq_length, -tgt_seq_length
Maximum target sequence length.
Default: 200
Transform/Filter¶
- --word_ratio_threshold, -word_ratio_threshold
Threshold for discarding sentences based on word ratio.
Default: 3
Transform/Filter¶
- --rep_threshold, -rep_threshold
Number of times the substring is repeated.
Default: 2
- --rep_min_len, -rep_min_len
Minimum length of the repeated pattern.
Default: 3
- --rep_max_len, -rep_max_len
Maximum length of the repeated pattern.
Default: 100
Transform/Filter¶
- --punct_threshold, -punct_threshold
Minimum penalty score for discarding sentences based on their terminal punctuation signs
Default: -2
Transform/Filter¶
- --nonzero_threshold, -nonzero_threshold
Threshold for discarding sentences based on numerals between the segments with zeros removed
Default: 0.5
Transform/InferFeats¶
- --reversible_tokenization, -reversible_tokenization
Possible choices: joiner, spacer
Type of reversible tokenization applied on the tokenizer.
Default: “joiner”
- --prior_tokenization, -prior_tokenization
Whether the input has already been tokenized.
Default: False
Transform/SwitchOut¶
- -switchout_temperature, --switchout_temperature
Sampling temperature for SwitchOut. \(\tau^{-1}\) in [WPDN18]. Smaller value makes data more diverse.
Default: 1.0
Transform/Token_Drop¶
- -tokendrop_temperature, --tokendrop_temperature
Sampling temperature for token deletion.
Default: 1.0
Transform/Token_Mask¶
- -tokenmask_temperature, --tokenmask_temperature
Sampling temperature for token masking.
Default: 1.0
Transform/Subword/Common¶
Attention
Common options shared by all subword transforms. Including options for indicate subword model path, Subword Regularization/BPE-Dropout, and Vocabulary Restriction.
- -src_subword_model, --src_subword_model
Path of subword model for src (or shared).
- -tgt_subword_model, --tgt_subword_model
Path of subword model for tgt.
- -src_subword_nbest, --src_subword_nbest
Number of candidates in subword regularization. Valid for unigram sampling, invalid for BPE-dropout. (source side)
Default: 1
- -tgt_subword_nbest, --tgt_subword_nbest
Number of candidates in subword regularization. Valid for unigram sampling, invalid for BPE-dropout. (target side)
Default: 1
- -src_subword_alpha, --src_subword_alpha
Smoothing parameter for sentencepiece unigram sampling, and dropout probability for BPE-dropout. (source side)
Default: 0
- -tgt_subword_alpha, --tgt_subword_alpha
Smoothing parameter for sentencepiece unigram sampling, and dropout probability for BPE-dropout. (target side)
Default: 0
- -src_subword_vocab, --src_subword_vocab
Path to the vocabulary file for src subword. Format: <word> <count> per line.
Default: “”
- -tgt_subword_vocab, --tgt_subword_vocab
Path to the vocabulary file for tgt subword. Format: <word> <count> per line.
Default: “”
- -src_vocab_threshold, --src_vocab_threshold
Only produce src subword in src_subword_vocab with frequency >= src_vocab_threshold.
Default: 0
- -tgt_vocab_threshold, --tgt_vocab_threshold
Only produce tgt subword in tgt_subword_vocab with frequency >= tgt_vocab_threshold.
Default: 0
Transform/Subword/ONMTTOK¶
- -src_subword_type, --src_subword_type
Possible choices: none, sentencepiece, bpe
Type of subword model for src (or shared) in pyonmttok.
Default: “none”
- -tgt_subword_type, --tgt_subword_type
Possible choices: none, sentencepiece, bpe
Type of subword model for tgt in pyonmttok.
Default: “none”
- -src_onmttok_kwargs, --src_onmttok_kwargs
Other pyonmttok options for src in dict string, except subword related options listed earlier.
Default: “{‘mode’: ‘none’}”
- -tgt_onmttok_kwargs, --tgt_onmttok_kwargs
Other pyonmttok options for tgt in dict string, except subword related options listed earlier.
Default: “{‘mode’: ‘none’}”
Model¶
- --model, -model
Path to model .pt file(s). Multiple models can be specified, for ensemble decoding.
Default: []
- --fp32, -fp32
Force the model to be in FP32 because FP16 is very slow on GTX1080(ti).
Default: False
- --int8, -int8
Enable dynamic 8-bit quantization (CPU only).
Default: False
- --avg_raw_probs, -avg_raw_probs
If this is set, during ensembling scores from different models will be combined by averaging their raw probabilities and then taking the log. Otherwise, the log probabilities will be averaged directly. Necessary for models whose output layers can assign zero probability.
Default: False
- --task_id, -task_id
Task id to determine components to load for translation
Data¶
- --data_type, -data_type
Type of the source input. Options: [text].
Default: “text”
- --src, -src
Source sequence to decode (one line per sequence)
- -src_feats, --src_feats
Source sequence features (dict format). Ex: {‘feat_0’: ‘../data.txt.feats0’, ‘feat_1’: ‘../data.txt.feats1’}
- --tgt, -tgt
True target sequence (optional)
- --shard_size, -shard_size
Divide src and tgt (if applicable) into smaller multiple src and tgt files, then build shards, each shard will have opts.shard_size samples except last shard. shard_size=0 means no segmentation shard_size>0 means segment dataset into multiple shards, each shard has shard_size samples
Default: 10000
- --output, -output
Path to output the predictions (each line will be the decoded sequence
Default: “pred.txt”
- --report_align, -report_align
Report alignment for each translation.
Default: False
- --report_time, -report_time
Report some translation time metrics
Default: False
Beam Search¶
- --beam_size, -beam_size
Beam size
Default: 5
- --ratio, -ratio
Ratio based beam stop condition
Default: -0.0
Random Sampling¶
- --random_sampling_topk, -random_sampling_topk
Set this to -1 to do random sampling from full distribution. Set this to value k>1 to do random sampling restricted to the k most likely next tokens. Set this to 1 to use argmax.
Default: 0
- --random_sampling_topp, -random_sampling_topp
Probability for top-p/nucleus sampling. Restrict tokens to the most likely until the cumulated probability is over p. In range [0, 1]. https://arxiv.org/abs/1904.09751
Default: 0.0
- --random_sampling_temp, -random_sampling_temp
If doing random sampling, divide the logits by this before computing softmax during decoding.
Default: 1.0
- --beam_size, -beam_size
Beam size
Default: 5
Reproducibility¶
- --seed, -seed
Set random seed used for better reproducibility between experiments.
Default: -1
Penalties¶
Note
Coverage Penalty is not available in sampling.
- --length_penalty, -length_penalty
Possible choices: none, wu, avg
Length Penalty to use.
Default: “none”
- --alpha, -alpha
Google NMT length penalty parameter (higher = longer generation)
Default: 0.0
- --coverage_penalty, -coverage_penalty
Possible choices: none, wu, summary
Coverage Penalty to use. Only available in beam search.
Default: “none”
- --beta, -beta
Coverage penalty parameter
Default: -0.0
- --stepwise_penalty, -stepwise_penalty
Apply coverage penalty at every decoding step. Helpful for summary penalty.
Default: False
Decoding tricks¶
Tip
Following options can be used to limit the decoding length or content.
- --min_length, -min_length
Minimum prediction length
Default: 0
- --max_length, -max_length
Maximum prediction length.
Default: 100
- --max_sent_length, -max_sent_length
Deprecated, use -max_length instead
- --block_ngram_repeat, -block_ngram_repeat
Block repetition of ngrams during decoding.
Default: 0
- --ignore_when_blocking, -ignore_when_blocking
Ignore these strings when blocking repeats. You want to block sentence delimiters.
Default: []
- --replace_unk, -replace_unk
Replace the generated UNK tokens with the source token that had highest attention weight. If phrase_table is provided, it will look up the identified source token and give the corresponding target token. If it is not provided (or the identified source token does not exist in the table), then it will copy the source token.
Default: False
- --ban_unk_token, -ban_unk_token
Prevent unk token generation by setting unk proba to 0
Default: False
- --phrase_table, -phrase_table
If phrase_table is provided (with replace_unk), it will look up the identified source token and give the corresponding target token. If it is not provided (or the identified source token does not exist in the table), then it will copy the source token.
Default: “”
Logging¶
- --log_file, -log_file
Output logs to a file under this path.
Default: “”
- --structured_log_file, -structured_log_file
Output machine-readable structured logs to a file under this path.
Default: “”
- --log_file_level, -log_file_level
Possible choices: CRITICAL, ERROR, WARNING, INFO, DEBUG, NOTSET, 50, 40, 30, 20, 10, 0
Default: “0”
- --verbose, -verbose
Print scores and predictions for each sentence
Default: False
- --attn_debug, -attn_debug
Print best attn for each word
Default: False
- --align_debug, -align_debug
Print best align for each word
Default: False
- --dump_beam, -dump_beam
File to dump beam information to.
Default: “”
- --n_best, -n_best
If verbose is set, will output the n_best decoded sentences
Default: 1
Efficiency¶
- --batch_size, -batch_size
Batch size
Default: 30
- --batch_type, -batch_type
Possible choices: sents, tokens
Batch grouping for batch_size. Standard is sents. Tokens will do dynamic batching
Default: “sents”
- --gpu, -gpu
Device to run on
Default: -1
- --output_model, -output_model
Path to the model output