Build Vocab¶
build_vocab.py
usage: build_vocab.py [-h] [-config CONFIG] [-save_config SAVE_CONFIG] -tasks TASKS [-skip_empty_level {silent,warning,error}]
[-mammoth_transforms {prefix,denoising,filtertoolong,filterwordratio,filterrepetitions,filterterminalpunct,filternonzeronumerals,filterfeats,inferfeats,switchout,tokendrop,tokenmask,sentencepiece,bpe,onmt_tokenize} [{prefix,denoising,filtertoolong,filterwordratio,filterrepetitions,filterterminalpunct,filternonzeronumerals,filterfeats,inferfeats,switchout,tokendrop,tokenmask,sentencepiece,bpe,onmt_tokenize} ...]]
-save_data SAVE_DATA [-overwrite] [-n_sample N_SAMPLE] [-dump_samples] [-num_threads NUM_THREADS]
[-vocab_sample_queue_size VOCAB_SAMPLE_QUEUE_SIZE] -src_vocab SRC_VOCAB [-tgt_vocab TGT_VOCAB] [-share_vocab]
[-vocab_paths VOCAB_PATHS] [-src_feats_vocab SRC_FEATS_VOCAB] [--permute_sent_ratio PERMUTE_SENT_RATIO] [--rotate_ratio ROTATE_RATIO]
[--insert_ratio INSERT_RATIO] [--random_ratio RANDOM_RATIO] [--mask_ratio MASK_RATIO] [--mask_length {subword,word,span-poisson}]
[--poisson_lambda POISSON_LAMBDA] [--replace_length {-1,0,1}] [--denoising_objective {bart,mass}] [--src_seq_length SRC_SEQ_LENGTH]
[--tgt_seq_length TGT_SEQ_LENGTH] [--word_ratio_threshold WORD_RATIO_THRESHOLD] [--rep_threshold REP_THRESHOLD]
[--rep_min_len REP_MIN_LEN] [--rep_max_len REP_MAX_LEN] [--punct_threshold PUNCT_THRESHOLD] [--nonzero_threshold NONZERO_THRESHOLD]
[--reversible_tokenization {joiner,spacer}] [--prior_tokenization] [-switchout_temperature SWITCHOUT_TEMPERATURE]
[-tokendrop_temperature TOKENDROP_TEMPERATURE] [-tokenmask_temperature TOKENMASK_TEMPERATURE] [-src_subword_model SRC_SUBWORD_MODEL]
[-tgt_subword_model TGT_SUBWORD_MODEL] [-src_subword_nbest SRC_SUBWORD_NBEST] [-tgt_subword_nbest TGT_SUBWORD_NBEST]
[-src_subword_alpha SRC_SUBWORD_ALPHA] [-tgt_subword_alpha TGT_SUBWORD_ALPHA] [-src_subword_vocab SRC_SUBWORD_VOCAB]
[-tgt_subword_vocab TGT_SUBWORD_VOCAB] [-src_vocab_threshold SRC_VOCAB_THRESHOLD] [-tgt_vocab_threshold TGT_VOCAB_THRESHOLD]
[-src_subword_type {none,sentencepiece,bpe}] [-tgt_subword_type {none,sentencepiece,bpe}] [-src_onmttok_kwargs SRC_ONMTTOK_KWARGS]
[-tgt_onmttok_kwargs TGT_ONMTTOK_KWARGS] [--seed SEED]
Configuration¶
- -config, --config
Path of the main YAML config file.
- -save_config, --save_config
Path where to save the config.
Data/Tasks¶
- -tasks, --tasks
List of datasets and their specifications. See examples/*.yaml for further details.
- -skip_empty_level, --skip_empty_level
Possible choices: silent, warning, error
Security level when encounter empty examples.silent: silently ignore/skip empty example;warning: warning when ignore/skip empty example;error: raise error & stop execution when encouter empty.
Default: “warning”
- -mammoth_transforms, --mammoth_transforms
Possible choices: prefix, denoising, filtertoolong, filterwordratio, filterrepetitions, filterterminalpunct, filternonzeronumerals, filterfeats, inferfeats, switchout, tokendrop, tokenmask, sentencepiece, bpe, onmt_tokenize
Default transform pipeline to apply to data. Can be specified in each corpus of data to override.
Default: []
- -save_data, --save_data
Output base path for objects that will be saved (vocab, transforms, embeddings, …).
- -overwrite, --overwrite
Overwrite existing objects if any.
Default: False
- -n_sample, --n_sample
Build vocab using this number of transformed samples/corpus. Can be [-1, 0, N>0]. Set to -1 to go full corpus, 0 to skip.
Default: 5000
- -dump_samples, --dump_samples
Dump samples when building vocab. Warning: this may slow down the process.
Default: False
- -num_threads, --num_threads
Number of parallel threads to build the vocab.
Default: 1
- -vocab_sample_queue_size, --vocab_sample_queue_size
Size of queues used in the build_vocab dump path.
Default: 20
Vocab¶
- -src_vocab, --src_vocab
Path to save src (or shared) vocabulary file. Format: one <word> or <word> <count> per line.
- -tgt_vocab, --tgt_vocab
Path to save tgt vocabulary file. Format: one <word> or <word> <count> per line.
- -share_vocab, --share_vocab
Share source and target vocabulary.
Default: False
- -vocab_paths, --vocab_paths
file name with ENCorDEC TAB language name TAB path of the vocab.
- -src_feats_vocab, --src_feats_vocab
List of paths to save src features vocabulary files. Files format: one <word> or <word> <count> per line.
Transform/Denoising AE¶
- --permute_sent_ratio, -permute_sent_ratio
Permute this proportion of sentences (boundaries defined by [‘.’, ‘?’, ‘!’]) in all inputs.
Default: 0.0
- --rotate_ratio, -rotate_ratio
Rotate this proportion of inputs.
Default: 0.0
- --insert_ratio, -insert_ratio
Insert this percentage of additional random tokens.
Default: 0.0
- --random_ratio, -random_ratio
Instead of using <mask>, use random token this often. Incompatible with MASS
Default: 0.0
- --mask_ratio, -mask_ratio
Fraction of words/subwords that will be masked.
Default: 0.0
- --mask_length, -mask_length
Possible choices: subword, word, span-poisson
Length of masking window to apply.
Default: “subword”
- --poisson_lambda, -poisson_lambda
Lambda for Poisson distribution to sample span length if -mask_length set to span-poisson.
Default: 3.0
- --replace_length, -replace_length
Possible choices: -1, 0, 1
When masking N tokens, replace with 0, 1, or N tokens. (use -1 for N)
Default: -1
- --denoising_objective
Possible choices: bart, mass
choose between BART-style or MASS-style denoising objectives
Default: “bart”
Transform/Filter¶
- --src_seq_length, -src_seq_length
Maximum source sequence length.
Default: 200
- --tgt_seq_length, -tgt_seq_length
Maximum target sequence length.
Default: 200
Transform/Filter¶
- --word_ratio_threshold, -word_ratio_threshold
Threshold for discarding sentences based on word ratio.
Default: 3
Transform/Filter¶
- --rep_threshold, -rep_threshold
Number of times the substring is repeated.
Default: 2
- --rep_min_len, -rep_min_len
Minimum length of the repeated pattern.
Default: 3
- --rep_max_len, -rep_max_len
Maximum length of the repeated pattern.
Default: 100
Transform/Filter¶
- --punct_threshold, -punct_threshold
Minimum penalty score for discarding sentences based on their terminal punctuation signs
Default: -2
Transform/Filter¶
- --nonzero_threshold, -nonzero_threshold
Threshold for discarding sentences based on numerals between the segments with zeros removed
Default: 0.5
Transform/InferFeats¶
- --reversible_tokenization, -reversible_tokenization
Possible choices: joiner, spacer
Type of reversible tokenization applied on the tokenizer.
Default: “joiner”
- --prior_tokenization, -prior_tokenization
Whether the input has already been tokenized.
Default: False
Transform/SwitchOut¶
Caution
This transform will not take effect when building vocabulary.
- -switchout_temperature, --switchout_temperature
Sampling temperature for SwitchOut. \(\tau^{-1}\) in [WPDN18]. Smaller value makes data more diverse.
Default: 1.0
Transform/Token_Drop¶
- -tokendrop_temperature, --tokendrop_temperature
Sampling temperature for token deletion.
Default: 1.0
Transform/Token_Mask¶
- -tokenmask_temperature, --tokenmask_temperature
Sampling temperature for token masking.
Default: 1.0
Transform/Subword/Common¶
Attention
Common options shared by all subword transforms. Including options for indicate subword model path, Subword Regularization/BPE-Dropout, and Vocabulary Restriction.
- -src_subword_model, --src_subword_model
Path of subword model for src (or shared).
- -tgt_subword_model, --tgt_subword_model
Path of subword model for tgt.
- -src_subword_nbest, --src_subword_nbest
Number of candidates in subword regularization. Valid for unigram sampling, invalid for BPE-dropout. (source side)
Default: 1
- -tgt_subword_nbest, --tgt_subword_nbest
Number of candidates in subword regularization. Valid for unigram sampling, invalid for BPE-dropout. (target side)
Default: 1
- -src_subword_alpha, --src_subword_alpha
Smoothing parameter for sentencepiece unigram sampling, and dropout probability for BPE-dropout. (source side)
Default: 0
- -tgt_subword_alpha, --tgt_subword_alpha
Smoothing parameter for sentencepiece unigram sampling, and dropout probability for BPE-dropout. (target side)
Default: 0
- -src_subword_vocab, --src_subword_vocab
Path to the vocabulary file for src subword. Format: <word> <count> per line.
Default: “”
- -tgt_subword_vocab, --tgt_subword_vocab
Path to the vocabulary file for tgt subword. Format: <word> <count> per line.
Default: “”
- -src_vocab_threshold, --src_vocab_threshold
Only produce src subword in src_subword_vocab with frequency >= src_vocab_threshold.
Default: 0
- -tgt_vocab_threshold, --tgt_vocab_threshold
Only produce tgt subword in tgt_subword_vocab with frequency >= tgt_vocab_threshold.
Default: 0
Transform/Subword/ONMTTOK¶
- -src_subword_type, --src_subword_type
Possible choices: none, sentencepiece, bpe
Type of subword model for src (or shared) in pyonmttok.
Default: “none”
- -tgt_subword_type, --tgt_subword_type
Possible choices: none, sentencepiece, bpe
Type of subword model for tgt in pyonmttok.
Default: “none”
- -src_onmttok_kwargs, --src_onmttok_kwargs
Other pyonmttok options for src in dict string, except subword related options listed earlier.
Default: “{‘mode’: ‘none’}”
- -tgt_onmttok_kwargs, --tgt_onmttok_kwargs
Other pyonmttok options for tgt in dict string, except subword related options listed earlier.
Default: “{‘mode’: ‘none’}”
Reproducibility¶
- --seed, -seed
Set random seed used for better reproducibility between experiments.
Default: -1