Build Vocab

build_vocab.py

usage: build_vocab.py [-h] [-config CONFIG] [-save_config SAVE_CONFIG] -tasks TASKS [-skip_empty_level {silent,warning,error}]
                      [-mammoth_transforms {prefix,denoising,filtertoolong,filterwordratio,filterrepetitions,filterterminalpunct,filternonzeronumerals,filterfeats,inferfeats,switchout,tokendrop,tokenmask,sentencepiece,bpe,onmt_tokenize} [{prefix,denoising,filtertoolong,filterwordratio,filterrepetitions,filterterminalpunct,filternonzeronumerals,filterfeats,inferfeats,switchout,tokendrop,tokenmask,sentencepiece,bpe,onmt_tokenize} ...]]
                      -save_data SAVE_DATA [-overwrite] [-n_sample N_SAMPLE] [-dump_samples] [-num_threads NUM_THREADS]
                      [-vocab_sample_queue_size VOCAB_SAMPLE_QUEUE_SIZE] -src_vocab SRC_VOCAB [-tgt_vocab TGT_VOCAB] [-share_vocab]
                      [-vocab_paths VOCAB_PATHS] [-src_feats_vocab SRC_FEATS_VOCAB] [--permute_sent_ratio PERMUTE_SENT_RATIO] [--rotate_ratio ROTATE_RATIO]
                      [--insert_ratio INSERT_RATIO] [--random_ratio RANDOM_RATIO] [--mask_ratio MASK_RATIO] [--mask_length {subword,word,span-poisson}]
                      [--poisson_lambda POISSON_LAMBDA] [--replace_length {-1,0,1}] [--denoising_objective {bart,mass}] [--src_seq_length SRC_SEQ_LENGTH]
                      [--tgt_seq_length TGT_SEQ_LENGTH] [--word_ratio_threshold WORD_RATIO_THRESHOLD] [--rep_threshold REP_THRESHOLD]
                      [--rep_min_len REP_MIN_LEN] [--rep_max_len REP_MAX_LEN] [--punct_threshold PUNCT_THRESHOLD] [--nonzero_threshold NONZERO_THRESHOLD]
                      [--reversible_tokenization {joiner,spacer}] [--prior_tokenization] [-switchout_temperature SWITCHOUT_TEMPERATURE]
                      [-tokendrop_temperature TOKENDROP_TEMPERATURE] [-tokenmask_temperature TOKENMASK_TEMPERATURE] [-src_subword_model SRC_SUBWORD_MODEL]
                      [-tgt_subword_model TGT_SUBWORD_MODEL] [-src_subword_nbest SRC_SUBWORD_NBEST] [-tgt_subword_nbest TGT_SUBWORD_NBEST]
                      [-src_subword_alpha SRC_SUBWORD_ALPHA] [-tgt_subword_alpha TGT_SUBWORD_ALPHA] [-src_subword_vocab SRC_SUBWORD_VOCAB]
                      [-tgt_subword_vocab TGT_SUBWORD_VOCAB] [-src_vocab_threshold SRC_VOCAB_THRESHOLD] [-tgt_vocab_threshold TGT_VOCAB_THRESHOLD]
                      [-src_subword_type {none,sentencepiece,bpe}] [-tgt_subword_type {none,sentencepiece,bpe}] [-src_onmttok_kwargs SRC_ONMTTOK_KWARGS]
                      [-tgt_onmttok_kwargs TGT_ONMTTOK_KWARGS] [--seed SEED]

Configuration

-config, --config

Path of the main YAML config file.

-save_config, --save_config

Path where to save the config.

Data/Tasks

-tasks, --tasks

List of datasets and their specifications. See examples/*.yaml for further details.

-skip_empty_level, --skip_empty_level

Possible choices: silent, warning, error

Security level when encounter empty examples.silent: silently ignore/skip empty example;warning: warning when ignore/skip empty example;error: raise error & stop execution when encouter empty.

Default: “warning”

-mammoth_transforms, --mammoth_transforms

Possible choices: prefix, denoising, filtertoolong, filterwordratio, filterrepetitions, filterterminalpunct, filternonzeronumerals, filterfeats, inferfeats, switchout, tokendrop, tokenmask, sentencepiece, bpe, onmt_tokenize

Default transform pipeline to apply to data. Can be specified in each corpus of data to override.

Default: []

-save_data, --save_data

Output base path for objects that will be saved (vocab, transforms, embeddings, …).

-overwrite, --overwrite

Overwrite existing objects if any.

Default: False

-n_sample, --n_sample

Build vocab using this number of transformed samples/corpus. Can be [-1, 0, N>0]. Set to -1 to go full corpus, 0 to skip.

Default: 5000

-dump_samples, --dump_samples

Dump samples when building vocab. Warning: this may slow down the process.

Default: False

-num_threads, --num_threads

Number of parallel threads to build the vocab.

Default: 1

-vocab_sample_queue_size, --vocab_sample_queue_size

Size of queues used in the build_vocab dump path.

Default: 20

Vocab

-src_vocab, --src_vocab

Path to save src (or shared) vocabulary file. Format: one <word> or <word> <count> per line.

-tgt_vocab, --tgt_vocab

Path to save tgt vocabulary file. Format: one <word> or <word> <count> per line.

-share_vocab, --share_vocab

Share source and target vocabulary.

Default: False

-vocab_paths, --vocab_paths

file name with ENCorDEC TAB language name TAB path of the vocab.

-src_feats_vocab, --src_feats_vocab

List of paths to save src features vocabulary files. Files format: one <word> or <word> <count> per line.

Transform/Denoising AE

--permute_sent_ratio, -permute_sent_ratio

Permute this proportion of sentences (boundaries defined by [‘.’, ‘?’, ‘!’]) in all inputs.

Default: 0.0

--rotate_ratio, -rotate_ratio

Rotate this proportion of inputs.

Default: 0.0

--insert_ratio, -insert_ratio

Insert this percentage of additional random tokens.

Default: 0.0

--random_ratio, -random_ratio

Instead of using <mask>, use random token this often. Incompatible with MASS

Default: 0.0

--mask_ratio, -mask_ratio

Fraction of words/subwords that will be masked.

Default: 0.0

--mask_length, -mask_length

Possible choices: subword, word, span-poisson

Length of masking window to apply.

Default: “subword”

--poisson_lambda, -poisson_lambda

Lambda for Poisson distribution to sample span length if -mask_length set to span-poisson.

Default: 3.0

--replace_length, -replace_length

Possible choices: -1, 0, 1

When masking N tokens, replace with 0, 1, or N tokens. (use -1 for N)

Default: -1

--denoising_objective

Possible choices: bart, mass

choose between BART-style or MASS-style denoising objectives

Default: “bart”

Transform/Filter

--src_seq_length, -src_seq_length

Maximum source sequence length.

Default: 200

--tgt_seq_length, -tgt_seq_length

Maximum target sequence length.

Default: 200

Transform/Filter

--word_ratio_threshold, -word_ratio_threshold

Threshold for discarding sentences based on word ratio.

Default: 3

Transform/Filter

--rep_threshold, -rep_threshold

Number of times the substring is repeated.

Default: 2

--rep_min_len, -rep_min_len

Minimum length of the repeated pattern.

Default: 3

--rep_max_len, -rep_max_len

Maximum length of the repeated pattern.

Default: 100

Transform/Filter

--punct_threshold, -punct_threshold

Minimum penalty score for discarding sentences based on their terminal punctuation signs

Default: -2

Transform/Filter

--nonzero_threshold, -nonzero_threshold

Threshold for discarding sentences based on numerals between the segments with zeros removed

Default: 0.5

Transform/InferFeats

--reversible_tokenization, -reversible_tokenization

Possible choices: joiner, spacer

Type of reversible tokenization applied on the tokenizer.

Default: “joiner”

--prior_tokenization, -prior_tokenization

Whether the input has already been tokenized.

Default: False

Transform/SwitchOut

Caution

This transform will not take effect when building vocabulary.

-switchout_temperature, --switchout_temperature

Sampling temperature for SwitchOut. \(\tau^{-1}\) in [WPDN18]. Smaller value makes data more diverse.

Default: 1.0

Transform/Token_Drop

-tokendrop_temperature, --tokendrop_temperature

Sampling temperature for token deletion.

Default: 1.0

Transform/Token_Mask

-tokenmask_temperature, --tokenmask_temperature

Sampling temperature for token masking.

Default: 1.0

Transform/Subword/Common

Attention

Common options shared by all subword transforms. Including options for indicate subword model path, Subword Regularization/BPE-Dropout, and Vocabulary Restriction.

-src_subword_model, --src_subword_model

Path of subword model for src (or shared).

-tgt_subword_model, --tgt_subword_model

Path of subword model for tgt.

-src_subword_nbest, --src_subword_nbest

Number of candidates in subword regularization. Valid for unigram sampling, invalid for BPE-dropout. (source side)

Default: 1

-tgt_subword_nbest, --tgt_subword_nbest

Number of candidates in subword regularization. Valid for unigram sampling, invalid for BPE-dropout. (target side)

Default: 1

-src_subword_alpha, --src_subword_alpha

Smoothing parameter for sentencepiece unigram sampling, and dropout probability for BPE-dropout. (source side)

Default: 0

-tgt_subword_alpha, --tgt_subword_alpha

Smoothing parameter for sentencepiece unigram sampling, and dropout probability for BPE-dropout. (target side)

Default: 0

-src_subword_vocab, --src_subword_vocab

Path to the vocabulary file for src subword. Format: <word> <count> per line.

Default: “”

-tgt_subword_vocab, --tgt_subword_vocab

Path to the vocabulary file for tgt subword. Format: <word> <count> per line.

Default: “”

-src_vocab_threshold, --src_vocab_threshold

Only produce src subword in src_subword_vocab with frequency >= src_vocab_threshold.

Default: 0

-tgt_vocab_threshold, --tgt_vocab_threshold

Only produce tgt subword in tgt_subword_vocab with frequency >= tgt_vocab_threshold.

Default: 0

Transform/Subword/ONMTTOK

-src_subword_type, --src_subword_type

Possible choices: none, sentencepiece, bpe

Type of subword model for src (or shared) in pyonmttok.

Default: “none”

-tgt_subword_type, --tgt_subword_type

Possible choices: none, sentencepiece, bpe

Type of subword model for tgt in pyonmttok.

Default: “none”

-src_onmttok_kwargs, --src_onmttok_kwargs

Other pyonmttok options for src in dict string, except subword related options listed earlier.

Default: “{‘mode’: ‘none’}”

-tgt_onmttok_kwargs, --tgt_onmttok_kwargs

Other pyonmttok options for tgt in dict string, except subword related options listed earlier.

Default: “{‘mode’: ‘none’}”

Reproducibility

--seed, -seed

Set random seed used for better reproducibility between experiments.

Default: -1