Tokenizer
Tokenize parallel texts.
Parameters:
tokenizer: tokenizer type or a list of types for each inputlanguages: a list of language codes for each inputoptions: tokenizer options dictionary, or a list of tokenizer dictionaries for multiple tokenziers (optional)
Supported tokenizers:
moses:Uses the opus-fast-mosestokenizer package (a fork of fast-mosestokenizer).
Available for most languages.
Options are passed to the
mosestokenizer.MosesTokenizerclass; see its documentation for the available options.
jieba:Uses the jieba package.
Only avaliable for Chinese (zh, zh_CN).
In order to keep track of original space characters, they are by default converted to “␣” before tokenization. The character can be changed with the
map_space_tooption, or the feature disabled by givingnullor an empty string as the value.Other options are passed to
jieba.cutfunction; see its documentation for the avaliable options.If you use
jieba, please install OpusFilter with extras[jieba]or[all].
mecab:Uses the MeCab package.
Only avaliable for Japanese (jp).
In order to keep track of original space characters, they are by default converted to “␣” before tokenization. The character can be changed with the
map_space_tooption, or the feature disabled by givingnullor an empty string as the value.By default,
unidic-litedictionary is installed and used. Other dictionaries can be used by providing appropriate option string in themecab_argsoption.If you use
mecab, please install OpusFilter with extras[mecab]or[all].
The list of language codes should match to the languages of the input
files given in the preprocess step. If more than on tokenizer is
provided, the length of the list should match the number of the
languages. If more than one tokenizer options are provided, the length
should again match the number of the languages.