Script and language identification filters
AlphabetRatioFilter
Filter segments based on what proportion of the characters are alphabetic characters.
Parameters:
threshold
: minimum proportion of alphabets in a segment (default 0.75)exclude_whitespace
: whether to exclude whitespace characters from the ratio (defaultfalse
)
Returned scores are proportions of alphabetic characters in the
segments (after removing whitespace if exclude_whitespace
is
true). In filtering, all values have to be equal to or greater than
the minimum threshold.
In order to allow have different thresholds per language, the threshold parameter can also be given as a list.
CharacterScoreFilter
Filter segments based on what proportion of their alphabetic characters are in a given script. For a list of valid scripts, see e.g. www.regular-expressions.info/unicode.html.
Parameters:
scripts
: scripts for input segmentsthresholds
: minimum proportion of characters in a script (default 1)
Returned scores are proportions of valid characters in the segments. In filtering, all values have to be equal to or greater than the minimum thresholds.
LanguageIDFilter
Filter segments based on their language identification confidence scores.
Parameters:
languages
: expected languages (ISO639 language codes) for the segmentsid_method
: language indentification method (langid
for using thelangid
library,cld2
for using thecld2
library, orfasttext
for using afasttext
model; the default islangid
)thresholds
: minimum identification confidence score for the segments (a single float or a list of floats per language)fasttext_model_path
: path for afasttext
model (required only for thefasttext
method; defaultnull
)langid_languages
: limit detection to a list of possible languages (valid only for thelangid
method; defaultnull
)cld2_options
: a dictionary of options for thecld2
method (valid only for thecld2
method; defaultnull
)
Returned scores are the language identification confidence scores from a given identification method for the segments. The scores range from 0 to 1. In filtering, all values have to be greater than the minimum thresholds. Negative threshold can be used to skip filtering for a language.
See langid.py and
pycld2 for the method-specific
options. A pretrained fasttext
model can be downloaded from
fasttext.cc/docs/en/language-identification.html.