Special character and similarity filters
HtmlTagFilter
Filter segments based on whether they contain HTML tags or not.
The returned scores are two boolean values indicating whether the segments contain HTML tags. In filtering, a segment pair is accepted if none of the segments contains HTML tags.
TerminalPunctuationFilter
Filter segments based on a penalty score with respect to the
co-occurrence of therminal punctuation marks (‘.’, ‘…’, ‘?’, ‘!’) in
source and target segments [Vázquez et al., 2019]. The
score is formulated as follows: the initial score is the absolute
difference in source and target terminal punctuation counts, the score
is then incremented by the number of terminal punctuation beyond the
first occurence in both segments, and finally, the score is updated
with score=-log(score+1)
. The score of the greatest co-occurrence is
0 and smaller values indicate greater penalty.
This filter works only for bilingual input.
Parameters:
threshold
: minimum score threshold (default -2)
The returned score is a single terminal punctuation score. In filtering, the score has to equal to of be greater than the minimum threshold.
NonZeroNumeralsFilter
Filter segments based on a similarity measure of numerals between the
segments with zeros removed [Vázquez et al., 2019].
Non-zero numerals are extracted from all segments preserving the
relative order of the numerals. The similarity score between the
numeral sequences is produced with SequenceMatcher.ratio()
from
Python’s difflib
library.
Parameters:
threshold
: minimum score threshold (default 0.5)require_all
: if True, all scores (for pairs of n segments) have to be reach threshold; otherwise at least one the ratios has to reach the threshold
The returned value is a list of similarity scores for all language pairs. For n-lingual input, the scores will include C(n, 2) values. In filtering, all pairwise scores has to equal to or be greater than the minimum threshold.
LongestCommonSubstringFilter
Filter segments based on the normalized length of the longest common substring.
Parameters:
threshold
: filter segments if the normalized length is equal or above the threshold (optional; default 0.9)require_all
: if True, all ratios (for pairs of n segments) have to be below the threshold; otherwise at least one the ratios have to be below the threshold
Returned scores are ratios between the length of the longest common substring and the length of the shorter of the compared strings for all language pairs. For n-lingual input, the scores will include C(n, 2) values.
SimilarityFilter
Filter segments based on string or word sequence similarity based on Levenshtein distance.
Parameters:
threshold
: filter segments if the similarity is equal or above the threshold (optional, default 0.9)weights
: a list of three integers corresponding to the costs of three edit operations: insertion, deletion, substitution (optional; default[1, 1, 1]
)unit
: type of unit for calculating the distance (optional;word
for words or any whitespace-separated units, andcharacter
orchar
for characters; the default ischar
)lowercase
: lowercase strings as preprocessing (defaultfalse
)require_all
: if True, all similarities (for pairs of n segments) have to be below the threshold; otherwise at least one the similarities have to be below the threshold
The returned scores are normalized similarities (1 - edit distance / max edit distance) between the compared sequences for all language pairs. For n-lingual input, the scores will include C(n, 2) values.
RepetitionFilter
Filter segments with repeated content. Useful e.g. for filtering data generated by a low-quality NMT model.
Parameters:
threshold
: number of repetitions required to activate the filter (optional, default 2)min_length
: minimum number of characters in the repeated sequence (optional, default 3)max_length
: maximum number of characters in the repeated sequence (optional, default 100)
The returned scores are the numbers of repetitions if at least threshold repetitions were found (first occurrence of the string is not counted), or zero if no repetitions were found, or all were below the threshold. The returned number of repetitions is for the first match, and it is possible that the segment contains longer repetitions.
There may be optional space character(s) between the repeated strings that are not counted to the length. The repeated string cannot start with a whitespace character but is not limited otherwise.
RegExpFilter
Filter out segments that match (or do not match) a arbitrary regular expression.
Parameters:
regexps
: a regular expression or a list of expressions to matchaccept_match
: accept matching segments instead of rejecting (defaultfalse
)
You can either provide a single regexp or one for each language in the
parallel data. If accept_match
is false
, the pair is accepted only
if none of the segment match the corresponding regular experssion. If
accept_match
is true
, the pair is accepted only if all segments
match the corresponding regular expression.
The regex module is used for the regular expressions.