Special character and similarity filters

HtmlTagFilter

Filter segments based on whether they contain HTML tags or not.

The returned scores are two boolean values indicating whether the segments contain HTML tags. In filtering, a segment pair is accepted if none of the segments contains HTML tags.

TerminalPunctuationFilter

Filter segments based on a penalty score with respect to the co-occurrence of therminal punctuation marks (‘.’, ‘…’, ‘?’, ‘!’) in source and target segments [Vázquez et al., 2019]. The score is formulated as follows: the initial score is the absolute difference in source and target terminal punctuation counts, the score is then incremented by the number of terminal punctuation beyond the first occurence in both segments, and finally, the score is updated with score=-log(score+1). The score of the greatest co-occurrence is 0 and smaller values indicate greater penalty.

This filter works only for bilingual input.

Parameters:

threshold: minimum score threshold (default -2)

The returned score is a single terminal punctuation score. In filtering, the score has to equal to of be greater than the minimum threshold.

NonZeroNumeralsFilter

Filter segments based on a similarity measure of numerals between the segments with zeros removed [Vázquez et al., 2019]. Non-zero numerals are extracted from all segments preserving the relative order of the numerals. The similarity score between the numeral sequences is produced with SequenceMatcher.ratio() from Python’s difflib library.

Parameters:

threshold: minimum score threshold (default 0.5)
require_all: if True, all scores (for pairs of n segments) have to be reach threshold; otherwise at least one the ratios has to reach the threshold

The returned value is a list of similarity scores for all language pairs. For n-lingual input, the scores will include C(n, 2) values. In filtering, all pairwise scores has to equal to or be greater than the minimum threshold.

LongestCommonSubstringFilter

Filter segments based on the normalized length of the longest common substring.

Parameters:

threshold: filter segments if the normalized length is equal or above the threshold (optional; default 0.9)
require_all: if True, all ratios (for pairs of n segments) have to be below the threshold; otherwise at least one the ratios have to be below the threshold

Returned scores are ratios between the length of the longest common substring and the length of the shorter of the compared strings for all language pairs. For n-lingual input, the scores will include C(n, 2) values.

SimilarityFilter

Filter segments based on string or word sequence similarity based on Levenshtein distance.

Parameters:

threshold: filter segments if the similarity is equal or above the threshold (optional, default 0.9)
weights: a list of three integers corresponding to the costs of three edit operations: insertion, deletion, substitution (optional; default [1, 1, 1])
unit: type of unit for calculating the distance (optional; word for words or any whitespace-separated units, and character or char for characters; the default is char)
lowercase: lowercase strings as preprocessing (default false)
require_all: if True, all similarities (for pairs of n segments) have to be below the threshold; otherwise at least one the similarities have to be below the threshold

The returned scores are normalized similarities (1 - edit distance / max edit distance) between the compared sequences for all language pairs. For n-lingual input, the scores will include C(n, 2) values.

RepetitionFilter

Filter segments with repeated content. Useful e.g. for filtering data generated by a low-quality NMT model.

Parameters:

threshold: number of repetitions required to activate the filter (optional, default 2)
min_length: minimum number of characters in the repeated sequence (optional, default 3)
max_length: maximum number of characters in the repeated sequence (optional, default 100)

The returned scores are the numbers of repetitions if at least threshold repetitions were found (first occurrence of the string is not counted), or zero if no repetitions were found, or all were below the threshold. The returned number of repetitions is for the first match, and it is possible that the segment contains longer repetitions.

There may be optional space character(s) between the repeated strings that are not counted to the length. The repeated string cannot start with a whitespace character but is not limited otherwise.

RegExpFilter

Filter out segments that match (or do not match) a arbitrary regular expression.

Parameters:

regexps: a regular expression or a list of expressions to match
accept_match: accept matching segments instead of rejecting (default false)

You can either provide a single regexp or one for each language in the parallel data. If accept_match is false, the pair is accepted only if none of the segment match the corresponding regular experssion. If accept_match is true, the pair is accepted only if all segments match the corresponding regular expression.

The regex module is used for the regular expressions.