Custom filters
You can also import your own filters by defining the module key in
the filter configuration entries.
The custom filters should inherit the abstract base class FilterABC
from the opusfilter package. They should implement two abstract
methods, score and accept, and one abstract property,
score_direction. Additionally, for filters with adjustable
thresholds, defining accept_threshold and reject_threshold
properties is recommended.
The score method is a generator that takes an iterator over tuples
of parallel sentences, and yields a score object for each pair. The
score may either be a single number, or if multiple score values need
to be yielded, a dictionary that has the numbers as values.
The accept method takes a single output yielded by the score
method, and returns whether the sentence pair should be accepted based
on the score.
The score_direction should be one of the following constants defined
in the opusfilter module depending on the output of the score()
method:
CLEAN_LOW: scores below a threshold parameter indicate clean dataCLEAN_HIGH: scores above a threshold parameter indicate clean dataCLEAN_BETWEEN: scores between minimum and maximum thresholds indicate clean dataCLEAN_TRUE: score valueTrueindicates clean dataCLEAN_FALSE: score valueFalseindicates clean data
If the filter requires any parameters (e.g. score thresholds for the
accept method), the class should implement also the __init__
method. Arbitrary keyword arguments should be accepted (with
**kwargs), and the __init__ method of the base class (FilterABC)
should be called with the remaining keyword arguments. The keyword
argument name is reserved for giving names to the filters and
workdir for a location for non-temprary files.
For compability with the included automatic configuration generation tools, also the following should be considered:
If there is a threshold value used by
accept, the argument should be named asthreshold(a single global threshold) orthresholds(multiple thresholds, e.g. one per language). Theaccept_thresholdandreject_thresholdproperties should have threshold values that force all inputs to be accepted or rejected, respectively. That is, a sensible threshold value will always be betweenaccept_thresholdandreject_threshold.If there are lower and upper thresholds used by
accept(i.e.score_directionisCLEAN_BETWEEN), the respective arguments should be named asmin_thresholdandmax_thresholdormin_lengthandmax_length. Theaccept_thresholdandreject_thresholdproperties should have tuples of two threshold values (for lower and upper thresholds) that force all inputs to be accepted or rejected, respectively.
Based on the score and accept methods, the abstract class
FilterABC implements the following three generators that take
iterator over segment pairs as input:
decisionsyields results of theacceptmethodfilteryields only accepted segmentsfilterfalseyields only rejected segments
These should not be redefined except for a good reason.
The example below shows code for simple filter that calculates the proportion of uppercase letters in the sentences, and accepts the pair only if all sentences have less than 50% (or given threshold) of uppercase characters:
import opusfilter
class UppercaseFilter(opusfilter.FilterABC):
score_direction = opusfilter.CLEAN_LOW
accept_threshold = 1 + 10**-6
reject_threshold = 0
def __init__(self, threshold=0.5, **kwargs):
self.threshold = threshold
super().__init__(**kwargs)
def uppercase_ratio(self, sentence):
length = len(sentence)
if length > 0:
return sum(1 for char in sent if char.isupper()) / length
return 0
def score(self, pairs):
for pair in pairs:
yield [self.uppercase_ratio(sentence) for sentence in pair]
def accept(self, score):
return all(ratio < self.threshold for ratio in score)
Assuming that the above code is in a module named customfilter in
the Python evironment (e.g. save the code as customfilter.py and add
the directory that contains it to PYTHONPATH environment variable),
it can be selected in the filter configurations as follows:
steps:
...
- type: filter
parameters:
...
filters:
- UppercaseFilter:
threshold: 0.5
module: customfilter
If a filter requires external resources files (e.g. for model
parameters), or stores non-temporary files itself, they should be
located in the path defined the attribute workdir. The
implementation of the filter should join workdir with relative file
paths using os.path.join().