Custom filters
You can also import your own filters by defining the module
key in
the filter configuration entries.
The custom filters should inherit the abstract base class FilterABC
from the opusfilter
package. They should implement two abstract
methods, score
and accept
, and one abstract property,
score_direction
. Additionally, for filters with adjustable
thresholds, defining accept_threshold
and reject_threshold
properties is recommended.
The score
method is a generator that takes an iterator over tuples
of parallel sentences, and yields a score object for each pair. The
score may either be a single number, or if multiple score values need
to be yielded, a dictionary that has the numbers as values.
The accept
method takes a single output yielded by the score
method, and returns whether the sentence pair should be accepted based
on the score.
The score_direction
should be one of the following constants defined
in the opusfilter
module depending on the output of the score()
method:
CLEAN_LOW
: scores below a threshold parameter indicate clean dataCLEAN_HIGH
: scores above a threshold parameter indicate clean dataCLEAN_BETWEEN
: scores between minimum and maximum thresholds indicate clean dataCLEAN_TRUE
: score valueTrue
indicates clean dataCLEAN_FALSE
: score valueFalse
indicates clean data
If the filter requires any parameters (e.g. score thresholds for the
accept
method), the class should implement also the __init__
method. Arbitrary keyword arguments should be accepted (with
**kwargs
), and the __init__
method of the base class (FilterABC
)
should be called with the remaining keyword arguments. The keyword
argument name
is reserved for giving names to the filters and
workdir
for a location for non-temprary files.
For compability with the included automatic configuration generation tools, also the following should be considered:
If there is a threshold value used by
accept
, the argument should be named asthreshold
(a single global threshold) orthresholds
(multiple thresholds, e.g. one per language). Theaccept_threshold
andreject_threshold
properties should have threshold values that force all inputs to be accepted or rejected, respectively. That is, a sensible threshold value will always be betweenaccept_threshold
andreject_threshold
.If there are lower and upper thresholds used by
accept
(i.e.score_direction
isCLEAN_BETWEEN
), the respective arguments should be named asmin_threshold
andmax_threshold
ormin_length
andmax_length
. Theaccept_threshold
andreject_threshold
properties should have tuples of two threshold values (for lower and upper thresholds) that force all inputs to be accepted or rejected, respectively.
Based on the score
and accept
methods, the abstract class
FilterABC
implements the following three generators that take
iterator over segment pairs as input:
decisions
yields results of theaccept
methodfilter
yields only accepted segmentsfilterfalse
yields only rejected segments
These should not be redefined except for a good reason.
The example below shows code for simple filter that calculates the proportion of uppercase letters in the sentences, and accepts the pair only if all sentences have less than 50% (or given threshold) of uppercase characters:
import opusfilter
class UppercaseFilter(opusfilter.FilterABC):
score_direction = opusfilter.CLEAN_LOW
accept_threshold = 1 + 10**-6
reject_threshold = 0
def __init__(self, threshold=0.5, **kwargs):
self.threshold = threshold
super().__init__(**kwargs)
def uppercase_ratio(self, sentence):
length = len(sentence)
if length > 0:
return sum(1 for char in sent if char.isupper()) / length
return 0
def score(self, pairs):
for pair in pairs:
yield [self.uppercase_ratio(sentence) for sentence in pair]
def accept(self, score):
return all(ratio < self.threshold for ratio in score)
Assuming that the above code is in a module named customfilter
in
the Python evironment (e.g. save the code as customfilter.py
and add
the directory that contains it to PYTHONPATH
environment variable),
it can be selected in the filter configurations as follows:
steps:
...
- type: filter
parameters:
...
filters:
- UppercaseFilter:
threshold: 0.5
module: customfilter
If a filter requires external resources files (e.g. for model
parameters), or stores non-temporary files itself, they should be
located in the path defined the attribute workdir
. The
implementation of the filter should join workdir
with relative file
paths using os.path.join()
.