Alignment model filters
WordAlignFilter
Filter segments by word aligment scores using eflomal [Östling and Tiedemann, 2016].
Parameters:
src_threshold
: score threshold for source language (default 0)tgt_threshold
: score threshold for target language (default 0)src_tokenizer
: tokenizer for source language (optional; defaultnull
)tgt_tokenizer
: tokenizer for target language (optional; defaultnull
)model
: eflomal model type (optional; default 3)priors
: priors for the aligment (optional; defaultnull
)score_for_empty
: score values for empty input pairs (default -100)
A segment pair is accepted if scores for both directions are lower than the corresponding thresholds.
The supported tokenizers are listed in Tokenizer. For
example, to enable Moses tokenizer, provide a tuple containing moses
and an appropriate two-letter language code, e.g. [moses, en]
for
English.
The eflomal model types are 1 for IBM1, 2 for IBM1 + HMM, and 3 for IBM1 + HMM + fertility. See github.com/robertostling/eflomal for details.
See train_aligment for training priors. Compatible tokenizer and model parameters should be used.
Caveats:
eflomal
is a stochastic method and two runs will not produce exactly
the same result. Thus, if you include it in your pipeline, full
replicability of the end result is not possible.
Moreover, eflomal
estimates the model parameters using the input
data, even if you provide the priors file. In consequence, the size of
the input matters, and aligning the same data in chunks will not
provide the same results as aligning all at once. Thus the following
may give worse results than expected:
Using
score
if input size is larger thanchunksize
(default 100000)Using
filter
withfilterfalse
on if input size is larger thanchunksize
(default 100000)Using parallel processing (
n_jobs
> 1) forscore
orfilter
regardless of the other options