Automatic configuration generation
You can generate OpusFilter config files with the opusfilter-autogen
script. Currently the script supports only adding a single filter
step, with a few options for determining the filter parameters.
The usage description for the script is as follows:
usage: opusfilter-autogen [-h] --files TEXTFILE [TEXTFILE ...]
[--langs LANGCODE [LANGCODE ...]]
[--scripts SCRIPT [SCRIPT ...]]
[--method {defaults,percentiles,clustering}]
[--sample-size SAMPLE_SIZE]
[--noisy-percentile NOISY_PERCENTILE]
[--work-dir WORK_DIR] [--inter-dir INTER_DIR]
[--plot] [--list-defaults] [--add-filter CLASS JSON]
[--overwrite] [-o CONFIGFILE]
Generate initial configuration based on parallel text data
options:
-h, --help show this help message and exit
--files TEXTFILE [TEXTFILE ...]
parallel text input file(s)
--langs LANGCODE [LANGCODE ...]
Language codes corresponding to the input files. If
omitted, LanguageIDFilters will not be used.
--scripts SCRIPT [SCRIPT ...]
Alphabetic scripts (e.g. Latin) corresponding to the
input files. If omitted, CharacterScoreFilter will not
be used.
--method {defaults,percentiles,clustering}
Method for selecting filter thresholds (default:
clustering)
--sample-size INT Max number of sentence pairs used for data-based
methods (default 100000)
--noisy-percentile FLOAT
Proportion of the data considered to be noisy; only
for percentiles method (default 0.001)
--clusters INT, -k INT
Number of clusters for the clustering method; try
increasing if too much data is clustered as noisy
(default 2)
--work-dir WORK_DIR Location of the source and target files for the
generated configuration (default work)
--inter-dir INTER_DIR
Save intermediate files in this directory (use a
temporary directory if not given)
--plot Show a scatter plot of the clustering and histograms
of feature data distributions; only for the clustering
method
--list-defaults List default filters of the method to the output and
quit
--add-filter CLASS JSON
Instead of using default filters, add a filter of
CLASS with JSON parameters object ("{}" for default
parameters). The class name may be followed by a dot
and a unique filter identifier in order to allow
multiple filters of the same class. Example: --add-
filter LanguageIDFilter.cld2 '{"id_method": "cld2"}'
--overwrite Overwrite existing intermediate files
-o CONFIGFILE, --output CONFIGFILE
Output configuration file (default -)
The --method
option sets how the filter parameters are set. The
option default
uses the default parameters defined in the filter
classes. The option percentiles
assumes that a proportion of the
data (set by --noisy-percentile
) is noisy, and sets the thresholds
for each filter independently based on the percentile. The
clustering
option may be the most useful of the three, and described
in more detail below. However, it is applicable to a more limited set
of filters.
Unsupervised threshold selection for filters
This implements the method introduced by Aulamo et al. [2023].
It takes a parallel corpus as an input and tries to separate the clean
and noisy samples to generate threshold parameters for filters. The
currently supported filters are AlphabetRatioFilter
,
CharacterScoreFilter
, LanguageIDFilter
, LengthRatioFilter
,
NonZeroNumeralsFilter
and TerminalPunctuationFilter
, but this list
will be expanded and made more flexible in the future.
First, we remove duplicates and empty sentences from the input
corpus. Next, we take a subset (--sample-size
, 100k sentence pairs
by default) of the corpus and produce scores for each sentence pair in
the subset with the previously mentioned filters. These scores are
used as features for K-means clustering to group the sentence pairs
into clean and noisy pairs. The values of the noisy cluster center are
used as the filter threshold parameters in the generated config file.
If it looks like too many samples are clustered as noisy, increasing
the number of clusters (--clusters
) may help.
Figures from the clustering and score histograms are plotted given the
--plot
option. If you want also to save the intermediate files, make
sure to use the --inter-dir
argument.
Note: The method should be considered as experimental, and it is not expected to give good results on all corpora. If you try it, please consider giving feedback on the project issues page.