Automatic configuration generation
You can generate OpusFilter config files with the opusfilter-autogen
script. Currently the script supports only adding a single filter
step, with a few options for determining the filter parameters.
The usage description for the script is as follows:
usage: opusfilter-autogen [-h] --files TEXTFILE [TEXTFILE ...]
                          [--langs LANGCODE [LANGCODE ...]]
                          [--scripts SCRIPT [SCRIPT ...]]
                          [--method {defaults,percentiles,clustering}]
                          [--sample-size SAMPLE_SIZE]
                          [--noisy-percentile NOISY_PERCENTILE]
                          [--work-dir WORK_DIR] [--inter-dir INTER_DIR]
                          [--plot] [--list-defaults] [--add-filter CLASS JSON]
                          [--overwrite] [-o CONFIGFILE]
Generate initial configuration based on parallel text data
options:
  -h, --help            show this help message and exit
  --files TEXTFILE [TEXTFILE ...]
                        parallel text input file(s)
  --langs LANGCODE [LANGCODE ...]
                        Language codes corresponding to the input files. If
                        omitted, LanguageIDFilters will not be used.
  --scripts SCRIPT [SCRIPT ...]
                        Alphabetic scripts (e.g. Latin) corresponding to the
                        input files. If omitted, CharacterScoreFilter will not
                        be used.
  --method {defaults,percentiles,clustering}
                        Method for selecting filter thresholds (default:
                        clustering)
  --sample-size INT     Max number of sentence pairs used for data-based
                        methods (default 100000)
  --noisy-percentile FLOAT
                        Proportion of the data considered to be noisy; only
                        for percentiles method (default 0.001)
  --clusters INT, -k INT
                        Number of clusters for the clustering method; try
                        increasing if too much data is clustered as noisy
                        (default 2)
  --work-dir WORK_DIR   Location of the source and target files for the
                        generated configuration (default work)
  --inter-dir INTER_DIR
                        Save intermediate files in this directory (use a
                        temporary directory if not given)
  --plot                Show a scatter plot of the clustering and histograms
                        of feature data distributions; only for the clustering
                        method
  --list-defaults       List default filters of the method to the output and
                        quit
  --add-filter CLASS JSON
                        Instead of using default filters, add a filter of
                        CLASS with JSON parameters object ("{}" for default
                        parameters). The class name may be followed by a dot
                        and a unique filter identifier in order to allow
                        multiple filters of the same class. Example: --add-
                        filter LanguageIDFilter.cld2 '{"id_method": "cld2"}'
  --overwrite           Overwrite existing intermediate files
  -o CONFIGFILE, --output CONFIGFILE
                        Output configuration file (default -)
The --method option sets how the filter parameters are set.  The
option default uses the default parameters defined in the filter
classes. The option percentiles assumes that a proportion of the
data (set by --noisy-percentile) is noisy, and sets the thresholds
for each filter independently based on the percentile. The
clustering option may be the most useful of the three, and described
in more detail below. However, it is applicable to a more limited set
of filters.
Unsupervised threshold selection for filters
This implements the method introduced by Aulamo et al. [2023].
It takes a parallel corpus as an input and tries to separate the clean
and noisy samples to generate threshold parameters for filters. The
currently supported filters are AlphabetRatioFilter,
CharacterScoreFilter, LanguageIDFilter, LengthRatioFilter,
NonZeroNumeralsFilter and TerminalPunctuationFilter, but this list
will be expanded and made more flexible in the future.
First, we remove duplicates and empty sentences from the input
corpus. Next, we take a subset (--sample-size, 100k sentence pairs
by default) of the corpus and produce scores for each sentence pair in
the subset with the previously mentioned filters. These scores are
used as features for K-means clustering to group the sentence pairs
into clean and noisy pairs. The values of the noisy cluster center are
used as the filter threshold parameters in the generated config file.
If it looks like too many samples are clustered as noisy, increasing
the number of clusters (--clusters) may help.
Figures from the clustering and score histograms are plotted given the
--plot option. If you want also to save the intermediate files, make
sure to use the --inter-dir argument.
Note: The method should be considered as experimental, and it is not expected to give good results on all corpora. If you try it, please consider giving feedback on the project issues page.