Command line tools for analysis
Apart from the main scripts (opusfilter
and
opusfilter-cmd
) and configuration
generation (opusfilter-autogen
),
the package also provides command line tools for testing and analyzing
the configurations and filters.
opusfilter-diagram
Draws a diagram (a directed acyclic graph) from OpusFilter
configuration file using the graphviz
library.
opusfilter-diagram [--rankdir {TB,LR}] FILE FILE
The --rankdir
option changes the direction of the graph from
left-to-right (default) to top-to-bottom. If the output file ends
with .dot
, the raw dot format is used; otherwise the graph is
rendered to the format indicated by the extension (e.g. PDF or PNG).
opusfilter-duplicates
This is a simple script based on the remove_duplicates
function, that instead of filtering the data, prints out statistics of
the duplicate entries. You can either provide a single corpus (as one
monolingual file or multiple parallel files) for calculating the
number of duplicates in it, or two corpora for calculating the overlap
between them. The syntax for the opusfilter-duplicates
is:
opusfilter-duplicates [--overlap FILE [FILE ...]] [--hash HASH] [--letters-only] [--lowercase] FILE [FILE ...]
The options are essentially the same as for remove_duplicates
.
opusfilter-scores
This is a tool that can be used to calculate and plot statistics from
scores produced by the score
function. The tool has
several subcommands that all take the JSON Lines score file as the
input, and either print or plot the output:
list
: Print score column namesdescribe
: Print basic score statisticscorr
: Plot score correlation matrixhist
: Plot score histogramsscatter-matrix
: Plot scatter matrix for scoresvalues
: Plot score values by line number
opusfilter-test
This is a simple script based on the filter
function that
can be used to calculate the amount of segments that the given
filter(s) would remove from the parallel data, and optionally output
the to-be-removed segments.
The syntax for the opusfilter-test
is:
opusfilter-test [--yaml FILE] [--add CLASS JSON] [--removed FILE] FILE [FILE ...]
The filters to test can be defined either from a YAML file (--yaml
)
using a similar definition as the filters
parameter for the filter
function, or adding the one by one with the --add
option, which
takes the filter class as the first argument and filter parameters in
as a JSON object as the second argument. For default filter
parameters, an empty dictionary ('{}'
) should be provided.
The scripts first calculates the total number of segments in the input
files, and then runs the filters on them one by one. The number and
proportion of removed segments is printed. In addition, it is possible
to write the removed segments to a file in JSON Lines format
(--removed
), and collect the scores from the filters to similarly to
the score function (--scores
).