Downloading and selecting data
opus_read
Read a corpus from the OPUS corpus collection [Tiedemann, 2012] using the OpusTools [Aulamo et al., 2020] interface.
Parameters:
corpus_name
: name of the corpus in OPUSsource_language
: language code for the source languagetarget_language
: language code for the target languagerelease
: version of the corpus in OPUSpreprocessing
:moses
orraw
for untokenized andxml
for tokenized segmentssrc_output
: output file for source languagetgt_output
: output file for target languagesuppress_prompts
:false
(default) prompts user to confirm before download,true
to download without prompting
The moses
preprocessing type (available with OpusTools
version
1.6.2 and above) is recommended for those corpora for which it
exists. The output is equivalent to raw
, but in some cases it can
significantly reduce the amount of data downloaded in the process.
concatenate
Concatenate two or more text files.
Parameters:
inputs
: a list of input filesoutput
: output file
download
Download a file from URL.
Parameters:
url
: URL for the file to downloadoutput
: output file
head
Take the first n lines from files.
Parameters:
inputs
: a list of input filesoutputs
: a list of output filesn
: number of output lines
tail
Take the last n lines from files.
Parameters:
inputs
: a list of input filesoutputs
: a list of output filesn
: number of output lines
Note: The memory requirement of tail
is proportional to n. Use
slice
if you need all except the first n lines.
slice
Take slice of lines from files.
Parameters:
inputs
: a list of input filesoutputs
: a list of output filesstart
: start index (optional; default 0)stop
: stop index (optional; defaultnull
)step
: step size (optional; default 1)
Either start
, stop
, or both of them should be given. If stop
is
not given, reads until the end of the file.
split
Split files to two parts giving the approximative proportions as fractions.
Parameters:
inputs
: input file(s)outputs
: output file(s) for selected linesoutputs_2
: output file(s) for the rest of the lines (optional)divisor
: divisor for the modulo operation (e.g. 2 for splitting to equal sized parts)threshold
: threshold for the output of the modulo operation (optional; default 1)compare
: select files to use for hash operation (optional; defaultall
or a list of indices)hash
: select hash algorithm from xxhash (optional; defaultxxh64
)seed
: integer seed for the hash algorithm (optional; default 0)
Input files are processed line by line in parallel. If the condition
hash(content) % divisor < threshold
, where the content is a
concatenation of the input lines and the hash function returns an
integer, holds, the lines are written to the outputs
. If the
condition does not hold, and outputs_2
are defined, the lines are
written there.
Compared to random splitting (see subset) or using the modulo operation on the line number, the benefit of the hash-based approach is that the decision is fully deterministic and based only on the content of the lines. Consequently, identical content always goes to the the same output file(s). For example, if you split a parallel corpus into test and training sets, and you can be sure that your test data does not contain exactly same samples as the training data even if the original data has duplicates.
The downside is that you need to be careful if you use several splits
for the same data. The divisors used in consecutive splits should not
themselves have common divisors, or the proportion of the data in the
output files may be unexpected. Distinct prime numbers are good
choices. Also setting a different seed
value for the hash functions
prevents the issue.
The compare
parameter can be used to select which input files are
used to generate the content for the hash function. For example, if
you have source and target language files, and you want that the split
depends only on the source or target sentence, set compare
to [0]
or [1]
, respectively.
subset
Take a random subset from parallel corpus files.
Parameters:
inputs
: input filesoutputs
: output files for storing the subsetsize
: number of lines to select for the subsetseed
: seed for the random generator; set to ensure that two runs select the same lines (optional; defaultnull
)shuffle_subset
: shuffle the order of the selected lines for each language except for the first; can be used to produce noisy examples for training a corpus filtering model (optional; defaultfalse
)
product
Create a Cartesian product of parallel segments and optionally sample from them.
Parameters:
inputs
: a list of input files listsoutputs
: a list of output filesskip_empty
: skip empty lines (optional, defaulttrue
)skip_duplicates
: skip duplicate lines per language (optional, defaulttrue
)k
: sample at most k random items per product (optional, defaultnull
)seed
: seed for the random generator; set to ensure that two runs produce the same lines (optional; defaultnull
)
Can be used to combine parallel files of the same language that contain alternative translations or other meaningful variation (e.g. alternative subword segmenatations). For example, if you have the same text translated to language A by N translators and to language B by M translators, you can combine the N + M files into two files having N x M lines for each original line.
unzip
Unzip parallel segments joined in a single file into multiple files.
Parameters:
input
: input fileoutputs
: a list of output filesseparator
: a string separator in the input file
Can be used to split e.g. Moses-style (|||
) or tab-separated parallel text files into parts.
write
Write a specified string into a file.
Parameters:
output
: output filedata
: input data to write to the output (converted to a string if not already)
Useful mostly for testing.