Downloading and selecting data

opus_read

Read a corpus from the OPUS corpus collection [Tiedemann, 2012] using the OpusTools [Aulamo et al., 2020] interface.

Parameters:

  • corpus_name: name of the corpus in OPUS

  • source_language: language code for the source language

  • target_language: language code for the target language

  • release: version of the corpus in OPUS

  • preprocessing: moses or raw for untokenized and xml for tokenized segments

  • src_output: output file for source language

  • tgt_output: output file for target language

  • suppress_prompts: false (default) prompts user to confirm before download, true to download without prompting

The moses preprocessing type (available with OpusTools version 1.6.2 and above) is recommended for those corpora for which it exists. The output is equivalent to raw, but in some cases it can significantly reduce the amount of data downloaded in the process.

concatenate

Concatenate two or more text files.

Parameters:

  • inputs: a list of input files

  • output: output file

download

Download a file from URL.

Parameters:

  • url: URL for the file to download

  • output: output file

tail

Take the last n lines from files.

Parameters:

  • inputs: a list of input files

  • outputs: a list of output files

  • n: number of output lines

Note: The memory requirement of tail is proportional to n. Use slice if you need all except the first n lines.

slice

Take slice of lines from files.

Parameters:

  • inputs: a list of input files

  • outputs: a list of output files

  • start: start index (optional; default 0)

  • stop: stop index (optional; default null)

  • step: step size (optional; default 1)

Either start, stop, or both of them should be given. If stop is not given, reads until the end of the file.

split

Split files to two parts giving the approximative proportions as fractions.

Parameters:

  • inputs: input file(s)

  • outputs: output file(s) for selected lines

  • outputs_2: output file(s) for the rest of the lines (optional)

  • divisor: divisor for the modulo operation (e.g. 2 for splitting to equal sized parts)

  • threshold: threshold for the output of the modulo operation (optional; default 1)

  • compare: select files to use for hash operation (optional; default all or a list of indices)

  • hash: select hash algorithm from xxhash (optional; default xxh64)

  • seed: integer seed for the hash algorithm (optional; default 0)

Input files are processed line by line in parallel. If the condition hash(content) % divisor < threshold, where the content is a concatenation of the input lines and the hash function returns an integer, holds, the lines are written to the outputs. If the condition does not hold, and outputs_2 are defined, the lines are written there.

Compared to random splitting (see subset) or using the modulo operation on the line number, the benefit of the hash-based approach is that the decision is fully deterministic and based only on the content of the lines. Consequently, identical content always goes to the the same output file(s). For example, if you split a parallel corpus into test and training sets, and you can be sure that your test data does not contain exactly same samples as the training data even if the original data has duplicates.

The downside is that you need to be careful if you use several splits for the same data. The divisors used in consecutive splits should not themselves have common divisors, or the proportion of the data in the output files may be unexpected. Distinct prime numbers are good choices. Also setting a different seed value for the hash functions prevents the issue.

The compare parameter can be used to select which input files are used to generate the content for the hash function. For example, if you have source and target language files, and you want that the split depends only on the source or target sentence, set compare to [0] or [1], respectively.

subset

Take a random subset from parallel corpus files.

Parameters:

  • inputs: input files

  • outputs: output files for storing the subset

  • size: number of lines to select for the subset

  • seed: seed for the random generator; set to ensure that two runs select the same lines (optional; default null)

  • shuffle_subset: shuffle the order of the selected lines for each language except for the first; can be used to produce noisy examples for training a corpus filtering model (optional; default false)

product

Create a Cartesian product of parallel segments and optionally sample from them.

Parameters:

  • inputs: a list of input files lists

  • outputs: a list of output files

  • skip_empty: skip empty lines (optional, default true)

  • skip_duplicates: skip duplicate lines per language (optional, default true)

  • k: sample at most k random items per product (optional, default null)

  • seed: seed for the random generator; set to ensure that two runs produce the same lines (optional; default null)

Can be used to combine parallel files of the same language that contain alternative translations or other meaningful variation (e.g. alternative subword segmenatations). For example, if you have the same text translated to language A by N translators and to language B by M translators, you can combine the N + M files into two files having N x M lines for each original line.

unzip

Unzip parallel segments joined in a single file into multiple files.

Parameters:

  • input: input file

  • outputs: a list of output files

  • separator: a string separator in the input file

Can be used to split e.g. Moses-style (|||) or tab-separated parallel text files into parts.

write

Write a specified string into a file.

Parameters:

  • output: output file

  • data: input data to write to the output (converted to a string if not already)

Useful mostly for testing.