OpusFilter

Welcome to OpusFilter’s documentation!

OpusFilter is a tool for filtering and combining parallel corpora. It uses the OpusTools library [Aulamo et al., 2020] to download data from the OPUS corpus collection [Tiedemann, 2012], but can be used with any corpora in raw text format.

Features:

  • Corpus preprocessing pipelines configured with YAML

  • Simple downloading of parallel corpora from OPUS with OpusTools

  • Implementations for many common text file operations on parallel files

  • Memory-efficient processing of large files

  • Implemented filters based e.g. on language identification, word aligment, n-gram language models, and multilingual sentence embeddings

  • Extendable with your own filters written in Python

OpusFilter has been presented in ACL 2020 system demonstrations.