OpusFilter
Welcome to OpusFilter’s documentation!
OpusFilter is a tool for filtering and combining parallel corpora. It uses the OpusTools library [Aulamo et al., 2020] to download data from the OPUS corpus collection [Tiedemann, 2012], but can be used with any corpora in raw text format.
Features:
Corpus preprocessing pipelines configured with YAML
Simple downloading of parallel corpora from OPUS with OpusTools
Implementations for many common text file operations on parallel files
Memory-efficient processing of large files
Implemented filters based e.g. on language identification, word aligment, n-gram language models, and multilingual sentence embeddings
Extendable with your own filters written in Python
OpusFilter has been presented in ACL 2020 system demonstrations.
Get started
Available functions
Available filters
Available preprocessors
Other information