OpusFilter
3.2

Get started

  • Installation
  • Basic usage
  • Automatic configuration generation
  • Command line tools for analysis

Available functions

  • Downloading and selecting data
  • Preprocessing text
  • Filtering and scoring
  • Using score files
  • Training language and alignment models
  • Training and using classifiers

Available filters

  • Length filters
  • Script and language identification filters
  • Special character and similarity filters
  • Language model filters
  • Alignment model filters
  • Sentence embedding filters
  • Custom filters

Available preprocessors

  • Tokenizer
  • Detokenizer
  • WhitespaceNormalizer
  • RegExpSub
  • MonolingualSentenceSplitter
  • BPESegmentation
  • MorfessorSegmentation
  • Custom preprocessors

Other information

  • Citing and references
  • Contributing
  • Changelog
OpusFilter
  • Search