Changelog
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
Unreleased
Fixed
broken citations and links in the documentation
3.3.0 - 2025-07-02
Added
letter_words_onlyoption for duplicate detectionsupport for selection of possible languages for
lingualanguage detectionsupport for language detection based on
heliport(a port of HeLI-OTS)
Changed
update library requirements and include Python 3.13 tests
removed pycld2 and fasttext from the
[all]extrasrefactored all language identification filters to have their own classes and marked
LanguageIDFilteras deprecated
Fixed
fix broken import in
opusfilter-scores
3.2.0 - 2024-08-14
Changed
make
pycld2andfasttextlibraries optionalreplace
langid.pylibrary withpy3langidupdate github workflows and include Python 3.12 tests
Fixed
OpusReadinterface usingmosesformat (requiresopustools >= 1.6.2)
3.1.0 - 2024-06-05
Added
support
linguabased for language detection (https://github.com/Helsinki-NLP/OpusFilter/pull/65)
Removed
Python 3.7 support
Fixed
fix score method in
SentenceEmbeddingFilter(https://github.com/Helsinki-NLP/OpusFilter/pull/71)fix filter and filterfalse methods in
SentenceEmbeddingFilter
3.0.0 - 2023-10-11
Added
opusfilter-autogenscript for automatic filter config generationscore_direction,accept_threshold, andreject_thresholdproperties for filters
Changed
refactor code and move auxiliary methods to opusfilter.util
update varikn installation instructions (installable from PyPI)
update github workflows and include Python 3.11 tests
update library version requirements to support Python 3.11
use xxhash instead of pyhash for hash functions
use opus-fast-mosestokenizer instead of fast-mosestokenizer
install eflomal from PyPI and use the new interface in WordAlignFilter
Removed
Python 3.6 support
Fixed
catch NotImplementedError from beautifulsoup 4.11.2
catch ParserRejectedMarkup from beautifulsoup 4.12.0
2.6.0 - 2022-11-30
Added
add
slicemissing from the enabled steps
Changed
improve documentation
import slow libraries only when needed
use chunks for the filter method of
SentenceEmbeddingFilterchange
RepetitionFilterto use single score for consistency with the threshold
Fixed
allow float thresholds for
AverageWordLengthFilterremove unnecessary code from
RegExpSubadd
setuptoolsversion requirement
2.5.1 - 2022-09-28
Fixed
add missing document file
2.5.0 - 2022-09-28
Added
map_space_tooption for Jieba and MeCab tokenizers to preserve existing space characters in inputparallel processing options for filter, score, and preprocess steps
Changed
re-organize documentation and support building it with sphinx
Fixed
catch TypeError exceptions from BeautifulSoup in HtmlTagFilter
2.4.0 - 2022-04-05
Added
an option to write filter scores to a file with
opusfilter-testnew filters:
AlphabetRatioFilter,RegExpFilter,SimilarityFilter,SentenceEmbeddingFiltersupport for Japanese word segmentation using
MeCabas a tokenizerpreprocessing methods for subword segmentation (
BPESegmentation,MorfessorSegmentation)subword segmentation support for the n-gram language models and language model filters
Changed
allow per-language parameters for LengthFilter, LengthRatioFilter, LongWordFilter, and AverageWordLengthFilter
fix documentation for
train_aligmentparameters
2.3.1 - 2022-01-28
Fixed
fix bug in classifier training without development set
2.3.0 - 2022-01-18
Added
new OpusFilterRuntimeError exception for having e.g. empty training data
option to save scores from the training data when creating word aligment priors
RepetitionFilter for filtering segments with repeated substrings
new preprocessor for sentence splitting monolingual data
method-specific options for LanguageIDFilter
chunksize option to the common section
LMClassifierFilter for classification based on n-gram language models
Changed
add
workdirattribute to theFilterABCbase class and change that the filters should use it for any file parametersincrease default chunksize in FilterPipeline from 10000 to 100000
refactor and clean up code
2.2.0 - 2021-11-23
Added
support for Chinese word segmentation using
jiebaas a tokenizer (https://github.com/Helsinki-NLP/OpusFilter/pull/27)
2.1.2 - 2021-11-11
Fixed
fix wrong keyword argument name in opusfilter-duplicates
2.1.1 - 2021-10-19
Changed
move “How to contribute” to docs/CONTRIBUTING.md
Fixed
fix setuptools requirement (https://github.com/Helsinki-NLP/OpusFilter/issues/21)
fix version requirement for pandas (>=1.0.0)
2.1.0 - 2021-08-31
Changed
replace PyYAML with ruamel.yaml
Added
support for variables in the YAML configuration (https://github.com/Helsinki-NLP/OpusFilter/pull/13)
support to
fasttextbased for language detection (https://github.com/Helsinki-NLP/OpusFilter/pull/20)suppress_promptsparameter foropus_read(https://github.com/Helsinki-NLP/OpusFilter/pull/19)downloadandwritesteps“How to contribute” section to README.md
changelog
bibliography and improved references
2.0.0 - 2021-06-01
Changed
extend to n-lingual parallel data instead of just bilingual data
switch tokenizer to
fast-mosestokenizer
Added
new commands:
opusfilter-diagram,opusfilter-duplicates,opusfilter-testnew filters:
LongestCommonSubstringFilter,AverageWordLengthFilternew steps:
preprocessset “latest” as the default corpus release for
opus_read(https://github.com/Helsinki-NLP/OpusFilter/pull/5)overlap option for
remove_duplicateslower threshold option for
CrossEntropyFiltergithub CI workflow for flake8 and unittests
Fixed
behaviour of simple filters on empty segments
1.0.1 - 2020-05-25
Added
improved logging, documentation, and project files
Fixed
prevent
UnboundLocalErrorfor empty output after filter
1.0.0 - 2020-04-10
First tagged version.