Citing and references

Citing

If you use OpusFilter in your research, please cite our ACL 2020 paper [Aulamo et al., 2020]:

@inproceedings{aulamo-etal-2020-opusfilter,
    title = "{O}pus{F}ilter: A Configurable Parallel Corpus Filtering Toolbox",
    author = {Aulamo, Mikko and Virpioja, Sami and Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations",
    month = jul,
    year = "2020",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-demos.20",
    doi = "10.18653/v1/2020.acl-demos.20",
    pages = "150--156"
}

References

A full bibliography of papers cited in the documentation and code:

[artetxe-schwenk-2018-margin]

Mikel Artetxe and Holger Schwenk. Margin-based parallel corpus mining with multilingual sentence embeddings. CoRR, 2018. URL: http://arxiv.org/abs/1811.01136, arXiv:1811.01136.

[aulamo-etal-2023-unsupervised]

Mikko Aulamo, Ona de Gibert, Sami Virpioja, and Jörg Tiedemann. Unsupervised feature selection for effective parallel corpus filtering. In Proceedings of the 24th Annual Conference of the European Association for Machine Translation, 31–38. Tampere, Finland, June 2023. European Association for Machine Translation. URL: https://aclanthology.org/2023.eamt-1.4.

[aulamo-etal-2020-opustools]

Mikko Aulamo, Umut Sulubacak, Sami Virpioja, and Jörg Tiedemann. OpusTools and parallel corpus diagnostics. In Proceedings of the 12th Language Resources and Evaluation Conference, 3782–3789. Marseille, France, May 2020. European Language Resources Association. URL: https://aclanthology.org/2020.lrec-1.467.

[aulamo-etal-2020-opusfilter]

Mikko Aulamo, Sami Virpioja, and Jörg Tiedemann. OpusFilter: a configurable parallel corpus filtering toolbox. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 150–156. Association for Computational Linguistics, July 2020. URL: https://aclanthology.org/2020.acl-demos.20, doi:10.18653/v1/2020.acl-demos.20.

[chaudhary-etal-2019-low]

Vishrav Chaudhary, Yuqing Tang, Francisco Guzmán, Holger Schwenk, and Philipp Koehn. Low-resource corpus filtering using multilingual sentence embeddings. In Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2), 261–266. Florence, Italy, August 2019. Association for Computational Linguistics. URL: https://aclanthology.org/W19-5435, doi:10.18653/v1/W19-5435.

[joulin-etal-2016-fasttext]

Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hervé Jégou, and Tomás Mikolov. Fasttext.zip: compressing text classification models. CoRR, 2016. URL: http://arxiv.org/abs/1612.03651, arXiv:1612.03651.

[joulin-etal-2017-bag]

Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, 427–431. Valencia, Spain, April 2017. Association for Computational Linguistics. URL: https://aclanthology.org/E17-2068.

[koehn-2005-europarl]

Philipp Koehn. Europarl: a parallel corpus for statistical machine translation. In Proceedings of Machine Translation Summit X: Papers, 79–86. Phuket, Thailand, September 2005. URL: https://aclanthology.org/2005.mtsummit-papers.11.

[lui-baldwin-2012-langid]

Marco Lui and Timothy Baldwin. Langid.py: an off-the-shelf language identification tool. In Proceedings of the ACL 2012 System Demonstrations, 25–30. Jeju Island, Korea, July 2012. Association for Computational Linguistics. URL: https://aclanthology.org/P12-3005.

[moore-lewis-2010-intelligent]

Robert C. Moore and William Lewis. Intelligent selection of language model training data. In Proceedings of the ACL 2010 Conference Short Papers, 220–224. Uppsala, Sweden, July 2010. Association for Computational Linguistics. URL: https://aclanthology.org/P10-2041.

[sennrich-etal-2016-neural]

Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1715–1725. Berlin, Germany, August 2016. Association for Computational Linguistics. URL: https://aclanthology.org/P16-1162, doi:10.18653/v1/P16-1162.

[siivola-etal-2007-growing]

Vesa Siivola, Teemu Hirsimäki, and Sami Virpioja. On growing and pruning Kneser-Ney smoothed n-gram models. IEEE Transactions on Audio, Speech and Language Processing, 15(5):1617–1624, 2007. URL: https://doi.org/10.1109/TASL.2007.896666.

[tiedemann-2012-parallel]

Jörg Tiedemann. Parallel data, tools and interfaces in OPUS. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), 2214–2218. Istanbul, Turkey, May 2012. European Language Resources Association (ELRA). URL: http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf.

[vatanen-etal-2010-language]

Tommi Vatanen, Jaakko J. Väyrynen, and Sami Virpioja. Language identification of short text segments with n-gram models. In Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis, Mike Rosner, and Daniel Tapias, editors, Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10). European Language Resources Association (ELRA), May 2010.

[virpioja-etal-2013-morfessor]

Sami Virpioja, Peter Smit, Stig-Arne Grönroos, and Mikko Kurimo. Morfessor 2.0: python implementation and extensions for Morfessor Baseline. Report 25/2013 in Aalto University publication series SCIENCE + TECHNOLOGY, Department of Signal Processing and Acoustics, Aalto University, Helsinki, Finland, 2013.

[vazquez-etal-2019-university]

Raúl Vázquez, Umut Sulubacak, and Jörg Tiedemann. The University of Helsinki submission to the WMT19 parallel corpus filtering task. In Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2), 294–300. Florence, Italy, August 2019. Association for Computational Linguistics. URL: https://aclanthology.org/W19-5441, doi:10.18653/v1/W19-5441.

[ostling-tiedemann-2016-efficient]

Robert Östling and Jörg Tiedemann. Efficient word alignment with Markov Chain Monte Carlo. Prague Bulletin of Mathematical Linguistics, 106:125–146, October 2016. URL: http://ufal.mff.cuni.cz/pbml/106/art-ostling-tiedemann.pdf.