Citing and references

Citing

If you use OpusFilter in your research, please cite our ACL 2020 paper [Aulamo et al., 2020]:

@inproceedings{aulamo-etal-2020-opusfilter,
    title = "{O}pus{F}ilter: A Configurable Parallel Corpus Filtering Toolbox",
    author = {Aulamo, Mikko and Virpioja, Sami and Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations",
    month = jul,
    year = "2020",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-demos.20",
    doi = "10.18653/v1/2020.acl-demos.20",
    pages = "150--156"
}

References

A full bibliography of papers cited in the documentation and code:

[artetxe-schwenk-2018-margin]

Mikel Artetxe and Holger Schwenk. Margin-based parallel corpus mining with multilingual sentence embeddings. CoRR, 2018. URL: http://arxiv.org/abs/1811.01136, arXiv:1811.01136.

[aulamo-etal-2023-unsupervised]

Mikko Aulamo, Ona de Gibert, Sami Virpioja, and Jörg Tiedemann. Unsupervised feature selection for effective parallel corpus filtering. In Proceedings of the 24th Annual Conference of the European Association for Machine Translation, 31–38. Tampere, Finland, June 2023. European Association for Machine Translation. URL: https://aclanthology.org/2023.eamt-1.4.

[aulamo-etal-2020-opustools]

Mikko Aulamo, Umut Sulubacak, Sami Virpioja, and Jörg Tiedemann. OpusTools and parallel corpus diagnostics. In Proceedings of the 12th Language Resources and Evaluation Conference, 3782–3789. Marseille, France, May 2020. European Language Resources Association. URL: https://aclanthology.org/2020.lrec-1.467.

[aulamo-etal-2020-opusfilter]

Mikko Aulamo, Sami Virpioja, and Jörg Tiedemann. OpusFilter: a configurable parallel corpus filtering toolbox. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 150–156. Association for Computational Linguistics, July 2020. URL: https://aclanthology.org/2020.acl-demos.20, doi:10.18653/v1/2020.acl-demos.20.

[chaudhary-etal-2019-low]

Vishrav Chaudhary, Yuqing Tang, Francisco Guzmán, Holger Schwenk, and Philipp Koehn. Low-resource corpus filtering using multilingual sentence embeddings. In Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2), 261–266. Florence, Italy, August 2019. Association for Computational Linguistics. URL: https://aclanthology.org/W19-5435, doi:10.18653/v1/W19-5435.

[joulin-etal-2016-fasttext]

Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hervé Jégou, and Tomás Mikolov. Fasttext.zip: compressing text classification models. CoRR, 2016. URL: http://arxiv.org/abs/1612.03651, arXiv:1612.03651.

[joulin-etal-2017-bag]

Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, 427–431. Valencia, Spain, April 2017. Association for Computational Linguistics. URL: https://aclanthology.org/E17-2068.

[koehn-2005-europarl]

Philipp Koehn. Europarl: a parallel corpus for statistical machine translation. In Proceedings of Machine Translation Summit X: Papers, 79–86. Phuket, Thailand, September 2005. URL: https://aclanthology.org/2005.mtsummit-papers.11.

[lui-baldwin-2012-langid]

Marco Lui and Timothy Baldwin. Langid.py: an off-the-shelf language identification tool. In Proceedings of the ACL 2012 System Demonstrations, 25–30. Jeju Island, Korea, July 2012. Association for Computational Linguistics. URL: https://aclanthology.org/P12-3005.

[moore-lewis-2010-intelligent]

Robert C. Moore and William Lewis. Intelligent selection of language model training data. In Proceedings of the ACL 2010 Conference Short Papers, 220–224. Uppsala, Sweden, July 2010. Association for Computational Linguistics. URL: https://aclanthology.org/P10-2041.

[sennrich-etal-2016-neural]

Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1715–1725. Berlin, Germany, August 2016. Association for Computational Linguistics. URL: https://aclanthology.org/P16-1162, doi:10.18653/v1/P16-1162.

[siivola-etal-2007-growing]

Vesa Siivola, Teemu Hirsimäki, and Sami Virpioja. On growing and pruning Kneser-Ney smoothed n-gram models. IEEE Transactions on Audio, Speech and Language Processing, 15(5):1617–1624, 2007. URL: https://doi.org/10.1109/TASL.2007.896666.

[tiedemann-2012-parallel]

Jörg Tiedemann. Parallel data, tools and interfaces in OPUS. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), 2214–2218. Istanbul, Turkey, May 2012. European Language Resources Association (ELRA). URL: http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf.

[vatanen-etal-2010-language]

Tommi Vatanen, Jaakko J. Väyrynen, and Sami Virpioja. Language identification of short text segments with n-gram models. In Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis, Mike Rosner, and Daniel Tapias, editors, Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10). European Language Resources Association (ELRA), May 2010.

[virpioja-etal-2013-morfessor]

Sami Virpioja, Peter Smit, Stig-Arne Grönroos, and Mikko Kurimo. Morfessor 2.0: python implementation and extensions for Morfessor Baseline. Report 25/2013 in Aalto University publication series SCIENCE + TECHNOLOGY, Department of Signal Processing and Acoustics, Aalto University, Helsinki, Finland, 2013.

[vazquez-etal-2019-university]

Raúl Vázquez, Umut Sulubacak, and Jörg Tiedemann. The University of Helsinki submission to the WMT19 parallel corpus filtering task. In Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2), 294–300. Florence, Italy, August 2019. Association for Computational Linguistics. URL: https://aclanthology.org/W19-5441, doi:10.18653/v1/W19-5441.

[ostling-tiedemann-2016-efficient]

Robert Östling and Jörg Tiedemann. Efficient word alignment with Markov Chain Monte Carlo. Prague Bulletin of Mathematical Linguistics, 106:125–146, October 2016. URL: http://ufal.mff.cuni.cz/pbml/106/art-ostling-tiedemann.pdf.

References as BibTeX

Finally, here are BibTeX entries for all the references:

% Artetxe & Schwenk (2018)
@article{artetxe-schwenk-2018-margin,
  author    = {Mikel Artetxe and Holger Schwenk},
  title     = {Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings},
  journal   = {CoRR},
  volume    = {abs/1811.01136},
  year      = {2018},
  url       = {http://arxiv.org/abs/1811.01136},
  eprinttype = {arXiv},
  eprint    = {1811.01136},
  timestamp = {Thu, 22 Nov 2018 17:58:30 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1811-01136.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

% Aulamo et al. (2020a)
@inproceedings{aulamo-etal-2020-opustools,
    title = "{O}pus{T}ools and Parallel Corpus Diagnostics",
    author = {Aulamo, Mikko and Sulubacak, Umut and Virpioja, Sami and Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the 12th Language Resources and Evaluation Conference",
    month = may,
    year = "2020",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://aclanthology.org/2020.lrec-1.467",
    pages = "3782--3789",
    language = "English",
    ISBN = "979-10-95546-34-4",
}

% Aulamo et al. (2020b)
@inproceedings{aulamo-etal-2020-opusfilter,
    title = "{O}pus{F}ilter: A Configurable Parallel Corpus Filtering Toolbox",
    author = {Aulamo, Mikko and Virpioja, Sami and Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations",
    month = jul,
    year = "2020",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.acl-demos.20",
    doi = "10.18653/v1/2020.acl-demos.20",
    pages = "150--156",
}

% Aulamo et al. (2023)
@inproceedings{aulamo-etal-2023-unsupervised,
    title = "Unsupervised Feature Selection for Effective Parallel Corpus Filtering",
    author = {Aulamo, Mikko and de Gibert, Ona and Virpioja, Sami and Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the 24th Annual Conference of the European Association for Machine Translation",
    month = jun,
    year = "2023",
    address = "Tampere, Finland",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2023.eamt-1.4",
    pages = "31--38",
}

% Chaudhary et al. (2019)
@inproceedings{chaudhary-etal-2019-low,
    title = "Low-Resource Corpus Filtering Using Multilingual Sentence Embeddings",
    author = "Chaudhary, Vishrav and Tang, Yuqing and Guzm{\'a}n, Francisco and Schwenk, Holger and Koehn, Philipp",
    booktitle = "Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)",
    month = aug,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/W19-5435",
    doi = "10.18653/v1/W19-5435",
    pages = "261--266"
}

% Joulin et al. (2016)
@article{joulin-etal-2016-fasttext,
    author    = {Armand Joulin and Edouard Grave and Piotr Bojanowski and Matthijs Douze and Herv{\'{e}} J{\'{e}}gou and Tom{\'{a}}s Mikolov},
    title     = {FastText.zip: Compressing text classification models},
    journal   = {CoRR},
    volume    = {abs/1612.03651},
    year      = {2016},
    url       = {http://arxiv.org/abs/1612.03651},
    archivePrefix = {arXiv},
    eprint    = {1612.03651},
    timestamp = {Mon, 28 Dec 2020 11:31:02 +0100},
    biburl    = {https://dblp.org/rec/journals/corr/JoulinGBDJM16.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

% Joulin et al. (2017)
@inproceedings{joulin-etal-2017-bag,
    title = "Bag of Tricks for Efficient Text Classification",
    author = "Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas",
    booktitle = "Proceedings of the 15th Conference of the {E}uropean Chapter of the Association for Computational Linguistics: Volume 2, Short Papers",
    month = apr,
    year = "2017",
    address = "Valencia, Spain",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/E17-2068",
    pages = "427--431",
}

% Koehn (2005)
@inproceedings{koehn-2005-europarl,
    title = "{E}uroparl: A Parallel Corpus for Statistical Machine Translation",
    author = "Koehn, Philipp",
    booktitle = "Proceedings of Machine Translation Summit X: Papers",
    month = sep,
    year = "2005",
    address = "Phuket, Thailand",
    url = "https://aclanthology.org/2005.mtsummit-papers.11",
    pages = "79--86"
}

% Lui and Baldwin (2012)
@inproceedings{lui-baldwin-2012-langid,
    title = "langid.py: An Off-the-shelf Language Identification Tool",
    author = "Lui, Marco and Baldwin, Timothy",
    booktitle = "Proceedings of the {ACL} 2012 System Demonstrations",
    month = jul,
    year = "2012",
    address = "Jeju Island, Korea",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/P12-3005",
    pages = "25--30",
}

% Moore and Lewis (2010)
@inproceedings{moore-lewis-2010-intelligent,
    title = "Intelligent Selection of Language Model Training Data",
    author = "Moore, Robert C. and Lewis, William",
    booktitle = "Proceedings of the {ACL} 2010 Conference Short Papers",
    month = jul,
    year = "2010",
    address = "Uppsala, Sweden",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/P10-2041",
    pages = "220--224",
}

% Östling and Tiedemann (2016)
@article{ostling-tiedemann-2016-efficient,
    title = {Efficient word alignment with {M}arkov {C}hain {M}onte {C}arlo},
    author = {Robert {\"O}stling and J{\"o}rg Tiedemann},
    journal = {Prague Bulletin of Mathematical Linguistics},
    year = {2016},
    month = {October},
    pages = {125--146},
    volume = {106},
    owner = {robert},
    timestamp = {2016.08.26},
    url = {http://ufal.mff.cuni.cz/pbml/106/art-ostling-tiedemann.pdf}
}

% Sennrich et al. (2016)
@inproceedings{sennrich-etal-2016-neural,
    title = "Neural Machine Translation of Rare Words with Subword Units",
    author = "Sennrich, Rico and Haddow, Barry and Birch, Alexandra",
    booktitle = "Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = aug,
    year = "2016",
    address = "Berlin, Germany",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/P16-1162",
    doi = "10.18653/v1/P16-1162",
    pages = "1715--1725",
}

% Siivola et al. (2007)
@article{siivola-etal-2007-growing,
    author = {Vesa Siivola and Teemu Hirsim\"aki and Sami Virpioja},
    title = {On Growing and Pruning {K}neser-{N}ey Smoothed N-Gram Models},
    journal = {IEEE Transactions on Audio, Speech and Language Processing},
    volume = {15},
    number = {5},
    pages = {1617--1624},
    year = {2007},
    url = {https://doi.org/10.1109/TASL.2007.896666}
}

% Tiedemann (2012)
@inproceedings{tiedemann-2012-parallel,
    title = "Parallel Data, Tools and Interfaces in {OPUS}",
    author = {Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the Eighth International Conference on Language Resources and Evaluation ({LREC}'12)",
    month = may,
    year = "2012",
    address = "Istanbul, Turkey",
    publisher = "European Language Resources Association (ELRA)",
    url = "http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf",
    pages = "2214--2218"
}

% Vatanen et al. (2010)
@inproceedings{vatanen-etal-2010-language,
    title = "Language Identification of Short Text Segments with N-gram Models",
    author = "Tommi Vatanen and V{\"a}yrynen, {Jaakko J.} and Sami Virpioja",
    year = "2010",
    month = may,
    editor = "Nicoletta Calzolari and Khalid Choukri and Bente Maegaard and Joseph Mariani and Jan Odjik and Stelios Piperidis and Mike Rosner and Daniel Tapias",
    booktitle = "Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10)",
    publisher = "European Language Resources Association (ELRA)"
}

% Vázquez et al. (2019)
@inproceedings{vazquez-etal-2019-university,
    title = "The {U}niversity of {H}elsinki Submission to the {WMT}19 Parallel Corpus Filtering Task",
    author = {V{\'a}zquez, Ra{\'u}l and Sulubacak, Umut and Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)",
    month = aug,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/W19-5441",
    doi = "10.18653/v1/W19-5441",
    pages = "294--300"
}

% Virpioja et al. (2013)
@techreport{virpioja-etal-2013-morfessor,
    address = {Helsinki, Finland},
    author = {Virpioja, Sami and Smit, Peter and Gr\"{o}nroos, Stig-Arne and Kurimo, Mikko},
    institution = {Department of Signal Processing and Acoustics, Aalto University},
    language = {eng},
    number = {25/2013 in Aalto University publication series SCIENCE + TECHNOLOGY},
    pages = {38},
    series = {Aalto University publication series SCIENCE + TECHNOLOGY},
    title = {Morfessor 2.0: Python Implementation and Extensions for {M}orfessor {B}aseline},
    type = {Report},
    year = {2013},
}