MonolingualSentenceSplitter
Split monolingual text segments into sentences.
Parameters:
language: language code for the inputnon_breaking_prefix_file: override the language’s non-breaking prefix file by a custom one (optional; defaultnull)enable_parallel: do not raise expection if the input is parallel data (optional; defaultfalse)
Sentence splitting method imported from the sentence-splitter library. Uses a heuristic algorithm by Philipp Koehn and Josh Schroeder developed for the Europarl corpus [Koehn, 2005]. Supports mostly European languages, but a non-breaking prefix file for new languages can be provided.
Warning: This is not intended for parallel data, as there the number
of output lines per each parallel input line would not always
match. Because of this, you can define only a single language, and an
exception is raised if multiple input files are provided. The
exception can be disabled with the enable_parallel option for
special cases.