All Projects → mediacloud → Sentence Splitter

mediacloud / Sentence Splitter

Licence: other
Text to sentence splitter using heuristic algorithm by Philipp Koehn and Josh Schroeder.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Sentence Splitter

Kagome
Self-contained Japanese Morphological Analyzer written in pure Go
Stars: ✭ 554 (+575.61%)
Mutual labels:  tokenizer
Lfuzzer
Fuzzing Parsers with Tokens
Stars: ✭ 28 (-65.85%)
Mutual labels:  tokenizer
Greynir
The greynir.is natural language processing website for Icelandic
Stars: ✭ 47 (-42.68%)
Mutual labels:  tokenizer
Mustard
🌭 Mustard is a Swift library for tokenizing strings when splitting by whitespace doesn't cut it.
Stars: ✭ 689 (+740.24%)
Mutual labels:  tokenizer
React Input Tags
React component for tagging inputs.
Stars: ✭ 10 (-87.8%)
Mutual labels:  tokenizer
Nlp Js Tools French
POS Tagger, lemmatizer and stemmer for french language in javascript
Stars: ✭ 32 (-60.98%)
Mutual labels:  tokenizer
Open Korean Text
Open Korean Text Processor - An Open-source Korean Text Processor
Stars: ✭ 438 (+434.15%)
Mutual labels:  tokenizer
Wirb
Ruby Object Inspection for IRB
Stars: ✭ 69 (-15.85%)
Mutual labels:  tokenizer
Laravel Token
Laravel token management
Stars: ✭ 10 (-87.8%)
Mutual labels:  tokenizer
Py Nltools
A collection of basic python modules for spoken natural language processing
Stars: ✭ 46 (-43.9%)
Mutual labels:  tokenizer
Natasha
Solves basic Russian NLP tasks, API for lower level Natasha projects
Stars: ✭ 788 (+860.98%)
Mutual labels:  tokenizer
Lisp Esque Language
💠The Lel programming language
Stars: ✭ 24 (-70.73%)
Mutual labels:  tokenizer
Sharpmath
A small .NET math library.
Stars: ✭ 36 (-56.1%)
Mutual labels:  tokenizer
Soynlp
한국어 자연어처리를 위한 파이썬 라이브러리입니다. 단어 추출/ 토크나이저 / 품사판별/ 전처리의 기능을 제공합니다.
Stars: ✭ 613 (+647.56%)
Mutual labels:  tokenizer
Thot
Thot toolkit for statistical machine translation
Stars: ✭ 53 (-35.37%)
Mutual labels:  tokenizer
Tokenizer
A small library for converting tokenized PHP source code into XML (and potentially other formats)
Stars: ✭ 4,770 (+5717.07%)
Mutual labels:  tokenizer
Omnicat Bayes
Naive Bayes text classification implementation as an OmniCat classifier strategy. (#ruby #naivebayes)
Stars: ✭ 30 (-63.41%)
Mutual labels:  tokenizer
Cols Agent Tasks
Colin's ALM Corner Custom Build Tasks
Stars: ✭ 70 (-14.63%)
Mutual labels:  tokenizer
String Calc
PHP calculator library for mathematical terms (expressions) passed as strings
Stars: ✭ 60 (-26.83%)
Mutual labels:  tokenizer
Talismane
NLP framework: sentence detector, tokeniser, pos-tagger and dependency parser
Stars: ✭ 38 (-53.66%)
Mutual labels:  tokenizer

Text to Sentence Splitter

.. image:: https://travis-ci.org/berkmancenter/mediacloud-sentence-splitter.svg?branch=develop :target: https://travis-ci.org/berkmancenter/mediacloud-sentence-splitter

.. image:: https://coveralls.io/repos/github/berkmancenter/mediacloud-sentence-splitter/badge.svg?branch=develop :target: https://coveralls.io/github/berkmancenter/mediacloud-sentence-splitter?branch=develop

Text to sentence splitter using heuristic algorithm by Philipp Koehn and Josh Schroeder.

This module allows splitting of text paragraphs into sentences. It is based on scripts developed by Philipp Koehn and Josh Schroeder for processing the Europarl corpus <http://www.statmt.org/europarl/>_.

The module is a port of Lingua::Sentence Perl module <http://search.cpan.org/perldoc?Lingua::Sentence>_ with some extra additions (improved non-breaking prefix lists for some languages and added support for Danish, Finnish, Lithuanian, Norwegian (Bokmål), Romanian, and Turkish).

Usage

The module uses punctuation and capitalization clues to split plain text into a list of sentences:

.. code-block:: python

from sentence_splitter import SentenceSplitter, split_text_into_sentences

#
# Object interface
#
splitter = SentenceSplitter(language='en')
print(splitter.split(text='This is a paragraph. It contains several sentences. "But why," you ask?'))
# ['This is a paragraph.', 'It contains several sentences.', '"But why," you ask?']

#
# Functional interface
#
print(split_text_into_sentences(
    text='This is a paragraph. It contains several sentences. "But why," you ask?',
    language='en'
))
# ['This is a paragraph.', 'It contains several sentences.', '"But why," you ask?']

You can provide your own non-breaking prefix file to add support for new Latin languages or improve sentence tokenization of the currently supported ones:

.. code-block:: python

from sentence_splitter import SentenceSplitter, split_text_into_sentences

# Object interface
splitter = SentenceSplitter(language='en', non_breaking_prefix_file='custom_english_non_breaking_prefixes.txt')
print(splitter.split(text='This is a paragraph. It contains several sentences. "But why," you ask?'))

# Functional interface
print(split_text_into_sentences(
    text='This is a paragraph. It contains several sentences. "But why," you ask?',
    language='en',
    non_breaking_prefix_file='custom_english_non_breaking_prefixes.txt'
))

Languages

Currently supported languages are:

  • Catalan (ca)
  • Czech (cs)
  • Danish (da)
  • Dutch (nl)
  • English (en)
  • Finnish (fi)
  • French (fr)
  • German (de)
  • Greek (el)
  • Hungarian (hu)
  • Icelandic (is)
  • Italian (it)
  • Latvian (lv)
  • Lithuanian (lt)
  • Norwegian (Bokmål) (no)
  • Polish (pl)
  • Portuguese (pt)
  • Romanian (ro)
  • Russian (ru)
  • Slovak (sk)
  • Slovene (sl)
  • Spanish (es)
  • Swedish (sv)
  • Turkish (tr)

License

Copyright (C) 2010 by Digital Silk Road, 2017 Linas Valiukas.

Portions Copyright (C) 2005 by Philip Koehn and Josh Schroeder (used with permission).

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.

You should have received a copy of the GNU Lesser General Public License along with this program. If not, see http://www.gnu.org/licenses/.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].