All Projects → neurosnap → Sentences

neurosnap / Sentences

Licence: mit
A multilingual command line sentence tokenizer in Golang

Programming Languages

go
31211 projects - #10 most used programming language

Projects that are alternatives of or similar to Sentences

Ink
🌈 React for interactive command-line apps
Stars: ✭ 17,505 (+5874.4%)
Mutual labels:  cli
Kingpin
CONTRIBUTIONS ONLY: A Go (golang) command line and flag parser
Stars: ✭ 3,178 (+984.64%)
Mutual labels:  cli
Cli
✨ A powerful CLI for the Create Go App project. Create a new production-ready project with backend, frontend and deploy automation by running one CLI command!
Stars: ✭ 292 (-0.34%)
Mutual labels:  cli
Chalk
🖍 Terminal string styling done right
Stars: ✭ 17,566 (+5895.22%)
Mutual labels:  cli
Consola
Elegant Console Logger for Node.js and Browser 🐨
Stars: ✭ 3,461 (+1081.23%)
Mutual labels:  cli
Mutateful
Add-on for Ableton Live that enables live coding functionality fully integrated into Live's session view.
Stars: ✭ 290 (-1.02%)
Mutual labels:  cli
Croissant
🥐 A Lua REPL and debugger
Stars: ✭ 285 (-2.73%)
Mutual labels:  cli
Tmuxp
💻 tmux session manager. built on libtmux
Stars: ✭ 3,269 (+1015.7%)
Mutual labels:  cli
Vue Cli Plugin Electron Builder
Easily Build Your Vue.js App For Desktop With Electron
Stars: ✭ 3,549 (+1111.26%)
Mutual labels:  cli
Git Delete Merged Branches
Command-line tool to delete merged Git branches
Stars: ✭ 293 (+0%)
Mutual labels:  cli
Aws Rotate Key
Easily rotate your AWS access key. 🔑
Stars: ✭ 288 (-1.71%)
Mutual labels:  cli
Simpletable
Simple tables in terminal with Go
Stars: ✭ 288 (-1.71%)
Mutual labels:  cli
Docopt.php
Command line argument parser
Stars: ✭ 291 (-0.68%)
Mutual labels:  cli
Svgo
⚙️ Node.js tool for optimizing SVG files
Stars: ✭ 17,050 (+5719.11%)
Mutual labels:  cli
Sacremoses
Python port of Moses tokenizer, truecaser and normalizer
Stars: ✭ 293 (+0%)
Mutual labels:  tokenizer
Data Science At The Command Line
Data Science at the Command Line
Stars: ✭ 3,174 (+983.28%)
Mutual labels:  cli
Diff2html Cli
Pretty diff to html javascript cli (diff2html-cli)
Stars: ✭ 287 (-2.05%)
Mutual labels:  cli
Travis Watch
Stream live travis test results of the current commit to your terminal!
Stars: ✭ 294 (+0.34%)
Mutual labels:  cli
Sync
syncs your local folder with remote folder using scp
Stars: ✭ 293 (+0%)
Mutual labels:  cli
Crudini
A utility for manipulating ini files
Stars: ✭ 292 (-0.34%)
Mutual labels:  cli

Build Status GODOC MIT Go Report Card

Sentences - A command line sentence tokenizer

This command line utility will convert a blob of text into a list of sentences.

Install

go get gopkg.in/neurosnap/sentences.v1
go install gopkg.in/neurosnap/sentences.v1/_cmd/sentences

Binaries

Linux

Mac

Windows

Command

Command line

Get it

go get gopkg.in/neurosnap/sentences.v1

Use it

import (
    "fmt"

    "gopkg.in/neurosnap/sentences.v1"
    "gopkg.in/neurosnap/sentences.v1/data"
)

func main() {
    text := `A perennial also-ran, Stallings won his seat when longtime lawmaker David Holmes
    died 11 days after the filing deadline. Suddenly, Stallings was a shoo-in, not
    the long shot. In short order, the Legislature attempted to pass a law allowing
    former U.S. Rep. Carolyn Cheeks Kilpatrick to file; Stallings challenged the
    law in court and won. Kilpatrick mounted a write-in campaign, but Stallings won.`

    // Compiling language specific data into a binary file can be accomplished
    // by using `make <lang>` and then loading the `json` data:
    b, _ := data.Asset("data/english.json");

    // load the training data
    training, _ := sentences.LoadTraining(b)

    // create the default sentence tokenizer
    tokenizer := sentences.NewSentenceTokenizer(training)
    sentences := tokenizer.Tokenize(text)

    for _, s := range sentences {
        fmt.Println(s.Text)
    }
}

English

This package attempts to fix some problems I noticed for english.

import (
    "fmt"

    "gopkg.in/neurosnap/sentences.v1/english"
)

func main() {
    text := "Hi there. Does this really work?"

    tokenizer, err := english.NewSentenceTokenizer(nil)
    if err != nil {
        panic(err)
    }

    sentences := tokenizer.Tokenize(text)
    for _, s := range sentences {
        fmt.Println(s.Text)
    }
}

Contributing

I need help maintaining this library. If you are interested in contributing to this library then please start by looking at the golder-rules branch which tests the Golden Rules for english sentence tokenization created by the Pragmatic Segmenter library.

Create an issue for a particular failing test and submit an issue/PR.

I'm happy to help anyone willing to contribute.

Customizable

Sentences was built around composability, most major components of this package can be extended.

Eager to make adhoc changes but don't know how to start? Have a look at github.com/neurosnap/sentences/english for a solid example.

Notice

I have not tested this tokenizer in any other language besides English. By default the command line utility loads english. I welcome anyone willing to test the other languages to submit updates as needed.

A primary goal for this package is to be multilingual so I'm willing to help in any way possible.

This library is a port of the nltk's punkt tokenizer.

A Punkt Tokenizer

An unsupervised multilingual sentence boundary detection library for golang. The way the punkt system accomplishes this goal is through training the tokenizer with text in that given language. Once the likelyhoods of abbreviations, collocations, and sentence starters are determined, finding sentence boundaries becomes easier.

There are many problems that arise when tokenizing text into sentences, the primary issue being abbreviations. The punkt system attempts to determine whether a word is an abbrevation, an end to a sentence, or even both through training the system with text in the given language. The punkt system incorporates both token- and type-based analysis on the text through two different phases of annotation.

Unsupervised multilingual sentence boundary detection

Performance

Using Brown Corpus which is annotated American English text, we compare this package with other libraries across multiple programming languages.

Library Avg Speed (s, 10 runs) Accuracy (%)
Sentences 1.96 98.95
NLTK 5.22 99.21
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].