All Projects → jdkato → Prose

jdkato / Prose

Licence: mit
📖 A Golang library for text processing, including tokenization, part-of-speech tagging, and named-entity extraction.

Programming Languages

go
31211 projects - #10 most used programming language
python
139335 projects - #7 most used programming language
shell
77523 projects
Makefile
30231 projects

Projects that are alternatives of or similar to Prose

Pytorch Bert Crf Ner
KoBERT와 CRF로 만든 한국어 개체명인식기 (BERT+CRF based Named Entity Recognition model for Korean)
Stars: ✭ 236 (-91.69%)
Mutual labels:  natural-language-processing
Jack
Jack the Reader
Stars: ✭ 242 (-91.48%)
Mutual labels:  natural-language-processing
Low Resource Languages
Resources for conservation, development, and documentation of low resource (human) languages.
Stars: ✭ 247 (-91.3%)
Mutual labels:  natural-language-processing
Pykakasi
NLP: Convert Japanese Kana-kanji sentences into Kana-Roman in simple algorithm.
Stars: ✭ 238 (-91.62%)
Mutual labels:  natural-language-processing
Tensorflow qrnn
QRNN implementation for TensorFlow
Stars: ✭ 241 (-91.51%)
Mutual labels:  natural-language-processing
Dont Stop Pretraining
Code associated with the Don't Stop Pretraining ACL 2020 paper
Stars: ✭ 243 (-91.44%)
Mutual labels:  natural-language-processing
Deepnlp Models Pytorch
Pytorch implementations of various Deep NLP models in cs-224n(Stanford Univ)
Stars: ✭ 2,760 (-2.82%)
Mutual labels:  natural-language-processing
Datacamp Python Data Science Track
All the slides, accompanying code and exercises all stored in this repo. 🎈
Stars: ✭ 250 (-91.2%)
Mutual labels:  natural-language-processing
Summarization Papers
Summarization Papers
Stars: ✭ 238 (-91.62%)
Mutual labels:  natural-language-processing
Awesome Grounding
awesome grounding: A curated list of research papers in visual grounding
Stars: ✭ 247 (-91.3%)
Mutual labels:  natural-language-processing
Malaya
Natural Language Toolkit for bahasa Malaysia, https://malaya.readthedocs.io/
Stars: ✭ 239 (-91.58%)
Mutual labels:  natural-language-processing
Pytorch Sentiment Analysis
Tutorials on getting started with PyTorch and TorchText for sentiment analysis.
Stars: ✭ 3,209 (+12.99%)
Mutual labels:  natural-language-processing
Insight
Repository for Project Insight: NLP as a Service
Stars: ✭ 246 (-91.34%)
Mutual labels:  natural-language-processing
Chazutsu
The tool to make NLP datasets ready to use
Stars: ✭ 238 (-91.62%)
Mutual labels:  natural-language-processing
Good Papers
I try my best to keep updated cutting-edge knowledge in Machine Learning/Deep Learning and Natural Language Processing. These are my notes on some good papers
Stars: ✭ 248 (-91.27%)
Mutual labels:  natural-language-processing
Mitie
MITIE: library and tools for information extraction
Stars: ✭ 2,693 (-5.18%)
Mutual labels:  natural-language-processing
Bertviz
Tool for visualizing attention in the Transformer model (BERT, GPT-2, Albert, XLNet, RoBERTa, CTRL, etc.)
Stars: ✭ 3,443 (+21.23%)
Mutual labels:  natural-language-processing
Book Nlp
Natural language processing pipeline for book-length documents
Stars: ✭ 249 (-91.23%)
Mutual labels:  natural-language-processing
Awesome Tensorlayer
A curated list of dedicated resources and applications
Stars: ✭ 248 (-91.27%)
Mutual labels:  natural-language-processing
Soulvercore
A powerful Swift framework for evaluating mathematical expressions
Stars: ✭ 245 (-91.37%)
Mutual labels:  natural-language-processing

prose Build Status GoDoc Coverage Status Go Report Card codebeat badge Awesome

prose is a natural language processing library (English only, at the moment) in pure Go. It supports tokenization, segmentation, part-of-speech tagging, and named-entity extraction.

You can find a more detailed summary on the library's performance here: Introducing prose v2.0.0: Bringing NLP to Go.

Installation

$ go get github.com/jdkato/prose/v2

Usage

Contents

Overview

package main

import (
    "fmt"
    "log"

    "github.com/jdkato/prose/v2"
)

func main() {
    // Create a new document with the default configuration:
    doc, err := prose.NewDocument("Go is an open-source programming language created at Google.")
    if err != nil {
        log.Fatal(err)
    }

    // Iterate over the doc's tokens:
    for _, tok := range doc.Tokens() {
        fmt.Println(tok.Text, tok.Tag, tok.Label)
        // Go NNP B-GPE
        // is VBZ O
        // an DT O
        // ...
    }

    // Iterate over the doc's named-entities:
    for _, ent := range doc.Entities() {
        fmt.Println(ent.Text, ent.Label)
        // Go GPE
        // Google GPE
    }

    // Iterate over the doc's sentences:
    for _, sent := range doc.Sentences() {
        fmt.Println(sent.Text)
        // Go is an open-source programming language created at Google.
    }
}

The document-creation process adheres to the following sequence of steps:

tokenization -> POS tagging -> NE extraction
            \
             segmentation

Each step may be disabled (assuming later steps aren't required) by passing the appropriate functional option. To disable named-entity extraction, for example, you'd do the following:

doc, err := prose.NewDocument(
        "Go is an open-source programming language created at Google.",
        prose.WithExtraction(false))

Tokenizing

prose includes a tokenizer capable of processing modern text, including the non-word character spans shown below.

Type Example
Email addresses [email protected]
Hashtags #trending
Mentions @jdkato
URLs https://github.com/jdkato/prose
Emoticons :-), >:(, o_0, etc.
package main

import (
    "fmt"
    "log"

    "github.com/jdkato/prose/v2"
)

func main() {
    // Create a new document with the default configuration:
    doc, err := prose.NewDocument("@jdkato, go to http://example.com thanks :).")
    if err != nil {
        log.Fatal(err)
    }

    // Iterate over the doc's tokens:
    for _, tok := range doc.Tokens() {
        fmt.Println(tok.Text, tok.Tag)
        // @jdkato NN
        // , ,
        // go VB
        // to TO
        // http://example.com NN
        // thanks NNS
        // :) SYM
        // . .
    }
}

Segmenting

prose includes one of the most accurate sentence segmenters available, according to the Golden Rules created by the developers of the pragmatic_segmenter.

Name Language License GRS (English) GRS (Other) Speed†
Pragmatic Segmenter Ruby MIT 98.08% (51/52) 100.00% 3.84 s
prose Go MIT 75.00% (39/52) N/A 0.96 s
TactfulTokenizer Ruby GNU GPLv3 65.38% (34/52) 48.57% 46.32 s
OpenNLP Java APLv2 59.62% (31/52) 45.71% 1.27 s
Standford CoreNLP Java GNU GPLv3 59.62% (31/52) 31.43% 0.92 s
Splitta Python APLv2 55.77% (29/52) 37.14% N/A
Punkt Python APLv2 46.15% (24/52) 48.57% 1.79 s
SRX English Ruby GNU GPLv3 30.77% (16/52) 28.57% 6.19 s
Scapel Ruby GNU GPLv3 28.85% (15/52) 20.00% 0.13 s

† The original tests were performed using a MacBook Pro 3.7 GHz Quad-Core Intel Xeon E5 running 10.9.5, while prose was timed using a MacBook Pro 2.9 GHz Intel Core i7 running 10.13.3.

package main

import (
    "fmt"
    "strings"

    "github.com/jdkato/prose/v2"
)

func main() {
    // Create a new document with the default configuration:
    doc, _ := prose.NewDocument(strings.Join([]string{
        "I can see Mt. Fuji from here.",
        "St. Michael's Church is on 5th st. near the light."}, " "))

    // Iterate over the doc's sentences:
    sents := doc.Sentences()
    fmt.Println(len(sents)) // 2
    for _, sent := range sents {
        fmt.Println(sent.Text)
        // I can see Mt. Fuji from here.
        // St. Michael's Church is on 5th st. near the light.
    }
}

Tagging

prose includes a tagger based on Textblob's "fast and accurate" POS tagger. Below is a comparison of its performance against NLTK's implementation of the same tagger on the Treebank corpus:

Library Accuracy 5-Run Average (sec)
NLTK 0.893 7.224
prose 0.961 2.538

(See scripts/test_model.py for more information.)

The full list of supported POS tags is given below.

TAG DESCRIPTION
( left round bracket
) right round bracket
, comma
: colon
. period
'' closing quotation mark
`` opening quotation mark
# number sign
$ currency
CC conjunction, coordinating
CD cardinal number
DT determiner
EX existential there
FW foreign word
IN conjunction, subordinating or preposition
JJ adjective
JJR adjective, comparative
JJS adjective, superlative
LS list item marker
MD verb, modal auxiliary
NN noun, singular or mass
NNP noun, proper singular
NNPS noun, proper plural
NNS noun, plural
PDT predeterminer
POS possessive ending
PRP pronoun, personal
PRP$ pronoun, possessive
RB adverb
RBR adverb, comparative
RBS adverb, superlative
RP adverb, particle
SYM symbol
TO infinitival to
UH interjection
VB verb, base form
VBD verb, past tense
VBG verb, gerund or present participle
VBN verb, past participle
VBP verb, non-3rd person singular present
VBZ verb, 3rd person singular present
WDT wh-determiner
WP wh-pronoun, personal
WP$ wh-pronoun, possessive
WRB wh-adverb

NER

prose v2.0.0 includes a much improved version of v1.0.0's chunk package, which can identify people (PERSON) and geographical/political Entities (GPE) by default.

package main

import (
    "github.com/jdkato/prose/v2"
)

func main() {
    doc, _ := prose.NewDocument("Lebron James plays basketball in Los Angeles.")
    for _, ent := range doc.Entities() {
        fmt.Println(ent.Text, ent.Label)
        // Lebron James PERSON
        // Los Angeles GPE
    }
}

However, in an attempt to make this feature more useful, we've made it straightforward to train your own models for specific use cases. See Prodigy + prose: Radically efficient machine teaching in Go for a tutorial.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].