All Projects → clipperhouse → jargon

clipperhouse / jargon

Licence: MIT license
Tokenizers and lemmatizers for Go

Programming Languages

go
31211 projects - #10 most used programming language
java
68154 projects - #9 most used programming language

Projects that are alternatives of or similar to jargon

simplemma
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
Stars: ✭ 32 (-67.35%)
Mutual labels:  tokenizer, lemmatizer
mystem-scala
Morphological analyzer `mystem` (Russian language) wrapper for JVM languages
Stars: ✭ 21 (-78.57%)
Mutual labels:  tokenizer, lemmatizer
golem
A lemmatizer implemented in Go
Stars: ✭ 54 (-44.9%)
Mutual labels:  lemmatizer
farasapy
A Python implementation of Farasa toolkit
Stars: ✭ 69 (-29.59%)
Mutual labels:  tokenizer
tokenizer
A simple tokenizer in Ruby for NLP tasks.
Stars: ✭ 44 (-55.1%)
Mutual labels:  tokenizer
xontrib-output-search
Get identifiers, paths, URLs and words from the previous command output and use them for the next command in xonsh shell.
Stars: ✭ 26 (-73.47%)
Mutual labels:  tokenizer
lex
Lex is an implementation of lex tool in Ruby.
Stars: ✭ 49 (-50%)
Mutual labels:  tokenizer
suika
Suika 🍉 is a Japanese morphological analyzer written in pure Ruby
Stars: ✭ 31 (-68.37%)
Mutual labels:  tokenizer
elasticsearch-plugins
Some native scoring script plugins for elasticsearch
Stars: ✭ 30 (-69.39%)
Mutual labels:  tokenizer
lindera
A morphological analysis library.
Stars: ✭ 226 (+130.61%)
Mutual labels:  tokenizer
rustfst
Rust re-implementation of OpenFST - library for constructing, combining, optimizing, and searching weighted finite-state transducers (FSTs). A Python binding is also available.
Stars: ✭ 104 (+6.12%)
Mutual labels:  tokenizer
lara-hungarian-nlp
NLP class for rapid ChatBot development in Hungarian language
Stars: ✭ 27 (-72.45%)
Mutual labels:  lemmatizer
python-mecab
A repository to bind mecab for Python 3.5+. Not using swig nor pybind. (Not Maintained Now)
Stars: ✭ 27 (-72.45%)
Mutual labels:  tokenizer
psr2r-sniffer
A PSR-2-R code sniffer and code-style auto-correction-tool - including many useful additions
Stars: ✭ 32 (-67.35%)
Mutual labels:  tokenizer
snapdragon-lexer
Converts a string into an array of tokens, with useful methods for looking ahead and behind, capturing, matching, et cetera.
Stars: ✭ 19 (-80.61%)
Mutual labels:  tokenizer
neural tokenizer
Tokenize English sentences using neural networks.
Stars: ✭ 64 (-34.69%)
Mutual labels:  tokenizer
chinese-tokenizer
Tokenizes Chinese texts into words.
Stars: ✭ 72 (-26.53%)
Mutual labels:  tokenizer
SwiLex
A universal lexer library in Swift.
Stars: ✭ 29 (-70.41%)
Mutual labels:  tokenizer
hunspell
High-Performance Stemmer, Tokenizer, and Spell Checker for R
Stars: ✭ 101 (+3.06%)
Mutual labels:  tokenizer
Turkish-Lemmatizer
Lemmatization for Turkish Language
Stars: ✭ 72 (-26.53%)
Mutual labels:  lemmatizer

Jargon

Jargon is a text pipeline, focused on recognizing variations on canonical and synonymous terms.

For example, jargon lemmatizes react, React.js, React JS and REACTJS to a canonical reactjs.

Install

Binaries are available on the Releases page.

If you have Homebrew:

brew install clipperhouse/tap/jargon

If you have a Go installation:

go install github.com/clipperhouse/jargon/cmd/jargon

To display usage, simply type:

jargon

Example:

curl -s https://en.wikipedia.org/wiki/Computer_programming | jargon -html -stack -lemmas -lines

CLI usage and details...

In your code

See GoDoc. Example:

import (
	"fmt"
	"log"
	"strings"

	"github.com/clipperhouse/jargon"
	"github.com/clipperhouse/jargon/filters/stackoverflow"
)
 
text := `Let’s talk about Ruby on Rails and ASPNET MVC.`
stream := jargon.TokenizeString(text).Filter(stackoverflow.Tags)

// Loop while Scan() returns true. Scan() will return false on error or end of tokens.
for stream.Scan() {
	token := stream.Token()
	// Do stuff with token
	fmt.Print(token)
}

if err := stream.Err(); err != nil {
	// Because the source is I/O, errors are possible
	log.Fatal(err)
}

// As an iterator, a token stream is 'forward-only'; once you consume a token, you can't go back.

// See also the convenience methods String, ToSlice, WriteTo

Token filters

Canonical terms (lemmas) are looked up in token filters. Several are available:

Stack Overflow technology tags

  • Ruby on Rails → ruby-on-rails
  • ObjC → objective-c

Contractions

  • Couldn’t → Could not

ASCII fold

  • café → cafe

Stem

  • Manager|management|manages → manag

To implement your own, see the Filter type.

Performance

jargon is designed to work in constant memory, regardless of input size. It buffers input and streams tokens.

Execution time is designed to O(n) on input size. It is I/O-bound. In your code, you control I/O and performance implications by the Reader you pass to Tokenize.

Tokenizer

Jargon includes a tokenizer based partially on Unicode text segmentation. It’s good for many common cases.

It preserves all tokens verbatim, including whitespace and punctuation, so the original text can be reconstructed with fidelity (“round tripped”).

Background

When dealing with technical terms in text – say, a job listing or a resume – it’s easy to use different words for the same thing. This is acute for things like “react” where it’s not obvious what the canonical term is. Is it React or reactjs or react.js?

This presents a problem when searching for such terms. We know the above terms are synonymous but databases don’t.

A further problem is that some n-grams should be understood as a single term. We know that “Objective C” represents one technology, but databases naively see two words.

What’s it for?

  • Recognition of domain terms in text
  • NLP for unstructured data, when we wish to ensure consistency of vocabulary, for statistical analysis.
  • Search applications, where searches for “Ruby on Rails” are understood as an entity, instead of three unrelated words, or to ensure that “React” and “reactjs” and “react.js” and handled synonmously.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].