All Projects → dustalov → greeb

dustalov / greeb

Licence: MIT license
Greeb is a simple Unicode-aware regexp-based tokenizer.

Programming Languages

ruby
36898 projects - #4 most used programming language

Projects that are alternatives of or similar to greeb

Tokenizer
Fast and customizable text tokenization library with BPE and SentencePiece support
Stars: ✭ 132 (+725%)
Mutual labels:  unicode, tokenizer
Stringi
THE String Processing Package for R (with ICU)
Stars: ✭ 204 (+1175%)
Mutual labels:  unicode
Guide To Swift Strings Sample Code
Xcode Playground Sample Code for the Flight School Guide to Swift Strings
Stars: ✭ 136 (+750%)
Mutual labels:  unicode
Stringz
💯 Super fast unicode-aware string manipulation Javascript library
Stars: ✭ 181 (+1031.25%)
Mutual labels:  unicode
Transliterate
Convert Unicode characters to Latin characters using transliteration
Stars: ✭ 152 (+850%)
Mutual labels:  unicode
Diagon
Interactive ASCII art diagram generators. 🌟
Stars: ✭ 189 (+1081.25%)
Mutual labels:  unicode
Punic
PHP translation and localization made easy!
Stars: ✭ 133 (+731.25%)
Mutual labels:  unicode
V Emoji Picker
🌟 A Lightweight and customizable package of Emoji Picker in Vue using emojis natives (unicode).
Stars: ✭ 231 (+1343.75%)
Mutual labels:  unicode
Regexpu
A source code transpiler that enables the use of ES2015 Unicode regular expressions in ES5.
Stars: ✭ 201 (+1156.25%)
Mutual labels:  unicode
Voca rs
Voca_rs is the ultimate Rust string library inspired by Voca.js, string.py and Inflector, implemented as independent functions and on Foreign Types (String and str).
Stars: ✭ 167 (+943.75%)
Mutual labels:  unicode
Textwrap
An efficient and powerful Rust library for word wrapping text.
Stars: ✭ 164 (+925%)
Mutual labels:  unicode
Harfbuzz
HarfBuzz text shaping engine
Stars: ✭ 2,206 (+13687.5%)
Mutual labels:  unicode
Encoding rs
A Gecko-oriented implementation of the Encoding Standard in Rust
Stars: ✭ 196 (+1125%)
Mutual labels:  unicode
Idna
Internationalized Domain Names for Python (IDNA 2008 and UTS #46)
Stars: ✭ 138 (+762.5%)
Mutual labels:  unicode
Cowsay Files
A collection of additional/alternative cowsay files.
Stars: ✭ 216 (+1250%)
Mutual labels:  unicode
Rust Unic
UNIC: Unicode and Internationalization Crates for Rust
Stars: ✭ 189 (+1081.25%)
Mutual labels:  unicode
Rabbit
Another Zawgyi <=> Unicode Converter
Stars: ✭ 157 (+881.25%)
Mutual labels:  unicode
Text
An efficient packed, immutable Unicode text type for Haskell, with a powerful loop fusion optimization framework.
Stars: ✭ 248 (+1450%)
Mutual labels:  unicode
Twitter Text
Twitter Text Libraries. This code is used at Twitter to tokenize and parse text to meet the expectations for what can be used on the platform.
Stars: ✭ 2,627 (+16318.75%)
Mutual labels:  unicode
Contour
Modern C++ Terminal Emulator
Stars: ✭ 191 (+1093.75%)
Mutual labels:  unicode

Greeb

Greeb [grʲip] is a simple Unicode-aware text segmentator based on regular expressions. The API documentation is available on RubyDoc.info.

Gem Version Build Status Code Climate

Installation

Add this line to your application's Gemfile:

gem 'greeb'

And then execute:

$ bundle

Or install it yourself as:

$ gem install greeb

Usage

Greeb can approach such essential text processing problems as tokenization and segmentation. There are two ways to use it: (1) as a command-line application, (2) as a Ruby library.

Command-Line Interface

The greeb application reads the input text from STDIN and writes one token per line to STDOUT.

% echo 'Hello http://nlpub.ru guys, how are you?' | greeb
Hello
http://nlpub.ru
guys
,
how
are
you
?

Tokenization API

Greeb has a very convenient API that makes you happy.

pp Greeb::Tokenizer.tokenize('Hello!')
=begin
[#<struct Greeb::Span from=0, to=5, type=:letter>,
 #<struct Greeb::Span from=5, to=6, type=:punct>]
=end

It should be noted that it is also possible to process much complex texts than the present one.

text =<<-EOF
Hello! I am 18! My favourite number is 133.7...

What about you?
EOF

pp Greeb::Tokenizer.tokenize(text)
=begin
[#<struct Greeb::Span from=0, to=5, type=:letter>,
 #<struct Greeb::Span from=5, to=6, type=:punct>,
 #<struct Greeb::Span from=6, to=7, type=:space>,
 #<struct Greeb::Span from=7, to=8, type=:letter>,
 #<struct Greeb::Span from=8, to=9, type=:space>,
 #<struct Greeb::Span from=9, to=11, type=:letter>,
 #<struct Greeb::Span from=11, to=12, type=:space>,
 #<struct Greeb::Span from=12, to=14, type=:integer>,
 #<struct Greeb::Span from=14, to=15, type=:punct>,
 #<struct Greeb::Span from=15, to=16, type=:space>,
 #<struct Greeb::Span from=16, to=18, type=:letter>,
 #<struct Greeb::Span from=18, to=19, type=:space>,
 #<struct Greeb::Span from=19, to=28, type=:letter>,
 #<struct Greeb::Span from=28, to=29, type=:space>,
 #<struct Greeb::Span from=29, to=35, type=:letter>,
 #<struct Greeb::Span from=35, to=36, type=:space>,
 #<struct Greeb::Span from=36, to=38, type=:letter>,
 #<struct Greeb::Span from=38, to=39, type=:space>,
 #<struct Greeb::Span from=39, to=44, type=:float>,
 #<struct Greeb::Span from=44, to=47, type=:punct>,
 #<struct Greeb::Span from=47, to=49, type=:break>,
 #<struct Greeb::Span from=49, to=53, type=:letter>,
 #<struct Greeb::Span from=53, to=54, type=:space>,
 #<struct Greeb::Span from=54, to=59, type=:letter>,
 #<struct Greeb::Span from=59, to=60, type=:space>,
 #<struct Greeb::Span from=60, to=63, type=:letter>,
 #<struct Greeb::Span from=63, to=64, type=:punct>,
 #<struct Greeb::Span from=64, to=65, type=:break>]
=end

Segmentation API

The analyzer can also perform sentence detection.

text = 'Hello! How are you?'
tokens = Greeb::Tokenizer.tokenize(text)
pp Greeb::Segmentator.new(tokens).sentences
=begin
[#<struct Greeb::Span from=0, to=6, type=:sentence>,
 #<struct Greeb::Span from=7, to=19, type=:sentence>]
=end

Having obtained the sentence boundaries, it is possible to extract tokens covered by these sentences.

text = 'Hello! How are you?'
tokens = Greeb::Tokenizer.tokenize(text)
segmentator = Greeb::Segmentator.new(tokens)
pp segmentator.extract(segmentator.sentences)
=begin
{#<struct Greeb::Span from=0, to=6, type=:sentence>=>
  [#<struct Greeb::Span from=0, to=5, type=:letter>,
   #<struct Greeb::Span from=5, to=6, type=:punct>],
 #<struct Greeb::Span from=7, to=19, type=:sentence>=>
  [#<struct Greeb::Span from=7, to=10, type=:letter>,
   #<struct Greeb::Span from=10, to=11, type=:space>,
   #<struct Greeb::Span from=11, to=14, type=:letter>,
   #<struct Greeb::Span from=14, to=15, type=:space>,
   #<struct Greeb::Span from=15, to=18, type=:letter>,
   #<struct Greeb::Span from=18, to=19, type=:punct>]}
=end

Parsing API

It is often that a text includes such special entries as URLs and e-mail addresses. Greeb can assist you in extracting them.

Extraction of URLs and e-mails

text = 'My website is http://nlpub.ru and e-mail is [email protected].'

pp Greeb::Parser.urls(text).map { |e| [e, e.slice(text)] }
=begin
[[#<struct Greeb::Span from=14, to=29, type=:url>, "http://nlpub.ru"]]
=end

pp Greeb::Parser.emails(text).map { |e| [e, e.slice(text)] }
=begin
[[#<struct Greeb::Span from=44, to=63, type=:email>, "[email protected]"]]
=end

Please do not use Greeb for the development of spam lists. Spam sucks.

Extraction of abbreviations

text = 'Hello, G.L.H.F. everyone!'

pp Greeb::Parser.abbrevs(text).map { |e| [e, e.slice(text)] }
=begin
[[#<struct Greeb::Span from=7, to=15, type=:abbrev>, "G.L.H.F."]]
=end

The algorithm is not so accurate, but still useful in many practical situations.

Extraction of time stamps

text = 'Our time is running out: 13:37 or 14:89.'

pp Greeb::Parser.time(text).map { |e| [e, e.slice(text)] }
=begin
[[#<struct Greeb::Span from=25, to=30, type=:time>, "13:37"]]
=end

Spans

Greeb operates with spans, which are tuples of (from, to, kind), where from is a beginning of the span, to is an ending of the span, and kind is a type of the span.

There are several span types at the tokenization stage: :letter, :float, :integer, :separ, :punct (for punctuation), :spunct (for in-sentence punctuation), :space, and :break.

Contributing

  1. Fork it;
  2. Create your feature branch (git checkout -b my-new-feature);
  3. Commit your changes (git commit -am 'Added some feature');
  4. Push to the branch (git push origin my-new-feature);
  5. Create new Pull Request.

Copyright

Copyright (c) 2010-2019 Dmitry Ustalov. See LICENSE for details.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].