All Projects → abitdodgy → Words_counted

abitdodgy / Words_counted

Licence: mit
A Ruby natural language processor.

Programming Languages

ruby
36898 projects - #4 most used programming language

Projects that are alternatives of or similar to Words counted

Learn To Select Data
Code for Learning to select data for transfer learning with Bayesian Optimization
Stars: ✭ 140 (-4.11%)
Mutual labels:  natural-language-processing
Monkeylearn Python
Official Python client for the MonkeyLearn API. Build and consume machine learning models for language processing from your Python apps.
Stars: ✭ 143 (-2.05%)
Mutual labels:  natural-language-processing
Googlelanguager
R client for the Google Translation API, Google Cloud Natural Language API and Google Cloud Speech API
Stars: ✭ 145 (-0.68%)
Mutual labels:  natural-language-processing
Practical Machine Learning With Python
Master the essential skills needed to recognize and solve complex real-world problems with Machine Learning and Deep Learning by leveraging the highly popular Python Machine Learning Eco-system.
Stars: ✭ 1,868 (+1179.45%)
Mutual labels:  natural-language-processing
Tod Bert
Pre-Trained Models for ToD-BERT
Stars: ✭ 143 (-2.05%)
Mutual labels:  natural-language-processing
Absapapers
Worth-reading papers and related awesome resources on aspect-based sentiment analysis (ABSA). 值得一读的方面级情感分析论文与相关资源集合
Stars: ✭ 142 (-2.74%)
Mutual labels:  natural-language-processing
Prosody
Helsinki Prosody Corpus and A System for Predicting Prosodic Prominence from Text
Stars: ✭ 139 (-4.79%)
Mutual labels:  natural-language-processing
Char Cnn Text Classification Pytorch
Character-level Convolutional Neural Networks for text classification in PyTorch
Stars: ✭ 147 (+0.68%)
Mutual labels:  natural-language-processing
Neusum
Code for the ACL 2018 paper "Neural Document Summarization by Jointly Learning to Score and Select Sentences"
Stars: ✭ 143 (-2.05%)
Mutual labels:  natural-language-processing
Scientific Paper Summarisation
Machine learning models to automatically summarise scientific papers
Stars: ✭ 145 (-0.68%)
Mutual labels:  natural-language-processing
Nlpaug
Data augmentation for NLP
Stars: ✭ 2,761 (+1791.1%)
Mutual labels:  natural-language-processing
Stanza Old
Stanford NLP group's shared Python tools.
Stars: ✭ 142 (-2.74%)
Mutual labels:  natural-language-processing
Awesome Nlp Resources
This repository contains landmark research papers in Natural Language Processing that came out in this century.
Stars: ✭ 145 (-0.68%)
Mutual labels:  natural-language-processing
Lexpredict Contraxsuite
LexPredict ContraxSuite
Stars: ✭ 140 (-4.11%)
Mutual labels:  natural-language-processing
Nl2sql
阿里天池首届中文NL2SQL挑战赛top6
Stars: ✭ 146 (+0%)
Mutual labels:  natural-language-processing
Deeplearning.ai
Stars: ✭ 139 (-4.79%)
Mutual labels:  natural-language-processing
Multihead Siamese Nets
Implementation of Siamese Neural Networks built upon multihead attention mechanism for text semantic similarity task.
Stars: ✭ 144 (-1.37%)
Mutual labels:  natural-language-processing
Fxdesktopsearch
A JavaFX based desktop search application.
Stars: ✭ 147 (+0.68%)
Mutual labels:  natural-language-processing
Hands On Natural Language Processing With Python
This repository is for my students of Udemy. You can find all lecture codes along with mentioned files for reading in here. So, feel free to clone it and if you have any problem just raise a question.
Stars: ✭ 146 (+0%)
Mutual labels:  natural-language-processing
Ai Job Info
互联网大厂面试经验
Stars: ✭ 145 (-0.68%)
Mutual labels:  natural-language-processing

WordsCounted

We are all in the gutter, but some of us are looking at the stars.

-- Oscar Wilde

WordsCounted is a Ruby NLP (natural language processor). WordsCounted lets you implement powerful tokensation strategies with a very flexible tokeniser class.

Are you using WordsCounted to do something interesting? Please tell me about it.

Gem Version

RubyDoc documentation.

Demo

Visit this website for one example of what you can do with WordsCounted.

Features

  • Out of the box, get the following data from any string or readable file, or URL:
    • Token count and unique token count
    • Token densities, frequencies, and lengths
    • Char count and average chars per token
    • The longest tokens and their lengths
    • The most frequent tokens and their frequencies.
  • A flexible way to exclude tokens from the tokeniser. You can pass a string, regexp, symbol, lambda, or an array of any combination of those types for powerful tokenisation strategies.
  • Pass your own regexp rules to the tokeniser if you prefer. The default regexp filters special characters but keeps hyphens and apostrophes. It also plays nicely with diacritics (UTF and unicode characters): Bayrūt is treated as ["Bayrūt"] and not ["Bayr", "ū", "t"], for example.
  • Opens and reads files. Pass in a file path or a url instead of a string.

Installation

Add this line to your application's Gemfile:

gem 'words_counted'

And then execute:

$ bundle

Or install it yourself as:

$ gem install words_counted

Usage

Pass in a string or a file path, and an optional filter and/or regexp.

counter = WordsCounted.count(
  "We are all in the gutter, but some of us are looking at the stars."
)

# Using a file
counter = WordsCounted.from_file("path/or/url/to/my/file.txt")

.count and .from_file are convenience methods that take an input, tokenise it, and return an instance of WordsCounted::Counter initialized with the tokens. The WordsCounted::Tokeniser and WordsCounted::Counter classes can be used alone, however.

API

WordsCounted

WordsCounted.count(input, options = {})

Tokenises input and initializes a WordsCounted::Counter object with the resulting tokens.

counter = WordsCounted.count("Hello Beirut!")

Accepts two options: exclude and regexp. See Excluding tokens from the analyser and Passing in a custom regexp respectively.

WordsCounted.from_file(path, options = {})

Reads and tokenises a file, and initializes a WordsCounted::Counter object with the resulting tokens.

counter = WordsCounted.count("hello_beirut.txt")

Accepts the same options as .count.

Tokeniser

The tokeniser allows you to tokenise text in a variety of ways. You can pass in your own rules for tokenisation, and apply a powerful filter with any combination of rules as long as they can boil down into a lambda.

Out of the box the tokeniser includes only alpha chars. Hyphenated tokens and tokens with apostrophes are considered a single token.

#tokenise([pattern: TOKEN_REGEXP, exclude: nil])

tokeniser = WordsCounted::Tokeniser.new("Hello Beirut!").tokenise

# With `exclude`
tokeniser = WordsCounted::Tokeniser.new("Hello Beirut!").tokenise(exclude: "hello")

# With `pattern`
tokeniser = WordsCounted::Tokeniser.new("I <3 Beirut!").tokenise(pattern: /[a-z]/i)

See Excluding tokens from the analyser and Passing in a custom regexp for more information.

Counter

The WordsCounted::Counter class allows you to collect various statistics from an array of tokens.

#token_count

Returns the token count of a given string.

counter.token_count #=> 15

#token_frequency

Returns a sorted (unstable) two-dimensional array where each element is a token and its frequency. The array is sorted by frequency in descending order.

counter.token_frequency

[
  ["the", 2],
  ["are", 2],
  ["we",  1],
  # ...
  ["all", 1]
]

#most_frequent_tokens

Returns a hash where each key-value pair is a token and its frequency.

counter.most_frequent_tokens

{ "are" => 2, "the" => 2 }

#token_lengths

Returns a sorted (unstable) two-dimentional array where each element contains a token and its length. The array is sorted by length in descending order.

counter.token_lengths

[
  ["looking", 7],
  ["gutter",  6],
  ["stars",   5],
  # ...
  ["in",      2]
]

#longest_tokens

Returns a hash where each key-value pair is a token and its length.

counter.longest_tokens

{ "looking" => 7 }

#token_density([ precision: 2 ])

Returns a sorted (unstable) two-dimentional array where each element contains a token and its density as a float, rounded to a precision of two. The array is sorted by density in descending order. It accepts a precision argument, which must be a float.

counter.token_density

[
  ["are",     0.13],
  ["the",     0.13],
  ["but",     0.07 ],
  # ...
  ["we",      0.07 ]
]

#char_count

Returns the char count of tokens.

counter.char_count #=> 76

#average_chars_per_token([ precision: 2 ])

Returns the average char count per token rounded to two decimal places. Accepts a precision argument which defaults to two. Precision must be a float.

counter.average_chars_per_token #=> 4

#unique_token_count

Returns the number unique tokens.

counter.unique_token_count #=> 13

Excluding tokens from the tokeniser

You can exclude anything you want from the input by passing the exclude option. The exclude option accepts a variety of filters and is extremely flexible.

  1. A space-delimited string. The filter will normalise the string.
  2. A regular expression.
  3. A lambda.
  4. A symbol that names a predicate method. For example :odd?.
  5. An array of any combination of the above.
tokeniser =
  WordsCounted::Tokeniser.new(
    "Magnificent! That was magnificent, Trevor."
  )

# Using a string
tokeniser.tokenise(exclude: "was magnificent")
# => ["that", "trevor"]

# Using a regular expression
tokeniser.tokenise(exclude: /Trevor/)
# => ["that", "was", "magnificent"]

# Using a lambda
tokeniser.tokenise(exclude: ->(t) { t.length < 4 })
# => ["magnificent", "trevor"]

# Using symbol
tokeniser = WordsCounted::Tokeniser.new("Hello! محمد")
t.tokenise(exclude: :ascii_only?)
# => ["محمد"]

# Using an array
tokeniser = WordsCounted::Tokeniser.new(
  "Hello! اسماءنا هي محمد، كارولينا، سامي، وداني"
)
tokeniser.tokenise(
  exclude: [:ascii_only?, /محمد/, ->(t) { t.length > 6}, "و"]
)
# => ["هي", "سامي", "وداني"]

Passing in a Custom Regexp

The default regexp accounts for letters, hyphenated tokens, and apostrophes. This means twenty-one is treated as one token. So is Mohamad's.

/[\p{Alpha}\-']+/

You can pass your own criteria as a Ruby regular expression to split your string as desired.

For example, if you wanted to include numbers, you can override the regular expression:

counter = WordsCounted.count("Numbers 1, 2, and 3", pattern: /[\p{Alnum}\-']+/)
counter.tokens
#=> ["Numbers", "1", "2", "and", "3"]

Opening and Reading Files

Use the from_file method to open files. from_file accepts the same options as .count. The file path can be a URL.

counter = WordsCounted.from_file("url/or/path/to/file.text")

Gotchas

A hyphen used in leu of an em or en dash will form part of the token. This affects the tokeniser algorithm.

counter = WordsCounted.count("How do you do?-you are well, I see.")
counter.token_frequency

[
  ["do",   2],
  ["how",  1],
  ["you",  1],
  ["-you", 1], # WTF, mate!
  ["are",  1],
  # ...
]

In this example -you and you are separate tokens. Also, the tokeniser does not include numbers by default. Remember that you can pass your own regular expression if the default behaviour does not fit your needs.

A note on case sensitivity

The program will normalise (downcase) all incoming strings for consistency and filters.

Road Map

  1. Add ability to open URLs.
  2. Add Ngram support.

Ability to read URLs

Something like...

def self.from_url
  # open url and send string here after removing html
end

About

Originally I wrote this program for a code challenge on Treehouse. You can find the original implementation on Code Review.

Contributors

See contributors. Not listed there is Dave Yarwood.

Contributing

  1. Fork it
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create new Pull Request
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].