Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → wrathematics → Ngram

wrathematics / Ngram

Licence: other

Fast n-Gram Tokenization

Programming Languages

50402 projects - #5 most used programming language

7636 projects

Labels

text text-mining

Projects that are alternatives of or similar to Ngram

lda2vec

Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec from this paper https://arxiv.org/abs/1605.02019

Stars: ✭ 27 (-50.91%)

Mutual labels: text-mining, text

converse

Conversational text Analysis using various NLP techniques

Stars: ✭ 147 (+167.27%)

Mutual labels: text-mining, text

Orange3 Text

🍊 📄 Text Mining add-on for Orange3

Stars: ✭ 83 (+50.91%)

Mutual labels: text-mining, text

Artificial Adversary

🗣️ Tool to generate adversarial text examples and test machine learning models against them

Stars: ✭ 348 (+532.73%)

Mutual labels: text-mining, text

Art

🎨 ASCII art library for Python

Stars: ✭ 1,026 (+1765.45%)

Mutual labels: text

Language Modelling

Generating Text using Deep Learning in Python - LSTM, RNN, Keras

Stars: ✭ 38 (-30.91%)

Mutual labels: text

Nlp Experiments In Pytorch

PyTorch repository for text categorization and NER experiments in Turkish and English.

Stars: ✭ 35 (-36.36%)

Mutual labels: text

Femto

A toy text editor with no dependencies written in Ruby

Stars: ✭ 34 (-38.18%)

Mutual labels: text

Calyx

A Ruby library for generating text with recursive template grammars.

Stars: ✭ 51 (-7.27%)

Mutual labels: text

Sketch Textbox Fit Content

Set the height of a selected text layer or all text layers in a selected group to it's content's height.

Stars: ✭ 49 (-10.91%)

Mutual labels: text

Randomdatagenerator

This is a configurable generator to create random data like Lorum Ipsum Text, Words, Text Patterns, First/Last Names, MAC-Addresses, IP-Addresses, Guids and DateTime.

Stars: ✭ 45 (-18.18%)

Mutual labels: text

Header

Header Tool for Editor.js 2.0

Stars: ✭ 39 (-29.09%)

Mutual labels: text

Text Split

Text wrapping for type animations.

Stars: ✭ 46 (-16.36%)

Mutual labels: text

Gsoc2018 3gm

💫 Automated codification of Greek Legislation with NLP

Stars: ✭ 36 (-34.55%)

Mutual labels: text-mining

Spark Nkp

Natural Korean Processor for Apache Spark

Stars: ✭ 50 (-9.09%)

Mutual labels: text-mining

Tidytext

Text mining using tidy tools ✨📄✨

Stars: ✭ 975 (+1672.73%)

Mutual labels: text-mining

Urlify

A simple macOS app to create valid file and url names from clipboard text.

Stars: ✭ 44 (-20%)

Mutual labels: text

Insert Text At Cursor

Fast crossbrowser insertion of text at cursor position in a textarea / input

Stars: ✭ 49 (-10.91%)

Mutual labels: text

Tadw

An implementation of "Network Representation Learning with Rich Text Information" (IJCAI '15).

Stars: ✭ 43 (-21.82%)

Mutual labels: text-mining

Gpt2 Telegram Chatbot

GPT-2 Telegram Chat bot

Stars: ✭ 41 (-25.45%)

Mutual labels: text

View All Similar Projects ➔

ngram

Version: 3.1.0
Status:
License:
Author: Drew Schmidt and Christian Heckendorf

ngram is an R package for constructing n-grams ("tokenizing"), as well as generating new text based on the n-gram structure of a given text input ("babbling"). The package can be used for serious analysis or for creating "bots" that say amusing things. See details section below for more information.

The package is designed to be extremely fast at tokenizing, summarizing, and babbling tokenized corpora. Because of the architectural design, we are also able to handle very large volumes of text, with performance scaling very nicely. Benchmarks and example usage can be found in the package vignette.

Package Details

The original purpose for the package was to combine the book "Modern Applied Statistics in S" with the collected works of H. P. Lovecraft and generate amusing nonsense. This resulted in the post Modern Applied Statistics in R'lyeh. I had originally tried several other available R packages to do this, but they were taking hours on a subset of the full combined corpus to preprocess the data into a somewhat inconvenient format. However, the the ngram package can do the preprocessing into the desired format in well under a second (with about half of the preprocessing time spent on copying data for R coherency).

The package is mostly C, with the returned object (to R) being an external pointer. In fact, the underlying C code can be compiled as a standalone library. There is some minimal compatibility with exporting the data to proper R data structures, but it is incomplete at this time.

For more information, see the package vignette.

Installation

You can install the stable version from CRAN using the usual install.packages():

install.packages("ngram")

Development Version

The development version is maintained on GitHub, and can easily be installed by any of the packages that offer installations from GitHub:

### Pick your preference
devtools::install_github("wrathematics/ngram")
ghit::install_github("wrathematics/ngram")
remotes::install_github("wrathematics/ngram")

Example Usage

Here we present a few simple examples on how to use the ngram package. See the package vignette for more detailed information on package usage.

Tokenization, Summarizing, and Babbling

Let's take the sequence

x <- "a b a c a b b"

Eagle-eyed readers will recognize this as the blood code from Mortal Kombat, but you can pretend it's something boring like an amino acid sequence or something. We can form the n-gram structure of this sequence with the ngram function:

library(ngram)

ng <- ngram(x, n=3)

There are various ways of printing the object.

ng
# [1] "An ngram object with 5 3-grams"

print(ng, output="truncated")
# a b a 
# c {1} | 
# 
# a c a 
# b {1} | 
# 
# b a c 
# a {1} | 
# 
# a b b 
# NULL {1} | 
# 
# c a b 
# b {1} |

With output="truncated", only the first 5 n-grams will be shown (here there are only 5 total). To see all (in the case of having more than 5), you can set output="full".

There are several "getter" functions, but they are incomplete (see Notes section below). Perhaps the most useful of them generates a "phrase table", or a list of n-grams by their frequency and proportion in the input text:

get.phrasetable(ng)
#   ngrams freq      prop
# 1    a b    2 0.3333333
# 2    b a    1 0.1666667
# 3    c a    1 0.1666667
# 4    a c    1 0.1666667
# 5    b b    1 0.1666667

Finally, we can use the glory of Markov Chains to babble new sequences:

babble(ng=ng, genlen=12)
# [1] "a b b c a b b a b a c a "

For reproducibility, use the seed argument:

babble(ng=ng, genlen=12, seed=1234)
# [1] "a b a c a b b a b b a b "

At this time, we note that the seed may not guarantee the same results across machines. Currently only Solaris produces different values from mainstream platforms (Windows, Mac, Linux, FreeBSD), but potentially others could as well.

Weka-Like Tokenization

There is also a tokenizer that behaves identically to the one in the RWeka package (only the ngram one is significantly faster!). Using the same sequence x as above:

ngram::ngram_asweka(x, min=2, max=3)
##  [1] "a b a" "b a c" "a c a" "c a b" "a b b" "a b"   "b a"   "a c"   "c a"  
## [10] "a b"   "b b"

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 55

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (1) 🔗