All Projects → bnosac → ruimtehol

bnosac / ruimtehol

Licence: other
R package to Embed All the Things! using StarSpace

Programming Languages

C++
36643 projects - #6 most used programming language
r
7636 projects
shell
77523 projects
Makefile
30231 projects

Projects that are alternatives of or similar to ruimtehol

Magnetloss Pytorch
PyTorch implementation of a deep metric learning technique called "Magnet Loss" from Facebook AI Research (FAIR) in ICLR 2016.
Stars: ✭ 217 (+128.42%)
Mutual labels:  embeddings, classification
support-tickets-classification
This case study shows how to create a model for text analysis and classification and deploy it as a web service in Azure cloud in order to automatically classify support tickets. This project is a proof of concept made by Microsoft (Commercial Software Engineering team) in collaboration with Endava http://endava.com/en
Stars: ✭ 142 (+49.47%)
Mutual labels:  text-mining, classification
Ml Classify Text Js
Machine learning based text classification in JavaScript using n-grams and cosine similarity
Stars: ✭ 38 (-60%)
Mutual labels:  similarity, classification
Graph 2d cnn
Code and data for the paper 'Classifying Graphs as Images with Convolutional Neural Networks' (new title: 'Graph Classification with 2D Convolutional Neural Networks')
Stars: ✭ 67 (-29.47%)
Mutual labels:  embeddings, classification
Awesome Text Classification
Awesome-Text-Classification Projects,Papers,Tutorial .
Stars: ✭ 158 (+66.32%)
Mutual labels:  text-mining, classification
Sytora
A sophisticated smart symptom search engine
Stars: ✭ 111 (+16.84%)
Mutual labels:  embeddings, classification
Sensegram
Making sense embedding out of word embeddings using graph-based word sense induction
Stars: ✭ 209 (+120%)
Mutual labels:  similarity, embeddings
Fastrtext
R wrapper for fastText
Stars: ✭ 103 (+8.42%)
Mutual labels:  embeddings, classification
Applied Text Mining In Python
Repo for Applied Text Mining in Python (coursera) by University of Michigan
Stars: ✭ 59 (-37.89%)
Mutual labels:  text-mining, classification
Rmdl
RMDL: Random Multimodel Deep Learning for Classification
Stars: ✭ 375 (+294.74%)
Mutual labels:  text-mining, classification
Eda nlp
Data augmentation for NLP, presented at EMNLP 2019
Stars: ✭ 902 (+849.47%)
Mutual labels:  embeddings, classification
textlearnR
A simple collection of well working NLP models (Keras, H2O, StarSpace) tuned and benchmarked on a variety of datasets.
Stars: ✭ 16 (-83.16%)
Mutual labels:  text-mining, classification
Nlp Journey
Documents, papers and codes related to Natural Language Processing, including Topic Model, Word Embedding, Named Entity Recognition, Text Classificatin, Text Generation, Text Similarity, Machine Translation),etc. All codes are implemented intensorflow 2.0.
Stars: ✭ 1,290 (+1257.89%)
Mutual labels:  similarity, classification
Artificial Adversary
🗣️ Tool to generate adversarial text examples and test machine learning models against them
Stars: ✭ 348 (+266.32%)
Mutual labels:  text-mining, classification
Fake news detection
Fake News Detection in Python
Stars: ✭ 194 (+104.21%)
Mutual labels:  text-mining, classification
lda2vec
Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec from this paper https://arxiv.org/abs/1605.02019
Stars: ✭ 27 (-71.58%)
Mutual labels:  text-mining, embeddings
textstem
Tools for fast text stemming & lemmatization
Stars: ✭ 36 (-62.11%)
Mutual labels:  text-mining
text2class
Multi-class text categorization using state-of-the-art pre-trained contextualized language models, e.g. BERT
Stars: ✭ 15 (-84.21%)
Mutual labels:  classification
Recommender-Systems-with-Collaborative-Filtering-and-Deep-Learning-Techniques
Implemented User Based and Item based Recommendation System along with state of the art Deep Learning Techniques
Stars: ✭ 41 (-56.84%)
Mutual labels:  embeddings
Text and Audio classification with Bert
Text Classification in Turkish Texts with Bert
Stars: ✭ 34 (-64.21%)
Mutual labels:  embeddings

ruimtehol: R package to Embed All the Things! using StarSpace

This repository contains an R package which wraps the StarSpace C++ library (https://github.com/facebookresearch/StarSpace), allowing the following:

  • Text classification
  • Learning word, sentence or document level embeddings
  • Finding sentence or document similarity
  • Ranking web documents
  • Content-based recommendation (e.g. recommend text/music based on the content)
  • Collaborative filtering based recommendation (e.g. recommend text/music based on interest)
  • Identification of entity relationships

Installation

  • For regular users, install the package from your local CRAN mirror install.packages("ruimtehol")
  • For installing the development version of this package: devtools::install_github("bnosac/ruimtehol", build_vignettes = TRUE)

Look to the vignette and the documentation of the functions

vignette("ground-control-to-ruimtehol", package = "ruimtehol")
help(package = "ruimtehol")

Main functionalities

This R package allows to Build Starspace models on your own text / Get embeddings of words/ngrams/sentences/documents/labels / Get predictions from a model (e.g. classification / ranking) / Get nearest neighbours similarity

The following functions are made available.

Function Functionality
starspace Low-level interface to build a Starspace model
starspace_load_model Load a pre-trained model or a tab-separated file
starspace_save_model Save a Starspace model
starspace_embedding Get embeddings of documents/words/ngrams/labels
starspace_knn Find k-nearest neighbouring information for new text
starspace_dictonary Get words/labels part of the model dictionary
predict.textspace Get predictions along a Starspace model
as.matrix Get words and label embeddings
embedding_similarity Cosine/dot product similarity between embeddings - top-n most similar text
embed_wordspace Build a Starspace model which calculates word/ngram embeddings
embed_sentencespace Build a Starspace model which calculates sentence embeddings
embed_articlespace Build a Starspace model for embedding articles - sentence-article similarities
embed_tagspace Build a Starspace model for multi-label classification
embed_docspace Build a Starspace model for content-based recommendation
embed_pagespace Build a Starspace model for interest-based recommendation
embed_entityrelationspace Build a Starspace model for entity relationship completion

Example

Short example showing word embeddings

library(ruimtehol)
set.seed(123456789)

## Get some training data
download.file("https://s3.amazonaws.com/fair-data/starspace/wikipedia_train250k.tgz", "wikipedia_train250k.tgz")
x <- readLines("wikipedia_train250k.tgz", encoding = "UTF-8")
x <- x[-c(1:9)]
x <- x[sample(x = length(x), size = 10000)]
writeLines(text = x, sep = "\n", con = "wikipedia_train10k.txt")
## Train
set.seed(123456789)
model <- starspace(file = "wikipedia_train10k.txt", fileFormat = "labelDoc", dim = 10, trainMode = 3)
model

Object of class textspace
 dimension of the embedding: 10
 training arguments:
      loss: hinge
      margin: 0.05
      similarity: cosine
      epoch: 5
      adagrad: TRUE
      lr: 0.01
      termLr: 1e-09
      norm: 1
      maxNegSamples: 10
      negSearchLimit: 50
      p: 0.5
      shareEmb: TRUE
      ws: 5
      dropoutLHS: 0
      dropoutRHS: 0
      initRandSd: 0.001
embedding <- as.matrix(model)
embedding[c("school", "house"), ]

              [,1]         [,2]        [,3]        [,4]         [,5]        [,6]       [,7]       [,8]         [,9]       [,10]
school 0.008395348  0.002858619 0.004770191 -0.03791502 -0.016193179 0.008368539 -0.0221493 0.01587386 -0.002012054 0.029385706
house  0.005371093 -0.007831781 0.010563998  0.01040361  0.000616577 0.005770847 -0.0097075 0.01678141 -0.004738560 0.009139475
dictionary <- starspace_dictionary(model)
## Save trained model as a binary file or as TSV so that you can inspect the embeddings e.g. with data.table::fread("wikipedia_embeddings.tsv")
starspace_save_model(model, file = "textspace.ruimtehol",      method = "ruimtehol")
starspace_save_model(model, file = "wikipedia_embeddings.tsv", method = "tsv-data.table")
## Load a pre-trained model or pre-trained embeddings
model <- starspace_load_model("textspace.ruimtehol",      method = "ruimtehol")
model <- starspace_load_model("wikipedia_embeddings.tsv", method = "tsv-data.table", trainMode = 3)

## Get the document embedding
starspace_embedding(model, "get the embedding of a full document")

                                          [,1]        [,2]      [,3]       [,4]      [,5]      [,6]       [,7]      [,8]     [,9]     [,10]
get the embedding of a full document 0.1489144 -0.09543591 0.1242385 -0.1080941 0.6971645 0.3131362 -0.3405705 0.3293449 0.231894 -0.281555

The following functionalities do similar things. They see what is the closest word or sentence to a provided sentence.

## What is closest term from the dictionary
starspace_knn(model, "What does this bunch of text look like", k = 10)

## What is closest sentence to vector of sentences
predict(model, newdata = "what does this bunch of text look like", 
        basedoc = c("what does this bunch of text look like", 
                    "word abracadabra was not part of the dictionary", 
                    "give me back my mojo",
                    "cosine distance is what i show"))
                    
## Get cosine distance between 2 sentence vectors
embedding_similarity(
  starspace_embedding(model, "what does this bunch of text look like"),
  starspace_embedding(model, "word abracadabra was not part of the dictionary"), 
  type = "cosine")

Short example showing classification modelling (tagspace)

Below Starspace is used for classification

library(ruimtehol)
data("dekamer", package = "ruimtehol")
dekamer$x <- strsplit(dekamer$question, "\\W")
dekamer$x <- sapply(dekamer$x, FUN = function(x) paste(setdiff(x, ""), collapse = " "))
dekamer$x <- tolower(dekamer$x)
dekamer$y <- strsplit(dekamer$question_theme, split = ",")
dekamer$y <- lapply(dekamer$y, FUN=function(x) gsub(" ", "-", x))

set.seed(123456789)
model <- embed_tagspace(x = dekamer$x, y = dekamer$y,
                        dim = 50, 
                        lr = 0.01, epoch = 40, loss = "softmax", adagrad = TRUE, 
                        similarity = "cosine", negSearchLimit = 50,
                        ngrams = 2, minCount = 2)
plot(model)                        
            
text <- c("de nmbs heeft het treinaanbod uitgebreid via onteigening ...",
          "de migranten komen naar europa de asielcentra ...")                   
predict(model, text, k = 3)  
predict(model, "koning filip", k = 10, type = "knn")
predict(model, "koning filip", k = 10, type = "embedding")

Notes

  • Why did you call the package ruimtehol? Because that is the translation of StarSpace in WestVlaams.
  • The R wrapper is distributed under the Mozilla Public License 2.0. The package contains a copy of the StarSpace C++ code (namely all code under src/Starspace) which has a BSD license (which is available in file LICENSE.notes) and also has an accompanying PATENTS file which you can inspect here.

Support in text mining

Need support in text mining? Contact BNOSAC: http://www.bnosac.be

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].