All Projects → rth → Vtext

rth / Vtext

Licence: apache-2.0
Simple NLP in Rust with Python bindings

Programming Languages

rust
11053 projects

Projects that are alternatives of or similar to Vtext

text2text
Text2Text: Cross-lingual natural language processing and generation toolkit
Stars: ✭ 188 (+74.07%)
Mutual labels:  information-retrieval, tf-idf
Forte
Forte is a flexible and powerful NLP builder FOR TExt. This is part of the CASL project: http://casl-project.ai/
Stars: ✭ 89 (-17.59%)
Mutual labels:  information-retrieval
Greynir
The greynir.is natural language processing website for Icelandic
Stars: ✭ 47 (-56.48%)
Mutual labels:  tf-idf
Soqal
Arabic Open Domain Question Answering System using Neural Reading Comprehension
Stars: ✭ 72 (-33.33%)
Mutual labels:  tf-idf
Bert Vietnamese Question Answering
Vietnamese question answering system with BERT
Stars: ✭ 57 (-47.22%)
Mutual labels:  information-retrieval
Stringlifier
Stringlifier is on Opensource ML Library for detecting random strings in raw text. It can be used in sanitising logs, detecting accidentally exposed credentials and as a pre-processing step in unsupervised ML-based analysis of application text data.
Stars: ✭ 85 (-21.3%)
Mutual labels:  tf-idf
Domain discovery tool
This repository contains the Domain Discovery Tool (DDT) project. DDT is an interactive system that helps users explore and better understand a domain (or topic) as it is represented on the Web.
Stars: ✭ 33 (-69.44%)
Mutual labels:  information-retrieval
Sert
Semantic Entity Retrieval Toolkit
Stars: ✭ 100 (-7.41%)
Mutual labels:  information-retrieval
Textclustering
Stars: ✭ 89 (-17.59%)
Mutual labels:  tf-idf
Vectorsinsearch
Dice.com repo to accompany the dice.com 'Vectors in Search' talk by Simon Hughes, from the Activate 2018 search conference, and the 'Searching with Vectors' talk from Haystack 2019 (US). Builds upon my conceptual search and semantic search work from 2015
Stars: ✭ 71 (-34.26%)
Mutual labels:  information-retrieval
Wordtokenizers.jl
High performance tokenizers for natural language processing and other related tasks
Stars: ✭ 63 (-41.67%)
Mutual labels:  information-retrieval
Freediscovery
Web Service for E-Discovery Analytics
Stars: ✭ 59 (-45.37%)
Mutual labels:  information-retrieval
Pyndri
pyndri is a Python interface to the Indri search engine.
Stars: ✭ 85 (-21.3%)
Mutual labels:  information-retrieval
Scdv
Text classification with Sparse Composite Document Vectors.
Stars: ✭ 54 (-50%)
Mutual labels:  information-retrieval
Sypht Java Client
A Java client for the Sypht API
Stars: ✭ 93 (-13.89%)
Mutual labels:  information-retrieval
Predicting Myers Briggs Type Indicator With Recurrent Neural Networks
Stars: ✭ 43 (-60.19%)
Mutual labels:  tf-idf
How To Mine Newsfeed Data And Extract Interactive Insights In Python
A practical guide to topic mining and interactive visualizations
Stars: ✭ 61 (-43.52%)
Mutual labels:  tf-idf
Textrank Keyword Extraction
Keyword extraction using TextRank algorithm after pre-processing the text with lemmatization, filtering unwanted parts-of-speech and other techniques.
Stars: ✭ 79 (-26.85%)
Mutual labels:  information-retrieval
Ds2i
A library of inverted index data structures
Stars: ✭ 104 (-3.7%)
Mutual labels:  information-retrieval
Flexneuart
Flexible classic and NeurAl Retrieval Toolkit
Stars: ✭ 99 (-8.33%)
Mutual labels:  information-retrieval

vtext

Crates.io PyPI CircleCI Build Status

NLP in Rust with Python bindings

This package aims to provide a high performance toolkit for ingesting textual data for machine learning applications.

Features

  • Tokenization: Regexp tokenizer, Unicode segmentation + language specific rules
  • Stemming: Snowball (in Python 15-20x faster than NLTK)
  • Token counting: converting token counts to sparse matrices for use in machine learning libraries. Similar to CountVectorizer and HashingVectorizer in scikit-learn but will less broad functionality.
  • Levenshtein edit distance; Sørensen-Dice, Jaro, Jaro Winkler string similarities

Usage

Usage in Python

vtext requires Python 3.6+ and can be installed with,

pip install vtext

Below is a simple tokenization example,

>>> from vtext.tokenize import VTextTokenizer
>>> VTextTokenizer("en").tokenize("Flights can't depart after 2:00 pm.")
["Flights", "ca", "n't", "depart" "after", "2:00", "pm", "."]

For more details see the project documentation: vtext.io/doc/latest/index.html

Usage in Rust

Add the following to Cargo.toml,

[dependencies]
vtext = "0.2.0"

For more details see rust documentation: docs.rs/vtext

Benchmarks

Tokenization

Following benchmarks illustrate the tokenization accuracy (F1 score) on UD treebanks ,

lang dataset regexp spacy 2.1 vtext
en EWT 0.812 0.972 0.966
en GUM 0.881 0.989 0.996
de GSD 0.896 0.944 0.964
fr Sequoia 0.844 0.968 0.971

and the English tokenization speed,

regexp spacy 2.1 vtext
Speed (10⁶ tokens/s) 3.1 0.14 2.1

Text vectorization

Below are benchmarks for converting textual data to a sparse document-term matrix using the 20 newsgroups dataset, run on Intel(R) Xeon(R) CPU E3-1270 v6 @ 3.80GHz,

Speed (MB/s) scikit-learn 0.20.1 vtext (n_jobs=1) vtext (n_jobs=4)
CountVectorizer.fit 14 104 225
CountVectorizer.transform 14 82 303
CountVectorizer.fit_transform 14 70 NA
HashingVectorizer.transform 19 89 309

Note however that these two estimators in vtext currently support only a fraction of scikit-learn's functionality. See benchmarks/README.md for more details.

License

vtext is released under the Apache License, Version 2.0.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].