Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

PyNLPl, pronounced as 'pineapple', is a Python library for Natural Language Processing. It contains various modules useful for common, and less common, NLP tasks. PyNLPl can be used for basic tasks such as the extraction of n-grams and frequency lists, and to build simple language model. There are also more complex data types and algorithms. Moreover, there are parsers for file formats common in NLP (e.g. FoLiA/Giza/Moses/ARPA/Timbl/CQL). There are also clients to interface with various NLP specific servers. PyNLPl most notably features a very extensive library for working with FoLiA XML (Format for Linguistic Annotation).

Stars: ✭ 426 (+439.24%)

Mutual labels: nlp-library

Spacy

💫 Industrial-strength Natural Language Processing (NLP) in Python

Stars: ✭ 21,978 (+27720.25%)

Mutual labels: nlp-library

Underthesea

Underthesea - Vietnamese NLP Toolkit

Stars: ✭ 823 (+941.77%)

Mutual labels: nlp-library

Contextualized Topic Models

A python package to run contextualized topic modeling. CTMs combine BERT with topic models to get coherent topics. Also supports multilingual tasks. Cross-lingual Zero-shot model published at EACL 2021.

Stars: ✭ 318 (+302.53%)

Mutual labels: nlp-library

Simplenetnlp

.NET NLP library

Stars: ✭ 38 (-51.9%)

Mutual labels: nlp-library

Quick Nlp

Pytorch NLP library based on FastAI

Stars: ✭ 279 (+253.16%)

Mutual labels: nlp-library

Janome

Japanese morphological analysis engine written in pure Python

Stars: ✭ 630 (+697.47%)

Mutual labels: nlp-library

Farm

🏡 Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

Stars: ✭ 1,140 (+1343.04%)

Mutual labels: nlp-library

Tika Python

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

Stars: ✭ 997 (+1162.03%)

Mutual labels: nlp-library

Atr4s

Toolkit with state-of-the-art Automatic Terms Recognition methods in Scala

Stars: ✭ 23 (-70.89%)

Mutual labels: nlp-library

View All Similar Projects ➔

simstring

A Python implementation of the SimString, a simple and efficient algorithm for approximate string matching.

Features

With this library, you can extract strings/texts which has certain similarity from large amount of strings/texts. It will help you when you develop applications related to language processing.

This library supports variety of similarity functions such as Cossine similarity, Jaccard similarity, and supports Word N-gram and Character N-gram as features. You can also implement your own feature extractor easily.

SimString has the following features:

Fast algorithm for approximate string retrieval.
100% exact retrieval. Although some algorithms allow misses (false positives) for faster query response, SimString is guaranteed to achieve 100% correct retrieval with fast query response.
Unicode support.
Extensibility. You can implement your own feature extractor easily.
Japanese support. MeCabを使った形態素Nグラムをサポートしています。

Please see this paper for more details.

Install

pip install simstring-pure

Usage

from simstring.feature_extractor.character_ngram import CharacterNgramFeatureExtractor
from simstring.measure.cosine import CosineMeasure
from simstring.database.dict import DictDatabase
from simstring.searcher import Searcher

db = DictDatabase(CharacterNgramFeatureExtractor(2))
db.add('foo')
db.add('bar')
db.add('fooo')

searcher = Searcher(db, CosineMeasure())
results = searcher.search('foo', 0.8)
print(results)
# => ['foo', 'fooo']

If you want to use other feature, measure, and database, simply replace these classes. You can replace these classes easily by your own classes if you want.

from simstring.feature_extractor.word_ngram import WordNgramFeatureExtractor
from simstring.measure.jaccard import JaccardMeasure
from simstring.database.mongo import MongoDatabase
from simstring.searcher import Searcher

db = MongoDatabase(WordNgramFeatureExtractor(2))
db.add('You are so cool.')

searcher = Searcher(db, JaccardMeasure())
results = searcher.search('You are cool.', 0.8)
print(results)

Supported String Similarity Measures

Cosine
Dice
Jaccard

Run Tests

docker-compose run main bash -c 'source activate simstring && python -m unittest discover tests'

Benchmark

About 1ms to search strings from 5797 strings(company names).
About 14ms to search strings from 235544 strings(unabridged dictionary).

search from `dev/data/company_names.txt`

$ python dev/benchmark.py
benchmark for using dict as database
## benchmarker:         release 4.0.1 (for python)
## python version:      3.7.0
## python compiler:     GCC 7.2.0
## python platform:     Linux-4.9.87-linuxkit-aufs-x86_64-with-debian-9.4
## python executable:   /opt/conda/envs/simstring/bin/python
## cpu model:           Intel(R) Core(TM) i7-6567U CPU @ 3.30GHz  # 3300.000 MHz
## parameters:          loop=1, cycle=1, extra=0

##                        real    (total    = user    + sys)
initialize database(5797 lines)    0.1227    0.1200    0.1200    0.0000
search text(5797 times)    6.9719    6.9400    6.8900    0.0500

## Ranking                real
initialize database(5797 lines)    0.1227  (100.0) ********************
search text(5797 times)    6.9719  (  1.8)

## Matrix                 real    [01]    [02]
[01] initialize database(5797 lines)    0.1227   100.0  5680.9
[02] search text(5797 times)    6.9719     1.8   100.0

benchmark for using Mongo as database
## benchmarker:         release 4.0.1 (for python)
## python version:      3.7.0
## python compiler:     GCC 7.2.0
## python platform:     Linux-4.9.87-linuxkit-aufs-x86_64-with-debian-9.4
## python executable:   /opt/conda/envs/simstring/bin/python
## cpu model:           Intel(R) Core(TM) i7-6567U CPU @ 3.30GHz  # 3300.000 MHz
## parameters:          loop=1, cycle=1, extra=0

##                        real    (total    = user    + sys)
initialize database(5797 lines)    4.5762    2.4900    1.9200    0.5700
search text(5797 times)  177.8401   60.9100   47.2500   13.6600

## Ranking                real
initialize database(5797 lines)    4.5762  (100.0) ********************
search text(5797 times)  177.8401  (  2.6) *

## Matrix                 real    [01]    [02]
[01] initialize database(5797 lines)    4.5762   100.0  3886.2
[02] search text(5797 times)  177.8401     2.6   100.0

search from `dev/data/unabridged_dictionary.txt`

$ python dev/benchmark.py
benchmark for using dict as database
## benchmarker:         release 4.0.1 (for python)
## python version:      3.7.0
## python compiler:     GCC 7.2.0
## python platform:     Linux-4.9.87-linuxkit-aufs-x86_64-with-debian-9.4
## python executable:   /opt/conda/envs/simstring/bin/python
## cpu model:           Intel(R) Core(TM) i7-6567U CPU @ 3.30GHz  # 3300.000 MHz
## parameters:          loop=1, cycle=1, extra=0

##                        real    (total    = user    + sys)
initialize database(235544 lines)    2.2576    2.2300    2.1200    0.1100
search text(10000 times)  141.0302  140.6400  139.9600    0.6800

## Ranking                real
initialize database(235544 lines)    2.2576  (100.0) ********************
search text(10000 times)  141.0302  (  1.6)

## Matrix                 real    [01]    [02]
[01] initialize database(235544 lines)    2.2576   100.0  6246.8
[02] search text(10000 times)  141.0302     1.6   100.0

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 79

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (4) 🔗

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

nullnull / Simstring

Programming Languages

Labels

Projects that are alternatives of or similar to Simstring

simstring

Features

Install

Usage

Supported String Similarity Measures

Run Tests

Benchmark

search from dev/data/company_names.txt

search from dev/data/unabridged_dictionary.txt

search from `dev/data/company_names.txt`

search from `dev/data/unabridged_dictionary.txt`