All Projects → vi3k6i5 → synonym-extractor

vi3k6i5 / synonym-extractor

Licence: MIT license
Extract synonyms, keywords from sentences using modified implementation of Aho Corasick algorithm

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to synonym-extractor

Javainterview
最全的Java技术知识点,以及Java源码分析。为开源贡献自己的一份力。
Stars: ✭ 154 (+305.26%)
Mutual labels:  datastructures
Algocasts Js
DSA in JavaScript ✅
Stars: ✭ 189 (+397.37%)
Mutual labels:  datastructures
Schematics
Project documentation: https://schematics.readthedocs.io/en/latest/
Stars: ✭ 2,461 (+6376.32%)
Mutual labels:  datastructures
Python data structures and algorithms
Python 中文数据结构和算法教程
Stars: ✭ 2,194 (+5673.68%)
Mutual labels:  datastructures
Cosmos
Hacktoberfest 2021 | World's largest Contributor driven code dataset | Algorithms that run our universe | Your personal library of every algorithm and data structure code that you will ever encounter |
Stars: ✭ 12,936 (+33942.11%)
Mutual labels:  datastructures
C Macro Collections
Easy to use, header only, macro generated, generic and type-safe Data Structures in C
Stars: ✭ 192 (+405.26%)
Mutual labels:  datastructures
Competitive Programming
VastoLorde95's solutions to 2000+ competitive programming problems from various online judges
Stars: ✭ 147 (+286.84%)
Mutual labels:  datastructures
wordhoard
This Python module can be used to obtain antonyms, synonyms, hypernyms, hyponyms, homophones and definitions.
Stars: ✭ 78 (+105.26%)
Mutual labels:  synonyms
Interview Questions
List of all the Interview questions practiced from online resources and books
Stars: ✭ 187 (+392.11%)
Mutual labels:  datastructures
Hackerranksolutions
This is a repo for HackerRankSolutions with Swift
Stars: ✭ 213 (+460.53%)
Mutual labels:  datastructures
Algorithm
The repository algorithms implemented on the Go
Stars: ✭ 163 (+328.95%)
Mutual labels:  datastructures
Matlab Octave
This repository contains algorithms written in MATLAB/Octave. Developing algorithms in the MATLAB environment empowers you to explore and refine ideas, and enables you test and verify your algorithm.
Stars: ✭ 180 (+373.68%)
Mutual labels:  datastructures
Competitive Programming Resources
This repository consists of data helpful for ACM ICPC programming contest, in general competitive programming.
Stars: ✭ 199 (+423.68%)
Mutual labels:  datastructures
Golang Set
A simple set type for the Go language. Trusted by Docker, 1Password, Ethereum and Hashicorp.
Stars: ✭ 2,168 (+5605.26%)
Mutual labels:  datastructures
Staticvec
Implements a fixed-capacity stack-allocated Vec alternative backed by an array, using const generics.
Stars: ✭ 236 (+521.05%)
Mutual labels:  datastructures
Umbrella
"A collection of functional programming libraries that can be composed together. Unlike a framework, thi.ng is a suite of instruments and you (the user) must be the composer of. Geared towards versatility, not any specific type of music." — @loganpowell via Twitter
Stars: ✭ 2,186 (+5652.63%)
Mutual labels:  datastructures
Data Structures And Algorithms
Data Structures and Algorithms implementation in Go
Stars: ✭ 2,272 (+5878.95%)
Mutual labels:  datastructures
cracking-interview
Cracking the coding interview
Stars: ✭ 19 (-50%)
Mutual labels:  datastructures
Competitive Programming Library
Templates, algorithms and data structures implemented and collected for programming contests. Check README.md for an overview.
Stars: ✭ 236 (+521.05%)
Mutual labels:  datastructures
Nearestneighbors.jl
High performance nearest neighbor data structures and algorithms for Julia.
Stars: ✭ 212 (+457.89%)
Mutual labels:  datastructures

This project has moved to Flash Text.

synonym-extractor

Synonym Extractor is a python library that is loosely based on Aho-Corasick algorithm.

The idea is to extract words that we care about from a given sentence in one pass.

Basically say I have a vocabulary of 10K words and I want to get all the words from that set present in a sentence. A simple regex match will take a lot of time to loop over the 10K documents.

Hence we use a simpler yet much faster algorithm to get the desired result.

Installation

pip install synonym-extractor

Usage

# import module
from synonym.extractor import SynonymExtractor

# Create an object of SynonymExtractor
synonym_extractor = SynonymExtractor()

# add synonyms
synonym_names = ['NY', 'new-york', 'SF']
clean_names = ['new york', 'new york', 'san francisco']

for synonym_name, clean_name in zip(synonym_names, clean_names):
    synonym_extractor.add_to_synonym(synonym_name, clean_name)

synonyms_found = synonym_extractor.get_synonyms_from_sentence('I love SF and NY. new-york is the best.')

synonyms_found
>> ['san francisco', 'new york', 'new york']

Algorithm

synonym-extractor is based on Aho-Corasick algorithm.

Documentation

Documentation can be found at Read the Docs.

Why

Say you have a corpus where similar words appear frequently.

eg: Last weekened I was in NY.
I am traveling to new york next weekend.

If you train a word2vec model on this or do any sort of NLP it will treat NY and new york as 2 different words.

Instead if you create a synonym dictionary like:

eg: NY=>new york
new york=>new york

Then you can extract NY and new york as the same text.

To do the same with regex it will take a lot of time:

Docs count # Synonyms : Regex synonym-extractor
1.5 million 2K : 16 hours NA
2.5 million 10K : 15 days 15 mins

The idea for this library came from the following StackOverflow question.

License

The project is licensed under the MIT license.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].