Starter code to solve real world text data problems. Includes: Gensim Word2Vec, phrase embeddings, Text Classification with Logistic Regression, word count with pyspark, simple text preprocessing, pre-trained embeddings and more.

Stars: ✭ 790 (+3661.9%)

Mutual labels: tf-idf

devsearch

A web search engine built with Python which uses TF-IDF and PageRank to sort search results.

Stars: ✭ 52 (+147.62%)

Mutual labels: tf-idf

Textclustering

Stars: ✭ 89 (+323.81%)

Mutual labels: tf-idf

text-classification-baseline

Pipeline for fast building text classification TF-IDF + LogReg baselines.

Stars: ✭ 55 (+161.9%)

Mutual labels: tf-idf

2018 Machinelearning Lectures Esa

Machine Learning Lectures at the European Space Agency (ESA) in 2018

Stars: ✭ 280 (+1233.33%)

Mutual labels: tf-idf

clusterix

Visual exploration of clustered data.

Stars: ✭ 44 (+109.52%)

Mutual labels: tf-idf

Vntk

Vietnamese NLP Toolkit for Node

Stars: ✭ 170 (+709.52%)

Mutual labels: tf-idf

ResumeRise

An NLP tool which classifies and summarizes resumes

Stars: ✭ 29 (+38.1%)

Mutual labels: tf-idf

iresearch

IResearch is a cross-platform, high-performance document oriented search engine library written entirely in C++ with the focus on a pluggability of different ranking/similarity models

Stars: ✭ 121 (+476.19%)

Mutual labels: tf-idf

keras-knn

Code for the blog post Nearest Neighbors with Keras and CoreML

Stars: ✭ 25 (+19.05%)

Mutual labels: cosine-similarity

Greynir

The greynir.is natural language processing website for Icelandic

Stars: ✭ 47 (+123.81%)

Mutual labels: tf-idf

set-sketch-paper

SetSketch: Filling the Gap between MinHash and HyperLogLog

Stars: ✭ 23 (+9.52%)

Mutual labels: cosine-similarity

weibo-summary

微博自动摘要系统 Chinese Microblog Automatic Summary System

Stars: ✭ 28 (+33.33%)

Mutual labels: tf-idf

Naive-Resume-Matching

Text Similarity Applied to resume, to compare Resumes with Job Descriptions and create a score to rank them. Similar to an ATS.

Stars: ✭ 27 (+28.57%)

Mutual labels: cosine-similarity

soan

Social Analysis based on Whatsapp data

Stars: ✭ 106 (+404.76%)

Mutual labels: tf-idf

Coursera Uw Machine Learning Clustering Retrieval

Stars: ✭ 25 (+19.05%)

Mutual labels: tf-idf

SentimentAnalysis

(BOW, TF-IDF, Word2Vec, BERT) Word Embeddings + (SVM, Naive Bayes, Decision Tree, Random Forest) Base Classifiers + Pre-trained BERT on Tensorflow Hub + 1-D CNN and Bi-Directional LSTM on IMDB Movie Reviews Dataset

Stars: ✭ 40 (+90.48%)

Mutual labels: tf-idf

Vtext

Simple NLP in Rust with Python bindings

Stars: ✭ 108 (+414.29%)

Mutual labels: tf-idf

tf-idf-python

Term frequency–inverse document frequency for Chinese novel/documents implemented in python.

Stars: ✭ 98 (+366.67%)

Mutual labels: tf-idf

Moviebox

Machine learning movie recommending system

Stars: ✭ 504 (+2300%)

Mutual labels: tf-idf

TextAudit

一个短视频app文本审核模块的实现思路及demo

Stars: ✭ 63 (+200%)

Mutual labels: tf-idf

Cadmium

Natural Language Processing (NLP) library for Crystal

Stars: ✭ 172 (+719.05%)

Mutual labels: tf-idf

text-classification-cn

中文文本分类实践，基于搜狗新闻语料库，采用传统机器学习方法以及预训练模型等方法

Stars: ✭ 81 (+285.71%)

Mutual labels: tf-idf

Polyfuzz

Fuzzy string matching, grouping, and evaluation.

Stars: ✭ 292 (+1290.48%)

Mutual labels: tf-idf

Keyword-Extracter

Problem Statement: Given a particular PDF/Text document ,How to extract keywords and arrange in order of their weightage using Python?

Stars: ✭ 17 (-19.05%)

Mutual labels: tf-idf

Stringlifier

Stringlifier is on Opensource ML Library for detecting random strings in raw text. It can be used in sanitising logs, detecting accidentally exposed credentials and as a pre-processing step in unsupervised ML-based analysis of application text data.

Stars: ✭ 85 (+304.76%)

Mutual labels: tf-idf

topic modelling financial news

Topic modelling on financial news with Natural Language Processing

Stars: ✭ 51 (+142.86%)

Mutual labels: tf-idf

Textmining

Python文本挖掘系统 Research of Text Mining System

Stars: ✭ 268 (+1176.19%)

Mutual labels: tf-idf

Python Tf Idf

An extremely simple Python library to perform TF-IDF document comparison.

Stars: ✭ 214 (+919.05%)

Mutual labels: tf-idf

text2text

Text2Text: Cross-lingual natural language processing and generation toolkit

Stars: ✭ 188 (+795.24%)

Mutual labels: tf-idf

tika-similarity

Tika-Similarity uses the Tika-Python package (Python port of Apache Tika) to compute file similarity based on Metadata features.

Stars: ✭ 92 (+338.1%)

Mutual labels: cosine-similarity

KeywordExtraction

Implementation of algorithm in keyword extraction,including TextRank,TF-IDF and the combination of both

Stars: ✭ 95 (+352.38%)

Mutual labels: tf-idf

How To Mine Newsfeed Data And Extract Interactive Insights In Python

A practical guide to topic mining and interactive visualizations

Stars: ✭ 61 (+190.48%)

Mutual labels: tf-idf

Java String Similarity

Implementation of various string similarity and distance algorithms: Levenshtein, Jaro-winkler, n-Gram, Q-Gram, Jaccard index, Longest Common Subsequence edit distance, cosine similarity ...

Stars: ✭ 2,403 (+11342.86%)

Mutual labels: cosine-similarity

lucilla

Fast, efficient, in-memory Full Text Search for Kotlin

Stars: ✭ 102 (+385.71%)

Mutual labels: tf-idf

Simple-Plagiarism-Checker

Web Application for checking the similarity between query and document using the concept of Cosine Similarity.

Stars: ✭ 47 (+123.81%)

Mutual labels: cosine-similarity

Textvec

Text vectorization tool to outperform TFIDF for classification tasks

Stars: ✭ 167 (+695.24%)

Mutual labels: tf-idf

AI-for-Trading

📈This repo contains detailed notes and multiple projects implemented in Python related to AI and Finance. Follow the blog here: https://purvasingh.medium.com

Stars: ✭ 59 (+180.95%)

Mutual labels: cosine-similarity

watchman

Watchman: An open-source social-media event-detection system

Stars: ✭ 18 (-14.29%)

Mutual labels: tf-idf

live-cctv

To detect any reasonable change in a live cctv to avoid large storage of data. Once, we notice a change, our goal would be track that object or person causing it. We would be using Computer vision concepts. Our major focus will be on Deep Learning and will try to add as many features in the process.

Stars: ✭ 23 (+9.52%)

Mutual labels: cosine-similarity

Predicting Myers Briggs Type Indicator With Recurrent Neural Networks

Stars: ✭ 43 (+104.76%)

Mutual labels: tf-idf

occupationcoder

Given a job title and job description, the algorithm assigns a standard occupational classification (SOC) code to the job.

Stars: ✭ 30 (+42.86%)

Mutual labels: tf-idf

Document-Classification-using-LSA

Document classification using Latent semantic analysis in python

Stars: ✭ 16 (-23.81%)