All Projects → minven → nlp-lt

minven / nlp-lt

Licence: other
Natural Language Processing for Lithuanian language

Programming Languages

python
139335 projects - #7 most used programming language
TeX
3793 projects

Projects that are alternatives of or similar to nlp-lt

Multilingual Latent Dirichlet Allocation Lda
A Multilingual Latent Dirichlet Allocation (LDA) Pipeline with Stop Words Removal, n-gram features, and Inverse Stemming, in Python.
Stars: ✭ 64 (+276.47%)
Mutual labels:  clustering, lda
Lda Topic Modeling
A PureScript, browser-based implementation of LDA topic modeling.
Stars: ✭ 91 (+435.29%)
Mutual labels:  clustering, lda
Nlp
Selected Machine Learning algorithms for natural language processing and semantic analysis in Golang
Stars: ✭ 304 (+1688.24%)
Mutual labels:  lda, svd
policy-data-analyzer
Building a model to recognize incentives for landscape restoration in environmental policies from Latin America, the US and India. Bringing NLP to the world of policy analysis through an extensible framework that includes scraping, preprocessing, active learning and text analysis pipelines.
Stars: ✭ 22 (+29.41%)
Mutual labels:  text-classification, lda
Text Analytics With Python
Learn how to process, classify, cluster, summarize, understand syntax, semantics and sentiment of text data with the power of Python! This repository contains code and datasets used in my book, "Text Analytics with Python" published by Apress/Springer.
Stars: ✭ 1,132 (+6558.82%)
Mutual labels:  text-classification, clustering
Vectorai
Vector AI — A platform for building vector based applications. Encode, query and analyse data using vectors.
Stars: ✭ 195 (+1047.06%)
Mutual labels:  search-engine, clustering
Ml code
A repository for recording the machine learning code
Stars: ✭ 75 (+341.18%)
Mutual labels:  clustering, svd
kwx
BERT, LDA, and TFIDF based keyword extraction in Python
Stars: ✭ 33 (+94.12%)
Mutual labels:  text-classification, lda
ML2017FALL
Machine Learning (EE 5184) in NTU
Stars: ✭ 66 (+288.24%)
Mutual labels:  text-classification, clustering
Meta
A Modern C++ Data Sciences Toolkit
Stars: ✭ 600 (+3429.41%)
Mutual labels:  search-engine, text-classification
text clustering
文本聚类(Kmeans、DBSCAN、LDA、Single-pass)
Stars: ✭ 230 (+1252.94%)
Mutual labels:  clustering, lda
minicore
Fast and memory-efficient clustering + coreset construction, including fast distance kernels for Bregman and f-divergences.
Stars: ✭ 28 (+64.71%)
Mutual labels:  clustering
google-this
🔎 A simple yet powerful module to retrieve organic search results and much more from Google.
Stars: ✭ 88 (+417.65%)
Mutual labels:  search-engine
jina-meme-search
Meme search engine built with Jina neural search framework. Search with captions or image files to find matching memes.
Stars: ✭ 21 (+23.53%)
Mutual labels:  search-engine
Apartment-Interest-Prediction
Predict people interest in renting specific NYC apartments. The challenge combines structured data, geolocalization, time data, free text and images.
Stars: ✭ 17 (+0%)
Mutual labels:  clustering
Clustering-Datasets
This repository contains the collection of UCI (real-life) datasets and Synthetic (artificial) datasets (with cluster labels and MATLAB files) ready to use with clustering algorithms.
Stars: ✭ 189 (+1011.76%)
Mutual labels:  clustering
portfolio allocation js
A JavaScript library to allocate and optimize financial portfolios.
Stars: ✭ 145 (+752.94%)
Mutual labels:  clustering
3HAN
An original implementation of "3HAN: A Deep Neural Network for Fake News Detection" (ICONIP 2017)
Stars: ✭ 29 (+70.59%)
Mutual labels:  text-classification
product-quantization
🙃Implementation of vector quantization algorithms, codes for Norm-Explicit Quantization: Improving Vector Quantization for Maximum Inner Product Search.
Stars: ✭ 40 (+135.29%)
Mutual labels:  clustering
dbscan
DBSCAN Clustering Algorithm C# Implementation
Stars: ✭ 38 (+123.53%)
Mutual labels:  clustering

The main intention of this research is to study and learn natural language processing (NLP) principals for Lithuanian language. It is interesting to analyze classical NLP methods and see how they work on it, so in this work I implemented text classification, topics extraction, search query and clustering ideas. Implementation details and futher information is stored at paper/paper.pdf

Introduction

Data analysis can't be established without having textual data, due to that my work started from getting raw data from most popular news website www.delfi.lt. I decided to crawl articles from 5 categories (Criminals[227 articles], Music[120 articles], Movies[167 articles], Sports[136 articles], Science[204 articles]).

Classification

Classification performance is measured using confusion matrix where rows are true category and columns predicted category. Furthermore such approach reach above 90% recall and 90% precision. GitHub Logo

Topics extraction

Figure shows 6 components with 10 tokens for each component. From these results we can detect most important words and intuitively guess topic for each principal component. For example 4 principal component store information about sports and music whereas 6 principal component store information about criminals.

Main results are presented below: GitHub Logo

Search query

Search is based on http://webhome.cs.uvic.ca/~thomo/svd.pdf article, where lsa is applied to find related documents using not only exact query similarities, but deeper relations between documents. GitHub Logo

Example

Query = "švietim apdovanojam"

Result:

  • ['Imasi mokslininkų algų: siūlo kelti iki 50 proc.']
  • ['Įteiktos 6 Mokslo premijos']
  • ['Lietuvoje į susitikimą kviečia Nobelio premijos laureatas']
  • ['100 tūkst. eurų išdalins populiarinantiems mokslą']
  • ['V. Vaičaitis. Konkursinis mokslo finansavimas ar pasityčiojimas iš mokslininkų?']

Clustering

In progresss

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].