All Projects → lingualytics → py-lingualytics

lingualytics / py-lingualytics

Licence: MIT license
A text analytics library with support for codemixed data

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to py-lingualytics

Pytorch Sentiment Analysis
Tutorials on getting started with PyTorch and TorchText for sentiment analysis.
Stars: ✭ 3,209 (+8813.89%)
Mutual labels:  bert, pytorch-nlp
classy
classy is a simple-to-use library for building high-performance Machine Learning models in NLP.
Stars: ✭ 61 (+69.44%)
Mutual labels:  nlp-library, bert
Transformers
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
Stars: ✭ 55,742 (+154738.89%)
Mutual labels:  nlp-library, bert
Urduhack
An NLP library for the Urdu language. It comes with a lot of battery included features to help you process Urdu data in the easiest way possible.
Stars: ✭ 200 (+455.56%)
Mutual labels:  nlp-library
Fnlp
中文自然语言处理工具包 Toolkit for Chinese natural language processing
Stars: ✭ 2,468 (+6755.56%)
Mutual labels:  nlp-library
Bert-model-code-interpretation
解读tensorflow版本Bert中modeling.py数据流
Stars: ✭ 19 (-47.22%)
Mutual labels:  bert
vietnamese-roberta
A Robustly Optimized BERT Pretraining Approach for Vietnamese
Stars: ✭ 22 (-38.89%)
Mutual labels:  bert
Nlp profiler
A simple NLP library allows profiling datasets with one or more text columns. When given a dataset and a column name containing text data, NLP Profiler will return either high-level insights or low-level/granular statistical information about the text in that column.
Stars: ✭ 181 (+402.78%)
Mutual labels:  nlp-library
pn-summary
A well-structured summarization dataset for the Persian language!
Stars: ✭ 29 (-19.44%)
Mutual labels:  bert
spacy-sentence-bert
Sentence transformers models for SpaCy
Stars: ✭ 88 (+144.44%)
Mutual labels:  bert
GLUE-bert4keras
基于bert4keras的GLUE基准代码
Stars: ✭ 59 (+63.89%)
Mutual labels:  bert
Multi Task Nlp
multi_task_NLP is a utility toolkit enabling NLP developers to easily train and infer a single model for multiple tasks.
Stars: ✭ 221 (+513.89%)
Mutual labels:  nlp-library
npo classifier
Automated coding using machine-learning and remapping the U.S. nonprofit sector: A guide and benchmark
Stars: ✭ 18 (-50%)
Mutual labels:  bert
Sudachipy
Python version of Sudachi, a Japanese tokenizer.
Stars: ✭ 207 (+475%)
Mutual labels:  nlp-library
TradeTheEvent
Implementation of "Trade the Event: Corporate Events Detection for News-Based Event-Driven Trading." In Findings of ACL2021
Stars: ✭ 64 (+77.78%)
Mutual labels:  bert
Pyarabic
pyarabic
Stars: ✭ 183 (+408.33%)
Mutual labels:  nlp-library
FinBERT-QA
Financial Domain Question Answering with pre-trained BERT Language Model
Stars: ✭ 70 (+94.44%)
Mutual labels:  bert
Cool-NLPCV
Some Cool NLP and CV Repositories and Solutions (收集NLP中常见任务的开源解决方案、数据集、工具、学习资料等)
Stars: ✭ 143 (+297.22%)
Mutual labels:  bert
protonet-bert-text-classification
finetune bert for small dataset text classification in a few-shot learning manner using ProtoNet
Stars: ✭ 28 (-22.22%)
Mutual labels:  bert
AiSpace
AiSpace: Better practices for deep learning model development and deployment For Tensorflow 2.0
Stars: ✭ 28 (-22.22%)
Mutual labels:  bert

Lingualytics : Indic analytics with codemix support

Lingualytics is a Python library for dealing with indic text.
Lingualytics is powered by powerful libraries like Pytorch, Transformers, Texthero, NLTK and Scikit-learn.

Checkout our demo video!
Lingualytics demo

train-demo

🌟 Features

  1. Preprocessing

    • Remove stopwords
    • Remove punctuations, with an option to add punctuations of your own language
    • Remove words less than a character limit
  2. Representation

    • Find n-grams from given text
  3. NLP

    • Classification using PyTorch
      • Train a classifier on your data to perform tasks like Sentiment Analysis
      • Evaluate the classifier with metrics like accuracy, f1 score, precision and recall
      • Use the trained tokenizer to tokenize text

🧠 Pretrained Models

Checkout some codemix friendly models that we have trained using Lingualytics

💾 Installation

Use the package manager pip to install lingualytics.

pip install lingualytics

🕹️ Usage

Preprocessing

from lingualytics.preprocessing import remove_lessthan, remove_punctuation, remove_stopwords
from lingualytics.stopwords import hi_stopwords,en_stopwords
from texthero.preprocessing import remove_digits
import pandas as pd
df = pd.read_csv(
   "https://github.com/lingualytics/py-lingualytics/raw/master/datasets/SAIL_2017/Processed_Data/Devanagari/validation.txt", header=None, sep='\t', names=['text','label']
)
# pd.set_option('display.max_colwidth', None)
df['clean_text'] = df['text'].pipe(remove_digits) \
                                    .pipe(remove_punctuation) \
                                    .pipe(remove_lessthan,length=3) \
                                    .pipe(remove_stopwords,stopwords=en_stopwords.union(hi_stopwords))
print(df)

Classification

Currently available datasets are

from lingualytics.learner import Learner

learner = Learner(model_type = 'bert',
                model_name = 'bert-base-multilingual-cased',
                dataset = 'SAIL_2017')
learner.fit()

Custom Dataset

The train data path should have 3 files

  • train.txt
  • validation.txt
  • test.txt

Any file should have the text and label in a line, separated by a tab. Then change the data_dir to the path of your custom dataset.

Find topmost n-grams

from lingualytics.representation import get_ngrams
import pandas as pd
df = pd.read_csv(
   "https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv"
)

ngrams = get_ngrams(df['text'],n=2)

print(ngrams[:10])

Documentation | API Reference

Documentation is a work in progress! Have a look at it here.

Development Roadmap

We plan to add the following functionality in the coming weeks:

  • Language Identification (LID)
  • POS Tagging (POS)
  • Named Entity Recognition (NER)
  • Sentiment Analysis (SA)
  • Question Answering (QA)
  • Natural Language Inference (NLI)
  • Topic Modelling(LDA)
  • Fuzzy text matching at scale
  • Word Sense Disambiguation, TF-IDF , Keyword Extraction
  • data distribution over different languages

👪 Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

⚖️ License

MIT

📚 References

  1. Khanuja, Simran, et al. "GLUECoS: An Evaluation Benchmark for Code-Switched NLP." arXiv preprint arXiv:2004.12376 (2020).
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].