ivan-bilan / The Nlp Pandect
Licence: cc0-1.0
A comprehensive reference for all topics related to Natural Language Processing
Stars: ✭ 1,349
Programming Languages
python
139335 projects - #7 most used programming language
Projects that are alternatives of or similar to The Nlp Pandect
Python Tutorial Notebooks
Python tutorials as Jupyter Notebooks for NLP, ML, AI
Stars: ✭ 52 (-96.15%)
Mutual labels: natural-language-processing, deeplearning
Trankit
Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing
Stars: ✭ 311 (-76.95%)
Mutual labels: natural-language-processing, deeplearning
Fixy
Amacımız Türkçe NLP literatüründeki birçok farklı sorunu bir arada çözebilen, eşsiz yaklaşımlar öne süren ve literatürdeki çalışmaların eksiklerini gideren open source bir yazım destekleyicisi/denetleyicisi oluşturmak. Kullanıcıların yazdıkları metinlerdeki yazım yanlışlarını derin öğrenme yaklaşımıyla çözüp aynı zamanda metinlerde anlamsal analizi de gerçekleştirerek bu bağlamda ortaya çıkan yanlışları da fark edip düzeltebilmek.
Stars: ✭ 165 (-87.77%)
Mutual labels: natural-language-processing, deeplearning
Ai Series
📚 [.md & .ipynb] Series of Artificial Intelligence & Deep Learning, including Mathematics Fundamentals, Python Practices, NLP Application, etc. 💫 人工智能与深度学习实战,数理统计篇 | 机器学习篇 | 深度学习篇 | 自然语言处理篇 | 工具实践 Scikit & Tensoflow & PyTorch 篇 | 行业应用 & 课程笔记
Stars: ✭ 702 (-47.96%)
Mutual labels: natural-language-processing, deeplearning
Coursera Natural Language Processing Specialization
Programming assignments from all courses in the Coursera Natural Language Processing Specialization offered by deeplearning.ai.
Stars: ✭ 39 (-97.11%)
Mutual labels: natural-language-processing, deeplearning
Spago
Self-contained Machine Learning and Natural Language Processing library in Go
Stars: ✭ 854 (-36.69%)
Mutual labels: natural-language-processing, deeplearning
Learn Data Science For Free
This repositary is a combination of different resources lying scattered all over the internet. The reason for making such an repositary is to combine all the valuable resources in a sequential manner, so that it helps every beginners who are in a search of free and structured learning resource for Data Science. For Constant Updates Follow me in …
Stars: ✭ 4,757 (+252.63%)
Mutual labels: natural-language-processing, deeplearning
Ludwig
Data-centric declarative deep learning framework
Stars: ✭ 8,018 (+494.37%)
Mutual labels: natural-language-processing, deeplearning
Bidaf Keras
Bidirectional Attention Flow for Machine Comprehension implemented in Keras 2
Stars: ✭ 60 (-95.55%)
Mutual labels: natural-language-processing, deeplearning
Tageditor
🏖TagEditor - Annotation tool for spaCy
Stars: ✭ 92 (-93.18%)
Mutual labels: natural-language-processing
Toiro
A comparison tool of Japanese tokenizers
Stars: ✭ 95 (-92.96%)
Mutual labels: natural-language-processing
Abydos
Abydos NLP/IR library for Python
Stars: ✭ 91 (-93.25%)
Mutual labels: natural-language-processing
Micromlp
A micro neural network multilayer perceptron for MicroPython (used on ESP32 and Pycom modules)
Stars: ✭ 92 (-93.18%)
Mutual labels: deeplearning
Bdrar
Code for the ECCV 2018 paper "Bidirectional Feature Pyramid Network with Recurrent Attention Residual Modules for Shadow Detection"
Stars: ✭ 95 (-92.96%)
Mutual labels: deeplearning
Msr Nlp Projects
This is a list of open-source projects at Microsoft Research NLP Group
Stars: ✭ 92 (-93.18%)
Mutual labels: natural-language-processing
Sentence Similarity
PyTorch implementations of various deep learning models for paraphrase detection, semantic similarity, and textual entailment
Stars: ✭ 96 (-92.88%)
Mutual labels: natural-language-processing
Geotext
Geotext extracts country and city mentions from text
Stars: ✭ 91 (-93.25%)
Mutual labels: natural-language-processing
Lda Topic Modeling
A PureScript, browser-based implementation of LDA topic modeling.
Stars: ✭ 91 (-93.25%)
Mutual labels: natural-language-processing
Bond
BOND: BERT-Assisted Open-Domain Name Entity Recognition with Distant Supervision
Stars: ✭ 96 (-92.88%)
Mutual labels: natural-language-processing
Ngsim env
Learning human driver models from NGSIM data with imitation learning.
Stars: ✭ 96 (-92.88%)
Mutual labels: deeplearning
This pandect (πανδέκτης is Ancient Greek for encyclopedia) was created to help you find almost anything related to Natural Language Processing that is available online.
Compendiums and awesome lists on the topic of NLP:
- Awesome NLP by keon [GitHub, 11582 stars]
- Speech and Natural Language Processing Awesome List by elaboshira [GitHub, 2003 stars]
- Awesome Deep Learning for Natural Language Processing (NLP) [GitHub, 899 stars]
- Text Mining and Natural Language Processing Resources by stepthom [GitHub, 354 stars]
- Made with ML List by madewithml.com
- Brainsources for #NLP enthusiasts by Philip Vollet
- Awesome AI/ML/DL - NLP Section [GitHub, 823 stars]
NLP Conferences, Paper Summaries and Paper Compendiums:
Papers and Paper Summaries
- 100 Must-Read NLP Papers 100 Must-Read NLP Papers [GitHub, 3022 stars]
- NLP Paper Summaries by dair-ai [GitHub, 1308 stars]
- Curated collection of papers for the NLP practitioner [GitHub, 1019 stars]
- Papers on Textual Adversarial Attack and Defense [GitHub, 770 stars]
- The Most Influential NLP Research of 2019
- Recent Deep Learning papers in NLU and RL by Valentin Malykh [GitHub, 286 stars]
- Some Notable Recent ML Papers and Future Trends by Aran Komatsuzaki [Blog, Oct. 2020]
- A Survey of Surveys (NLP & ML): Collection of NLP Survey Papers [GitHub, 1199 stars]
Conferences
- NLP top 10 conferences Compendium by soulbliss [GitHub, 347 stars]
- NLP Conferences Calendar
- ICLR 2020 Trends
- SpacyIRL 2019 Conference in Overview
- Paper Digest - Conferences and Papers in Overview
NLP Progress and NLP Tasks:
- NLP Progress by sebastianruder [GitHub, 17841 stars]
- NLP Tasks by Kyubyong [GitHub, 2885 stars]
- Reading list for Awesome Sentiment Analysis papers by declare-lab [GitHub, 294 stars]
- Awesome Sentiment Analysis by xiamx [GitHub, 810 stars]
NLP Datasets:
- NLP Datasets by niderhoff [GitHub, 4517 stars]
- Datasets by Huggingface [GitHub, 6838 stars]
- Big Bad NLP Database
- 25 Best Parallel Text Datasets for Machine Translation Training
- UWA Unambiguous Word Annotations - Word Sense Disambiguation Dataset
- 20 Best German Language Datasets for Machine Learning
Word and Sentence embeddings:
- Awesome Embedding Models by Hironsan [GitHub, 1425 stars]
- Awesome list of Sentence Embeddings by Separius [GitHub, 1778 stars]
- Awesome BERT by Jiakui [GitHub, 1612 stars]
Notebooks, Scripts and Repositories
- The Super Duper NLP Repo [Website, 2020]
Non-English resources and compendiums
- NLP Resources for Bahasa Indonesian [GitHub, 153 stars]
- Indic NLP Catalog [GitHub, 223 stars]
- Pre-trained language models for Vietnamese [GitHub, 322 stars]
- Natural Language Toolkit for Indic Languages (iNLTK) [GitHub, 695 stars]
- Indic NLP Library [GitHub, 341 stars]
- AI4Bharat-IndicNLP Portal
- ARBML - Implementation of many Arabic NLP and ML projects [GitHub, 162 stars]
- zemberek-nlp - NLP tools for Turkish [GitHub, 898 stars]
Pre-trained NLP models
- List of pre-trained NLP models [GitHub, 119 stars]
NLP Year in Review
2020
- Natural Language Processing in 2020: The Year In Review [Blog, December 2020]
- ML and NLP Research Highlights of 2020 [Blog, January 2021]
NLP-only podcasts
- NLP Highlights [Years: 2017 - now, Status: active]
Many NLP episodes
- TWIML AI [Years: 2016 - now, Status: active]
- Practical AI [Years: 2018 - now, Status: active]
- The Data Exchange [Years: 2019 - now, Status: active]
- Gradient Dissent [Years: 2020 - now, Status: active]
- Machine Learning Street Talk [Years: 2020 - now, Status: active]
Some NLP episodes
- The Super Data Science Podcast [Years: 2016 - now, Status: active]
- Data Hack Radio [Years: 2018 - now, Status: active]
- AI Game Changers [Years: 2020 - now, Status: active]
- NLP News by Sebastian Ruder
- dair.ai Newsletter by dair.ai
- This Week in NLP by Robert Dale
- Papers with Code
- The Batch by deeplearning.ai
- Paper Digest by PaperDigest
- NLP Cypher by QuantumStat
- NLP Zurich [YouTube Recordings]
- NY-NLP (New York)
- Online NLP Meetup
- Hacking-Machine-Learning [YouTube Recordings]
- Yannic Kilcher
- HuggingFace
- Kaggle Reading Group
- Rasa Paper Reading
- Stanford CS224N: NLP with Deep Learning
- NLPxing
- ML Explained - A.I. Socratic Circles - AISC
- Deeplearning.ai
- Machine Learning Street Talk
General NLU
- GLUE - General Language Understanding Evaluation (GLUE) benchmark
- SuperGLUE - benchmark styled after GLUE with a new set of more difficult language understanding tasks
- decaNLP - The Natural Language Decathlon (decaNLP) for studying general NLP models
- RACE - ReAding Comprehension dataset collected from English Examinations
- dialoglue - DialoGLUE: A Natural Language Understanding Benchmark for Task-Oriented Dialogue
- DynaBench - Dynabench is a research platform for dynamic data collection and benchmarking
Summarization
- WikiAsp - WikiAsp: Multi-document aspect-based summarization Dataset
Question Answering
- SQuAD - Stanford Question Answering Dataset (SQuAD)
- XQuad - XQuAD (Cross-lingual Question Answering Dataset) for cross-lingual question answering
- GrailQA - Strongly Generalizable Question Answering (GrailQA)
- CSQA - Complex Sequential Question Answering
Multilingual and Non-English Benchmarks
- XTREME - Massively Multilingual Multi-task Benchmark
- GLUECoS - A benchmark for code-switched NLP
- IndoNLU Benchmark - collection of resources for training, evaluating, and analyzing NLP for Bahasa Indonesia
- IndicGLUE - Natural Language Understanding Benchmark for Indic Languages
- LinCE - Linguistic Code-Switching Evaluation Benchmark
Bio, Law, and other scientific domains
- BLURB - Biomedical Language Understanding and Reasoning Benchmark
- BLUE - Biomedical Language Understanding Evaluation benchmark
Transformer Efficiency
- Long-Range Arena - Long Range Arena for Benchmarking Efficient Transformers (Pre-print) [GitHub, 195 stars]
Other
- CodeXGLUE - A benchmark dataset for code intelligence
- CrossNER - CrossNER: Evaluating Cross-Domain Named Entity Recognition
- MultiNLI - Multi-Genre Natural Language Inference corpus
General
- A Recipe for Training Neural Networks by Andrej Karpathy [Keywords: research, training, 2019]
Embeddings
Repositories
- Pre-trained ELMo Representations for Many Languages [GitHub, 1308 stars]
- sense2vec - Contextually-keyed word vectors [GitHub, 1163 stars]
- wikipedia2vec [GitHub, 644 stars]
- StarSpace [GitHub, 3550 stars]
- fastText [GitHub, 22254 stars]
Blogs
- Language Models and Contextualised Word Embeddings by David S. Batista [Blog, 2018]
- An Essential Guide to Pretrained Word Embeddings for NLP Practitioners by AnalyticsVidhya [Blog, 2020]
- Polyglot Word Embeddings Discover Language Clusters [Blog, 2020]
- The Illustrated Word2vec by Jay Alammar [Blog, 2019]
Cross-lingual Word Embeddings
- vecmap - VecMap (cross-lingual word embedding mappings) [GitHub, 527 stars]
Byte Pair Encoding
- bpemb - Pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) [GitHub, 893 stars]
- subword-nmt - Unsupervised Word Segmentation for Neural Machine Translation and Text Generation [GitHub, 1633 stars]
- python-bpe - Byte Pair Encoding for Python [GitHub, 141 stars]
Transformer-based Architectures
General
- The Transformer Family by Lilian Weng [Blog, 2020]
- Keeping up with the BERTs: a review of the main NLP benchmarks by Manuel Tonneau [Blog, 2020]
- Playing the lottery with rewards and multiple languages - about the effect of random initialization [ICLR 2020 Paper]
- Attention? Attention! by Lilian Weng [Blog, 2018]
- the transformer … “explained”? [Blog, 2019]
- Attention is all you need; Attentional Neural Network Models by Łukasz Kaiser [Talk, 2017]
- Understanding and Applying Self-Attention for NLP [Talk, 2018]
Transformer
- The Annotated Transformer by Harvard NLP [Blog, 2018]
- The Illustrated Transformer by Jay Alammar [Blog, 2018]
- Illustrated Guide to Transformers by Hong Jing [Blog, 2020]
- Sequential Transformer with Adaptive Attention Span by Facebook. Blog [Blog, 2019]
- Evolution of Representations in the Transformer by Lena Voita [Blog, 2019]
- Reformer: The Efficient Transformer [Blog, 2020]
- Longformer — The Long-Document Transformer by Viktor Karlsson [Blog, 2020]
- TRANSFORMERS FROM SCRATCH [Blog, 2019]
- Universal Transformers by Mostafa Dehghani [Blog, 2019]
- Transformers in Natural Language Processing — A Brief Survey by George Ho [Blog, May 2020]
- Lite Transformer - Lite Transformer with Long-Short Range Attention [GitHub, 397 stars]
BERT
- A Visual Guide to Using BERT for the First Time by Jay Alammar [Blog, 2019]
- The Dark Secrets of BERT by Anna Rogers [Blog, 2020]
- Understanding searches better than ever before [Blog, 2019]
- Demystifying BERT: A Comprehensive Guide to the Groundbreaking NLP Framework [Blog, 2019]
- SemBERT - Semantics-aware BERT for Language Understanding [GitHub, 191 stars]
- BERTweet - BERTweet: A pre-trained language model for English Tweets [GitHub, 271 stars]
- Optimal Subarchitecture Extraction for BERT [GitHub, 413 stars]
- CharacterBERT: Reconciling ELMo and BERT [GitHub, 84 stars]
Other Transformer Variants
T5
- T5 Understanding Transformer-Based Self-Supervised Architectures [Blog, August 2020]
- T5: the Text-To-Text Transfer Transformer [Blog, 2020]
- multilingual-t5 - Multilingual T5 (mT5) is a massively multilingual pretrained text-to-text transformer model [GitHub, 579 stars]
BigBird
- Big Bird: Transformers for Longer Sequences original paper by Google Research [Paper, July 2020]
Reformer / Linformer / Longformer / Performers
- Reformer: The Efficient Transformer - [Paper, February 2020] [Video, October 2020]
- Longformer: The Long-Document Transformer - [Paper, April 2020] [Video, April 2020]
- Linformer: Self-Attention with Linear Complexity - [Paper, June 2020] [Video, June 2020]
- Rethinking Attention with Performers - [Paper, September 2020] [Video, September 2020]
- performer-pytorch - An implementation of Performer, a linear attention-based transformer, in Pytorch [GitHub, 516 stars]
Switch Transformer
- Switch Transformers: Scaling to Trillion Parameter Models original paper by Google Research [Paper, January 2021]
GPT-family
General
- The Illustrated GPT-2 by Jay Alammar [Blog, 2019]
- The Annotated GPT-2 by Aman Arora
- OpenAI’s GPT-2: the model, the hype, and the controversy by Ryan Lowe [Blog, 2019]
- How to generate text by Patrick von Platen [Blog, 2020]
GPT-3
Learning Resources
- Zero Shot Learning for Text Classification by Amit Chaudhary [Blog, 2020]
- GPT-3 A Brief Summary by Leo Gao [Blog, 2020]
- GPT-3, a Giant Step for Deep Learning And NLP by Yoel Zeldes [Blog, June 2020]
- GPT-3 Language Model: A Technical Overview by Chuan Li [Blog, June 2020]
- Is it possible for language models to achieve language understanding? by Christopher Potts
Applications
- Aweseome GPT-3 - list of all resources related to GPT-3 [GitHub, 2946 stars]
- GPT-3 Projects - a map of all GPT-3 start-ups and commercial projects
- OpenAI API - API Demo to use GPT-3 for commercial applications
Open-source Efforts
- GPT-Neo - in-progress GPT-3 open source replication
Other
- What is Two-Stream Self-Attention in XLNet by Xu LIANG [Blog, 2019]
- Visual Paper Summary: ALBERT (A Lite BERT) by Amit Chaudhary [Blog, 2020]
- Turing NLG by Microsoft
- Multi-Label Text Classification with XLNet by Josh Xin Jie Lee [Blog, 2019]
- ELECTRA [GitHub, 1676 stars]
- Performer implementation of Performer, a linear attention-based transformer, in Pytorch [GitHub, 516 stars]
Distillation, Pruning and Quantization
- Distilling knowledge from Neural Networks to build smaller and faster models by FloydHub [Blog, 2019]
- David over Goliath: towards smaller models for cheaper, faster, and greener NLP by Manuel Tonneau [Blog, 2020]
Automated Summarization
- PEGASUS: A State-of-the-Art Model for Abstractive Text Summarization by Google AI [Blog, June 2020]
- CTRLsum - CTRLsum: Towards Generic Controllable Text Summarization [GitHub, 21 stars]
Rule-based NLP
- LemmInflect - A python module for English lemmatization and inflection
Best Practices for NLP
- In Search of Best Practices for NLP Projects [Slides, Dec. 2020]
- EMNLP 2020: High Performance Natural Language Processing by Google Research [Slides, Recording, Nov. 2020]
- Practical Natural Language Processing - A Comprehensive Guide to Building Real-World NLP Systems [Book, June 2020]
Transformer-based Architectures
- Why BERT Fails in Commercial Environments by Intel AI [Blog, 2020]
- Fine Tuning BERT for Text Classification with FARM by Sebastian Guggisberg [Blog, 2020]
- Pretrain Transformers Models in PyTorch using Hugging Face Transformers [GitHub, 57 stars]
- Practical NLP for the Real World [Presentation, 2019]
- From Paper to Product – How we implemented BERT by Christoph Henkelmann [Talk, 2020]
Embeddings as a Service
- embedding-as-service [GitHub, 148 stars]
- Bert-as-service [GitHub, 8913 stars]
NLP Recipes Industrial Applications:
- NLP Recipes by microsoft [GitHub, 5399 stars]
- NLP with Python by susanli2016 [GitHub, 1886 stars]
- Basic Utilities for PyTorch NLP by PetrochukM [GitHub, 1867 stars]
NLP Applications in Bio, Finance, Legal and other industries
- Blackstone - A spaCy pipeline and model for NLP on unstructured legal text [GitHub, 452 stars]
- Sci spaCy - spaCy pipeline and models for scientific/biomedical documents [GitHub, 838 stars]
- FinBERT: Pre-Trained on SEC Filings for Financial NLP Tasks [GitHub, 132 stars]
- LexNLP - Information retrieval and extraction for real, unstructured legal text [GitHub, 433 stars]
- NerDL and NerCRF - Tutorial on Named Entity Recognition for Healthcare with SparkNLP
- Legal Text Analytics - A list of selected resources dedicated to Legal Text Analytics [GitHub, 192 stars]
Model and Data testing
- WildNLP - Corrupt an input text to test NLP models' robustness [GitHub, 64 stars]
- Great Expectations - Write tests for your data [GitHub, 3721 stars]
- CheckList - Beyond Accuracy: Behavioral Testing of NLP models [GitHub, 1245 stars]
- TextAttack - framework for adversarial attacks, data augmentation, and model training in NLP [GitHub, 1263 stars]
General Speech Recognition
- wav2letter - Automatic Speech Recognition Toolkit [GitHub, 5655 stars]
- DeepSpeech - Baidu's DeepSpeech architecture [GitHub, 16623 stars]
- Acoustic Word Embeddings by Maria Obedkova [Blog, 2020]
- kaldi - Kaldi is a toolkit for speech recognition [GitHub, 10155 stars]
- awesome-kaldi - resources for using Kaldi [GitHub, 381 stars]
- ESPnet - End-to-End Speech Processing Toolkit [GitHub, 3540 stars]
Text to Speech
- FastSpeech - The Implementation of FastSpeech based on pytorch [GitHub, 587 stars]
Blogs
- Topic Modelling with PySpark and Spark NLP by Maria Obedkova [Spark, Blog, 2020]
Frameworks for Topic Modeling
Repositories
- Top2Vec [GitHub, 932 stars]
- Anchored Correlation Explanation Topic Modeling [GitHub, 266 stars]
- Topic Modeling in Embedding Spaces [GitHub, 294 stars] Paper
- TopicNet - A high-level interface for BigARTM library [GitHub, 105 stars]
- BERTopic - Leveraging BERT and a class-based TF-IDF to create easily interpretable topics [GitHub, 689 stars]
Text Rank
- PyTextRank - PyTextRank is a Python implementation of TextRank as a spaCy pipeline extension [GitHub, 1471 stars]
- textrank - TextRank implementation for Python 3 [GitHub, 996 stars]
RAKE - Rapid Automatic Keyword Extraction
- rake-nltk - Rapid Automatic Keyword Extraction algorithm using NLTK [GitHub, 792 stars]
- yake - Single-document unsupervised keyword extraction [GitHub, 577 stars]
- RAKE-tutorial - A python implementation of the Rapid Automatic Keyword Extraction [GitHub, 346 stars]
- rake-nltk - Rapid Automatic Keyword Extraction algorithm using NLTK [GitHub, 792 stars]
Other
- flashtext - Extract Keywords from sentence or Replace keywords in sentences [GitHub, 4656 stars]
- BERT-Keyword-Extractor - Deep Keyphrase Extraction using BERT [GitHub, 164 stars]
- keyBERT - Minimal keyword extraction with BERT [GitHub, 353 stars]
- Adding a custom tokenizer to spaCy and extracting keywords from Chinese texts by Haowen Jiang [Blog, Feb 2021]
NLP and ML Interpretability
- Language Interpretability Tool (LIT) [GitHub, 2380 stars]
- WhatLies - Toolkit to help visualise - what lies in word embeddings [GitHub, 236 stars]
- Interpret-Text - Interpretability techniques and visualization dashboards for NLP models [GitHub, 214 stars]
- InterpretML - Fit interpretable models. Explain blackbox machine learning [GitHub, 3491 stars]
- ecco - Tools to visuals and explore NLP language models [GitHub, 693 stars]
- NLP Profiler - A simple NLP library allows profiling datasets with text columns [GitHub, 179 stars]
Ethics, Bias, and Equality in NLP
- Machine Learning as a Software Engineering Enterprise - NeurIPS 2020 Keynote [Presentation, Dec 2020]
- Computational Ethics for NLP - course resources from the Carnegie Mellon University [Lecture Notes, Spring 2020]
- Ethics in NLP - resources from ACLs Ethics in NLP track
- The Institute for Ethical AI & Machine Learning
- Understanding the Capabilities, Limitations, and Societal Impact of Large Language Models [Paper, Feb 2021]
Adversarial Attacks for NLP
- Privacy Considerations in Large Language Models [Blog, Dec 2020]
- DeepWordBug - Generation of Adversarial Text Sequences to Evade Deep Learning Classifiers [GitHub, 46 stars]
General Purpose
- spaCy by Explosion AI [GitHub, 19664 stars]
- flair by Zalando [GitHub, 9971 stars]
- AllenNLP by AI2 [GitHub, 9705 stars]
- stanza (former Stanford NLP) [GitHub, 5201 stars]
- spaCy stanza [GitHub, 502 stars]
- nltk [GitHub, 9651 stars]
- gensim - framework for topic modeling [GitHub, 11751 stars]
- pororo - Platform of neural models for natural language processing [GitHub, 780 stars]
- NLP Architect - A Deep Learning NLP/NLU library by Intel® AI Lab [GitHub, 2601 stars]
- polyglot - Multi-lingual NLP Framework [GitHub, 1773 stars]
- FARM [GitHub, 1108 stars]
- gobbli by RTI International [GitHub, 251 stars]
- headliner - training and deployment of seq2seq models [GitHub, 221 stars]
- SyferText - A privacy preserving NLP framework [GitHub, 164 stars]
- DeText - Text Understanding Framework for Ranking and Classification Tasks [GitHub, 1023 stars]
- TextHero - Text preprocessing, representation and visualization [GitHub, 2097 stars]
- textblob - TextBlob: Simplified Text Processing [GitHub, 7553 stars]
- AdaptNLP - A high level framework and library for NLP [GitHub, 269 stars]
- textacy - NLP, before and after spaCy [GitHub, 1609 stars]
- texar - Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow [GitHub, 2113 stars]
Data Augmentation
- WildNLP Text manipulation library to test NLP models [GitHub, 64 stars]
- snorkel Framework to generate training data [GitHub, 4490 stars]
- NLPAug Data augmentation for NLP [GitHub, 1638 stars]
- SentAugment Data augmentation by retrieving similar sentences from larger datasets [GitHub, 268 stars]
- faker - Python package that generates fake data for you [GitHub, 12133 stars]
Adversarial NLP Attacks
- TextAttack - framework for adversarial attacks, data augmentation, and model training in NLP [GitHub, 1263 stars]
- CleverHans - adversarial example library for constructing NLP attacks and building defenses [GitHub, 4957 stars]
Non-English oriented
- textblob-de - TextBlob: Simplified Text Processing for German [GitHub, 80 stars]
- Kashgari Transfer Learning with focus on Chinese [GitHub, 2032 stars]
- Underthesea - Vietnamese NLP Toolkit [GitHub, 814 stars]
Transformer-oriented
- transformers by HuggingFace [GitHub, 41314 stars]
- Adapter Hub and its documentation - Adapter modules for Transformers [GitHub, 323 stars]
- haystack - Transformers at scale for question answering & neural search. [GitHub, 1417 stars]
Dialog Systems and Speech
- DeepPavlov by MIPT [GitHub, 5021 stars]
- ParlAI by FAIR [GitHub, 6999 stars]
- rasa - Framework for Conversational Agents [GitHub, 10826 stars]
- wav2letter - Automatic Speech Recognition Toolkit [GitHub, 5655 stars]
- ChatterBot - conversational dialog engine for creating chat bots [GitHub, 10900 stars]
Word-embeddings oriented
- MUSE A library for Multilingual Unsupervised or Supervised word Embeddings [GitHub, 2690 stars]
- vecmap A framework to learn cross-lingual word embedding mappings [GitHub, 527 stars]
Distributed NLP
- Spark NLP [GitHub, 1963 stars]
Machine Translation
- COMET -A Neural Framework for MT Evaluation [GitHub, 53 stars]
- marian-nmt - Fast Neural Machine Translation in C++ [GitHub, 764 stars]
- argos-translate - Open source neural machine translation in Python [GitHub, 453 stars]
- Opus-MT - Open neural machine translation models and web services [GitHub, 104 stars]
Entity and String Matching
- PolyFuzz - Fuzzy string matching, grouping, and evaluation [GitHub, 285 stars]
- pyahocorasick - Python module implementing Aho-Corasick algorithm for string matching [GitHub, 585 stars]
- fuzzywuzzy - Fuzzy String Matching in Python [GitHub, 7900 stars]
- jellyfish - approximate and phonetic matching of strings [GitHub, 1400 stars]
- textdistance - Compute distance between sequences [GitHub, 1900 stars]
- DeepMatcher - Compute distance between sequences [GitHub, 276 stars]
Discourse Analysis
- ConvoKit - Cornell Conversational Analysis Toolkit [GitHub, 244 stars]
General
- Learn NLP the practical way [Blog, Nov. 2019]
Books
- Dive into Deep Learning - An interactive deep learning book with code, math, and discussions
- Natural Language Processing and Computational Linguistics - Speech, Morphology and Syntax (Cognitive Science)
- Top NLP Books to Read 2020 - Blog post by Raymong Cheng [Blog, Sep 2020]
Courses
- NLP Course | For You - Great and interactive course on NLP
- Choosing the right course for a Practical NLP Engineer
- 12 Best Natural Language Processing Courses & Tutorials to Learn Online
Tutorials
- nlp-tutorial - A list of NLP(Natural Language Processing) tutorials built on PyTorch [GitHub, 1175 stars]
- nlp-tutorial - Natural Language Processing Tutorial for Deep Learning Researchers [GitHub, 8238 stars]
- Hands-On NLTK Tutorial [GitHub, 414 stars]
- Modern Practical Natural Language Processing [GitHub, 252 stars]
- r/LanguageTechnology - NLP Reddit forum
General
- NeuralCoref 4.0: Coreference Resolution in spaCy with Neural Networks by HuggingFace [GitHub, 2207 stars]
Tokenization
- tokenizers - Fast State-of-the-Art Tokenizers optimized for Research and Production [GitHub, 4274 stars]
- SentencePiece - Unsupervised text tokenizer for Neural Network-based text generation [GitHub, 4832 stars]
- SoMaJo - A tokenizer and sentence splitter for German and English web and social media texts [GitHub, 84 stars]
Data Augmentation and Weak Supervision
Libraries and Frameworks
- WildNLP Text manipulation library to test NLP models [GitHub, 64 stars]
- snorkel Framework to generate training data [GitHub, 4490 stars]
- NLPAug Data augmentation for NLP [GitHub, 1638 stars]
- SentAugment Data augmentation by retrieving similar sentences from larger datasets [GitHub, 268 stars]
- TextAttack - framework for adversarial attacks, data augmentation, and model training in NLP [GitHub, 1263 stars]
Blogs and Tutorials
- A Visual Survey of Data Augmentation in NLP [Blog, 2020]
- Weak Supervision: A New Programming Paradigm for Machine Learning [Blog, March 2019]
Named Entity Recognition (NER)
- Datasets for Entity Recognition [GitHub, 870 stars]
- Datasets to train supervised classifiers for Named-Entity Recognition [GitHub, 219 stars]
- Bootleg - Self-Supervision for Named Entity Disambiguation at the Tail [GitHub, 81 stars]
Relation Extraction
- tacred-relation TACRED: position-aware attention model for relation extraction [GitHub, 268 stars]
- tacrev TACRED Revisited: A Thorough Evaluation of the TACRED Relation Extraction Task [GitHub, 34 stars]
- tac-self-attention Relation extraction with position-aware self-attention [GitHub, 57 stars]
Domain Adaptation
- Neural Adaptation in Natural Language Processing - curated list [GitHub, 135 stars]
Low Resource NLP
- CMU LTI Low Resource NLP Bootcamp 2020 - CMU Language Technologies Institute low resource NLP bootcamp 2020 [GitHub, 463 stars]
Spell Correction
- NeuSpell - A Neural Spelling Correction Toolkit [GitHub, 76 stars]
- SymSpellPy - Python port of SymSpell [GitHub, 412 stars]
- Speller100 by Microsoft [Blog, Feb 2021]
Automata Theory for NLP
- pyahocorasick - Python module implementing Aho-Corasick algorithm for string matching [GitHub, 585 stars]
Obscene words detection
LDNOOBW - List of Dirty, Naughty, Obscene, and Otherwise Bad Words [GitHub, 1275 stars]
Reinforcement Learning for NLP
- nlp-gym - NLPGym - A toolkit to develop RL agents to solve NLP tasks [GitHub, 80 stars]
AutoML
- TPOT - Python Automated Machine Learning tool [GitHub, 7835 stars]
- Auto-PyTorch - Automatic architecture search and hyperparameter optimization for PyTorch [GitHub, 1154 stars]
- HungaBunga - Brute-Force all sklearn models with all parameters using .fit .predict [GitHub, 610 stars]
- AutoML Natural Language - Google's paid AutoML NLP service
CC0
LicenseAttributions
Resources
- All linked resources belong to original authors
Icons
- Akropolis by parkjisun from the Noun Project
- Book of Ester by Gilad Sotil from the Noun Project
- quill by Juan Pablo Bravo from the Noun Project
- acting by Flatart from the Noun Project
- olympic by supalerk laipawat from the Noun Project
- aristocracy by Eucalyp from the Noun Project
- Horn by Eucalyp from the Noun Project
- temple by Eucalyp from the Noun Project
- constellation by Eucalyp from the Noun Project
- ancient greek round pattern by Olena Panasovska from the Noun Project
- Harp by Vectors Point from the Noun Project
- Atlas by parkjisun from the Noun Project
- Parthenon by Eucalyp from the Noun Project
- papyrus by IconMark from the Noun Project
- papyrus by Smalllike from the Noun Project
- pegasus by Saeful Muslim from the Noun Project
Fonts
Note that the project description data, including the texts, logos, images, and/or trademarks,
for each open source project belongs to its rightful owner.
If you wish to add or remove any projects, please contact us at [email protected].