All Projects → oroszgy → Awesome Hungarian Nlp

oroszgy / Awesome Hungarian Nlp

A curated list of NLP resources for Hungarian

Projects that are alternatives of or similar to Awesome Hungarian Nlp

Oie Resources
A curated list of Open Information Extraction (OIE) resources: papers, code, data, etc.
Stars: ✭ 283 (+133.88%)
Mutual labels:  dataset, natural-language-processing, natural-language-understanding, nlu, information-extraction
Awesome Persian Nlp Ir
Curated List of Persian Natural Language Processing and Information Retrieval Tools and Resources
Stars: ✭ 460 (+280.17%)
Mutual labels:  information-retrieval, corpus, natural-language-processing, named-entity-recognition
Insuranceqa Corpus Zh
🚁 保险行业语料库,聊天机器人
Stars: ✭ 821 (+578.51%)
Mutual labels:  dataset, corpus, natural-language-processing, natural-language-understanding
Dan Jurafsky Chris Manning Nlp
My solution to the Natural Language Processing course made by Dan Jurafsky, Chris Manning in Winter 2012.
Stars: ✭ 124 (+2.48%)
Mutual labels:  parser, information-retrieval, named-entity-recognition, information-extraction
Botlibre
An open platform for artificial intelligence, chat bots, virtual agents, social media automation, and live chat automation.
Stars: ✭ 412 (+240.5%)
Mutual labels:  natural-language-processing, natural-language-understanding, nlu
Spokestack Python
Spokestack is a library that allows a user to easily incorporate a voice interface into any Python application.
Stars: ✭ 103 (-14.88%)
Mutual labels:  natural-language-processing, natural-language-understanding, nlu
Chat
基于自然语言理解与机器学习的聊天机器人,支持多用户并发及自定义多轮对话
Stars: ✭ 516 (+326.45%)
Mutual labels:  natural-language-processing, natural-language-understanding, nlu
Chatito
🎯🗯 Generate datasets for AI chatbots, NLP tasks, named entity recognition or text classification models using a simple DSL!
Stars: ✭ 678 (+460.33%)
Mutual labels:  dataset, named-entity-recognition, nlu
Graphbrain
Language, Knowledge, Cognition
Stars: ✭ 294 (+142.98%)
Mutual labels:  natural-language-processing, text-mining, natural-language-understanding
Nlp Recipes
Natural Language Processing Best Practices & Examples
Stars: ✭ 5,783 (+4679.34%)
Mutual labels:  natural-language-processing, natural-language-understanding, nlu
Ua Gec
UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language
Stars: ✭ 108 (-10.74%)
Mutual labels:  dataset, corpus, natural-language-processing
Pynlp
A pythonic wrapper for Stanford CoreNLP.
Stars: ✭ 103 (-14.88%)
Mutual labels:  parser, natural-language-processing, named-entity-recognition
Snips Nlu
Snips Python library to extract meaning from text
Stars: ✭ 3,583 (+2861.16%)
Mutual labels:  named-entity-recognition, nlu, information-extraction
Coarij
Corpus of Annual Reports in Japan
Stars: ✭ 55 (-54.55%)
Mutual labels:  dataset, corpus, natural-language-processing
Clause
🏇 聊天机器人,自然语言理解,语义理解
Stars: ✭ 323 (+166.94%)
Mutual labels:  natural-language-processing, natural-language-understanding, nlu
Gsoc2018 3gm
💫 Automated codification of Greek Legislation with NLP
Stars: ✭ 36 (-70.25%)
Mutual labels:  natural-language-processing, text-mining, natural-language-understanding
Nested Ner Tacl2020 Transformers
Implementation of Nested Named Entity Recognition using BERT
Stars: ✭ 76 (-37.19%)
Mutual labels:  natural-language-processing, named-entity-recognition, information-extraction
Bond
BOND: BERT-Assisted Open-Domain Name Entity Recognition with Distant Supervision
Stars: ✭ 96 (-20.66%)
Mutual labels:  dataset, natural-language-processing, named-entity-recognition
Ner
Named Entity Recognition
Stars: ✭ 288 (+138.02%)
Mutual labels:  natural-language-processing, named-entity-recognition, natural-language-understanding
Understanding Financial Reports Using Natural Language Processing
Investigate how mutual funds leverage credit derivatives by studying their routine filings to the SEC using NLP techniques 📈🤑
Stars: ✭ 36 (-70.25%)
Mutual labels:  natural-language-processing, named-entity-recognition, information-extraction

Awesome NLP Resources for Hungarian Awesome

A curated list of free resources dedicated to Hungarian Natural Language Processing

Maintainers - György Orosz

Table of contents

  1. Tools
  2. Datasets
  3. Journals / Conferences / Institutes / Events
  4. Courses / Tutorials
  5. Blogs / Communities

1. Tools

Notations:

  • 👌 Easy to install and use
  • 🚀 Commercial-friendly license
  • 💯 Pretrained models are available or not needed

Word tokenization, sentence splitting

  • huntoken 👌🚀💯 Hungarian word and sentence splitter
  • quntoken 👌🚀💯 New Hungarian tokenizer based on quex, huntoken

Morphology

  • emMorph (Humor) 💯 Hungarian morphological analyzer based on Humor
  • emMorphPy 👌💯A wrapper, a lemmatizer and REST API implemented in Python for emMorph (Humor) Hungarian morphological analyzer
  • hunmorph 🚀💯 is an open source tool and programming library for spell-checking, stemming and morphological analysing of agglutinative, german and other languages.
  • hunmorph-foma 🚀💯 Hungarian morpholical analyzer and generator based on hunmorph.
  • hunspell 👌🚀💯 is an open-source spell-checker, stemmer and morphological analyzer
  • lara-hungarian-nlp 👌🚀💯 LARA is a lightweight Python NLP library for ChatBots in Hungarian.
  • Lemmagen 👌🚀💯 project aims at providing standardized open source multilingual platform for lemmatisation. (Python package for v2 | C# project for v3)

PoS / Morphological taggers

  • hunpos 👌🚀💯 Hunpos is an open source reimplementation of TnT, the well known part-of-speech tagger by Thorsten Brants.
  • PurePos 👌🚀 Open source morphological tagger based on HunPos
  • purepos.py 👌🚀 Python wrapper for PurePos

Taggers / Chunkers

  • HunTag 👌🚀 A sequential tagger for NLP using Maximum Entropy Learning and Hidden Markov Models
  • HunTag3 👌🚀 Improved version of the original HunTag
  • SzegedNER 👌🚀💯 Named Entity Recognition tool for Hungarian and English
  • DBpedia Spotlight 👌🚀💯 DBpedia Spotlight is a tool for automatically annotating mentions of DBpedia resources in text. Docker image
  • emBERT 👌🚀💯 is an emtsv module for pre-trained Transfomer-based models. It provides tagging models based on Huggingface's transformers package.

Pipelines with Hungarian NLP components

  • magyarlanc 👌💯 A toolkit for the basic linguistic processing of Hungarian
  • magyarlanc_spark 👌💯 Spark wrapper for magyarlanc
  • spaCy 👌🚀💯 Industrial-strength Natural Language Processing (NLP) with Python and Cython (Hungarian models)
  • huNLP 👌💯 Unified Java and REST API for magyarlanc and szegedNER
  • hunlp-GATE 💯 GATE plugin containing Hungarian NLP tools as GATE processing resources
  • Trendminer Hungarian Processing Pipeline 🚀 Hungarian NLP pipeline for social media text analysis (TrendMiner project)
  • Google Syntaxnet 🚀💯 Neural Models of Syntax
  • UDPipe 👌🚀💯 is a trainable pipeline for tokenization, tagging, lemmatization and dependency parsing of CoNLL-U files
  • polyglot 👌🚀💯 is a natural language pipeline that supports massive multilingual applications.
  • emtsv 👌💯 is a text processing system with inter-module communication via tsv + REST API
  • StanfordNLP 👌💯 is a Python NLP Library for Many Human Languages including Hungarian
  • spaCy StanfordNLP 👌💯 wraps the StanfordNLP library, so you can use Stanford's models as a spaCy pipeline

Syntactic parsers

  • hunpars 🚀💯 A rule based Hungarian syntactical analyzer
  • HunParse 🚀💯 An NLTK-based parser using KR-style morphological annotation
  • Anagramma Parser A parser based on psycholinguistics principles
  • benepar A high-accuracy parser with models for 11 languages, implemented in Python. Based on Constituency Parsing with a Self-Attentive Encoder from ACL 2018.

Semantic analysis

  • SentimentAnalysisHUN 👌🚀💯 is an open-source sentiment analysis tool for Hungarian language, written in Python.
  • hun-date-parser 👌🚀💯 A tool for extracting datetime intervals from Hungarian sentences and turning datetime objects into Hungarian text.

Other

  • emLam 👌🚀💯 Preprocessing scripts for Hungarian Language Modeling
  • pywnxml 👌🚀💯 Python3 API for WordNet XML (Hungarian WordNet / BalkaNet / VisDic format)
  • Hun-appointment-chatbot 👌🚀💯 A simple Hungarian chatbot for booking an appointment using the Rasa framework.
  • neural-punctuator Automatic punctuation restoration with BERT models for English and Hungarian

2. Datasets

Corpora

  • Hungarian Webcorpus With over 1.48 billion words unfiltered (589 million words fully filtered), this is by far the largest Hungarian language corpus, and unlike the Hungarian National Corpus (125 million words), it is available in its entirety under a permissive Open Content license.
  • Hungarian Webcorpus 2.0 The new version of the Hungarian Webcorpus was built from Common Crawl and includes a little over 9 billion words.
  • OSCAR is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture. (2339 million unique words)
  • emLam A Language Modeling Benchmark Corpus for Hungarian, similar to the One Billion Word corpus (Chelba, 2014) for English.
  • Leipzig corpora contains randomly selected sentences in the language of the corpus and are available in sizes from 10,000 sentences up to 1 million sentences. The sources are either newspaper texts or texts randomly collected from the web.
  • web2corpus Automatically create multilingual web corpus
  • CoNLL 2017: Automatically Annotated Raw Texts and Word Embeddings Automatic segmentation, tokenization and morphological and syntactic annotations of raw texts in 45 languages, generated by UDPipe, together with word embeddings of dimension 100 computed from lowercased texts by word2vec
  • OpinHuBank OpinHuBank is a human-annotated corpus to aid the research of opinion mining and sentiment analysis in Hungarian
  • The Hungarian forum corpus for Opinion Mining This database is the first one dedicated to Opinion Mining in Hungarian. The data for further processing were gathered from the posts of the forum topic of the Hungarian government portal dealing with the referendum about dual citizenship.
  • Szeged Treebank The Szeged Treebank is the largest fully manually annotated treebank of the Hungarian language
  • Szeged Dependency Treebank The Szeged Dependency Treebank is a dependency-tree format version of the Szeged Treebank.
  • Universal Dependencies
  • Hungarian Named Entity Corpora The Named Entity Corpus for Hungarian is a subcorpus of the Szeged Treebank, which contains full syntactic annotations done manually by linguist experts.
  • NerKor is a gold standard named entity annotated corpus containing 1 million tokens.
  • hunNERwiki a silver standard corpus for Hungarian Named Entity Recognition
  • Mazsola database containes 28M sentences from the MNSZ1 corpus annotated with shallow syntactic analysis
  • Hungarian word sense disambiguated corpus containing 39 suitable word form samples for the purpose of word sense disambiguation
  • HunLearner is a learners' corpus of Hungarian containing written data from 35 students majoring in Hungarian studies at the University of Zagreb, Croatia. Texts were morphologically and syntactically analyzed by the magyarlanc tool.
  • Hunglish Corpus The Hunglish Corpus is a free sentence-aligned Hungarian-English parallel corpus of about 120 million words in 4 million sentence pairs.
  • SzegedParallel The English-Hungarian parallel corpus contains texts selected on the basis of grammatical and translational criteria.
  • HunOr A Hungarian-Russian Parallel corpus comprises approximately 800 thousand words.
  • CoNLL 2017 Shared Task Hungarian data Automatic segmentation, tokenization and morphological and syntactic annotations of raw texts from the Common Crawl
  • CSS10 A Collection of Single Speaker Speech Datasets for 10 Languages including Hungarian
  • CC-100 Monolingual Datasets from Web Crawl Data
  • Hungarian-Russian Prisoner of War Database

Word vectors

Contextualized Embeddings

  • ELMo Representations Deep contextualized word representation trained for many languages
  • huBERT Hungarian BERT base models trained on Webcorpus 2.0 and the Hungarian Wikipedia

Linguistic Resources

  • morphdb.hu is an open source morphological database of Hungarian, consisting of a lexicon and morphological grammar that are based on well-founded theoretical decisions.
  • huwn Hungarian Wordnet
  • Hungarian Sentiment Lexicon The dictionaries were manually created on the basis of Wordnet-Affect lexicons.
  • 4lang Concept dictionary using Eilenberg machines
  • Named Entity lists for Hungarian
  • Mazsola ISZ lists 500K verb frames extracted from the Mazsola database
  • Manocska merges verb frames existing databases
  • PrevLex List of phrasel verbs
  • panmorph Tagsets and description of Hungarian morphological analysers.
  • hun_ner_checklist CHECKLIST diagnostic test cases for Hungarian Named Entity Recognition

Linked Open Data

Speech

3. Journals / Conferences / Institutes / Events

Journals

Conferences

Institutes

4. Courses / Tutorials

TBD

5. Blogs / Communities

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].