All Projects → mideind → Greynir

mideind / Greynir

Licence: gpl-3.0
The greynir.is natural language processing website for Icelandic

Programming Languages

python
139335 projects - #7 most used programming language
grammar
57 projects

Projects that are alternatives of or similar to Greynir

text2text
Text2Text: Cross-lingual natural language processing and generation toolkit
Stars: ✭ 188 (+300%)
Mutual labels:  tokenizer, tf-idf
Php Parser
🌿 NodeJS PHP Parser - extract AST or tokens (PHP5 and PHP7)
Stars: ✭ 400 (+751.06%)
Mutual labels:  tokenizer, parser
Link Grammar
The CMU Link Grammar natural language parser
Stars: ✭ 286 (+508.51%)
Mutual labels:  parser, natural-language-processing
Tokenizer
Fast and customizable text tokenization library with BPE and SentencePiece support
Stars: ✭ 132 (+180.85%)
Mutual labels:  tokenizer, natural-language-processing
Py Nltools
A collection of basic python modules for spoken natural language processing
Stars: ✭ 46 (-2.13%)
Mutual labels:  tokenizer, natural-language-processing
Udpipe
R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit
Stars: ✭ 160 (+240.43%)
Mutual labels:  tokenizer, natural-language-processing
Nlp
Selected Machine Learning algorithms for natural language processing and semantic analysis in Golang
Stars: ✭ 304 (+546.81%)
Mutual labels:  natural-language-processing, tf-idf
Thot
Thot toolkit for statistical machine translation
Stars: ✭ 53 (+12.77%)
Mutual labels:  tokenizer, natural-language-processing
Nlp In Practice
Starter code to solve real world text data problems. Includes: Gensim Word2Vec, phrase embeddings, Text Classification with Logistic Regression, word count with pyspark, simple text preprocessing, pre-trained embeddings and more.
Stars: ✭ 790 (+1580.85%)
Mutual labels:  natural-language-processing, tf-idf
Self Attentive Parser
High-accuracy NLP parser with models for 11 languages.
Stars: ✭ 569 (+1110.64%)
Mutual labels:  parser, natural-language-processing
Works For Me
Collection of developer toolkits
Stars: ✭ 131 (+178.72%)
Mutual labels:  tokenizer, parser
Lfuzzer
Fuzzing Parsers with Tokens
Stars: ✭ 28 (-40.43%)
Mutual labels:  tokenizer, parser
Kadot
Kadot, the unsupervised natural language processing library.
Stars: ✭ 108 (+129.79%)
Mutual labels:  tokenizer, natural-language-processing
Query Translator
Query Translator is a search query translator with AST representation
Stars: ✭ 165 (+251.06%)
Mutual labels:  tokenizer, parser
String Calc
PHP calculator library for mathematical terms (expressions) passed as strings
Stars: ✭ 60 (+27.66%)
Mutual labels:  tokenizer, parser
Pyresparser
A simple resume parser used for extracting information from resumes
Stars: ✭ 297 (+531.91%)
Mutual labels:  parser, natural-language-processing
Awesome Hungarian Nlp
A curated list of NLP resources for Hungarian
Stars: ✭ 121 (+157.45%)
Mutual labels:  parser, natural-language-processing
Postagga
A Library to parse natural language in pure Clojure and ClojureScript
Stars: ✭ 152 (+223.4%)
Mutual labels:  parser, natural-language-processing
Open Korean Text
Open Korean Text Processor - An Open-source Korean Text Processor
Stars: ✭ 438 (+831.91%)
Mutual labels:  tokenizer, natural-language-processing
Lisp Esque Language
💠The Lel programming language
Stars: ✭ 24 (-48.94%)
Mutual labels:  tokenizer, parser

License: GPL v3 Python 3.6

Greynir

Greynir

Natural Language Processing for Icelandic

Greynir is a natural language processing engine that extracts processable information from Icelandic text, allows natural language querying of that information and facilitates natural language understanding. Greynir is the core of Embla, a voice-driven virtual assistant app for smartphones and tablets.

Try Greynir (in Icelandic) at https://greynir.is

Greynir periodically scrapes chunks of text from Icelandic news sites on the web. It employs the Tokenizer and GreynirPackage modules (by the same authors) to tokenize the text and parse the token streams according to a hand-written context-free grammar for the Icelandic language. The resulting parse forests are disambiguated using scoring heuristics to find the best parse trees. The trees are then stored in a database and processed by grammatical pattern matching modules to obtain statements of fact and relations between stated facts.

An overview of the technology behind Greynir can be found in the paper A Wide-Coverage Context-Free Grammar for Icelandic and an Accompanying Parsing System by Vilhjálmur Þorsteinsson, Hulda Óladóttir and Hrafn Loftsson (Proceedings of Recent Advances in Natural Language Processing, pages 1397–1404, Varna, Bulgaria, Sep 2–4, 2019).

Greynir parse tree

A parse tree as displayed by Greynir. Nouns and noun phrases are blue, verbs and verb phrases are red, adjectives are green, prepositional and adverbial phrases are grey, etc.

Greynir is most effective for text that is objective and factual, i.e. has a relatively high ratio of concrete concepts such as numbers, amounts, dates, person and entity names, etc.

Greynir is innovative in its ability to parse and disambiguate text written in a morphologically rich language, i.e. Icelandic, which does not lend itself easily to statistical parsing methods. Greynir uses grammatical feature agreement (cases, genders, persons, number (singular/plural), verb tenses, modes, etc.) to guide and disambiguate parses. Its highly optimized Earley-based parser, implemented in C++, is fast and compact enough to make real-time while-you-wait analysis of web pages, as well as bulk processing, feasible.

Greynir's goal is to "understand" text to a usable extent by parsing it into structured, recursive trees that directly correspond to the original grammar. These trees can then be further processed and acted upon by sets of Python functions that are linked to grammar nonterminals.

Greynir is currently able to parse about 90% of sentences in a typical news article from the web, and many well-written articles can be parsed completely. It presently has more than 600,000 parsed articles in its database, containing over 11 million parsed sentences. A recent version of this database is available via the GreynirCorpus project.

Greynir supports natural language querying of its databases. Users can ask about person names, titles and entity definitions and get appropriate replies. The HTML5 Web Speech API is supported to allow queries to be recognized from speech in enabled browsers, such as recent versions of Chrome. Similarity queries are also supported, yielding articles that are similar in content to a given search phrase or sentence.

Greynir may in due course be expanded, for instance:

  • to make logical inferences from statements in its database;
  • to find statements supporting or refuting a thesis; and/or
  • to discover contradictions between statements.

Implementation

Greynir is written in Python 3 except for its core Earley-based parser module which is written in C++ and called via CFFI. Greynir requires Python 3.6 or later, and runs on CPython and PyPy with the latter being recommended for performance reasons.

Greynir works in stages, roughly as follows:

  1. Web scraper, built on BeautifulSoup and SQLAlchemy storing data in PostgreSQL.
  2. Tokenizer (this one), extended to use the BÍN database of Icelandic word forms for lemmatization and initial part-of-speech tagging.
  3. Parser (from this module), using an improved version of the Earley algorithm to parse text according to an unconstrained hand-written context-free grammar for Icelandic that may yield multiple parse trees (a parse forest) in case of ambiguity.
  4. Parse forest reducer with heuristics to find the best parse tree.
  5. Information extractor that maps a parse tree via its grammar constituents to plug-in Python functions.
  6. Article indexer that transforms articles from bags-of-words to fixed-dimensional topic vectors using Tf-Idf and Latent Semantic Analysis.
  7. Query processor that supports a range of natural language queries (including queries about entities in Greynir's database).

Greynir has an embedded web server that displays news articles recently scraped into its database, as well as names of people extracted from those articles along with their titles. The web UI enables the user to type in any URL and have Greynir scrape it, tokenize it and display the result as a web page. Queries can also be entered via the keyboard or using voice input. The server runs on the Flask framework, implements WSGI and can for instance be plugged into Gunicorn and nginx.

The tokenizer divides text chunks into sentences and recognizes entities such as dates, numbers, amounts and person names, as well as common abbreviations and punctuation.

Grammar rules are laid out in a separate text file, Greynir.grammar, which is a part of GreynirPackage. The standard Backus-Naur form has been augmented with repeat specifiers for right-hand-side tokens (* for 0..n instances, + for 1..n instances, or ? for 0..1 instances). Also, the grammar allows for compact specification of rules with variants, for instance due to cases, numbers and genders. Thus, a single rule (e.g. NounPhrase/case/gender → Adjective/case noun/case/gender) is automatically expanded into multiple rules (12 in this case, 4 cases x 3 genders) with appropriate substitutions for right-hand-side tokens depending on their local variants.

The parser is an optimized C++ implementation of an Earley parser as enhanced by Scott and Johnstone, referencing Tomita. It parses ambiguous grammars without restriction and returns a compact Shared Packed Parse Forest (SPPF) of parse trees. If a parse fails, it identifies the token at which no parse was available.

The Greynir scraper is typically run in a cron job every 30 minutes to extract articles automatically from the web, parse them and store the resulting trees in a PostgreSQL database for further processing.

Scraper modules for new websites are plugged in by adding Python code to the scrapers/ directory. Currently, the scrapers/default.py module supports a wide range of popular Icelandic news sites.

Processor modules can be plugged into Greynir by adding Python code to the processors/ directory. The demo in processors/default.py extracts person names and titles from parse trees for storage in a database table.

Query (question answering) modules can be plugged into Greynir by adding Python code to the queries/ directory. Reference implementations for several query types can be found in that directory, for instance queries/builtin.py which supports questions about people and titles. Example query modules can be viewed in queries/examples.

File details

  • main.py: WSGI web server application and main module for command-line invocation
  • routes/*.py: Routes for the web application
  • query.py: Natural language query processor
  • queries/*.py: Question answering modules
  • db/*.py: Database models and functions via SQLAlchemy
  • scraper.py: Web scraper, collecting articles from a set of pre-selected websites (roots)
  • scrapers/*.py: Scraper code for various websites
  • settings.py: Management of global settings and configuration data
  • config/Greynir.conf: Editable configuration file
  • fetcher.py: Utility classes for fetching articles given their URLs
  • nertokenizer.py: A layer on top of the tokenizer for named entity recognition
  • processor.py: Information extraction from parse trees and token streams
  • article.py: Representation of an article through its life cycle
  • tree.py: Representation of parse trees for processing
  • vectors/builder.py: Article indexer and LSA topic vector builder
  • doc.py: Extract plain text from various document formats
  • geo.py: Geography and location-related utility functions
  • speech.py: Speech synthesis-related utility functions
  • tools/*.py: Various command line tools
  • util.py: Various utility functions

Installation and setup

Running Greynir

Once you have followed the setup and installation instructions above, change to the Greynir repository and activate the virtual environment:

cd Greynir
venv/bin/activate

You should now be able to run Greynir.

Web application

python main.py

Defaults to running on localhost:5000 but this can be changed in config/Greynir.conf.

Web scrapers

python scraper.py

If you are running the scraper on macOS, you may run into problems with Python's fork(). This can be fixed by setting the following environment variable in your shell:

export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES

Interactive shell

You can launch an IPython REPL shell with a database session (s), the Greynir parser (r) and all SQLAlchemy database models preloaded. See Using the Greynir Shell for instructions.

Contributing

See Contributing to Greynir.

Copyright and licensing

Greynir is Copyright (C) 2021 Miðeind ehf. The original author of this software is Vilhjálmur Þorsteinsson.

This set of programs is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This set of programs is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

The full text of the GNU General Public License v3 is included here and also available here: https://www.gnu.org/licenses/gpl-3.0.html.

If you wish to use this set of programs in ways that are not covered under the GNU GPLv3 license, please contact us at [email protected] to negotiate a custom license. This applies for instance if you want to include or use this software, in part or in full, in other software that is not licensed under GNU GPLv3 or other compatible licenses.


Greynir uses the official BÍN (Beygingarlýsing íslensks nútímamáls) lexicon and database of Icelandic word forms to identify words and find their potential meanings and lemmas. The database is included in GreynirPackage in compressed form. BÍN is licensed under CC-BY-4.0, and credit is hereby given as follows:

Beygingarlýsing íslensks nútímamáls. Stofnun Árna Magnússonar í íslenskum fræðum. Höfundur og ritstjóri Kristín Bjarnadóttir.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].