All Projects → onlyuser → NatLang

onlyuser / NatLang

Licence: other
NatLang is an English parser with an extensible grammar

Programming Languages

C++
36643 projects - #6 most used programming language
Yacc
648 projects
Makefile
30231 projects
Lex
420 projects
shell
77523 projects
c
50402 projects - #5 most used programming language

Projects that are alternatives of or similar to NatLang

language-grammars
Syntax highlighting for ABNF/BNF/EBNF, Yacc, and other language-related languages.
Stars: ✭ 14 (-30%)
Mutual labels:  flex, bnf
bit-css
用原子类赋予元素属性,减少甚至不写css
Stars: ✭ 19 (-5%)
Mutual labels:  flex
verbecc
Complete Conjugation of any Verb using Machine Learning for French, Spanish, Portuguese, Italian and Romanian
Stars: ✭ 45 (+125%)
Mutual labels:  linguistics
libpalaso
Palaso Library: A set of .Net libraries useful for developers of Language Software.
Stars: ✭ 36 (+80%)
Mutual labels:  linguistics
teaching
Teaching material relevant to KGs
Stars: ✭ 61 (+205%)
Mutual labels:  semantic-web
bioportal web ui
A Rails application for biological ontologies
Stars: ✭ 20 (+0%)
Mutual labels:  semantic-web
ontobio
python library for working with ontologies and ontology associations
Stars: ✭ 104 (+420%)
Mutual labels:  semantic-web
icon-generator
Generate icons and launch screens for your Adobe AIR projects.
Stars: ✭ 24 (+20%)
Mutual labels:  flex
simpPRU
Intuitive language for PRU which compiles down to PRU C
Stars: ✭ 54 (+170%)
Mutual labels:  flex
Wordbook
Wordbook is a dictionary application built for GNOME.
Stars: ✭ 56 (+180%)
Mutual labels:  wordnet
expletives
Expletives vomiting library...
Stars: ✭ 12 (-40%)
Mutual labels:  linguistics
flexui
A light UI for C++ with XML and CSS support
Stars: ✭ 21 (+5%)
Mutual labels:  flex
textaugment
TextAugment: Text Augmentation Library
Stars: ✭ 280 (+1300%)
Mutual labels:  wordnet
Processor
Ontology-driven Linked Data processor and server for SPARQL backends. Apache License.
Stars: ✭ 54 (+170%)
Mutual labels:  semantic-web
eliza-rs
A rust implementation of ELIZA - a natural language processing program developed by Joseph Weizenbaum in 1966.
Stars: ✭ 48 (+140%)
Mutual labels:  linguistics
langua
A suite of language tools
Stars: ✭ 29 (+45%)
Mutual labels:  linguistics
corese
Software platform implementing and extending the standards of the Semantic Web.
Stars: ✭ 55 (+175%)
Mutual labels:  semantic-web
mlconjug3
A Python library to conjugate verbs in French, English, Spanish, Italian, Portuguese and Romanian (more soon) using Machine Learning techniques.
Stars: ✭ 47 (+135%)
Mutual labels:  linguistics
semantic-python-overview
(subjective) overview of projects which are related both to python and semantic technologies (RDF, OWL, Reasoning, ...)
Stars: ✭ 406 (+1930%)
Mutual labels:  semantic-web
bnf-mode
A GNU Emacs major mode for editing BNF grammars
Stars: ✭ 34 (+70%)
Mutual labels:  bnf

Build Status

NatLang

Copyright (C) 2011-2017 mailto:[email protected]

About

NatLang is an English parser with an extensible grammar. It generates abstract syntax trees for all possible interpretations of an English sentence accepted by a grammar. The algorithm is completely deterministic. No training data is required.

This project is no longer being maintained. See new version here: parse-english

It works as follows:

  1. The user inputs a sentence.
  2. WordNet identifies one or more POS identities for each word in the sentence.
  3. All POS configurations of the sentence are evaluated using the Yacc-generated parser.
  4. A parse tree is generated for each successful parse.

For example:

The sentence "eats shoots and leaves" has four interpretations depending on the POS of "shoots" and "leaves".

eats  shoots and leaves
 |     |      |   |
(V)---(V)----(C)-(V)
   \        /   \
    *-(N)--*     (N)

Path #1: {V V C V}
Path #2: {V V C N}
Path #3: {V N C V}
Path #4: {V N C N}

The sentence "flying saucers are dangerous" has two interpretations depending on the POS of "flying".

flying saucers are   dangerous
 |      |       |     |
(Adj)--(N)-----(Aux)-(Adj)
      /
(V)--*

Path #1: {Adj N Aux Adj}
Path #2: {V   N Aux Adj}

TODO: Additional analyses passes can be applied to the generated trees for further processing.

A Motivating Example

input:

the quick brown fox jumps over the lazy dog

output:

picture alt

best solution (5th tree from left to right):

picture alt

Strategy for Eliminating Grammar Ambiguity

English is a non-context-free language. That means the same word used in different contexts can have different meanings. But Yacc cannot interpret the same lexer terminal differently if it is represented using the same lexer terminal ID. When this happens, it is necessary to split ambiguous terminals.

To do so:

  1. Identify lexer terminal with ambiguous meaning.
  2. Identify parser rules that use the lexer terminals with ambiguous meaning, and assign to each use case a different lexer terminal ID.
  3. Take advantage of stateful lexing to return a different lexer terminal ID for the same lexer terminal.

For example:

The sentence "She and I run and he jumps and shouts." has three conjugations "and".

A possible parse tree may look like this:

                (S)
                 |
         *-------*-------*
         |       |       |
        (S)      |      (S)
         |       |       |
     *---*---*   |   *---*----*
     |       |   |   |        |
    (NP)    (VP) |  (NP)     (VP)
     |       |   |   |        |
 *---*---*   |   |   |   *----*----*
 |   |   |   |   |   |   |     |   |
(N) (C) (N) (V) (C) (N) (V)   (C) (V)
 |   |   |   |   |   |   |     |   |
She and  I  run and  he jumps and shouts.

Yacc chokes on this input due to the ambiguity of "and".

The solution is to split "and" into multiple lexer terminals, each representing a different abstraction level in the grammar.

  • C_NP for noun-part level conjugations.
  • C_VP for verb-part level conjugations.
  • C_S for sentence level conjugations.

And to try all 27 permutations to see which ones work.

She and    I  run and    he jumps and   shouts.
 |   |     |   |   |     |   |     |     |
(N) (C#1) (N) (V) (C#2) (N) (V)   (C#3) (V)

           C#1  C#2  C#3
            |    |    |
Path #1:  {C_NP C_NP C_NP} -- fail!
Path #2:  {C_NP C_NP C_VP} -- fail!
Path #3:  {C_NP C_NP C_S}  -- fail!
Path #4:  {C_NP C_VP C_NP} -- fail!
Path #5:  {C_NP C_VP C_VP} -- fail!
Path #6:  {C_NP C_VP C_S}  -- fail!
Path #7:  {C_NP C_S  C_NP} -- fail!
Path #8:  {C_NP C_S  C_VP} -- success!
...
Path #27: {C_S  C_S  C_S}  -- fail!

Here, path #8's POS configuration results in a successful parse.

She and     I  run and    he jumps and    shouts.
 |   |      |   |   |     |   |     |      |
(N) (C_NP) (N) (V) (C_S) (N) (V)   (C_VP) (V)

Usage

./app/bin/NatLang -e "the quick brown fox jumps over the lazy dog" -d | dot -Tpng -oast_fox.png

Requirements

Unix tools and 3rd party components (accessible from $PATH):

gcc flex bison wordnet valgrind cppcheck doxygen graphviz ticpp

Environment variables:

  • $INCLUDE_PATH_EXTERN -- where "ticpp/ticpp.h" resides
  • $LIB_PATH_EXTERN -- where "libticppd.a" resides

Supported Language Features

  • Present tense
  • Progressive tense
  • Past tense
  • Past perfect tense

Limitations

  • Passive voice not supported.
  • Questions not supported.
  • Conditionals not supported.
  • WordNet does not provide POS look-up for inflected verb forms and mechanical words such as prepositions, leading to a reliance on hard-coded POS definitions in the lexer for some words.
  • A brute force algorithm tries all supported interpretations of a sentence. This is slow for long sentences.
  • BNF rules are suitable for specifying constituent-based phrase structure grammars, but are a poor fit for expressing non-local dependencies.

Make Targets

target action
all make binaries
test all + run tests
pure test + use valgrind to check for memory leaks
dot test + generate .png graph for tests
lint use cppcheck to perform static analysis on .cpp files
doc use doxygen to generate documentation
xml test + generate .xml for tests
import test + use ticpp to serialize-to/deserialize-from xml
clean remove all intermediate files

References

"Part-of-speech tagging"
http://en.wikipedia.org/wiki/Part-of-speech_tagging
"Princeton WordNet"
http://wordnet.princeton.edu/
"Syntactic Theory: A Unified Approach"
ISBN: 0340706104
"Enju - A fast, accurate, and deep parser for English"
http://www.nactem.ac.uk/enju/

Keywords

Lex, Yacc, Flex, Bison, NLP, Natural Language Processing, WordNet, Part-of-Speech Tagging, Yacc Grammar for English, English parser, parsing English, Linguistics, Phrase Structure Grammar, X-bar Theory, POS, PSG, BNF, CFG, GLR
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].