NatLang

About

NatLang is an English parser with an extensible grammar. It generates abstract syntax trees for all possible interpretations of an English sentence accepted by a grammar. The algorithm is completely deterministic. No training data is required.

This project is no longer being maintained. See new version here: parse-english

It works as follows:

The user inputs a sentence.
WordNet identifies one or more POS identities for each word in the sentence.
All POS configurations of the sentence are evaluated using the Yacc-generated parser.
A parse tree is generated for each successful parse.

For example:

The sentence "eats shoots and leaves" has four interpretations depending on the POS of "shoots" and "leaves".

eats  shoots and leaves
 |     |      |   |
(V)---(V)----(C)-(V)
   \        /   \
    *-(N)--*     (N)

Path #1: {V V C V}
Path #2: {V V C N}
Path #3: {V N C V}
Path #4: {V N C N}

The sentence "flying saucers are dangerous" has two interpretations depending on the POS of "flying".

flying saucers are   dangerous
 |      |       |     |
(Adj)--(N)-----(Aux)-(Adj)
      /
(V)--*

Path #1: {Adj N Aux Adj}
Path #2: {V   N Aux Adj}

TODO: Additional analyses passes can be applied to the generated trees for further processing.

A Motivating Example

input:

the quick brown fox jumps over the lazy dog

output:

best solution (5th tree from left to right):

Strategy for Eliminating Grammar Ambiguity

English is a non-context-free language. That means the same word used in different contexts can have different meanings. But Yacc cannot interpret the same lexer terminal differently if it is represented using the same lexer terminal ID. When this happens, it is necessary to split ambiguous terminals.

To do so:

Identify lexer terminal with ambiguous meaning.
Identify parser rules that use the lexer terminals with ambiguous meaning, and assign to each use case a different lexer terminal ID.
Take advantage of stateful lexing to return a different lexer terminal ID for the same lexer terminal.

For example:

The sentence "She and I run and he jumps and shouts." has three conjugations "and".

A possible parse tree may look like this:

                (S)
                 |
         *-------*-------*
         |       |       |
        (S)      |      (S)
         |       |       |
     *---*---*   |   *---*----*
     |       |   |   |        |
    (NP)    (VP) |  (NP)     (VP)
     |       |   |   |        |
 *---*---*   |   |   |   *----*----*
 |   |   |   |   |   |   |     |   |
(N) (C) (N) (V) (C) (N) (V)   (C) (V)
 |   |   |   |   |   |   |     |   |
She and  I  run and  he jumps and shouts.

Yacc chokes on this input due to the ambiguity of "and".

The solution is to split "and" into multiple lexer terminals, each representing a different abstraction level in the grammar.

C_NP for noun-part level conjugations.
C_VP for verb-part level conjugations.
C_S for sentence level conjugations.

And to try all 27 permutations to see which ones work.

She and    I  run and    he jumps and   shouts.
 |   |     |   |   |     |   |     |     |
(N) (C#1) (N) (V) (C#2) (N) (V)   (C#3) (V)

           C#1  C#2  C#3
            |    |    |
Path #1:  {C_NP C_NP C_NP} -- fail!
Path #2:  {C_NP C_NP C_VP} -- fail!
Path #3:  {C_NP C_NP C_S}  -- fail!
Path #4:  {C_NP C_VP C_NP} -- fail!
Path #5:  {C_NP C_VP C_VP} -- fail!
Path #6:  {C_NP C_VP C_S}  -- fail!
Path #7:  {C_NP C_S  C_NP} -- fail!
Path #8:  {C_NP C_S  C_VP} -- success!
...
Path #27: {C_S  C_S  C_S}  -- fail!

Here, path #8's POS configuration results in a successful parse.

She and     I  run and    he jumps and    shouts.
 |   |      |   |   |     |   |     |      |
(N) (C_NP) (N) (V) (C_S) (N) (V)   (C_VP) (V)

Usage

./app/bin/NatLang -e "the quick brown fox jumps over the lazy dog" -d | dot -Tpng -oast_fox.png

Requirements

Unix tools and 3rd party components (accessible from $PATH):

gcc flex bison wordnet valgrind cppcheck doxygen graphviz ticpp

Environment variables:

$INCLUDE_PATH_EXTERN -- where "ticpp/ticpp.h" resides
$LIB_PATH_EXTERN -- where "libticppd.a" resides

Supported Language Features

Present tense
Progressive tense
Past tense
Past perfect tense

Limitations

Passive voice not supported.
Questions not supported.
Conditionals not supported.
WordNet does not provide POS look-up for inflected verb forms and mechanical words such as prepositions, leading to a reliance on hard-coded POS definitions in the lexer for some words.
A brute force algorithm tries all supported interpretations of a sentence. This is slow for long sentences.
BNF rules are suitable for specifying constituent-based phrase structure grammars, but are a poor fit for expressing non-local dependencies.

Make Targets

target	action
all	make binaries
test	all + run tests
pure	test + use valgrind to check for memory leaks
dot	test + generate .png graph for tests
lint	use cppcheck to perform static analysis on .cpp files
doc	use doxygen to generate documentation
xml	test + generate .xml for tests
import	test + use ticpp to serialize-to/deserialize-from xml
clean	remove all intermediate files

References

"Part-of-speech tagging": http://en.wikipedia.org/wiki/Part-of-speech_tagging
"Princeton WordNet": http://wordnet.princeton.edu/
"Syntactic Theory: A Unified Approach": ISBN: 0340706104
"Enju - A fast, accurate, and deep parser for English": http://www.nactem.ac.uk/enju/

Keywords

Lex, Yacc, Flex, Bison, NLP, Natural Language Processing, WordNet, Part-of-Speech Tagging, Yacc Grammar for English, English parser, parsing English, Linguistics, Phrase Structure Grammar, X-bar Theory, POS, PSG, BNF, CFG, GLR

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

onlyuser / NatLang

Programming Languages

Labels

Projects that are alternatives of or similar to NatLang