All Projects → erickrf → Nlpnet

erickrf / Nlpnet

Licence: mit
A neural network architecture for NLP tasks, using cython for fast performance. Currently, it can perform POS tagging, SRL and dependency parsing.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Nlpnet

Vncorenlp
A Vietnamese natural language processing toolkit (NAACL 2018)
Stars: ✭ 354 (-6.6%)
Mutual labels:  parsing, natural-language-processing, pos-tagging
Rdrpostagger
R package for Ripple Down Rules-based Part-Of-Speech Tagging (RDRPOS). On more than 45 languages.
Stars: ✭ 31 (-91.82%)
Mutual labels:  natural-language-processing, pos-tagging
Jcseg
Jcseg is a light weight NLP framework developed with Java. Provide CJK and English segmentation based on MMSEG algorithm, With also keywords extraction, key sentence extraction, summary extraction implemented based on TEXTRANK algorithm. Jcseg had a build-in http server and search modules for the latest lucene,solr,elasticsearch
Stars: ✭ 754 (+98.94%)
Mutual labels:  natural-language-processing, pos-tagging
Vntk
Vietnamese NLP Toolkit for Node
Stars: ✭ 170 (-55.15%)
Mutual labels:  natural-language-processing, pos-tagging
Pytorch Pos Tagging
A tutorial on how to implement models for part-of-speech tagging using PyTorch and TorchText.
Stars: ✭ 96 (-74.67%)
Mutual labels:  natural-language-processing, pos-tagging
Deep Generative Models For Natural Language Processing
DGMs for NLP. A roadmap.
Stars: ✭ 185 (-51.19%)
Mutual labels:  parsing, natural-language-processing
Udpipe
R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit
Stars: ✭ 160 (-57.78%)
Mutual labels:  natural-language-processing, pos-tagging
Hanlp
中文分词 词性标注 命名实体识别 依存句法分析 成分句法分析 语义依存分析 语义角色标注 指代消解 风格转换 语义相似度 新词发现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁转换 自然语言处理
Stars: ✭ 24,626 (+6397.63%)
Mutual labels:  natural-language-processing, pos-tagging
Python Tutorial Notebooks
Python tutorials as Jupyter Notebooks for NLP, ML, AI
Stars: ✭ 52 (-86.28%)
Mutual labels:  parsing, natural-language-processing
Self Attentive Parser
High-accuracy NLP parser with models for 11 languages.
Stars: ✭ 569 (+50.13%)
Mutual labels:  parsing, natural-language-processing
Malaya
Natural Language Toolkit for bahasa Malaysia, https://malaya.readthedocs.io/
Stars: ✭ 239 (-36.94%)
Mutual labels:  natural-language-processing, pos-tagging
Articutapi
API of Articut 中文斷詞 (兼具語意詞性標記):「斷詞」又稱「分詞」,是中文資訊處理的基礎。Articut 不用機器學習,不需資料模型,只用現代白話中文語法規則,即能達到 SIGHAN 2005 F1-measure 94% 以上,Recall 96% 以上的成績。
Stars: ✭ 252 (-33.51%)
Mutual labels:  natural-language-processing, pos-tagging
Spacy Api Docker
spaCy REST API, wrapped in a Docker container.
Stars: ✭ 222 (-41.42%)
Mutual labels:  parsing, natural-language-processing
Nlpython
This repository contains the code related to Natural Language Processing using python scripting language. All the codes are related to my book entitled "Python Natural Language Processing"
Stars: ✭ 265 (-30.08%)
Mutual labels:  parsing, natural-language-processing
Awesome Text Generation
A curated list of recent models of text generation and application
Stars: ✭ 370 (-2.37%)
Mutual labels:  natural-language-processing
Tensorlayer Tricks
How to use TensorLayer
Stars: ✭ 357 (-5.8%)
Mutual labels:  natural-language-processing
Awesome Self Supervised Learning
A curated list of awesome self-supervised methods
Stars: ✭ 4,492 (+1085.22%)
Mutual labels:  natural-language-processing
Beginner nlp
A curated list of beginner resources in Natural Language Processing
Stars: ✭ 376 (-0.79%)
Mutual labels:  natural-language-processing
Nlp
[UNMANTEINED] Extract values from strings and fill your structs with nlp.
Stars: ✭ 367 (-3.17%)
Mutual labels:  natural-language-processing
Reek
Code smell detector for Ruby
Stars: ✭ 3,693 (+874.41%)
Mutual labels:  parsing

.. image:: https://badges.gitter.im/Join%20Chat.svg :alt: Join the chat at https://gitter.im/erickrf/nlpnet :target: https://gitter.im/erickrf/nlpnet?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge

Gitter is chat room for developers.

=============================================================== nlpnet --- Natural Language Processing with neural networks

nlpnet is a Python library for Natural Language Processing tasks based on neural networks. Currently, it performs part-of-speech tagging, semantic role labeling and dependency parsing. Most of the architecture is language independent, but some functions were specially tailored for working with Portuguese. This system was inspired by SENNA_.

.. _SENNA: http://ronan.collobert.com/senna/

Important: in order to use the trained models for Portuguese NLP, you will need to download the data from http://nilc.icmc.usp.br/nlpnet/models.html.

Dependencies

nlpnet requires NLTK_ and numpy_. Additionally, it needs to download some data from NLTK. After installing it, call

>>> nltk.download()

go to the Models tab and select the Punkt tokenizer. It is used in order to split the text into sentences.

Cython_ is used to generate C extensions and run faster. You probably won't need it, since the generated .c file is already provided with nlpnet, but you will need a C compiler. On Linux and Mac systems this shouldn't be a problem, but may be on Windows, because setuptools_ requires the Microsoft C Compiler by default. If you don't have it already, it is usually easier to install MinGW_ instead and follow the instructions here <http://docs.cython.org/src/tutorial/appendix.html>_.

.. _NLTK: http://www.nltk.org .. _numpy: http://www.numpy.org .. _Cython: http://cython.org .. _MinGW: http://www.mingw.org .. _setuptools: http://pythonhosted.org/setuptools/

Basic usage

nlpnet can be used both as a Python library or by its standalone scripts. Both usages are explained below.

Library usage


You can use ``nlpnet`` as a library in Python code as follows:

.. code-block:: python

    >>> import nlpnet
    >>> tagger = nlpnet.POSTagger('/path/to/pos-model/', language='pt')
    >>> tagger.tag('O rato roeu a roupa do rei de Roma.')
    [[(u'O', u'ART'), (u'rato', u'N'), (u'roeu', u'V'), (u'a', u'ART'), (u'roupa', u'N'), (u'do', u'PREP+ART'), (u'rei', u'N'), (u'de', u'PREP'), (u'Roma', u'NPROP'), (u'.', 'PU')]]

In the example above, the ``POSTagger`` constructor receives as the first argument the directory where its trained model is located. The second argument is the two letter language code (currently, onle ``pt`` and ``en`` are supported). This only has impact in tokenization.

Calling an annotation tool is pretty straightforward. The provided ones are ``POSTagger``, ``SRLTagger`` and ``DependencyParser``, all of them having a method ``tag`` which receives strings with text to be tagged (in ``DependencyParser``, there is an alias to the method ``parse``, which sounds more adequate). The tagger splits the text into sentences and then tokenizes each one (hence the return of the POSTagger is a list of lists).

The output of the SRLTagger is slightly more complicated:

    >>> tagger = nlpnet.SRLTagger()
    >>> tagger.tag(u'O rato roeu a roupa do rei de Roma.')
    [<nlpnet.taggers.SRLAnnotatedSentence at 0x84020f0>]

Instead of a list of tuples, sentences are represented by instances of ``SRLAnnotatedSentence``. This class serves basically as a data holder, and has two attributes:

    >>> sent = tagger.tag(u'O rato roeu a roupa do rei de Roma.')[0]
    >>> sent.tokens
    [u'O', u'rato', u'roeu', u'a', u'roupa', u'do', u'rei', u'de', u'Roma', u'.']
    >>> sent.arg_structures
    [(u'roeu',
      {u'A0': [u'O', u'rato'],
       u'A1': [u'a', u'roupa', u'do', u'rei', u'de', u'Roma'],
       u'V': [u'roeu']})]

The ``arg_structures`` is a list containing all predicate-argument structures in the sentence. The only one in this example is for the verb `roeu`. It is represented by a tuple with the predicate and a dictionary mapping semantic role labels to the tokens that constitute the argument.

Note that the verb appears as the first member of the tuple and also as the content of label 'V' (which stands for verb). This is because some predicates are multiwords. In these cases, the "main" predicate word (usually the verb itself) appears in ``arg_structures[0]``, and all the words appear under the key 'V'.

Here's an example with the DependencyParser:

    >>> parser = nlpnet.DependencyParser('dependency', language='en')
    >>> parsed_text = parser.parse('The book is on the table.')
    >>> parsed_text
    [<nlpnet.taggers.ParsedSentence at 0x10e067f0>]
    >>> sent = parsed_text[0]
    >>> print(sent.to_conll())
    1       The     _       DT      DT      _       2       NMOD
    2       book    _       NN      NN      _       3       SBJ
    3       is      _       VBZ     VBZ     _       0       ROOT
    4       on      _       IN      IN      _       3       LOC-PRD
    5       the     _       DT      DT      _       6       NMOD
    6       table   _       NN      NN      _       4       PMOD
    7       .       _       .       .       _       3       P

The ``to_conll()`` method of ParsedSentence objects prints them in the `CoNLL`_ notation. The tokens, labels and head indices are accessible through member variables:

    >>> sent.tokens
    [u'The', u'book', u'is', u'on', u'the', u'table', u'.']
    >>> sent.heads
    array([ 1,  2, -1,  2,  5,  3,  2])
    >>> sent.labels
    [u'NMOD', u'SBJ', u'ROOT', u'LOC-PRD', u'NMOD', u'PMOD', u'P']
    
The ``heads`` member variable is a numpy array. The i-th position in the array contains the index of the head of the i-th token, except for the root token, which has a head of -1. Notice that these indices are 0-based, while the ones shown in the ``to_conll()`` function are 1-based.

.. _`CoNLL`: http://ilk.uvt.nl/conll/#dataformat

Standalone scripts

nlpnet also provides scripts for tagging text, training new models and testing them. They are copied to the scripts subdirectory of your Python installation, which can be included in the system PATH variable. You can call them from command line and give some text input.

.. code-block:: bash

$ nlpnet-tag.py pos --data /path/to/nlpnet-data/ --lang pt
O rato roeu a roupa do rei de Roma.
O_ART rato_N roeu_V a_ART roupa_N do_PREP+ART rei_N de_PREP Roma_NPROP ._PU

If --data is not given, the script will search for the trained models in the current directory. --lang defaults to en. If you have text already tokenized, you may use the -t option; it assumes tokens are separated by whitespaces.

With semantic role labeling:

.. code-block:: bash

$ nlpnet-tag.py srl /path/to/nlpnet-data/
O rato roeu a roupa do rei de Roma.
O rato roeu a roupa do rei de Roma .
roeu
    A1: a roupa do rei de Roma
    A0: O rato
    V: roeu

The first line was typed by the user, and the second one is the result of tokenization.

And dependency parsing:

.. code-block:: bash

$ nlpnet-tag.py dependency --data dependency --lang en
The book is on the table.
1       The     _       DT      DT      _       2       NMOD
2       book    _       NN      NN      _       3       SBJ
3       is      _       VBZ     VBZ     _       0       ROOT
4       on      _       IN      IN      _       3       LOC-PRD
5       the     _       DT      DT      _       6       NMOD
6       table   _       NN      NN      _       4       PMOD
7       .       _       .       .       _       3       P

To learn more about training and testing new models, and other functionalities, refer to the documentation at http://nilc.icmc.usp.br/nlpnet

References

The following references describe the design of nlpnet, as well as experiments carried out. Some improvements to the code have been implemented since their publication.

  • Fonseca, Erick and Aluísio, Sandra M. A Deep Architecture for Non-Projective Dependency Parsing. Proceedings of the NAACL-HLT Workshop on Vector Space Modeling for NLP. 2015

  • Fonseca, Erick and Rosa, João Luís G. A Two-Step Convolutional Neural Network Approach for Semantic Role Labeling. Proceedings of the International Joint Conference on Neural Networks. 2013.

  • Fonseca, Erick, Rosa, João Luís G. and Aluísio, Sandra M. Evaluating word embeddings and a revised corpus for part-of-speech tagging in Portuguese. Journal of The Brazilian Computer Society. 2015.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].