All Projects → arne-cl → nltk-maxent-pos-tagger

arne-cl / nltk-maxent-pos-tagger

Licence: other
maximum entropy based part-of-speech tagger for NLTK

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to nltk-maxent-pos-tagger

chrome-raw-print
Chrome app to enable raw printing from a browser
Stars: ✭ 57 (+26.67%)
Mutual labels:  pos
RcppMeCab
RcppMeCab: Rcpp Interface of CJK Morpheme Analyzer MeCab
Stars: ✭ 24 (-46.67%)
Mutual labels:  pos
nlp-akash
Natural Language Processing notes and implementations.
Stars: ✭ 66 (+46.67%)
Mutual labels:  nltk
ipython-notebook-nltk
An introduction to Natural Language processing using NLTK with python.
Stars: ✭ 19 (-57.78%)
Mutual labels:  nltk
escpos-tools
Utilities to read ESC/POS print data
Stars: ✭ 145 (+222.22%)
Mutual labels:  pos
frog
Frog is an integration of memory-based natural language processing (NLP) modules developed for Dutch. All NLP modules are based on Timbl, the Tilburg memory-based learning software package.
Stars: ✭ 70 (+55.56%)
Mutual labels:  pos-tagger
NRCLex
An affect generator based on TextBlob and the NRC affect lexicon. Note that lexicon license is for research purposes only.
Stars: ✭ 42 (-6.67%)
Mutual labels:  nltk
NotrinosERP
A web-based erp, accounting system that written in PHP and MySql includes Sales, Purchasing, Warehousing, Manufacturing, Human Resource... It supports multi user, multi currencies, multi languages.
Stars: ✭ 46 (+2.22%)
Mutual labels:  pos
ESCPOS
A ESC/POS Printer Commands Helper
Stars: ✭ 26 (-42.22%)
Mutual labels:  pos
datalinguist
Stanford CoreNLP in idiomatic Clojure.
Stars: ✭ 93 (+106.67%)
Mutual labels:  pos-tagger
nlp-cheat-sheet-python
NLP Cheat Sheet, Python, spacy, LexNPL, NLTK, tokenization, stemming, sentence detection, named entity recognition
Stars: ✭ 69 (+53.33%)
Mutual labels:  nltk
flutter-pos
A mobile POS app written with Flutter, compatible Sunmi device
Stars: ✭ 106 (+135.56%)
Mutual labels:  pos
youtube-video-maker
📹 A tool for automatic video creation and uploading on YouTube
Stars: ✭ 134 (+197.78%)
Mutual labels:  nltk
rpicore
RPICoin - Proof of Stake Cryptocurrency
Stars: ✭ 16 (-64.44%)
Mutual labels:  pos
ppp
PHP POS Print Server
Stars: ✭ 35 (-22.22%)
Mutual labels:  pos
pos-mamba-sdk
SDK for developing in the Mamba web environment
Stars: ✭ 34 (-24.44%)
Mutual labels:  pos
character-extraction
Extracts character names from a text file and performs analysis of text sentences containing the names.
Stars: ✭ 40 (-11.11%)
Mutual labels:  nltk
Stock-Analyser
📈 Stocks technical analysis code collection and Stocks data platform.
Stars: ✭ 30 (-33.33%)
Mutual labels:  nltk
open-pos
Open Source Point of Sale System.
Stars: ✭ 52 (+15.56%)
Mutual labels:  pos
CorBinian
CorBinian: A toolbox for modelling and simulating high-dimensional binary and count-data with correlations
Stars: ✭ 15 (-66.67%)
Mutual labels:  maximum-entropy

nltk-maxent-pos-tagger

nltk-maxent-pos-tagger is a part-of-speech (POS) tagger based on Maximum Entropy (ME) principles written for NLTK. It is based on NLTK's Maximum Entropy classifier (nltk.classify.maxent.MaxentClassifier), which uses MEGAM for number crunching.

Part-of-Speech Tagging

nltk-maxent-pos-tagger uses the set of features proposed by Ratnaparki (1996), which are also used in his MXPOST implementation (Java).

Installation

  1. Install Python and NLTK.

NLTK offers lots of data sets, which you might download and install from within a Python shell:

import nltk
nltk.download()

Download at least brown or treebank, as nltk-maxent-pos-tagger uses them for its demo() function.

  1. (Mac) Install MEGAM.

On Mac, it is easy to install MEGAM using brew:

brew tap homebrew/science
brew install megam

Usage

Have a look at the example given in the demo() function in mxpost.py. Basically, you just have to import the tagger and train it with labelled data to use it:

import mxpost
maxent_tagger = mxpost.MaxentPosTagger()
maxent_tagger.train(tagged_training_sentences)

for sentence in unlabeled_sentences:
    maxent_tagger.tag(sentence)

Meta

Status: Beta. I wrote this in 2008 as a semester project for a class on NLP tools.
Licence: GPL Version 3
Original Author: Arne Neumann
Contributors: Arne Neumann, Andrew Drozdov

TODO

  1. speed / memory consumption
    As you can expect, a Python implementation is much slower and consumes much more RAM than similar tools written in Java or C/C++ (MXPOST, acopost, C&C etc.). This being said, most of the time isn't spend in Python but rather in MEGAM (which is written in O'Caml and therefore shouldn't have such issues). NLTK currently is only able to encode POS features explicitly when converting data for MEGAM. According to the MEGAM website, using implicit feature encoding should be much faster.

  2. accuracy
    I trained several taggers on the WSJ corpus (90% training / 10% test data). nltk-maxent-pos-tagger achieved an accuracy of 93.64% (100 iterations, rare feature cutoff = 5) while MXPOST reached 96.93% (100 iterations). Since both implementations use the same feature set, results shouldn't be that different. Unfortunately, there's no source code available for MXPOST, but comparing nltk-maxent-pos-tagger with OpenNLP's implementation should be helpful.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].