All Projects → stanfordnlp → Stanza Old

stanfordnlp / Stanza Old

Licence: apache-2.0
Stanford NLP group's shared Python tools.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Stanza Old

Textvec
Text vectorization tool to outperform TFIDF for classification tasks
Stars: ✭ 167 (+17.61%)
Mutual labels:  natural-language-processing, text-processing, text-analysis
Artificial Adversary
🗣️ Tool to generate adversarial text examples and test machine learning models against them
Stars: ✭ 348 (+145.07%)
Mutual labels:  text-processing, text-analysis
Graphbrain
Language, Knowledge, Cognition
Stars: ✭ 294 (+107.04%)
Mutual labels:  natural-language-processing, text-analysis
Open Korean Text
Open Korean Text Processor - An Open-source Korean Text Processor
Stars: ✭ 438 (+208.45%)
Mutual labels:  natural-language-processing, text-processing
TRUNAJOD2.0
An easy-to-use library to extract indices from texts.
Stars: ✭ 18 (-87.32%)
Mutual labels:  text-analysis, text-processing
support-tickets-classification
This case study shows how to create a model for text analysis and classification and deploy it as a web service in Azure cloud in order to automatically classify support tickets. This project is a proof of concept made by Microsoft (Commercial Software Engineering team) in collaboration with Endava http://endava.com/en
Stars: ✭ 142 (+0%)
Mutual labels:  text-analysis, text-processing
Pynlpl
PyNLPl, pronounced as 'pineapple', is a Python library for Natural Language Processing. It contains various modules useful for common, and less common, NLP tasks. PyNLPl can be used for basic tasks such as the extraction of n-grams and frequency lists, and to build simple language model. There are also more complex data types and algorithms. Moreover, there are parsers for file formats common in NLP (e.g. FoLiA/Giza/Moses/ARPA/Timbl/CQL). There are also clients to interface with various NLP specific servers. PyNLPl most notably features a very extensive library for working with FoLiA XML (Format for Linguistic Annotation).
Stars: ✭ 426 (+200%)
Mutual labels:  natural-language-processing, text-processing
text-analysis
Weaving analytical stories from text data
Stars: ✭ 12 (-91.55%)
Mutual labels:  text-analysis, text-processing
Lingua Franca
Mycroft's multilingual text parsing and formatting library
Stars: ✭ 51 (-64.08%)
Mutual labels:  natural-language-processing, text-processing
Javascript Text Expander
Expands texts as you type, naturally
Stars: ✭ 58 (-59.15%)
Mutual labels:  text-processing, text-analysis
Cogcomp Nlpy
CogComp's light-weight Python NLP annotators
Stars: ✭ 115 (-19.01%)
Mutual labels:  natural-language-processing, text-processing
knime-textprocessing
KNIME - Text Processing Extension (Labs)
Stars: ✭ 17 (-88.03%)
Mutual labels:  text-analysis, text-processing
corpusexplorer2.0
Korpuslinguistik war noch nie so einfach...
Stars: ✭ 16 (-88.73%)
Mutual labels:  text-analysis, text-processing
Textpipe
Textpipe: clean and extract metadata from text
Stars: ✭ 284 (+100%)
Mutual labels:  text-processing, text-analysis
ConTexto
Librería en Python para minería de texto y NLP
Stars: ✭ 43 (-69.72%)
Mutual labels:  text-analysis, text-processing
Text mining resources
Resources for learning about Text Mining and Natural Language Processing
Stars: ✭ 358 (+152.11%)
Mutual labels:  natural-language-processing, text-analysis
Konoha
🌿 An easy-to-use Japanese Text Processing tool, which makes it possible to switch tokenizers with small changes of code.
Stars: ✭ 130 (-8.45%)
Mutual labels:  natural-language-processing, text-processing
Stringi
THE String Processing Package for R (with ICU)
Stars: ✭ 204 (+43.66%)
Mutual labels:  natural-language-processing, text-processing
Shifterator
Interpretable data visualizations for understanding how texts differ at the word level
Stars: ✭ 209 (+47.18%)
Mutual labels:  natural-language-processing, text-analysis
Biomedicus
Code for the old version of BioMedICUS, for the new version see the biomedicus3 repository.
Stars: ✭ 45 (-68.31%)
Mutual labels:  natural-language-processing, text-analysis

Stanza

|Master Build Status| |Documentation Status|

Stanza is the Stanford NLP group’s shared repository for Python infrastructure. The goal of Stanza is not to replace your modeling tools of choice, but to offer implementations for common patterns useful for machine learning experiments.

Usage

You can install the package as follows:

::

git clone [email protected]:stanfordnlp/stanza.git
cd stanza
pip install -e .

To use the package, import it in your python code. An example would be:

::

from stanza.text.vocab import Vocab
v = Vocab('UNK')

To use the Python client for the CoreNLP server, first launch your CoreNLP Java server <https://stanfordnlp.github.io/CoreNLP/corenlp-server.html>__. Then, in your Python program:

::

from stanza.nlp.corenlp import CoreNLPClient
client = CoreNLPClient(server='http://localhost:9000', default_annotators=['ssplit', 'tokenize', 'lemma', 'pos', 'ner'])
annotated = client.annotate('This is an example document. Here is a second sentence')
for sentence in annotated.sentences:
    print('sentence', sentence)
    for token in sentence:
        print(token.word, token.lemma, token.pos, token.ner)

Please see the documentation for more use cases.

Documentation

Documentation is hosted on Read the Docs at http://stanza.readthedocs.org/en/latest/. Stanza is still in early development. Interfaces and code organization will probably change substantially over the next few months.

Development Guide

To request or discuss additional functionality, please open a GitHub issue. We greatly appreciate pull requests!

Tests


Stanza has unit tests, doctests, and longer, integration tests. We ask that all
contributors run the unit tests and doctests before submitting pull requests:

.. code:: python

    python setup.py test

Doctests are the easiest way to write a test for new functionality, and serve
as helpful examples for how to use your code. See
`progress.py <stanza/research/progress.py>`__ for a simple example of a easily
testable module, or `summary.py <stanza/research/summary.py>`__ for a more
involved setup involving a mocked filesystem.

Adding a new module

If you are adding a new module, please remember to add it to setup.py as well as a corresponding .rst file in the docs directory.

Documentation


Documentation is generated via
`Sphinx <http://www.sphinx-doc.org/en/stable/>`__ using inline comments.
This means that the docstring in Python double both as interactive
documentation and standalone documentation. This also means that you
must format your docstring in RST. RST is very similar to Markdown.
There are many tutorials on the exact syntax, essentially you only need
to know the function parameter syntax which can be found
`here <http://thomas-cokelaer.info/tutorials/sphinx/rest_syntax.html#auto-document-your-python-code>`__.
You can, of course, look at documentations for existing modules for
guidance as well. A good place to start is the ``text.dataset`` package.

To set up your environment such that you can generate docs locally:

::

    pip install sphinx sphinx-autobuild

If you introduced a new module, please auto-generate the docs:

::

    sphinx-apidoc -F -o docs stanza
    cd docs && make
    open _build/html/index.html

You most likely need to manually edit the `rst` file corresponding to your new module.

Our docs are `hosted on Readthedocs <https://readthedocs.org/projects/stanza/>`__. If you'd like admin access to the Readthedocs project, please contact Victor or Will.

Road Map
--------

-  common objects used in NLP

   -  [x] a Vocabulary object mapping from strings to integers/vectors

-  tools for running experiments on the NLP cluster

   -  [ ] a function for querying GPU device stats (to aid in selecting
      a GPU on the cluster)
   -  [ ] a tool for plotting training curves from multiple jobs
   -  [ ] a tool for interacting with an already running job via edits
      to a text file

-  [x] an API for calling CoreNLP

For Stanford NLP members
------------------------

Stanza is not meant to include every research project the group
undertakes. If you have a standalone project that you would like to
share with other people in the group, you can:

-  request your own private repo under the `stanfordnlp GitHub
   account <https://github.com/stanfordnlp>`__.
-  share your code on `CodaLab <https://codalab.stanford.edu/>`__.
-  For targeted questions, ask on `Stanford NLP
   Overflow <http://nlp.stanford.edu/local/qa/>`__ (use the ``stanza``
   tag).

Using `git subtree`

That said, it can be useful to add functionality to Stanza while you work in a separate repo on a project that depends on Stanza. Since Stanza is under active development, you will want to version-control the Stanza code that your code uses. Probably the most effective way of accomplishing this is by using git subtree.

git subtree includes the source tree of another repo (in this case, Stanza) as a directory within your repo (your cutting-edge research), and keeps track of some metadata that allows you to keep that directory in sync with the original Stanza code. The main advantage of git subtree is that you can modify the Stanza code locally, merge in updates, and push your changes back to the Stanza repo to share them with the group. (git submodule doesn't allow this.)

It has some downsides to be aware of:

  • You have a copy of all of Stanza as part of your repo. For small projects, this could increase your repo size dramatically. (Note: you can keep the history of your repo from growing at the same rate as Stanza's by using squashed commits; it's only the size of the source tree that unavoidably bloats your project.)
  • Your repo's history will contain a merge commit every time you update Stanza from upstream. This can look ugly, especially in graphical viewers.

Still, subtree can be configured to be fairly easy to use, and the consensus seems to be that it is superior to submodule (<https://codingkilledthecat.wordpress.com/2012/04/28/why-your-company-shouldnt-use-git-submodules/>__).

Here's one way to configure subtree so that you can include Stanza in your repo and contribute your changes back to the master repo:

::

# Add Stanza as a remote repo
git remote add stanza http://<your github username>@github.com/stanfordnlp/stanza.git
# Import the contents of the repo as a subtree
git subtree add --prefix third-party/stanza stanza develop --squash
# Put a symlink to the actual module somewhere where your code needs it
ln -s third-party/stanza/stanza stanza
# Add aliases for the two things you'll need to do with the subtree
git config alias.stanza-update 'subtree pull --prefix third-party/stanza stanza develop --squash'
git config alias.stanza-push 'subtree push --prefix third-party/stanza stanza develop'

After this, you can use the aliases to push and pull Stanza like so:

::

git stanza-update
git stanza-push

I [@futurulus] highly recommend a topic branch/rebase workflow <https://randyfay.com/content/rebase-workflow-git>__, which will keep your history fairly clean besides those pesky subtree merge commits:

::

# Create a topic branch
git checkout -b fix-stanza
# <hack hack hack, make some commits>

git checkout master
# Update Stanza on master, should go smoothly because master doesn't
# have any of your changes yet
git stanza-update

# Go back and replay your fixes on top of master changes
git checkout fix-stanza
git rebase master
# You might need to resolve merge conflicts here

# Add your rebased changes to master and push
git checkout master
git merge --ff-only fix-stanza
git stanza-push
# Done!
git branch -d fix-stanza

.. |Master Build Status| image:: https://travis-ci.org/stanfordnlp/stanza.svg?branch=master :target: https://travis-ci.org/stanfordnlp/stanza .. |Documentation Status| image:: https://readthedocs.org/projects/stanza/badge/?version=latest :target: http://stanza.readthedocs.org/en/latest/?badge=latest

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].