All Projects → explosion → Spacy Stanza

explosion / Spacy Stanza

Licence: mit
💥 Use the latest Stanza (StanfordNLP) research models directly in spaCy

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Spacy Stanza

Prodigy Recipes
🍳 Recipes for the Prodigy, our fully scriptable annotation tool
Stars: ✭ 229 (-54.92%)
Mutual labels:  data-science, natural-language-processing, spacy
Jupyterlab Prodigy
🧬 A JupyterLab extension for annotating data with Prodigy
Stars: ✭ 97 (-80.91%)
Mutual labels:  data-science, natural-language-processing, spacy
Tageditor
🏖TagEditor - Annotation tool for spaCy
Stars: ✭ 92 (-81.89%)
Mutual labels:  data-science, natural-language-processing, spacy
Spacy
💫 Industrial-strength Natural Language Processing (NLP) in Python
Stars: ✭ 21,978 (+4226.38%)
Mutual labels:  data-science, natural-language-processing, spacy
Autogluon
AutoGluon: AutoML for Text, Image, and Tabular Data
Stars: ✭ 3,920 (+671.65%)
Mutual labels:  data-science, natural-language-processing
Learn Data Science For Free
This repositary is a combination of different resources lying scattered all over the internet. The reason for making such an repositary is to combine all the valuable resources in a sequential manner, so that it helps every beginners who are in a search of free and structured learning resource for Data Science. For Constant Updates Follow me in …
Stars: ✭ 4,757 (+836.42%)
Mutual labels:  data-science, natural-language-processing
Adam qas
ADAM - A Question Answering System. Inspired from IBM Watson
Stars: ✭ 330 (-35.04%)
Mutual labels:  natural-language-processing, spacy
Data Science
Collection of useful data science topics along with code and articles
Stars: ✭ 315 (-37.99%)
Mutual labels:  data-science, natural-language-processing
Datacamp Python Data Science Track
All the slides, accompanying code and exercises all stored in this repo. 🎈
Stars: ✭ 250 (-50.79%)
Mutual labels:  data-science, natural-language-processing
Tensorlayer Tricks
How to use TensorLayer
Stars: ✭ 357 (-29.72%)
Mutual labels:  data-science, natural-language-processing
Nlp Python Deep Learning
NLP in Python with Deep Learning
Stars: ✭ 374 (-26.38%)
Mutual labels:  natural-language-processing, spacy
Medacy
🏥 Medical Text Mining and Information Extraction with spaCy
Stars: ✭ 287 (-43.5%)
Mutual labels:  natural-language-processing, spacy
Oie Resources
A curated list of Open Information Extraction (OIE) resources: papers, code, data, etc.
Stars: ✭ 283 (-44.29%)
Mutual labels:  data-science, natural-language-processing
Displacy
💥 displaCy.js: An open-source NLP visualiser for the modern web
Stars: ✭ 311 (-38.78%)
Mutual labels:  natural-language-processing, spacy
Awesome Distributed Deep Learning
A curated list of awesome Distributed Deep Learning resources.
Stars: ✭ 277 (-45.47%)
Mutual labels:  data-science, natural-language-processing
Spacy Streamlit
👑 spaCy building blocks and visualizers for Streamlit apps
Stars: ✭ 360 (-29.13%)
Mutual labels:  natural-language-processing, spacy
Mlinterview
A curated awesome list of AI Startups in India & Machine Learning Interview Guide. Feel free to contribute!
Stars: ✭ 410 (-19.29%)
Mutual labels:  data-science, natural-language-processing
D2l Vn
Một cuốn sách tương tác về học sâu có mã nguồn, toán và thảo luận. Đề cập đến nhiều framework phổ biến (TensorFlow, Pytorch & MXNet) và được sử dụng tại 175 trường Đại học.
Stars: ✭ 402 (-20.87%)
Mutual labels:  data-science, natural-language-processing
Code search
Code For Medium Article: "How To Create Natural Language Semantic Search for Arbitrary Objects With Deep Learning"
Stars: ✭ 436 (-14.17%)
Mutual labels:  data-science, natural-language-processing
Machine Learning Resources
A curated list of awesome machine learning frameworks, libraries, courses, books and many more.
Stars: ✭ 226 (-55.51%)
Mutual labels:  data-science, natural-language-processing

spaCy + Stanza (formerly StanfordNLP)

This package wraps the Stanza (formerly StanfordNLP) library, so you can use Stanford's models in a spaCy pipeline. The Stanford models achieved top accuracy in the CoNLL 2017 and 2018 shared task, which involves tokenization, part-of-speech tagging, morphological analysis, lemmatization and labeled dependency parsing in 68 languages. As of v1.0, Stanza also supports named entity recognition for selected languages.

⚠️ Previous version of this package were available as spacy-stanfordnlp.

Azure Pipelines PyPi GitHub Code style: black

Using this wrapper, you'll be able to use the following annotations, computed by your pretrained stanza model:

  • Statistical tokenization (reflected in the Doc and its tokens)
  • Lemmatization (token.lemma and token.lemma_)
  • Part-of-speech tagging (token.tag, token.tag_, token.pos, token.pos_)
  • Morphological analysis (token.morph)
  • Dependency parsing (token.dep, token.dep_, token.head)
  • Named entity recognition (doc.ents, token.ent_type, token.ent_type_, token.ent_iob, token.ent_iob_)
  • Sentence segmentation (doc.sents)

️️️⌛️ Installation

As of v1.0.0 spacy-stanza is only compatible with spaCy v3.x. To install the most recent version:

pip install spacy-stanza

For spaCy v2, install v0.2.x and refer to the v0.2.x usage documentation:

pip install "spacy-stanza<0.3.0"

Make sure to also download one of the pre-trained Stanza models.

📖 Usage & Examples

⚠️ Important note: This package has been refactored to take advantage of spaCy v3.0. Previous versions that were built for spaCy v2.x worked considerably differently. Please see previous tagged versions of this README for documentation on prior versions.

Use spacy_stanza.load_pipeline() to create an nlp object that you can use to process a text with a Stanza pipeline and create a spaCy Doc object. By default, both the spaCy pipeline and the Stanza pipeline will be initialized with the same lang, e.g. "en":

import stanza
import spacy_stanza

# Download the stanza model if necessary
stanza.download("en")

# Initialize the pipeline
nlp = spacy_stanza.load_pipeline("en")

doc = nlp("Barack Obama was born in Hawaii. He was elected president in 2008.")
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.dep_, token.ent_type_)
print(doc.ents)

If language data for the given language is available in spaCy, the respective language class can be used as the base for the nlp object – for example, English(). This lets you use spaCy's lexical attributes like is_stop or like_num. The nlp object follows the same API as any other spaCy Language class – so you can visualize the Doc objects with displaCy, add custom components to the pipeline, use the rule-based matcher and do pretty much anything else you'd normally do in spaCy.

# Access spaCy's lexical attributes
print([token.is_stop for token in doc])
print([token.like_num for token in doc])

# Visualize dependencies
from spacy import displacy
displacy.serve(doc)  # or displacy.render if you're in a Jupyter notebook

# Process texts with nlp.pipe
for doc in nlp.pipe(["Lots of texts", "Even more texts", "..."]):
    print(doc.text)

# Combine with your own custom pipeline components
from spacy import Language
@Language.component("custom_component")
def custom_component(doc):
    # Do something to the doc here
    print(f"Custom component called: {doc.text}")
    return doc

nlp.add_pipe("custom_component")
doc = nlp("Some text")

# Serialize attributes to a numpy array
np_array = doc.to_array(['ORTH', 'LEMMA', 'POS'])

Stanza Pipeline options

Additional options for the Stanza Pipeline can be provided as keyword arguments following the Pipeline API:

  • Provide the Stanza language as lang. For Stanza languages without spaCy support, use "xx" for the spaCy language setting:

    # Initialize a pipeline for Coptic
    nlp = spacy_stanza.load_pipeline("xx", lang="cop")
    
  • Provide Stanza pipeline settings following the Pipeline API:

    # Initialize a German pipeline with the `hdt` package
    nlp = spacy_stanza.load_pipeline("de", package="hdt")
    
  • Tokenize with spaCy rather than the statistical tokenizer (only for English):

    nlp = spacy_stanza.load_pipeline("en", processors= {"tokenize": "spacy"})
    
  • Provide any additional processor settings as additional keyword arguments:

    # Provide pretokenized texts (whitespace tokenization)
    nlp = spacy_stanza.load_pipeline("de", tokenize_pretokenized=True)
    

The spaCy config specifies all Pipeline options in the [nlp.tokenizer] block. For example, the config for the last example above, a German pipeline with pretokenized texts:

[nlp.tokenizer]
@tokenizers = "spacy_stanza.PipelineAsTokenizer.v1"
lang = "de"
dir = null
package = "default"
logging_level = null
verbose = null
use_gpu = true

[nlp.tokenizer.kwargs]
tokenize_pretokenized = true

[nlp.tokenizer.processors]

Serialization

The full Stanza pipeline configuration is stored in the spaCy pipeline config, so you can save and load the pipeline just like any other nlp pipeline:

# Save to a local directory
nlp.to_disk("./stanza-spacy-model")

# Reload the pipeline
nlp = spacy.load("./stanza-spacy-model")

Note that this does not save any Stanza model data by default. The Stanza models are very large, so for now, this package expects you to download the models separately with stanza.download() and have them available either in the default model directory or in the path specified under [nlp.tokenizer.dir] in the config.

Adding additional spaCy pipeline components

By default, the spaCy pipeline in the nlp object returned by spacy_stanza.load_pipeline() will be empty, because all stanza attributes are computed and set within the custom tokenizer, StanzaTokenizer. But since it's a regular nlp object, you can add your own components to the pipeline. For example, you could add your own custom text classification component with nlp.add_pipe("textcat", source=source_nlp), or augment the named entities with your own rule-based patterns using the EntityRuler component.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].