All Projects → yash1994 → Dframcy

yash1994 / Dframcy

Licence: mit
Dataframe Integration with spaCy.

Programming Languages

python
139335 projects - #7 most used programming language
python3
1442 projects

Projects that are alternatives of or similar to Dframcy

Pdpipe
Easy pipelines for pandas DataFrames.
Stars: ✭ 590 (+697.3%)
Mutual labels:  dataframe, pandas-dataframe
Dataframe Go
DataFrames for Go: For statistics, machine-learning, and data manipulation/exploration
Stars: ✭ 487 (+558.11%)
Mutual labels:  dataframe, pandas-dataframe
Spacy Transformers
🛸 Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy
Stars: ✭ 919 (+1141.89%)
Mutual labels:  spacy
Net.jgp.labs.spark
Apache Spark examples exclusively in Java
Stars: ✭ 55 (-25.68%)
Mutual labels:  dataframe
Multiuser prodigy
Running Prodigy for a team of annotators
Stars: ✭ 36 (-51.35%)
Mutual labels:  spacy
S3bp
Read and write Python objects to S3, caching them on your hard drive to avoid unnecessary IO.
Stars: ✭ 24 (-67.57%)
Mutual labels:  pandas-dataframe
Bevel
Ordinal regression in Python
Stars: ✭ 41 (-44.59%)
Mutual labels:  pandas-dataframe
Foxcross
AsyncIO serving for data science models
Stars: ✭ 18 (-75.68%)
Mutual labels:  dataframe
Sense2vec
🦆 Contextually-keyed word vectors
Stars: ✭ 1,184 (+1500%)
Mutual labels:  spacy
Pandas Ta
Technical Analysis Indicators - Pandas TA is an easy to use Python 3 Pandas Extension with 130+ Indicators
Stars: ✭ 962 (+1200%)
Mutual labels:  dataframe
Pyinflect
A python module for word inflections designed for use with spaCy.
Stars: ✭ 52 (-29.73%)
Mutual labels:  spacy
Sparkmagic
Jupyter magics and kernels for working with remote Spark clusters
Stars: ✭ 954 (+1189.19%)
Mutual labels:  pandas-dataframe
Mobius
C# and F# language binding and extensions to Apache Spark
Stars: ✭ 929 (+1155.41%)
Mutual labels:  dataframe
10 Simple Hacks To Speed Up Your Data Analysis In Python
Some useful Tips and Tricks to speed up the data analysis process in Python.
Stars: ✭ 45 (-39.19%)
Mutual labels:  pandas-dataframe
Boltzmannclean
Fill missing values in Pandas DataFrames using Restricted Boltzmann Machines
Stars: ✭ 23 (-68.92%)
Mutual labels:  dataframe
Dragonfire
the open-source virtual assistant for Ubuntu based Linux distributions
Stars: ✭ 1,120 (+1413.51%)
Mutual labels:  spacy
Quickviz
Visualize a pandas dataframe in a few clicks
Stars: ✭ 18 (-75.68%)
Mutual labels:  pandas-dataframe
Pandas Profiling
Create HTML profiling reports from pandas DataFrame objects
Stars: ✭ 8,329 (+11155.41%)
Mutual labels:  pandas-dataframe
Lambda Packs
Precompiled packages for AWS Lambda
Stars: ✭ 997 (+1247.3%)
Mutual labels:  spacy
Python nlp tutorial
This repository provides everything to get started with Python for Text Mining / Natural Language Processing (NLP)
Stars: ✭ 72 (-2.7%)
Mutual labels:  spacy

DframCy

Package Version Python 3.6 Build Status codecov Code style: black

DframCy is a light-weight utility module to integrate Pandas Dataframe to spaCy's linguistic annotation and training tasks. DframCy provides clean APIs to convert spaCy's linguistic annotations, Matcher and PhraseMatcher information to Pandas dataframe, also supports training and evaluation of NLP pipeline from CSV/XLXS/XLS without any changes to spaCy's underlying APIs.

Getting Started

DframCy can be easily installed. Just need to the following:

Requirements

  • Python 3.6 or later
  • Pandas
  • spaCy >= 3.0.0

Also need to download spaCy's language model:

python -m spacy download en_core_web_sm

For more information refer to: Models & Languages

Installation:

This package can be installed from PyPi by running:

pip install dframcy

To build from source:

git clone https://github.com/yash1994/dframcy.git
cd dframcy
python setup.py install

Usage

Linguistic Annotations

Get linguistic annotation in the dataframe. For linguistic annotations (dataframe column names) refer to spaCy's Token API document.

import spacy
from dframcy import DframCy

nlp = spacy.load("en_core_web_sm")

dframcy = DframCy(nlp)
doc = dframcy.nlp(u"Apple is looking at buying U.K. startup for $1 billion")

# default columns: ["id", "text", "start", "end", "pos_", "tag_", "dep_", "head", "ent_type_"]
annotation_dataframe = dframcy.to_dataframe(doc)

# can also pass columns names (spaCy's linguistic annotation attributes)
annotation_dataframe = dframcy.to_dataframe(doc, columns=["text", "lemma_", "lower_", "is_punct"])

# for separate entity dataframe
token_annotation_dataframe, entity_dataframe = dframcy.to_dataframe(doc, separate_entity_dframe=True)

# custom attributes can also be included
from spacy.tokens import Token
fruit_getter = lambda token: token.text in ("apple", "pear", "banana")
Token.set_extension("is_fruit", getter=fruit_getter)
doc = dframcy.nlp(u"I have an apple")

annotation_dataframe = dframcy.to_dataframe(doc, custom_attributes=["is_fruit"])

Rule-Based Matching

# Token-based Matching
import spacy

nlp = spacy.load("en_core_web_sm")

from dframcy.matcher import DframCyMatcher, DframCyPhraseMatcher, DframCyDependencyMatcher
dframcy_matcher = DframCyMatcher(nlp)
pattern = [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}]
dframcy_matcher.add("HelloWorld", [pattern])
doc = dframcy_matcher.nlp("Hello, world! Hello world!")
matches_dataframe = dframcy_matcher(doc)

# Phrase Matching
dframcy_phrase_matcher = DframCyPhraseMatcher(nlp)
terms = [u"Barack Obama", u"Angela Merkel",u"Washington, D.C."]
patterns = [dframcy_phrase_matcher.nlp.make_doc(text) for text in terms]
dframcy_phrase_matcher.add("TerminologyList", patterns)
doc = dframcy_phrase_matcher.nlp(u"German Chancellor Angela Merkel and US President Barack Obama "
                                u"converse in the Oval Office inside the White House in Washington, D.C.")
phrase_matches_dataframe = dframcy_phrase_matcher(doc)

# Dependency Matching
dframcy_dependency_matcher = DframCyDependencyMatcher(nlp)
pattern = [{"RIGHT_ID": "founded_id", "RIGHT_ATTRS": {"ORTH": "founded"}}]
dframcy_dependency_matcher.add("FOUNDED", [pattern])
doc = dframcy_dependency_matcher.nlp(u"Bill Gates founded Microsoft. And Elon Musk founded SpaceX")
dependency_matches_dataframe = dframcy_dependency_matcher(doc)

Command Line Interface

Dframcy supports command-line arguments for the conversion of a plain text file to linguistically annotated text in CSV/JSON format. Previous versions of Dframcy were used to support CLI utilities for training and evaluation of spaCy models from CSV/XLS files. After the v3 release, spaCy's training pipeline has become much more flexible and robust so didn't want to introduce additional step using Dframcy for just format conversion (CSV/XLS to spaCy’s binary format).

# convert
dframcy dframe -i plain_text.txt -o annotations.csv -f csv
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].