All Projects → textpipe → Textpipe

textpipe / Textpipe

Licence: mit
Textpipe: clean and extract metadata from text

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Textpipe

Padatious
A neural network intent parser
Stars: ✭ 124 (-56.34%)
Mutual labels:  text-processing, text-analysis
Dan Jurafsky Chris Manning Nlp
My solution to the Natural Language Processing course made by Dan Jurafsky, Chris Manning in Winter 2012.
Stars: ✭ 124 (-56.34%)
Mutual labels:  named-entity-recognition, text-processing
Stanza Old
Stanford NLP group's shared Python tools.
Stars: ✭ 142 (-50%)
Mutual labels:  text-processing, text-analysis
Textvec
Text vectorization tool to outperform TFIDF for classification tasks
Stars: ✭ 167 (-41.2%)
Mutual labels:  text-processing, text-analysis
teanaps
자연어 처리와 텍스트 분석을 위한 오픈소스 파이썬 라이브러리 입니다.
Stars: ✭ 91 (-67.96%)
Mutual labels:  named-entity-recognition, text-processing
Javascript Text Expander
Expands texts as you type, naturally
Stars: ✭ 58 (-79.58%)
Mutual labels:  text-processing, text-analysis
Open Semantic Search
Open Source research tool to search, browse, analyze and explore large document collections by Semantic Search Engine and Open Source Text Mining & Text Analytics platform (Integrates ETL for document processing, OCR for images & PDF, named entity recognition for persons, organizations & locations, metadata management by thesaurus & ontologies, search user interface & search apps for fulltext search, faceted search & knowledge graph)
Stars: ✭ 386 (+35.92%)
Mutual labels:  named-entity-recognition, text-analysis
Artificial Adversary
🗣️ Tool to generate adversarial text examples and test machine learning models against them
Stars: ✭ 348 (+22.54%)
Mutual labels:  text-processing, text-analysis
corpusexplorer2.0
Korpuslinguistik war noch nie so einfach...
Stars: ✭ 16 (-94.37%)
Mutual labels:  text-analysis, text-processing
ConTexto
Librería en Python para minería de texto y NLP
Stars: ✭ 43 (-84.86%)
Mutual labels:  text-analysis, text-processing
support-tickets-classification
This case study shows how to create a model for text analysis and classification and deploy it as a web service in Azure cloud in order to automatically classify support tickets. This project is a proof of concept made by Microsoft (Commercial Software Engineering team) in collaboration with Endava http://endava.com/en
Stars: ✭ 142 (-50%)
Mutual labels:  text-analysis, text-processing
frog
Frog is an integration of memory-based natural language processing (NLP) modules developed for Dutch. All NLP modules are based on Timbl, the Tilburg memory-based learning software package.
Stars: ✭ 70 (-75.35%)
Mutual labels:  named-entity-recognition, text-processing
text-analysis
Weaving analytical stories from text data
Stars: ✭ 12 (-95.77%)
Mutual labels:  text-analysis, text-processing
knime-textprocessing
KNIME - Text Processing Extension (Labs)
Stars: ✭ 17 (-94.01%)
Mutual labels:  text-analysis, text-processing
TRUNAJOD2.0
An easy-to-use library to extract indices from texts.
Stars: ✭ 18 (-93.66%)
Mutual labels:  text-analysis, text-processing
daachorse
🐎 A fast implementation of the Aho-Corasick algorithm using the compact double-array data structure.
Stars: ✭ 75 (-73.59%)
Mutual labels:  text-processing
redcoat
A lightweight web-based annotation tool for labelling entity recognition data.
Stars: ✭ 19 (-93.31%)
Mutual labels:  named-entity-recognition
Stanford-NER-Python
Stanford Named Entity Recognizer (NER) - Python Wrapper
Stars: ✭ 63 (-77.82%)
Mutual labels:  named-entity-recognition
ADEL
ADEL is a robust and efficient entity linking framework that is adaptive to text genres and language, entity types for the classification task and referent knowledge base for the linking task.
Stars: ✭ 15 (-94.72%)
Mutual labels:  named-entity-recognition
aylien textapi nodejs
AYLIEN's officially supported node.js client library for accessing Text API
Stars: ✭ 13 (-95.42%)
Mutual labels:  text-analysis

textpipe: clean and extract metadata from text

Build Status

The textpipe logo

textpipe is a Python package for converting raw text in to clean, readable text and extracting metadata from that text. Its functionalities include transforming raw text into readable text by removing HTML tags and extracting metadata such as the number of words and named entities from the text.

Vision: the zen of textpipe

  • Designed for use in production pipelines without adult supervision.
  • Rechargeable batteries included: provide sane defaults and clear examples to adapt.
  • A uniform interface with thin wrappers around state-of-the-art NLP packages.
  • As language-agnostic as possible.
  • Bring your own models.

Features

  • Clean raw text by removing HTML and other unreadable constructs
  • Identify the language of text
  • Extract the number of words, number of sentences, named entities from a text
  • Calculate the complexity of a text
  • Obtain text metadata by specifying a pipeline containing all desired elements
  • Obtain sentiment (polarity and a subjectivity score)
  • Generates word counts
  • Computes minhash for cheap similarity estimation of documents

Installation

It is recommended that you install textpipe using a virtual environment.

python3 -m venv .venv
  • Using virtualenv.
virtualenv venv -p python3.6
  • Using virtualenvwrapper
mkvirtualenv textpipe -p python3.6
  • Install textpipe using pip.
pip install textpipe
  • Install the required packages using requirements.txt.
pip install -r requirements.txt

A note on spaCy download model requirement

While the requirements.txt file that comes with the package calls for spaCy's en_core_web_sm model, this can be changed depending on the model and language you require for your intended use. See spaCy.io's page on their different models for more information.

Usage example

>>> from textpipe import doc, pipeline
>>> sample_text = 'Sample text! <!DOCTYPE>'
>>> document = doc.Doc(sample_text)
>>> print(document.clean)
'Sample text!'
>>> print(document.language)
'en'
>>> print(document.nwords)
2

>>> pipe = pipeline.Pipeline(['CleanText', 'NWords'])
>>> print(pipe(sample_text))
{'CleanText': 'Sample text!', 'NWords': 3}

In order to extend the existing Textpipe operations with your own proprietary operations;

test_pipe = pipeline.Pipeline(['CleanText', 'NWords'])
def custom_op(doc, context=None, settings=None, **kwargs):
    return 1

custom_argument = {'argument' :1 }
test_pipe.register_operation('CUSTOM_STEP', custom_op)
test_pipe.steps.append(('CUSTOM_STEP', custom_argument ))

Contributing

See CONTRIBUTING for guidelines for contributors.

Changes

0.12.1

  • Bumps redis, tqdm, pyling

0.12.0

  • Bumps versions of many dependencies including textacy. Results for keyterm extraction changed.

0.11.9

  • Exposes arbitrary SpaCy ents properties

0.11.8

  • Exposes SpaCy's cats attribute

0.11.7

  • Bumps spaCy and redis versions

0.11.6

  • Fixes bug where gensim model is not cached in pipeline

0.11.5

  • Raise TextpipeMissingModelException instead of KeyError

0.11.4

  • Bumps spaCy and datasketch dependencies

0.11.1

  • Replaces codacy with pylint on CI
  • Fixes pylint issues

0.11.0

  • Adds wrapper around Gensim keyed vectors to construct document embeddings from Redis cache

0.9.0

  • Adds functionality to compute document embeddings using a Gensim word2vec model

0.8.6

  • Removes non standard utf chars before detecting language

0.8.5

  • Bump spaCy to 2.1.3

0.8.4

  • Fix broken install command

0.8.3

  • Fix broken install command

0.8.2

  • Fix copy-paste error in word vector aggregation (#118)

0.8.1

  • Fixes bugs in several operations that didn't accept kwargs

0.8.0

  • Bumps Spacy to 2.1

0.7.2

  • Pins Spacy and Pattern versions (with pinned lxml)

0.7.0

  • change operation's registry from list to dict
  • global pipeline data is available across operations via the context kwarg
  • load custom operations using register_operation in pipeline
  • custom steps (operations) with arguments
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].