kavgan / phrase-at-scale

Licence: other

Detect common phrases in large amounts of text using a data-driven approach. Size of discovered phrases can be arbitrary. Can be used in languages other than English

Programming Languages

python

139335 projects - #7 most used programming language

Projects that are alternatives of or similar to phrase-at-scale

ceja

PySpark phonetic and string matching algorithms

Stars: ✭ 24 (-79.13%)

Mutual labels: pyspark

databricks-notebooks

Collection of Databricks and Jupyter Notebooks

Stars: ✭ 19 (-83.48%)

Mutual labels: pyspark

CVAE Dial

CVAE_XGate model in paper "Xu, Dusek, Konstas, Rieser. Better Conversations by Modeling, Filtering, and Optimizing for Coherence and Diversity"

Stars: ✭ 16 (-86.09%)

Mutual labels: nlp-machine-learning

vlainic.github.io

My GitHub blog: things you might be interested, and probably not...

Stars: ✭ 26 (-77.39%)

Mutual labels: nlp-machine-learning

nlp classification workshop

NLP Classification Workshop

Stars: ✭ 22 (-80.87%)

Mutual labels: nlp-machine-learning

TextFeatureSelection

Python library for feature selection for text features. It has filter method, genetic algorithm and TextFeatureSelectionEnsemble for improving text classification models. Helps improve your machine learning models

Stars: ✭ 42 (-63.48%)

Mutual labels: nlp-machine-learning

knime-textprocessing

KNIME - Text Processing Extension (Labs)

Stars: ✭ 17 (-85.22%)

Mutual labels: nlp-machine-learning

pyspark-k8s-boilerplate

Boilerplate for PySpark on Cloud Kubernetes

Stars: ✭ 24 (-79.13%)

Mutual labels: pyspark

python mozetl

ETL jobs for Firefox Telemetry

Stars: ✭ 25 (-78.26%)

Mutual labels: pyspark

Quora question pairs NLP Kaggle

Quora Kaggle Competition : Natural Language Processing using word2vec embeddings, scikit-learn and xgboost for training

Stars: ✭ 17 (-85.22%)

Mutual labels: nlp-machine-learning

pytorch-translm

An implementation of transformer-based language model for sentence rewriting tasks such as summarization, simplification, and grammatical error correction.

Stars: ✭ 22 (-80.87%)

Mutual labels: nlp-machine-learning

arabic-tagger

AQMAR Arabic Tagger: Sequence tagger with cost-augmented structured perceptron training

Stars: ✭ 38 (-66.96%)

Mutual labels: nlp-machine-learning

alter-nlu

Natural language understanding library for chatbots with intent recognition and entity extraction.

Stars: ✭ 45 (-60.87%)

Mutual labels: nlp-machine-learning

Question-Answering-based-on-SQuAD

Question Answering System using BiDAF Model on SQuAD v2.0

Stars: ✭ 20 (-82.61%)

Mutual labels: nlp-machine-learning

ake-datasets

Large, curated set of benchmark datasets for evaluating automatic keyphrase extraction algorithms.

Stars: ✭ 125 (+8.7%)

Mutual labels: nlp-machine-learning

vnla

Code accompanying the CVPR 2019 paper: https://arxiv.org/abs/1812.04155

Stars: ✭ 60 (-47.83%)

Mutual labels: nlp-machine-learning

RadiologyReportEmbedding

Intelligent Word Embeddings of Free-Text Radiology Reports

Stars: ✭ 22 (-80.87%)

Mutual labels: nlp-machine-learning

lingvo--Ner-ru

Named entity recognition (NER) in Russian texts / Определение именованных сущностей (NER) в тексте на русском языке

Stars: ✭ 38 (-66.96%)

Mutual labels: nlp-machine-learning

text-preprocess-python

Text preprocessing tools in python.

Stars: ✭ 22 (-80.87%)

Mutual labels: nlp-machine-learning

embeddings

Embeddings: State-of-the-art Text Representations for Natural Language Processing tasks, an initial version of library focus on the Polish Language

Stars: ✭ 27 (-76.52%)

Mutual labels: nlp-machine-learning

View All Similar Projects ➔

Phrase-At-Scale

Phrase-At-Scale provides a fast and easy way to discover phrases from large text corpora using PySpark. Here's an example of phrases extracted from a review dataset:

Features

Discover most common phrases in your text
Size of discovered phrases can be arbitrary (typically: bigrams and trigrams)
Adjust configuration to control quality of phrases
Can be used in languages other than English
Can be run locally using multiple threads, or in parallel on multiple machines
Annotate your corpora with the phrases discovered

Quick Start

Run locally

To re-run phrase discovery using the default dataset:

Install Spark

Clone this repo and move into its top-level directory.

git clone [email protected]:kavgan/phrase-at-scale.git

Run the spark job:

<your_path_to_spark>/bin/spark-submit --master local[200] --driver-memory 4G phrase_generator.py

This will use settings (including input data files) as specified in config.py.

You should be able to monitor the progress of your job at http://localhost:4040/

Notes:

The above command runs the job on master and uses the specified number of threads within local[num_of_threads].
This job outputs 2 files:
1. the list of phrases under top-opinrank-phrases.txt
2. the annotated corpora under data/tagged-data/

Configuration

To change configuration, just edit the config.py file.

Config	Description
`input_file`	Path to your input data files. This can be a file or folder with files. The default assumption is one text document (of any size) per line. This can be one sentence per line, one paragraph per line, etc.
`output-folder`	Path to output your annotated corpora. Can be local path or on HDFS
`phrase-file`	Path to file that should hold the list of discovered phrases.
`stop-file`	Stop-words file to use to indicate phrase boundary.
`min-phrase-count`	Minimum number of occurrence for phrases. Guidelines: use 50 for < 300 MB of text, 100 for < 2GB and larger values for a much larger dataset.

Dataset

The default configuration uses a subset of the OpinRank dataset, consisting of about 255,000 hotel reviews. You can use the following to cite the dataset:

@article{ganesan2012opinion,
  title={Opinion-based entity ranking},
  author={Ganesan, Kavita and Zhai, ChengXiang},
  journal={Information retrieval},
  volume={15},
  number={2},
  pages={116--150},
  year={2012},
  publisher={Springer} 
}

Contact

This repository is maintained by Kavita Ganesan. Please send me an e-mail or open a GitHub issue if you have questions.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

kavgan / phrase-at-scale

Programming Languages

Labels

Projects that are alternatives of or similar to phrase-at-scale

Phrase-At-Scale

Features

Quick Start

Run locally

Configuration

Dataset

Contact