All Projects → kavgan → phrase-at-scale

kavgan / phrase-at-scale

Licence: other
Detect common phrases in large amounts of text using a data-driven approach. Size of discovered phrases can be arbitrary. Can be used in languages other than English

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to phrase-at-scale

ceja
PySpark phonetic and string matching algorithms
Stars: ✭ 24 (-79.13%)
Mutual labels:  pyspark
databricks-notebooks
Collection of Databricks and Jupyter Notebooks
Stars: ✭ 19 (-83.48%)
Mutual labels:  pyspark
CVAE Dial
CVAE_XGate model in paper "Xu, Dusek, Konstas, Rieser. Better Conversations by Modeling, Filtering, and Optimizing for Coherence and Diversity"
Stars: ✭ 16 (-86.09%)
Mutual labels:  nlp-machine-learning
vlainic.github.io
My GitHub blog: things you might be interested, and probably not...
Stars: ✭ 26 (-77.39%)
Mutual labels:  nlp-machine-learning
nlp classification workshop
NLP Classification Workshop
Stars: ✭ 22 (-80.87%)
Mutual labels:  nlp-machine-learning
TextFeatureSelection
Python library for feature selection for text features. It has filter method, genetic algorithm and TextFeatureSelectionEnsemble for improving text classification models. Helps improve your machine learning models
Stars: ✭ 42 (-63.48%)
Mutual labels:  nlp-machine-learning
knime-textprocessing
KNIME - Text Processing Extension (Labs)
Stars: ✭ 17 (-85.22%)
Mutual labels:  nlp-machine-learning
pyspark-k8s-boilerplate
Boilerplate for PySpark on Cloud Kubernetes
Stars: ✭ 24 (-79.13%)
Mutual labels:  pyspark
python mozetl
ETL jobs for Firefox Telemetry
Stars: ✭ 25 (-78.26%)
Mutual labels:  pyspark
Quora question pairs NLP Kaggle
Quora Kaggle Competition : Natural Language Processing using word2vec embeddings, scikit-learn and xgboost for training
Stars: ✭ 17 (-85.22%)
Mutual labels:  nlp-machine-learning
pytorch-translm
An implementation of transformer-based language model for sentence rewriting tasks such as summarization, simplification, and grammatical error correction.
Stars: ✭ 22 (-80.87%)
Mutual labels:  nlp-machine-learning
arabic-tagger
AQMAR Arabic Tagger: Sequence tagger with cost-augmented structured perceptron training
Stars: ✭ 38 (-66.96%)
Mutual labels:  nlp-machine-learning
alter-nlu
Natural language understanding library for chatbots with intent recognition and entity extraction.
Stars: ✭ 45 (-60.87%)
Mutual labels:  nlp-machine-learning
Question-Answering-based-on-SQuAD
Question Answering System using BiDAF Model on SQuAD v2.0
Stars: ✭ 20 (-82.61%)
Mutual labels:  nlp-machine-learning
ake-datasets
Large, curated set of benchmark datasets for evaluating automatic keyphrase extraction algorithms.
Stars: ✭ 125 (+8.7%)
Mutual labels:  nlp-machine-learning
vnla
Code accompanying the CVPR 2019 paper: https://arxiv.org/abs/1812.04155
Stars: ✭ 60 (-47.83%)
Mutual labels:  nlp-machine-learning
RadiologyReportEmbedding
Intelligent Word Embeddings of Free-Text Radiology Reports
Stars: ✭ 22 (-80.87%)
Mutual labels:  nlp-machine-learning
lingvo--Ner-ru
Named entity recognition (NER) in Russian texts / Определение именованных сущностей (NER) в тексте на русском языке
Stars: ✭ 38 (-66.96%)
Mutual labels:  nlp-machine-learning
text-preprocess-python
Text preprocessing tools in python.
Stars: ✭ 22 (-80.87%)
Mutual labels:  nlp-machine-learning
embeddings
Embeddings: State-of-the-art Text Representations for Natural Language Processing tasks, an initial version of library focus on the Polish Language
Stars: ✭ 27 (-76.52%)
Mutual labels:  nlp-machine-learning

Phrase-At-Scale

Phrase-At-Scale provides a fast and easy way to discover phrases from large text corpora using PySpark. Here's an example of phrases extracted from a review dataset:

Features

  • Discover most common phrases in your text
  • Size of discovered phrases can be arbitrary (typically: bigrams and trigrams)
  • Adjust configuration to control quality of phrases
  • Can be used in languages other than English
  • Can be run locally using multiple threads, or in parallel on multiple machines
  • Annotate your corpora with the phrases discovered

Quick Start

Run locally

To re-run phrase discovery using the default dataset:

  1. Install Spark

  2. Clone this repo and move into its top-level directory.

    git clone [email protected]:kavgan/phrase-at-scale.git
    
  3. Run the spark job:

    <your_path_to_spark>/bin/spark-submit --master local[200] --driver-memory 4G phrase_generator.py 
    

This will use settings (including input data files) as specified in config.py.

  1. You should be able to monitor the progress of your job at http://localhost:4040/

Notes:

  • The above command runs the job on master and uses the specified number of threads within local[num_of_threads].
  • This job outputs 2 files:
    1. the list of phrases under top-opinrank-phrases.txt
    2. the annotated corpora under data/tagged-data/

Configuration

To change configuration, just edit the config.py file.

Config Description
input_file Path to your input data files. This can be a file or folder with files. The default assumption is one text document (of any size) per line. This can be one sentence per line, one paragraph per line, etc.
output-folder Path to output your annotated corpora. Can be local path or on HDFS
phrase-file Path to file that should hold the list of discovered phrases.
stop-file Stop-words file to use to indicate phrase boundary.
min-phrase-count Minimum number of occurrence for phrases. Guidelines: use 50 for < 300 MB of text, 100 for < 2GB and larger values for a much larger dataset.

Dataset

The default configuration uses a subset of the OpinRank dataset, consisting of about 255,000 hotel reviews. You can use the following to cite the dataset:

@article{ganesan2012opinion,
  title={Opinion-based entity ranking},
  author={Ganesan, Kavita and Zhai, ChengXiang},
  journal={Information retrieval},
  volume={15},
  number={2},
  pages={116--150},
  year={2012},
  publisher={Springer} 
}

Contact

This repository is maintained by Kavita Ganesan. Please send me an e-mail or open a GitHub issue if you have questions.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].