All Projects → USCDataScience → AgePredictor

USCDataScience / AgePredictor

Licence: Apache-2.0 license
Age classification from text using PAN16, blogs, Fisher Callhome, and Cancer Forum

Programming Languages

java
68154 projects - #9 most used programming language
python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to AgePredictor

Data-Scientist-In-Python
This repository contains notes and projects of Data scientist track from dataquest course work.
Stars: ✭ 23 (+76.92%)
Mutual labels:  machine-learning-algorithms, datascience
neptune-examples
Examples of using Neptune to keep track of your experiments (maintenance only).
Stars: ✭ 22 (+69.23%)
Mutual labels:  machine-learning-algorithms, datascience
Anomaly Detection
anomaly detection with anomalize and Google Trends data
Stars: ✭ 38 (+192.31%)
Mutual labels:  machine-learning-algorithms, datascience
Notebooks Statistics And Machinelearning
Jupyter Notebooks from the old UnsupervisedLearning.com (RIP) machine learning and statistics blog
Stars: ✭ 270 (+1976.92%)
Mutual labels:  machine-learning-algorithms, datascience
100 Days Of Ml Code
A day to day plan for this challenge. Covers both theoritical and practical aspects
Stars: ✭ 172 (+1223.08%)
Mutual labels:  machine-learning-algorithms, datascience
xgboost-smote-detect-fraud
Can we predict accurately on the skewed data? What are the sampling techniques that can be used. Which models/techniques can be used in this scenario? Find the answers in this code pattern!
Stars: ✭ 59 (+353.85%)
Mutual labels:  machine-learning-algorithms, datascience
genie
Genie: A Fast and Robust Hierarchical Clustering Algorithm (this R package has now been superseded by genieclust)
Stars: ✭ 21 (+61.54%)
Mutual labels:  machine-learning-algorithms, datascience
Machine learning a Z
Learning to create Machine Learning Algorithms
Stars: ✭ 104 (+700%)
Mutual labels:  machine-learning-algorithms, datascience
Boostaroota
A fast xgboost feature selection algorithm
Stars: ✭ 165 (+1169.23%)
Mutual labels:  machine-learning-algorithms, datascience
Statistical-Learning-using-R
This is a Statistical Learning application which will consist of various Machine Learning algorithms and their implementation in R done by me and their in depth interpretation.Documents and reports related to the below mentioned techniques can be found on my Rpubs profile.
Stars: ✭ 27 (+107.69%)
Mutual labels:  machine-learning-algorithms, datascience
vlainic.github.io
My GitHub blog: things you might be interested, and probably not...
Stars: ✭ 26 (+100%)
Mutual labels:  machine-learning-algorithms, datascience
R-data-wrangling
Materials for my my R data workshop. https://cengel.github.io/R-data-wrangling/
Stars: ✭ 17 (+30.77%)
Mutual labels:  datascience
student-grade-analytics
Analyse academic and non-academic information of students and predict grades
Stars: ✭ 17 (+30.77%)
Mutual labels:  datascience
Python-For-DataScience-Machine-Learning-Bootcamp-Udemy
Repository for the course on Udemy - Python for Data Science and Machine Learning Bootcamp , Jose Portilla
Stars: ✭ 31 (+138.46%)
Mutual labels:  datascience
ML-CaPsule
ML-capsule is a Project for beginners and experienced data science Enthusiasts who don't have a mentor or guidance and wish to learn Machine learning. Using our repo they can learn ML, DL, and many related technologies with different real-world projects and become Interview ready.
Stars: ✭ 177 (+1261.54%)
Mutual labels:  datascience
ScalaTIKZ
ScalaTIKZ is an open-source library for PGF/TIKZ vector graphics.
Stars: ✭ 18 (+38.46%)
Mutual labels:  datascience
machine-learning-implemetation-python
Basic Machine Learning implementation with python
Stars: ✭ 51 (+292.31%)
Mutual labels:  machine-learning-algorithms
pyspark-ML-in-Colab
Pyspark in Google Colab: A simple machine learning (Linear Regression) model
Stars: ✭ 32 (+146.15%)
Mutual labels:  machine-learning-algorithms
gretel-python-client
The Gretel Python Client allows you to interact with the Gretel REST API.
Stars: ✭ 28 (+115.38%)
Mutual labels:  datascience
66Days NaturalLanguageProcessing
I am sharing my Journey of 66DaysofData in Natural Language Processing.
Stars: ✭ 127 (+876.92%)
Mutual labels:  datascience

Author Age Prediction

This is a author age categorizer that leverages the Apache OpenNLP Maximum Entropy Classifier. It takes a text sample and classifies it into the following age categories: xx-18|18-24|25-34|35-49|50-64|65-xx.

Pre-Requisites

  1. Download Apache Spark 2.0.0 and place in the local directory for this checkout. Once downloaded remember to run tar xvzf spark-2.0.0-bin-hadoop2.7.tgz
  2. export SPARK_HOME="spark-2.0.0-bin-hadoop2.7"
  3. Run bin/download-opennlp.sh to download Apache OpenNLP models referenced below.
  4. Run mvn clean install to build the assembly jars. The key one you need is age-predictor-assembly/target/age-predictor-assembly-1.1-SNAPSHOT-jar-with-dependencies.jar. If you do not see this jar, investigate your Maven and Java issues. It should build fine with
openjdk version "13.0.2" 2020-01-14
OpenJDK Runtime Environment (build 13.0.2+8)
OpenJDK 64-Bit Server VM (build 13.0.2+8, mixed mode, sharing)
MT-202397:AgePredictor mattmann$ mvn --V
Apache Maven 3.6.1 (d66c9c0b3152b2e69ee9bac180bb8fcc8e6af555; 2019-04-04T12:00:29-07:00)
Maven home: /usr/local/Cellar/maven/3.6.1/libexec
Java version: 13.0.2, vendor: Oracle Corporation, runtime: /Library/Java/JavaVirtualMachines/openjdk-13.0.2.jdk/Contents/Home
Default locale: en_US, platform encoding: UTF-8
OS name: "mac os x", version: "10.15.7", arch: "x86_64", family: "mac"     

QuickStart

  1. Follow the instructions to perform training, and build yourself a model/en-ageClassify.bin file
    • bin/authorage AgeClassifyTrainer -model model/en-ageClassify.bin -lang en -data data/sample_train.txt -encoding UTF-8
  2. Run the Age prediction with the sample data
    • bin/authorage AgePredict ./model/classify-unigram.bin ./model/regression-global.bin data/sample_test.txt < data/sample_test.txt
  3. Run the Age prediction and grep out the predictions from the sample data
    • bin/authorage AgePredict ./model/classify-unigram.bin ./model/regression-global.bin data/sample_test.txt < data/sample_test.txt 2>&1 | grep "Prediction"
    • If you see as output from the above command you're good!
Prediction: 33.25378998833527
Prediction: 31.67628280063772

Usage

How to train an Age Classifier

Note: The training data should be a line-by-line, with each line starting with the age, or age category, followed by a tab and the text associated with the age.

Usage: bin/authorage AgeClassifyTrainer [-factory factoryName] [-featureGenerators featuregens] [-tokenizer tokenizer] -model modelFile [-params paramsFile] -lang language -data sampleData [-encoding charsetName]

Arguments description:
	-factory factoryName
        a sub-class of DoccatFactory where to get implementation and resources.
	-featureGenerators featuregens
	    comma separated feature generator classes. Bag of words default.
	-tokenizer tokenizer
        tokenizer implementation. WhitespaceTokenizer is used if not specified.
	-model modelFile
        output model file.
	-params paramsFile
	    training parameters file.
	-lang language
	    language which is being processed.
	-data sampleData
	    data to be used, usually a file name.
	-encoding charsetName
	    encoding for reading and writing text, if absent the system default is used.

Example Usage:

bin/authorage AgeClassifyTrainer -model model/en-ageClassify.bin -lang en -data data/sample_train.txt -encoding UTF-8

Training data format - Age and text seperated by tab in each line like <AGE><Tab><TEXT>
Sample training data-

12	I am just 12 year old
25	I am little bigger
35	I am mature
45	I am getting old
60	I am old like wine

How to evaluate an Age Classifier Model

Usage: bin/authorage AgeClassifyEvaluator -model model [-misclassified true|false] -data sampleData [-encoding charsetName]

Arguments description:
	-model model
		the model file to be evaluated.
	-misclassified true|false
		if true will print false negatives and false positives.
	-data sampleData
		data to be used, usually a file name.
	-encoding charsetName
		encoding for reading and writing text, if absent the system default is used.

Example Usage:

bin/authorage AgeClassifyEvaluator -model model/en-ageClassify.bin -data data/sample_test.txt -encoding UTF-8

How to run the Age Classifier

Note: Each document must be followed by an empty line to be detected as a separate case from the others.

Usage: bin/authorage AgeClassify model < documents
Usage: bin/authorage AgePredict ./model/classify-unigram.bin ./model/regression-global.bin  data/sample_test.txt < data/sample_test.txt

Downloads

For AgePredict to work you need to download en-pos-maxent.bin, en-sent.bin and en-token.bin from http://opennlp.sourceforge.net/models-1.5/ to model/opennlp/

Citation:

If you use this work, please cite:

@article{hong2017ensemble,
  title={Ensemble Maximum Entropy Classification and Linear Regression for Author Age Prediction},
  author={Hong, Joey and Mattmann, Chris and Ramirez, Paul},
  booktitle={Information Reuse and Integration (IRI), 2017 IEEE 18th International Conference on},
  organization={IEEE}
  year={2017}
}

Contributors

  • Chris A. Mattmann, JPL & USC
  • Joey Hong, Caltech
  • Madhav Sharan, JPL & USC

License

Apache License, version 2

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].