All Projects → polywock → Text2gender

polywock / Text2gender

Predict the author's gender from their text.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Text2gender

Textclassification Keras
Text classification models implemented in Keras, including: FastText, TextCNN, TextRNN, TextBiRNN, TextAttBiRNN, HAN, RCNN, RCNNVariant, etc.
Stars: ✭ 621 (+4335.71%)
Mutual labels:  text-classification
Tf Rnn Attention
Tensorflow implementation of attention mechanism for text classification tasks.
Stars: ✭ 735 (+5150%)
Mutual labels:  text-classification
Eda nlp
Data augmentation for NLP, presented at EMNLP 2019
Stars: ✭ 902 (+6342.86%)
Mutual labels:  text-classification
Wikipedia2vec
A tool for learning vector representations of words and entities from Wikipedia
Stars: ✭ 655 (+4578.57%)
Mutual labels:  text-classification
Text Classification Pytorch
Text classification using deep learning models in Pytorch
Stars: ✭ 683 (+4778.57%)
Mutual labels:  text-classification
Nlp In Practice
Starter code to solve real world text data problems. Includes: Gensim Word2Vec, phrase embeddings, Text Classification with Logistic Regression, word count with pyspark, simple text preprocessing, pre-trained embeddings and more.
Stars: ✭ 790 (+5542.86%)
Mutual labels:  text-classification
Multi Class Text Classification Cnn Rnn
Classify Kaggle San Francisco Crime Description into 39 classes. Build the model with CNN, RNN (GRU and LSTM) and Word Embeddings on Tensorflow.
Stars: ✭ 570 (+3971.43%)
Mutual labels:  text-classification
Nlp tensorflow project
Use tensorflow to achieve some NLP project, eg: classification chatbot ner attention QAetc.
Stars: ✭ 27 (+92.86%)
Mutual labels:  text-classification
Nlp chinese corpus
大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP
Stars: ✭ 6,656 (+47442.86%)
Mutual labels:  text-classification
Text Mining
Text Mining in Python
Stars: ✭ 18 (+28.57%)
Mutual labels:  text-classification
Eda nlp for chinese
An implement of the paper of EDA for Chinese corpus.中文语料的EDA数据增强工具。NLP数据增强。论文阅读笔记。
Stars: ✭ 660 (+4614.29%)
Mutual labels:  text-classification
Text Classification
Implementation of papers for text classification task on DBpedia
Stars: ✭ 682 (+4771.43%)
Mutual labels:  text-classification
Chatbot cn
基于金融-司法领域(兼有闲聊性质)的聊天机器人,其中的主要模块有信息抽取、NLU、NLG、知识图谱等,并且利用Django整合了前端展示,目前已经封装了nlp和kg的restful接口
Stars: ✭ 791 (+5550%)
Mutual labels:  text-classification
Nlp Recipes
Natural Language Processing Best Practices & Examples
Stars: ✭ 5,783 (+41207.14%)
Mutual labels:  text-classification
Concise Ipython Notebooks For Deep Learning
Ipython Notebooks for solving problems like classification, segmentation, generation using latest Deep learning algorithms on different publicly available text and image data-sets.
Stars: ✭ 23 (+64.29%)
Mutual labels:  text-classification
Meta
A Modern C++ Data Sciences Toolkit
Stars: ✭ 600 (+4185.71%)
Mutual labels:  text-classification
Lightnlp
基于Pytorch和torchtext的自然语言处理深度学习框架。
Stars: ✭ 739 (+5178.57%)
Mutual labels:  text-classification
Text classification
all kinds of text classification models and more with deep learning
Stars: ✭ 7,179 (+51178.57%)
Mutual labels:  text-classification
Bert language understanding
Pre-training of Deep Bidirectional Transformers for Language Understanding: pre-train TextCNN
Stars: ✭ 933 (+6564.29%)
Mutual labels:  text-classification
Text Classification Benchmark
文本分类基准测试
Stars: ✭ 18 (+28.57%)
Mutual labels:  text-classification

Author gender classification from text.

Use at own risk, not well supported/documented project.

Trained on Reddit posts from r/AskMen and r/AskWomen. If I can say so myself, a clever, but abeit lazy way to get labelled data. Training was done on posts directly from those two subreddits, but this introduces its own set of biases. Maybe women who post on r/AskWomen write in a unique style inside of the subreddit, but not outside of it. To rectify this, you could instead find "women" users from the r/AskWomen, but look at their posts outside of r/AskWomen. Ideally, in a subreddit both men and women visit like r/AskReddit.

The accuracy rate must be further investigated for real world data.

length accuracy examples
< 250 67.56% 48481
200 to 500 66.02% 30715
500 to 1000 69.22% 13600
1000 to 2000 72.99% 3654
> 2000 76.96% 599
- - -
male below 250 65.98% 23527
male 200 to 500 65.2% 15275
male 500 to 1000 66.51% 6346
male 1000 to 2000 69.99% 1656
male above 2000 73.08% 286
- - -
female below 250 69.06% 24954
female 200 to 500 66.83% 15440
female 500 to 1000 71.59% 7254
female 1000 to 2000 75.48% 1998
female above 2000 80.51% 313

Use

  1. Install pipenv and learn how to use it.

  2. Download required dependencies

    pipenv install

  3. Install required NLTK data.

    pipenv run python3 -m textblob.download_corpora lite

  4. Predict gender from piping in a text file. This should print out a 0 to 1 value. Male if above 0.5, otherwise female.

    cat some_text.txt | pipenv run python3 predict.py

Train your own model (not required).

  1. Install required developer dependencies. (also ensure you have sqlite3 installed)

    pipenv install --dev

  2. Install required NLTK data.

    pipenv run python3 -m textblob.download_corpora lite

  3. pipenv run python3 download.py to download Reddit posts using the PushShift API. This goes on forever until your interrupt the process. I recommend around ~200k posts. The posts are saved to data.db using sqlite3 under a "posts" table.

  4. Run pipenv run python3 transform.py to transform the posts into training data. Output will be stored in data.db under the examples table.

  5. Run pipenv run python3 generate_model.py to train and test the model. The model weights will be saved to data/model_weights.json and data/model_biases.json.

  6. Predict gender by piping in a text file. cat some_text.txt | pipenv run python3 predict.py

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].