All Projects → justmarkham → Pycon 2016 Tutorial

justmarkham / Pycon 2016 Tutorial

Machine Learning with Text in scikit-learn

Projects that are alternatives of or similar to Pycon 2016 Tutorial

Face specific augm
Face Renderer to perform Domain (Face) Specific Data Augmentation
Stars: ✭ 398 (-1.73%)
Mutual labels:  jupyter-notebook
Tensorflow learning notes
tensorflow学习笔记,来源于电子书:《Tensorflow实战Google深度学习框架》
Stars: ✭ 403 (-0.49%)
Mutual labels:  jupyter-notebook
Ml Powered Applications
Companion repository for the book Building Machine Learning Powered Applications
Stars: ✭ 402 (-0.74%)
Mutual labels:  jupyter-notebook
Trainyourownyolo
Train a state-of-the-art yolov3 object detector from scratch!
Stars: ✭ 399 (-1.48%)
Mutual labels:  jupyter-notebook
Deepsvg
[NeurIPS 2020] Official code for the paper "DeepSVG: A Hierarchical Generative Network for Vector Graphics Animation". Includes a PyTorch library for deep learning with SVG data.
Stars: ✭ 403 (-0.49%)
Mutual labels:  jupyter-notebook
Coursera Stanford Ml Python
Coursera/Stanford Machine Learning course assignments in python
Stars: ✭ 403 (-0.49%)
Mutual labels:  jupyter-notebook
Pandas
Data & Code for my video on the Pandas library of Python
Stars: ✭ 397 (-1.98%)
Mutual labels:  jupyter-notebook
Eco Efficient Video Understanding
Code and models of paper " ECO: Efficient Convolutional Network for Online Video Understanding", ECCV 2018
Stars: ✭ 406 (+0.25%)
Mutual labels:  jupyter-notebook
Namedtensor
Named Tensor implementation for Torch
Stars: ✭ 403 (-0.49%)
Mutual labels:  jupyter-notebook
Skimage Tutorials
skimage-tutorials: a collection of tutorials for the scikit-image package.
Stars: ✭ 403 (-0.49%)
Mutual labels:  jupyter-notebook
Triplet recommendations keras
An example of doing MovieLens recommendations using triplet loss in Keras
Stars: ✭ 400 (-1.23%)
Mutual labels:  jupyter-notebook
Tensorflow 101
《TensorFlow 快速入门与实战》和《TensorFlow 2 项目进阶实战》课程代码与课件
Stars: ✭ 402 (-0.74%)
Mutual labels:  jupyter-notebook
Joint Vae
Pytorch implementation of JointVAE, a framework for disentangling continuous and discrete factors of variation 🌟
Stars: ✭ 404 (-0.25%)
Mutual labels:  jupyter-notebook
Vcdb
VERIS Community Database
Stars: ✭ 398 (-1.73%)
Mutual labels:  jupyter-notebook
The Elements Of Statistical Learning Python Notebooks
A series of Python Jupyter notebooks that help you better understand "The Elements of Statistical Learning" book
Stars: ✭ 405 (+0%)
Mutual labels:  jupyter-notebook
Automl
Google Brain AutoML
Stars: ✭ 4,795 (+1083.95%)
Mutual labels:  jupyter-notebook
Oreilly Learning Tensorflow
Stars: ✭ 404 (-0.25%)
Mutual labels:  jupyter-notebook
Simgan Captcha
Solve captcha without manually labeling a training set
Stars: ✭ 405 (+0%)
Mutual labels:  jupyter-notebook
Faster rcnn for open images dataset keras
Faster R-CNN for Open Images Dataset by Keras
Stars: ✭ 405 (+0%)
Mutual labels:  jupyter-notebook
Mit ocw linear algebra 18 06
IPython notebooks on Gilbert Strang's MIT course on linear algebra (18.06)
Stars: ✭ 403 (-0.49%)
Mutual labels:  jupyter-notebook

Tutorial: Machine Learning with Text in scikit-learn

Presented by Kevin Markham at PyCon on May 28, 2016. Watch the complete tutorial video on YouTube.

Watch the complete tutorial video on YouTube

Description

Although numeric data is easy to work with in Python, most knowledge created by humans is actually raw, unstructured text. By learning how to transform text into data that is usable by machine learning models, you drastically increase the amount of data that your models can learn from. In this tutorial, we'll build and evaluate predictive models from real-world text using scikit-learn.

Objectives

By the end of this tutorial, attendees will be able to confidently build a predictive model from their own text-based data, including feature extraction, model building and model evaluation.

Required Software

Attendees will need to bring a laptop with scikit-learn and pandas (and their dependencies) already installed. Installing the Anaconda distribution of Python is an easy way to accomplish this. Both Python 2 and 3 are welcome.

I will be leading the tutorial using the IPython/Jupyter notebook, and have added a pre-written notebook to this repository. I have also created a Python script that is identical to the notebook, which you can use in the Python environment of your choice.

Tutorial Files

Prerequisite Knowledge

Attendees to this tutorial should be comfortable working in Python, should understand the basic principles of machine learning, and should have at least basic experience with both pandas and scikit-learn. However, no knowledge of advanced mathematics is required.

  • If you need a refresher on scikit-learn or machine learning, I recommend reviewing the notebooks and/or videos from my scikit-learn video series, focusing on videos 1-5 as well as video 9. Alternatively, you may prefer reading the tutorials from the scikit-learn documentation.
  • If you need a refresher on pandas, I recommend reviewing the notebook and/or videos from my pandas video series. Alternatively, you may prefer reading this 3-part tutorial.

Abstract

It can be difficult to figure out how to work with text in scikit-learn, even if you're already comfortable with the scikit-learn API. Many questions immediately come up: Which vectorizer should I use, and why? What's the difference between a "fit" and a "transform"? What's a document-term matrix, and why is it so sparse? Is it okay for my training data to have more features than observations? What's the appropriate machine learning model to use? And so on...

In this tutorial, we'll answer all of those questions, and more! We'll start by walking through the vectorization process in order to understand the input and output formats. Then we'll read a simple dataset into pandas, and immediately apply what we've learned about vectorization. We'll move on to the model building process, including a discussion of which model is most appropriate for the task. We'll evaluate our model a few different ways, and then examine the model for greater insight into how the text is influencing its predictions. Finally, we'll practice this entire workflow on a new dataset, and end with a discussion of which parts of the process are worth tuning for improved performance.

Detailed Outline

  1. Model building in scikit-learn (refresher)
  2. Representing text as numerical data
  3. Reading a text-based dataset into pandas
  4. Vectorizing our dataset
  5. Building and evaluating a model
  6. Comparing models
  7. Examining a model for further insight
  8. Practicing this workflow on another dataset
  9. Tuning the vectorizer (discussion)

About the Instructor

Kevin Markham is the founder of Data School and the former lead instructor for General Assembly's Data Science course in Washington, DC. He is passionate about teaching data science to people who are new to the field, regardless of their educational and professional backgrounds, and he enjoys teaching both online and in the classroom. Kevin's professional focus is supervised machine learning, which led him to create the popular scikit-learn video series for Kaggle. He has a degree in Computer Engineering from Vanderbilt University.

Recommended Resources

Text classification:

  • Read Paul Graham's classic post, A Plan for Spam, for an overview of a basic text classification system using a Bayesian approach. (He also wrote a follow-up post about how he improved his spam filter.)
  • Coursera's Natural Language Processing (NLP) course has video lectures on text classification, tokenization, Naive Bayes, and many other fundamental NLP topics. (Here are the slides used in all of the videos.)
  • Automatically Categorizing Yelp Businesses discusses how Yelp uses NLP and scikit-learn to solve the problem of uncategorized businesses.
  • How to Read the Mind of a Supreme Court Justice discusses CourtCast, a machine learning model that predicts the outcome of Supreme Court cases using text-based features only. (The CourtCast creator wrote a post explaining how it works, and the Python code is available on GitHub.)
  • Identifying Humorous Cartoon Captions is a readable paper about identifying funny captions submitted to the New Yorker Caption Contest.
  • In this PyData video (50 minutes), Facebook explains how they use scikit-learn for sentiment classification by training a Naive Bayes model on emoji-labeled data.

Naive Bayes and logistic regression:

scikit-learn:

  • The scikit-learn user guide includes an excellent section on text feature extraction that includes many details not covered in today's tutorial.
  • The user guide also describes the performance trade-offs involved when choosing between sparse and dense input data representations.
  • To learn more about evaluating classification models, watch video #9 from my scikit-learn video series (or just read the associated notebook).

pandas:

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].