All Projects → justmarkham → Trump Lies

justmarkham / Trump Lies

Tutorial: Web scraping in Python with Beautiful Soup

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Trump Lies

Learnpythonforresearch
This repository provides everything you need to get started with Python for (social science) research.
Stars: ✭ 163 (-18.91%)
Mutual labels:  jupyter-notebook, data-science, pandas, web-scraping, tutorial
Pycon 2019 Tutorial
Data Science Best Practices with pandas
Stars: ✭ 410 (+103.98%)
Mutual labels:  jupyter-notebook, data-science, pandas, tutorial
Pandas Videos
Jupyter notebook and datasets from the pandas Q&A video series
Stars: ✭ 1,716 (+753.73%)
Mutual labels:  jupyter-notebook, data-science, pandas, tutorial
Dat8
General Assembly's 2015 Data Science course in Washington, DC
Stars: ✭ 1,516 (+654.23%)
Mutual labels:  jupyter-notebook, data-science, pandas, web-scraping
Functional intro to python
[tutorial]A functional, Data Science focused introduction to Python
Stars: ✭ 228 (+13.43%)
Mutual labels:  jupyter-notebook, data-science, pandas, tutorial
Data Science Hacks
Data Science Hacks consists of tips, tricks to help you become a better data scientist. Data science hacks are for all - beginner to advanced. Data science hacks consist of python, jupyter notebook, pandas hacks and so on.
Stars: ✭ 273 (+35.82%)
Mutual labels:  jupyter-notebook, data-science, dataset, pandas
Python Introducing Pandas
Introduction to pandas Treehouse course
Stars: ✭ 24 (-88.06%)
Mutual labels:  jupyter-notebook, data-science, pandas, tutorial
30 Days Of Python
Learn Python for the next 30 (or so) Days.
Stars: ✭ 1,748 (+769.65%)
Mutual labels:  pandas, web-scraping, tutorial
Dtale
Visualizer for pandas data structures
Stars: ✭ 2,864 (+1324.88%)
Mutual labels:  jupyter-notebook, data-science, pandas
Data Science For Marketing Analytics
Achieve your marketing goals with the data analytics power of Python
Stars: ✭ 127 (-36.82%)
Mutual labels:  jupyter-notebook, data-science, pandas
Py Quantmod
Powerful financial charting library based on R's Quantmod | http://py-quantmod.readthedocs.io/en/latest/
Stars: ✭ 155 (-22.89%)
Mutual labels:  jupyter-notebook, data-science, pandas
Python Data Science Handbook
A Chinese translation of Jake Vanderplas' "Python Data Science Handbook". 《Python数据科学手册》在线Jupyter notebook中文翻译
Stars: ✭ 102 (-49.25%)
Mutual labels:  jupyter-notebook, data-science, tutorial
Scipy con 2019
Tutorial Sessions for SciPy Con 2019
Stars: ✭ 142 (-29.35%)
Mutual labels:  jupyter-notebook, data-science, tutorial
Andrew Ng Notes
This is Andrew NG Coursera Handwritten Notes.
Stars: ✭ 180 (-10.45%)
Mutual labels:  jupyter-notebook, data-science, pandas
Seaborn Tutorial
This repository is my attempt to help Data Science aspirants gain necessary Data Visualization skills required to progress in their career. It includes all the types of plot offered by Seaborn, applied on random datasets.
Stars: ✭ 114 (-43.28%)
Mutual labels:  jupyter-notebook, data-science, pandas
Sigmoidal ai
Tutoriais de Python, Data Science, Machine Learning e Deep Learning - Sigmoidal
Stars: ✭ 103 (-48.76%)
Mutual labels:  jupyter-notebook, data-science, pandas
Machine Learning With Python
Practice and tutorial-style notebooks covering wide variety of machine learning techniques
Stars: ✭ 2,197 (+993.03%)
Mutual labels:  jupyter-notebook, data-science, pandas
Imodels
Interpretable ML package 🔍 for concise, transparent, and accurate predictive modeling (sklearn-compatible).
Stars: ✭ 194 (-3.48%)
Mutual labels:  jupyter-notebook, data-science, tutorial
Shape Detection
🟣 Object detection of abstract shapes with neural networks
Stars: ✭ 170 (-15.42%)
Mutual labels:  jupyter-notebook, dataset, tutorial
Web Database Analytics
Web scrapping and related analytics using Python tools
Stars: ✭ 175 (-12.94%)
Mutual labels:  jupyter-notebook, data-science, web-scraping

Web scraping the President's lies in 16 lines of Python

This repository contains the Jupyter notebook and dataset from Data School's introductory web scraping tutorial. All that is required to follow along is a basic understanding of the Python programming language.

By the end of the tutorial, you will be able to scrape data from a static web page using the requests and Beautiful Soup libraries, and export that data into a structured text file using the pandas library.

You can also watch the tutorial on YouTube.

Watch the tutorial on YouTube

Motivation

On July 21, 2017, the New York Times updated an opinion article called Trump's Lies, detailing every public lie the President has told since taking office. Because this is a newspaper, the information was (of course) published as a block of text:

Screenshot of the article

This is a great format for human consumption, but it can't easily be understood by a computer. In this tutorial, we'll extract the President's lies from the New York Times article and store them in a structured dataset.

Screenshot of the DataFrame

Outline of the tutorial

  • What is web scraping?
  • Examining the New York Times article
    • Examining the HTML
    • Fact 1: HTML consists of tags
    • Fact 2: Tags can have attributes
    • Fact 3: Tags can be nested
  • Reading the web page into Python
  • Parsing the HTML using Beautiful Soup
    • Collecting all of the records
    • Extracting the date
    • Extracting the lie
    • Extracting the explanation
    • Extracting the URL
    • Recap: Beautiful Soup methods and attributes
  • Building the dataset
    • Applying a tabular data structure
    • Exporting the dataset to a CSV file
  • Summary: 16 lines of Python code
    • Appendix A: Web scraping advice
    • Appendix B: Web scraping resources
    • Appendix C: Alternative syntax for Beautiful Soup

16 lines of Python code

Just want to see the code? Here it is:

import requests  
r = requests.get('https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html')

from bs4 import BeautifulSoup  
soup = BeautifulSoup(r.text, 'html.parser')  
results = soup.find_all('span', attrs={'class':'short-desc'})

records = []  
for result in results:  
    date = result.find('strong').text[0:-1] + ', 2017'
    lie = result.contents[1][1:-2]
    explanation = result.find('a').text[1:-1]
    url = result.find('a')['href']
    records.append((date, lie, explanation, url))

import pandas as pd  
df = pd.DataFrame(records, columns=['date', 'lie', 'explanation', 'url'])  
df['date'] = pd.to_datetime(df['date'])  
df.to_csv('trump_lies.csv', index=False, encoding='utf-8') 

Want to understand the code? Read the tutorial!

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].