>
Introduction to Data Science with R
François Briatte
Spring 2017. Work in progress, not taught right now.
>
Syllabus
Participate
- Create a GitHub account (free).
- To ask a question, use the issues.
- To share your notes, use the wiki.
For strictly personal questions, email me.
1. Setup
Code
The scripts below are short demos that will give you an idea of what you can accomplish from the command line or from R, using a selection of packages designed for data import/export, manipulation and visualization.
Read
- Peng 2016b, ch. 3 (optional)
- Urdan 2010, ch. 1
- Zumel and Mount 2014, ch. 1
See also
- Briatte, F. R as a Data Science Language
- Bryan, J. et al. Happy Git and GitHub for the useR
- Center for Government Excellence. Data-Science Cheatsheet
- DataCamp. DataChats
- Deleneuville, M. Les stratégies open data des 20 plus grandes villes françaises
- Free Software Foundation. What is Free Software?
- Gillespie, C. and Lovelace, R. Efficient Learning
- McNeill, M. Base R - Cheat Sheet
- Schrodt, P. 7 Reasons Political Science “Math Camp” is a Complete Waste of Your Time
- Ushey, K. What is a Function? (difficult)
- Wickham, H. Data Science: How is it Different to Statistics ?
2. Data I/O
Code
The scripts below all show how to use
dplyr
for data manipulation,readr
orreadxl
for data import, andggplot2
for plotting. They also show how to use a few more packages that you might find useful.
- New York Times Brexit Coverage
- Demo: data reshaping with
tidyr
(switching between 'long' and 'wide' formats). - Background: Dolšak, N. 2016. Manufacturing Dissent: How The New York Times Covered the Brexit Vote.
- Source: Dolšak, N. and Prakash, A. 2016. The New York Times’ Coverage of the Brexit Vote.
- Demo: data reshaping with
- Journalists Killed Since 1992
- Demos: date manipulation with
lubridate
; country name manipulation withcountrycode
. - Background: Committee to Protect Journalists. 2016. Journalists Killed: Methodology.
- Source: Committee to Protect Journalists. 2016. Journalists Killed since 1992.
- Demos: date manipulation with
Read
- Grolemund 2014, App. A-B, ch. 1-2
- Grolemund and Wickham 2016, ch. 11
- Peng 2016a, ch. 3
- Peng 2016b, ch. 6-7, 8 (optional), 9, 12 (optional)
- Zumel and Mount 2014, ch. 2
See also
- Bryan, J. Sanesheets: A Rant About Spreadsheets (related: video presentation, slides, sources)
- Chen, X. et al. Awesome Public Datasets
- Damico, A.J. MonetDBLite Because Fast
- Gillespie, C. and Lovelace, R. Efficient Input/Output
- Hester, J. Database Best Practices.
DBI
,odbc
andpool
- Leek, J. The Four Eras of Data
- Luraschi, J. Importing Modern Data into R
- MonetDB. MonetDB.R Tutorial
- Onuoha, M. On Missing Data Sets
- Shafranovich, Y. IETF RFC 4180: Common Format and MIME Type for Comma-Separated Values (CSV) Files
- Tennison, J., Kellogg, G. and Herman, I. W3C: Model for Tabular Data and Metadata on the Web
3. Manipulation
Code
Read
- Grolemund 2014, ch. 3-5
- Peng 2016a, ch. 4-5
- Peng 2016b, ch. 13
- Urdan 2010, ch. 2-3, 6-7, 8-9 (optional), 13-14 (optional)
- Wickham 2014a, ch. 2-3 (optional)
- Wickham 2014b (optional)
- Zumel and Mount 2014, ch. 3-4, 5-9 (optional)
See also
- Bryan, J. Data Rectangling (video presentation)
- Bryan, J. Tidy Data Lesson using Lord of the Rings Data
- Gillespie, C. and Lovelace, R. Efficient Data Carpentry
- Kopacka, J. Basic Regular Expressions in R - Cheat Sheet
- Mount, J. The Case for Index-Free Data
- Myles White, J. Modes, Medians and Means: A Unifying Perspective
- Robinson, D.G.
broom
: Converting Statistical Models to Tidy Data Frames - RStudio. Data Wrangling with
dplyr
andtidyr
- Cheat Sheet - Silge, J. and Robinson, D. Tidy Text Mining
- Wickham, H. The Split-Apply-Combine Strategy for Data Analysis
- Wickham, H. Tidyverse
4. Visualization
Code
Read
- Grolemund and Wickham 2016, ch. 28
- Peng 2016a, ch. 6-7, 15-16
- Wickham 2010 (optional)
- Wickham, Cook and Hofmann 2015 (optional)
See also
The links below point to (mostly)
ggplot2
-related resources, but data visualization is much, much more than that: see the resources listed in awesome-visualization-research and awesome-dataviz.
- Emaasit, D.
ggplot2
Extensions - Healy, K. and Moody, J. Data Visualization in Sociology
- Hijmans, R. et al. GADM: Global Administrative Areas
- Nenadic, A. Generating Google Maps out of Google Spreadsheets
- Ognyanova, K. Network Visualization with R
- RStudio. Data Visualization with
ggplot2
- Cheat Sheet - Scavetta, R. Introducing the Grammar of Graphics Plotting Concept
- Tufte, E. Edward Tufte Notebooks
- Tufte, E. The Future of Data Analysis
- Unwin, A. Graphical Data Analysis with R
- Wickham, H.
ggplot2
Documentation
References
- Grolemund, G. 2014. Hands-On Programming with R
- Grolemund, G. and Wickham, H. 2016. R for Data Science
- Peng, R.D. 2016a. Exploratory Data Analysis with R
- Peng, R.D. 2016b. R Programming for Data Science
- Urdan, T.C. 2010. Statistics in Plain English
- Wickham, H. 2010. A Layered Grammar of Graphics
- Wickham, H. 2014a. Advanced R
- Wickham, H. 2014b. Tidy Data
- Wickham, H., Cook, D. and Hofmann, H. Visualizing Statistical Models
- Zumel, N. and Mount, J. 2014. Practical Data Science with R