All Projects → keberwein → mlbgameday

keberwein / mlbgameday

Licence: other
Multi-core processing of 'Gameday' data from Major League Baseball Advanced Media. Additional tools to parallelize large data sets and write them to a database.

Programming Languages

r
7636 projects

Projects that are alternatives of or similar to mlbgameday

hmac-timing-attacks
HMAC timing attack's w/ statistical analysis
Stars: ✭ 22 (-40.54%)
Mutual labels:  statistics
kf2-magicked-admin
🕷️ Mutator-free management, statistics, and in-game bot for ranked Killing Floor 2 servers
Stars: ✭ 27 (-27.03%)
Mutual labels:  statistics
data-science-notes
Open-source project hosted at https://makeuseofdata.com to crowdsource a robust collection of notes related to data science (math, visualization, modeling, etc)
Stars: ✭ 52 (+40.54%)
Mutual labels:  statistics
rsiena
An R package for Simulation Investigation for Empirical Network Analysis
Stars: ✭ 56 (+51.35%)
Mutual labels:  statistics
foremast-brain
Foremast-brain is a component of Foremast project.
Stars: ✭ 17 (-54.05%)
Mutual labels:  statistics
kitsu-season-trends
🦊 Kitsu seasonal anime trends
Stars: ✭ 13 (-64.86%)
Mutual labels:  statistics
carry
Python ETL(Extract-Transform-Load) tool / Data migration tool
Stars: ✭ 115 (+210.81%)
Mutual labels:  etl
spdr-etf-holdings
ETL for the SPDR ETF holdings XLS documents
Stars: ✭ 14 (-62.16%)
Mutual labels:  etl
yt-channels-DS-AI-ML-CS
A comprehensive list of 180+ YouTube Channels for Data Science, Data Engineering, Machine Learning, Deep learning, Computer Science, programming, software engineering, etc.
Stars: ✭ 1,038 (+2705.41%)
Mutual labels:  statistics
k9
Self-Taught Data Science
Stars: ✭ 25 (-32.43%)
Mutual labels:  statistics
vtuber-livechat-dataset
📊 VTuber 1B: Billion-scale Live Chat and Moderation Event Dataset for NLP
Stars: ✭ 30 (-18.92%)
Mutual labels:  statistics
TEAM
The Taxonomy for ETL Automation Metadata (TEAM) is a metadata management tool for data warehouse automation. It is part of the ecosystem for data warehouse automation, alongside the Virtual Data Warehouse pattern manager and the generic schema for Data Warehouse Automation.
Stars: ✭ 27 (-27.03%)
Mutual labels:  etl
math-stats
A small library that does the statistics for your numbers.
Stars: ✭ 18 (-51.35%)
Mutual labels:  statistics
veridical-flow
Making it easier to build stable, trustworthy data-science pipelines.
Stars: ✭ 28 (-24.32%)
Mutual labels:  statistics
ciencia datos
El curso en español, de acceso abierto y gratuito más grande del mundo sobre Ciencia de Datos en salud.
Stars: ✭ 66 (+78.38%)
Mutual labels:  statistics
dml
R package for Distance Metric Learning
Stars: ✭ 58 (+56.76%)
Mutual labels:  statistics
GeomMLBStadiums
Geoms to draw MLB stadiums in ggplot2
Stars: ✭ 44 (+18.92%)
Mutual labels:  baseball
Algorithms
Free hands-on course with the implementation (in Python) and description of several computational, mathematical and statistical algorithms.
Stars: ✭ 117 (+216.22%)
Mutual labels:  statistics
mathlion
Mathlion is an advanced math plugin for Kibana's Timelion
Stars: ✭ 77 (+108.11%)
Mutual labels:  statistics
openrefine-client
The OpenRefine Python Client from Paul Makepeace provides a library for communicating with an OpenRefine server. This fork extends the command line interface (CLI) and is distributed as a convenient one-file-executable (Windows, Linux, Mac). It is also available via Docker Hub, PyPI and Binder.
Stars: ✭ 67 (+81.08%)
Mutual labels:  etl

mlbgameday

Build Status CRAN_Status_Badge Project Status: Active - The project has reached a stable, usable state and is being actively developed.

Why mlbgameday?

Designed to facilitate extract, transform and load for MLBAM “Gameday” data. The package is optimized for parallel processing of data that may be larger than memory. There are other packages in the R universe that were built to perform statistics and visualizations on these data, but mlbgameday is concerned primarily with data collection. More uses of these data can be found in the pitchRx, openWAR, and baseballr packages.

Install

  • Stable version from CRAN
install.packages("mlbgameday")
  • The latest development version from GitHub:
devtools::install_github("keberwein/mlbgameday")

Basic Usage

Although the package is optimized for parallel processing, it will also work without registering a parallel backend. When only querying a single day's data, a parallel backend may not provide much additional performance. However, parallel backends are suggested for larger data sets, as the process will be faster by several orders of magnitude.

library(mlbgameday)

innings_df <- get_payload(start = "2017-04-03", end = "2017-04-04")

Take a peek at the data.

head(innings_df$atbat, 1)
#>   num b s o start_tfs       start_tfs_zulu batter stand b_height pitcher
#> 1   1 2 2 1    170552 2017-04-03T17:05:52Z 543829     L     5-11  544931
#>   p_throws                                                  des
#> 1        R Dee Gordon lines out to left fielder Jayson Werth.  
#>                                                                des_es
#> 1 Dee Gordon batea línea de out a jardinero izquierdo Jayson Werth.  
#>   event_num   event     event_es home_team_runs away_team_runs inning
#> 1        11 Lineout Línea de Out              0              0      1
#>   next_ inning_side
#> 1     Y         top
#>                                                                                                                      url
#> 1 http://gd2.mlb.com/components/game/mlb//year_2017/month_04/day_03/gid_2017_04_03_miamlb_wasmlb_1/inning/inning_all.xml
#>         date                    gameday_link score
#> 1 2017-04-03 /gid_2017_04_03_miamlb_wasmlb_1  <NA>
#>                              play_guid event2 event2_es event3 event3_es
#> 1 76e23666-26f1-4339-967f-c6f759d864f4   <NA>      <NA>   <NA>      <NA>
#>      batter_name      pitcher_name
#> 1 Devaris Gordon Stephen Strasburg

Parallel Processing

The package's internal functions are optimized to work with the doParallel package. By default, the R language will use one core of our CPU. The doParallel package enables us to use several cores, which will execute tasks simultaneously. In a standard regular season for all teams, the function has to process more than 2,400 individual files, which depending on your system, can take quite some time. Parallel processing speeds this process up by several times, depending on how many processor cores we choose to use.

library(mlbgameday)
library(doParallel)

# First we need to register our parallel cluster.
# Set the number of cores to use as the machine's maximum number of cores minus 1 for background processes.
no_cores <- detectCores() - 1
cl <- makeCluster(no_cores)  
registerDoParallel(cl)

# Then run the get_payload function as normal.
innings_df <- get_payload(start = "2017-04-03", end = "2017-04-10")

# Don't forget to stop the cluster when finished.
stopImplicitCluster()
rm(cl)

Note: The mlbgameday package is inteded for use on a single machine, using multiple cores. However, it may be possible to use a cluster of multiple machines as well. For more on parallel processing, please see the package vignettes

Databases

When collecting several seasons worth of data, the data may become larger than memory. If this is the case, the mlbgameday package includes functionality to break the data into "chunks" and load into a database. Database connections are provided by the DBI package, which includes connections for most modern relational databases. Below is an example that creates a SQLite database in our working directory and populates it with MLBAM Gameday data. Although this technique is fast, it is also a system intensive process. The authors of mlbgameday suggest loading no more than a single season per R session.

library(mlbgameday)
library(doParallel)
library(DBI)
library(RSQLite)

# First we need to register our parallel cluster.
no_cores <- detectCores() - 1
cl <- makeCluster(no_cores)  
registerDoParallel(cl)

# Create the database in our working directory.
con <- dbConnect(RSQLite::SQLite(), dbname = "gameday.sqlite3")

# Collect all games, including pre and post-season for the 2016 season.
get_payload(start = "2016-01-01", end = "2017-01-01", db_con = con)

# Don't forget to stop the cluster when finished.
stopImplicitCluster()
rm(cl)

For a more in-depth look at reading and writing to databases, please see the package vignettes.

Gameday Data Sets

Those familiar with Carson Sievert's pitchRx package probably recognize the default data format returned by the get_payload() function. The format was intentionally designed to be similar to the data returned by the pitchRx package for those who may be keeping persistent databases. The default data set returned is "inning_all," however there are several more options including:

  • inning_hit

  • bis_boxscore

  • game_events

  • linescore

For example, the following with query the linescore data set.

library(mlbgameday)

linescore_df <- get_payload(start = "2017-04-03", end = "2017-04-04", dataset = "linescore")

Visualization

The mlbgameday package is data-centric and does not provide any built-in visualization tools. However, there are several excellent visualization packages available for the R language. Below is a short example of what can be done with ggplot2. For more examples, please see the package vignettes.

First, get the data.

library(mlbgameday)
library(dplyr)

# Grap some Gameday data. We're specifically looking for Jake Arrieta's no-hitter.
gamedat <- get_payload(start = "2016-04-21", end = "2016-04-21")

# Subset that atbat table to only Arrieta's pitches and join it with the pitch table.
pitches <- inner_join(gamedat$pitch, gamedat$atbat, by = c("num", "url")) %>%
    subset(pitcher_name == "Jake Arrieta")
library(ggplot2)

# basic example
ggplot() +
    geom_point(data=pitches, aes(x=px, y=pz, shape=type, col=pitch_type)) +
    coord_equal() + geom_path(aes(x, y), data = mlbgameday::kzone)

library(ggplot2)

# basic example with stand.
ggplot() +
    geom_point(data=pitches, aes(x=px, y=pz, shape=type, col=pitch_type)) +
    facet_grid(. ~ stand) + coord_equal() +
    geom_path(aes(x, y), data = mlbgameday::kzone)

Acknowledgements

This package was inspired by the mlbgame Python library by Zach Panzarino, the pitchRx package by Carson Sievert and the openWAR package by Ben Baumer and Gregory Matthews.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].