All Projects → droher → boxball

droher / boxball

Licence: Apache-2.0 license
Prebuilt Docker images with Retrosheet's complete baseball history data for many analytical frameworks. Includes Postgres, cstore_fdw, MySQL, SQLite, Clickhouse, Drill, Parquet, and CSV.

Programming Languages

python
139335 projects - #7 most used programming language
Dockerfile
14818 projects
shell
77523 projects

Projects that are alternatives of or similar to boxball

retrosheet
Project to parse retrosheet baseball data in python
Stars: ✭ 19 (-75.95%)
Mutual labels:  sports, baseball, retrosheet
openrowingmonitor
A free and open source performance monitor for rowing machines
Stars: ✭ 29 (-63.29%)
Mutual labels:  sports, sports-stats, sports-data
mysportsfeeds-api
Feature requests for the MySportsFeeds Sports Data API.
Stars: ✭ 44 (-44.3%)
Mutual labels:  sports, sports-stats, sports-data
cfbscrapR
A scraping and aggregating package using the CollegeFootballData API
Stars: ✭ 25 (-68.35%)
Mutual labels:  sports, sports-stats, sports-data
scrapeOP
A python package for scraping oddsportal.com
Stars: ✭ 99 (+25.32%)
Mutual labels:  sports, baseball, sports-data
sports.py
A simple Python package to gather live sports scores
Stars: ✭ 51 (-35.44%)
Mutual labels:  sports, baseball, sports-stats
flask-react-d3-celery
A full-stack dockerized web application to visualize Formula 1 race statistics from 2016 to present, with a Python Flask server and a React front-end with d3.js as data visualization tool.
Stars: ✭ 20 (-74.68%)
Mutual labels:  sports, sports-stats, sports-data
mysportsfeeds-r
R wrapper functions for the MySportsFeeds Sports Data API
Stars: ✭ 27 (-65.82%)
Mutual labels:  sports-stats, sports-data
mysportsfeeds-python
Python wrapper for the MySportsFeeds Sports Data API
Stars: ✭ 77 (-2.53%)
Mutual labels:  sports-stats, sports-data
mysportsfeeds-node
NodeJS wrapper for the MySportsFeeds Sports Data API
Stars: ✭ 62 (-21.52%)
Mutual labels:  sports-stats, sports-data
Deep-Neural-Networks-for-Baseball
A repository to follow along with Andrew Trask's "Grokking Deep Learning" by modelling baseball statistics using various architectures of neural networks built from scratch.
Stars: ✭ 15 (-81.01%)
Mutual labels:  sports, baseball
NBA-Machine-Learning-Sports-Betting
NBA sports betting using machine learning
Stars: ✭ 150 (+89.87%)
Mutual labels:  sports, sports-data
replay-table
A javascript library for visualizing sport season results with interactive standings
Stars: ✭ 67 (-15.19%)
Mutual labels:  sports, sports-stats
scoreboard
CRG Derby Scoreboard
Stars: ✭ 83 (+5.06%)
Mutual labels:  sports, sports-data
sport-stats
Sport stats UI components
Stars: ✭ 62 (-21.52%)
Mutual labels:  sports, sports-data
Graphouse
Graphouse allows you to use ClickHouse as a Graphite storage.
Stars: ✭ 241 (+205.06%)
Mutual labels:  clickhouse
ClickhouseBuilder
Fluent queries builder for Clickhouse. Also has integration with Laravel / Lumen.
Stars: ✭ 155 (+96.2%)
Mutual labels:  clickhouse
Storagetapper
StorageTapper is a scalable realtime MySQL change data streaming, logical backup and logical replication service
Stars: ✭ 232 (+193.67%)
Mutual labels:  clickhouse
Promhouse
PromHouse is a long-term remote storage with built-in clustering and downsampling for Prometheus 2.x on top of ClickHouse.
Stars: ✭ 212 (+168.35%)
Mutual labels:  clickhouse
ARGoal
Get more goals. | Virtual Goals & Goal Distance | App Doctor Hu
Stars: ✭ 14 (-82.28%)
Mutual labels:  sports

GitHub release Docker Pulls

Introduction

Boxball creates prepopulated databases of the two most significant open source baseball datasets: Retrosheet and the Baseball Databank. Retrosheet contains information on every major-league pitch since 2000, every play since 1928, every box score since 1901, and every game since 1871. The Databank (based on the Lahman Database) contains yearly summaries for every player and team in history. In addition to the data and databases themselves, Boxball relies on the following tools:

  • Docker for repeatable builds and easy distribution
  • SQLAlchemy for abstracting away DDL differences between databases
  • Chadwick for translating Retrosheet's complex event files into a relational format

Follow the instructions below to install your distribution of choice. The full set of images is also available on Docker Hub.

The Retrosheet schema is extensively documented in the code; see the source here until I find a prettier solution.

If you find the project useful, please consider donating to:

Feel free to contact me with questions or comments!

Requirements

  • Docker (v18.06, earlier versions may not work)
  • 2-20GB Disk space (depends on distribution choice)
  • 500MB-8GB RAM available to Docker (depends on distribution choice)

Distributions

Column-Oriented Databases

Postgres cstore_fdw (Recommended)

This distribution uses the cstore_fdw extension to turn PostgreSQL into a column-oriented database. This means that you get the rich featureset of Postgres, but with a huge improvement in speed and disk usage. To install and run the database server:

docker run --name postgres-cstore-fdw -d -p 5433:5432 -e POSTGRES_PASSWORD="postgres" -v ~/boxball/postgres-cstore-fdw:/var/lib/postgresql/data doublewick/boxball:postgres-cstore-fdw-latest

Roughly an hour after the image is downloaded, the data will be fully loaded into the database, and you can connect to it as the user postgres with password postgres on port 5433 (either using the psql command line tool or a database client of your choice). The data will be persisted on your machine in ~/boxball/postgres-cstore-fdw (~1.5GB), which means you can stop/remove the container without having to reload the data when you turn it back on.

Clickhouse

Clickhouse is a database developed by Yandex with some very impressive performance benchmarks. It uses less disk space than Postgres cstore_fdw, but significantly more RAM (~5GB). I've yet to run any query performance comparisons. To install and run the database server:

docker run --name clickhouse -d -p 8123:8123 -v ~/boxball/clickhouse:/var/lib/clickhouse doublewick/boxball:clickhouse-latest

15-30 minutes after the image is downloaded, the data will be fully loaded into the database, and you can connect to it either by attaching the container and using the clickhouse-client CLI or by using a local database client on port 8123 as the user default. The data will be persisted on your machine in ~/boxball/clickhouse (~700MB), which means you can stop/remove the container without having to reload the data when you turn it back on.

Drill

Drill is a framework that allows for SQL queries directly on files, without having to declare any schema. It is usually used on a computing cluster with massive datasets, but we use a single-node setup. To install and run:

docker run --name drill -id -p 8047:8047 -p 31010:31010 -v ~/boxball/drill:/data doublewick/boxball:drill-latest

Data will be immediately available to query after the image is downloaded. Use port 8047 to access the Web UI (which includes a SQL runner) and port 31010 to connect via a database client. You may also attach the container and query from the command line. The data will be persisted on your machine in ~/boxball/drill (~700MB).

Traditional (Row-oriented) Databases

Note: these frameworks are likely to be prohibitively slow when querying play-by-play data, and they take up significantly more disk space than their columnar counterparts.

Postgres

Similar configuration to the cstore_fdw extended version above, but stored in the conventional way.

docker run --name postgres -d -p 5432:5432 -e POSTGRES_PASSWORD="postgres" -v ~/boxball/postgres:/var/lib/postgresql/data doublewick/boxball:postgres-latest

Roughly 90 minutes after the image is downloaded, the data will be fully loaded into the database, and you can connect to it as the user postgres with password postgres on port 5433 (either using the psql command line tool or a database client of your choice). The data will be persisted on your machine in ~/boxball/postgres (~12GB), which means you can stop/remove the container without having to reload the data when you turn it back on.

MySQL

To install and run:

docker run --name mysql -d -p 3306:3306 -v ~/boxball/mysql:/var/lib/mysql doublewick/boxball:mysql-latest

Roughly two hours after the image is downloaded, the data will be fully loaded into the database, and you can connect to it as the user root on port 3306. The data will be persisted on your machine in ~/boxball/mysql (~12GB), which means you can stop/remove the container without having to reload the data when you turn it back on.

SQLite (with web UI)

To install and run:

docker run --name sqlite -d -p 8080:8080 -v ~/boxball/sqlite:/db doublewick/boxball:sqlite-latest

Roughly two minutes after the image is downloaded, the data will be fully loaded into the database. localhost:8080 will provide a web UI where you can write queries and perform schema exploration.

Flat File Downloads

Parquet

Parquet is a columnar data format originally developed for the Hadoop ecosystem. It has solid support in Spark, Pandas, and many other frameworks. OneDrive

CSV

The original CSVs from the extract step (each CSV file is compressed in the ZSTD format). OneDrive

Acknowledgements

Ted Turocy's Chadwick Bureau developed the tools and repos that made this project possible. I am also grateful to Sean Lahman for creating his database, which I have been using for over 15 years. I was able to develop and host this project for free thanks to the generous open-source plans of Jetbrains, CircleCI, Github, and Docker Hub.

Retrosheet represents the collective effort of thousands of baseball fans over 150 years of scorekeeping and data entry. I hope Boxball facilitates more historical research to continue this tradition.

Licence(s)

All code is released under the Apache 2.0 license. Baseball Databank data is distributed under the CC-SA 4.0 license. Retrosheet data is released under the condition that the below text appear prominently:

The information used here was obtained free of
charge from and is copyrighted by Retrosheet.  Interested
parties may contact Retrosheet at "www.retrosheet.org".
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].