All Projects → bdilday → pybbda

bdilday / pybbda

Licence: GPL-2.0 license
Python Baseball Data and Analysis

Programming Languages

python
139335 projects - #7 most used programming language
Makefile
30231 projects

Projects that are alternatives of or similar to pybbda

retrosheet
Project to parse retrosheet baseball data in python
Stars: ✭ 19 (-9.52%)
Mutual labels:  baseball, baseball-statistics
sports.py
A simple Python package to gather live sports scores
Stars: ✭ 51 (+142.86%)
Mutual labels:  baseball
Deep-Neural-Networks-for-Baseball
A repository to follow along with Andrew Trask's "Grokking Deep Learning" by modelling baseball statistics using various architectures of neural networks built from scratch.
Stars: ✭ 15 (-28.57%)
Mutual labels:  baseball
mysportsfeeds-api
Feature requests for the MySportsFeeds Sports Data API.
Stars: ✭ 44 (+109.52%)
Mutual labels:  baseball-analysis-packages
baseball-pi
Get the live box score, plays, and batter stats of your favorite MLB team right on your desktop.
Stars: ✭ 13 (-38.1%)
Mutual labels:  baseball
boxball
Prebuilt Docker images with Retrosheet's complete baseball history data for many analytical frameworks. Includes Postgres, cstore_fdw, MySQL, SQLite, Clickhouse, Drill, Parquet, and CSV.
Stars: ✭ 79 (+276.19%)
Mutual labels:  baseball
scrapeOP
A python package for scraping oddsportal.com
Stars: ✭ 99 (+371.43%)
Mutual labels:  baseball
MLB-Menu
Status Bar menu for MLB games
Stars: ✭ 48 (+128.57%)
Mutual labels:  baseball
mlbgameday
Multi-core processing of 'Gameday' data from Major League Baseball Advanced Media. Additional tools to parallelize large data sets and write them to a database.
Stars: ✭ 37 (+76.19%)
Mutual labels:  baseball
GeomMLBStadiums
Geoms to draw MLB stadiums in ggplot2
Stars: ✭ 44 (+109.52%)
Mutual labels:  baseball
batter-pitcher-2vec
A model for learning distributed representations of MLB players.
Stars: ✭ 75 (+257.14%)
Mutual labels:  baseball
ballpark-tracker
A simple application used for tracking which MLB and AAA stadiums a "Ballpark Chaser" has been to.
Stars: ✭ 15 (-28.57%)
Mutual labels:  baseball
baseballstats
Baseball win expectancy and expected runs per inning calculators
Stars: ✭ 23 (+9.52%)
Mutual labels:  baseball
MachineLearning-BaseballPrediction-BlazorApp
Machine Learning over historical baseball data using latest Microsoft AI & Development technology stack (.Net Core & Blazor)
Stars: ✭ 36 (+71.43%)
Mutual labels:  baseball-statistics

pybbda

pybbda is a package for Python Baseball Data and Analysis.

data

pybbda aims to provide a uniform framework for accessing baseball data from various sources. The data are exposed as pandas DataFrames

The data sources it currently supports are:

  • Lahman data

  • Baseball Reference WAR

  • Fangraphs leaderboards and park factors

  • Retrosheet event data

  • Statcast pitch-by-pitch data

analysis

pybbda also provides analysis tools.

It currently supports:

  • Marcel projections

  • Batted ball trajectories

  • Run expectancy via Markov chains

The following are planned for a future release:

  • Simulations

  • and more...!

Installation

This package is available on PyPI, so you can install it with pip,

$ pip install -U pybbda

Or you can install the latest master branch directly from the github repo using pip,

$ pip install git+https://github.com/bdilday/pybbda.git

or download the source,

$ git clone [email protected]:bdilday/pybbda.git
$ cd pybbda
$ pip install .

Requirements

This package explicitly supports Python 3.6 andPython 3.7. It aims to support Python 3.8 but this is not guaranteed. It explicitly does not support any versions prior to Python 3.6, includingPython 2.7.

Installing data

This package ships without any data. Instead it provides tools to fetch and store data from a variety of sources.

To install data you can use the update tool in the pybbda.data.tools sub-module.

Example,

$ python -m pybbda.data.tools.update -h
usage: update.py [-h] [--data-root DATA_ROOT] --data-source
                 {Lahman,BaseballReference,Fangraphs,retrosheet,all} [--make-dirs]
                 [--overwrite] [--min-year MIN_YEAR] [--max-year MAX_YEAR]
                 [--num-threads NUM_THREADS]

optional arguments:
  -h, --help            show this help message and exit
  --data-root DATA_ROOT
                        Root directory for data storage
  --data-source {Lahman,BaseballReference,Fangraphs,retrosheet,all}
                        Update source
  --make-dirs           Make root dir if does not exist
  --overwrite           Overwrite files if they exist
  --min-year MIN_YEAR   Min year to download
  --max-year MAX_YEAR   Max year to download
  --num-threads NUM_THREADS
                        Number of threads to use for downloads

The data will be downloaded to --data-root, which defaults to the PYBBDA_DATA_ROOT

Detailed instructions are provided in the documentation

Example Usage

After installing some or all of the data, you can start using the package.

Following is an example of accessing Lahman data. More examples are included in the documentation

Lahman data

>>> from pybbda.data import LahmanData
>>> lahman_data = LahmanData()
>>> batting_df= lahman_data.batting
INFO:pybbda.data.sources.lahman.data:data:searching for file /home/bdilday/.pybbda/data/Lahman/Batting.csv
>>> batting_df.head()
    playerID  yearID  stint teamID lgID   G   AB   R   H  2B  3B  HR   RBI   SB   CS  BB   SO  IBB  HBP  SH  SF  GIDP
0  abercda01    1871      1    TRO  NaN   1    4   0   0   0   0   0   0.0  0.0  0.0   0  0.0  NaN  NaN NaN NaN   0.0
1   addybo01    1871      1    RC1  NaN  25  118  30  32   6   0   0  13.0  8.0  1.0   4  0.0  NaN  NaN NaN NaN   0.0
2  allisar01    1871      1    CL1  NaN  29  137  28  40   4   5   0  19.0  3.0  1.0   2  5.0  NaN  NaN NaN NaN   1.0
3  allisdo01    1871      1    WS3  NaN  27  133  28  44  10   2   2  27.0  1.0  1.0   0  2.0  NaN  NaN NaN NaN   0.0
4  ansonca01    1871      1    RC1  NaN  25  120  29  39  11   3   0  16.0  6.0  2.0   2  1.0  NaN  NaN NaN NaN   0.0
>>> batting_df.groupby("playerID").HR.sum().sort_values(ascending=False)
playerID
bondsba01    762
aaronha01    755
ruthba01     714
rodrial01    696
mayswi01     660
            ... 
mcconra01      0
mccolal01      0
mccluse01      0
mcclula01      0
aardsda01      0
Name: HR, Length: 19689, dtype: int64

CLI tools

Run expectancies

There is a cli tool for computing run expectancies from Markov chains.

$ python -m pybbda.analysis.run_expectancy.markov.cli --help

This Markov chain uses a lineup of 9 batters instead of assuming each batter has the same characteristics. You can also assign running probabilities, although they apply to all batters equally.

You can assign batting-event probabilities using a sequence of probabilities, or by referencing a player-season with the format {playerID}_{season}, where playerID is the Lahman ID and season is a 4-digit year. For example, to refer to Rickey Henderson's 1982 season, use henderi01_1982.

The lineup is assigned by giving the lineup slot followed by either 5 probabilities, or a player-season id. The lineup-slot 0 is a code to assign all nine batters to this value. Any other specific slots will be filled in as noted.

The number of outs to model is 3 by default. It can be changed by setting the environment variable PYBBDA_MAX_OUTS.

Example: Use a default set of probabilities for all 9 slots with no taking extra bases

$ python -m pybbda.analysis.run_expectancy.markov.cli -b 0 0.08 0.15 0.05 0.005 0.03 --running-probs 0 0 0 0 
mean score per 27 outs = 3.5227
std. score per 27 outs = 2.8009

Example: Use a default set of probabilities for all 9 slots with default probabilities for taking extra bases

$ python -m pybbda.analysis.run_expectancy.markov.cli -b 0 0.08 0.15 0.05 0.005 0.03
mean score per 27 outs = 4.2242
std. score per 27 outs = 3.0161

Example: Use a default set of probabilities for all 9 slots but let Rickey Henderson 1982 bat leadoff (using 27 outs, instead of 3)

$ PYBBDA_MAX_OUTS=27  python -m pybbda.analysis.run_expectancy.markov.cli -b 0 0.08 0.15 0.05 0.005 0.03 -i 1 henderi01_1982
WARNING:pybbda:__init__:Environment variable PYBBDA_DATA_ROOT is not set, defaulting to /home/bdilday/github/pybbda/pybbda/data/assets
INFO:pybbda.data.sources.lahman.data:data:searching for file /home/bdilday/github/pybbda/pybbda/data/assets/Lahman/Batting.csv
mean score per 27 outs = 4.3628
std. score per 27 outs = 3.0999

Example: Use a default set of probabilities for all 9 slots but let Rickey Henderson 1982 bat leadoff and Babe Ruth 1927 bat clean-up (using 27 outs, instead of 3)

$ PYBBDA_MAX_OUTS=27  python -m pybbda.analysis.run_expectancy.markov.cli -b 0 0.08 0.15 0.05 0.005 0.03 -i 1 henderi01_1982 -i 4 ruthba01_1927 
WARNING:pybbda:__init__:Environment variable PYBBDA_DATA_ROOT is not set, defaulting to /home/bdilday/github/pybbda/pybbda/data/assets
INFO:pybbda.data.sources.lahman.data:data:searching for file /home/bdilday/github/pybbda/pybbda/data/assets/Lahman/Batting.csv
mean score per 27 outs = 5.1420
std. score per 27 outs = 3.3996

Contributing

Contributions from the community are welcome. See the contributing guide.

License

GPLv2

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].