All Projects → dcaribou → transfermarkt-datasets

dcaribou / transfermarkt-datasets

Licence: other
⚽️ Extract, prepare and publish Transfermarkt datasets.

Programming Languages

python
139335 projects - #7 most used programming language
Jupyter Notebook
11667 projects
HCL
1544 projects
Makefile
30231 projects
Mermaid
1 project
shell
77523 projects
Dockerfile
14818 projects

Projects that are alternatives of or similar to transfermarkt-datasets

football analytics
⚽📊 A collection of football analytics projects, data, and analysis by Edd Webster (@eddwebster), including a curated list of publicly available resources published by the football analytics community.
Stars: ✭ 405 (+575%)
Mutual labels:  soccer, football-data, football, soccer-analytics
angular-footballdata-api-factory
AngularJS Factory for the football-data.org JSON REST API
Stars: ✭ 48 (-20%)
Mutual labels:  soccer, football-data, football
understatr
fetch understat data
Stars: ✭ 72 (+20%)
Mutual labels:  soccer, football, soccer-analytics
Footballdata
A hodgepodge of JSON and CSV Football/Soccer data
Stars: ✭ 526 (+776.67%)
Mutual labels:  soccer, football
Understat
An asynchronous Python package for https://understat.com/.
Stars: ✭ 88 (+46.67%)
Mutual labels:  soccer, football
regista
An R package for soccer modelling
Stars: ✭ 71 (+18.33%)
Mutual labels:  soccer, football
soccer-bookmaker-odds
Historical data of bookmaker odds for some of the major soccer European leagues.
Stars: ✭ 16 (-73.33%)
Mutual labels:  soccer, football
Soccerapi
soccerapi ⚽️ , an unambitious soccer odds scraper
Stars: ✭ 52 (-13.33%)
Mutual labels:  soccer, football
World Cup 2018 Cli Dashboard
⚽🏆A World Cup 2018 CLI dashboard – Watch matches in your terminal
Stars: ✭ 529 (+781.67%)
Mutual labels:  soccer, football
Datofutbol
Dato Fútbol repository
Stars: ✭ 23 (-61.67%)
Mutual labels:  soccer, football
Awesome Soccer Analytics
⚽️📈 A curated list of awesome resources related to Soccer Analytics.
Stars: ✭ 244 (+306.67%)
Mutual labels:  soccer, football
ggshakeR
An analysis and visualization R package that works with publicly available soccer data
Stars: ✭ 69 (+15%)
Mutual labels:  soccer, soccer-analytics
Football Cli
⚽ Command line interface for Hackers who love football
Stars: ✭ 984 (+1540%)
Mutual labels:  soccer, football
Soccergraphr
Soccer Analytics in R using OPTA data
Stars: ✭ 42 (-30%)
Mutual labels:  soccer, football
fotmob
⚽ A wrapper around the unofficial FotMob API
Stars: ✭ 22 (-63.33%)
Mutual labels:  soccer, football
Open Data
Free football data from StatsBomb
Stars: ✭ 891 (+1385%)
Mutual labels:  soccer, football
epl mysql db
Free/open English Premier League results database from 1993-2017. Dump format is MySQL and sqlite.
Stars: ✭ 26 (-56.67%)
Mutual labels:  soccer, football-data
Draw
⚽ Champions League draw simulator
Stars: ✭ 134 (+123.33%)
Mutual labels:  soccer, football
Golazon
Football data mnmlist way. Built with Next.js and Ruby.
Stars: ✭ 28 (-53.33%)
Mutual labels:  soccer, football
Epl Fantasy Geek
English Premier League 2017-18 Fantasy Stats for Geeks
Stars: ✭ 50 (-16.67%)
Mutual labels:  soccer, football

Build Status Pipeline Status Visitors

transfermarkt-datasets

In an nutshell, this project aims for three things:

  1. Acquiring data from the transfermarkt website using the trasfermarkt-scraper.
  2. Building a clean, public football (soccer) dataset using data in 1.
  3. Automating 1 and 2 to keep these assets up to date and publicly available on some well-known data catalogs.

Open in Streamlit Kaggle data.world


classDiagram
direction LR
competitions --|> games : competition_id
competitions --|> clubs : domestic_competition_id
clubs --|> players : current_club_id
clubs --|> club_games : opponent/club_id
clubs --|> game_events : club_id
players --|> appearances : player_id
players --|> game_events : player_id
players --|> player_valuations : player_id
games --|> appearances : game_id
games --|> game_events : game_id
games --|> clubs : home/away_club_id
games --|> club_games : game_id
class competitions {
 competition_id
}
class games {
    game_id
    home/away_club_id
    competition_id
}
class game_events {
    game_id
    player_id
}
class clubs {
    club_id
    domestic_competition_id
}
class club_games {
    club_id
    opponent_club_id
    game_id
}
class players {
    player_id
    current_club_id
}
class player_valuations{
    player_id
}
class appearances {
    appearance_id
    player_id
    game_id
}


setup

Setup your local environment to run the project with poetry.

  1. Install poetry
  2. Install python dependencies (poetry will create a virtual environment for you)
cd transfermarkt-datasets
poetry install

make

The Makefile in the root defines a set of useful targets that will help you run the different parts of the project. Some examples are

dvc_pull                       pull data from the cloud (aws s3)
docker_build                   build the project docker image and tag accordingly
acquire_local                  run the acquiring process locally (refreshes data/raw)
prepare_local                  run the prep process locally (refreshes data/prep)
sync                           run the sync process (refreshes data frontends)
streamlit_local                run streamlit app locally
dagit_local                    run dagit locally

Run make help to see the full list.

Once you've completed the setup, you should be able to run most of these from your machine.

data storage

ℹ️ Read access to the S3 DVC remote storage for the project is required to successfully run dvc pull. Contributors can grant themselves access by adding their AWS IAM user ARN to this whitelist.

All project data assets are kept inside the data folder. This is a DVC repository, all files can be pulled from the remote storage with the make dvc_pull.

path description
data/raw contains raw data per season as acquired with trasfermarkt-scraper (check acquire)
data/prep contains prepared datasets as produced by transfermarkt_datasets module (check prepare)

data acquisition

In the scope of this project, "acquiring" is the process of collecting "raw data", as it is produced by trasfermarkt-scraper. Acquired data lives in the data/raw folder and it can be created or updated for a particular season by running make acquire_local

make acquire_local ARGS="--asset all --season 2022"

This runs the scraper with a set of parameters and collects the output in data/raw.

data preparation

In the scope of this project, "preparing" is the process of transforming raw data to create a high quality dataset that can be conveniently consumed by analysts of all kinds. The transfermark_datasets module deals with the data preparation.

path description
transfermark_datasets/core core classes and utils that are used to work with the dataset
transfermark_datasets/tests unit tests for core classes
transfermark_datasets/assets perpared asset definitions: one python file per asset
transfermark_datasets/dagster dagster job definitions
transfermark_datasets/stage temporary location for asset generation

dagster

The dataset preparation steps are rendered as a dagster job.

  • make prepare_local runs the dagster preparation job in process
  • make dagit_local spins up a dagit UI where the execution can be visualised

dagster

configuration

Different project configurations are defined in the config.yml file.

python api

transfermark_datasets provides a python api that can be used to work with the module from the python console. This is particularly convenient for working with the datasets from a notebook.

# import the module
from transfermarkt_datasets.core.dataset import Dataset

# instantiate the datasets handler
td = Dataset()

# build the datasets from raw data
td.discover_assets()
td.build_datasets()
# if perpared files already exist in data/prep, you can just load them
# > td.load_assets()

# inspect the results
td.asset_names # ["games", "players", ...]
td.assets["games"].prep_df # get the built asset in a dataframe

# get raw data in a dataframe
td.assets["games"].load_raw()
td.assets["games"].raw_df 

For more examples on using transfermark_datasets, checkout the sample notebooks.

data publication

Prepared data is published to a couple of popular dataset websites. This is done running make sync, which runs weekly as part of the data pipeline.

streamlit 🎈

There is a streamlit app for the project with documentation, a data catalog and sample analyisis. The app is currently hosted in fly.io, you can check it out here.

For local development, you can also run the app in your machine. Provided you've done the setup, run the following to spin up a local instance of the app

make streamlit_local

⚠️ Note that the app expects prepared data to exist in data/prep. Check out data storage for instructions about how to populate that folder.

infra

Define all the necessary infrastructure for the project in the cloud with Terraform.

contributing 🙏

Contributions to transfermarkt-datasets are most welcome. If you want to contribute new fields or assets to this dataset, instructions are quite simple:

  1. Fork the repo
  2. Set up your local environment
  3. Pull the raw data by either running dvc pull (requesting access is needed) or using make acquire_local script (no access request needed)
  4. Start modifying assets or creating new ones in transfermarkt_datasets/assets. You can use make prepare_local to run and test your changes.
  5. If it's all looking good, create a pull request with your changes 🚀
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].