All Projects → ml-tooling → ml-project-template

ml-tooling / ml-project-template

Licence: other
ML project template facilitating both research and production phases.

Programming Languages

python
139335 projects - #7 most used programming language
Dockerfile
14818 projects

Projects that are alternatives of or similar to ml-project-template

open-solution-googleai-object-detection
Open solution to the Google AI Object Detection Challenge 🍁
Stars: ✭ 46 (-33.33%)
Mutual labels:  reproducibility
ten-years
Ten Years Reproducibility Challenge
Stars: ✭ 59 (-14.49%)
Mutual labels:  reproducibility
r10e-ds-py
Reproducible Data Science in Python (SciPy 2019 Tutorial)
Stars: ✭ 12 (-82.61%)
Mutual labels:  reproducibility
ReproducibleScience
Short course on reproducible science: what, why, how
Stars: ✭ 23 (-66.67%)
Mutual labels:  reproducibility
researchcompendium
NOTE: This repo is archived. Please see https://github.com/benmarwick/rrtools for my current approach
Stars: ✭ 26 (-62.32%)
Mutual labels:  reproducibility
git-ghost
Synchronize your working directory efficiently to a remote place without committing the changes.
Stars: ✭ 61 (-11.59%)
Mutual labels:  reproducibility
targets-minimal
A minimal example data analysis project with the targets R package
Stars: ✭ 50 (-27.54%)
Mutual labels:  reproducibility
Open-Data-Lab
an initiative to provide infrastructure for reproducible workflows around open data
Stars: ✭ 26 (-62.32%)
Mutual labels:  reproducibility
mlr3-learndrake
Template for using mlr3 with drake
Stars: ✭ 18 (-73.91%)
Mutual labels:  reproducibility
reproducible
A set of tools for R that enhance reproducibility beyond package management
Stars: ✭ 33 (-52.17%)
Mutual labels:  reproducibility
stantargets
Reproducible Bayesian data analysis pipelines with targets and cmdstanr
Stars: ✭ 31 (-55.07%)
Mutual labels:  reproducibility
ggtrack
restlessdata.com.au/ggtrack
Stars: ✭ 39 (-43.48%)
Mutual labels:  reproducibility
Reproducibilty-Challenge-ECANET
Unofficial Implementation of ECANets (CVPR 2020) for the Reproducibility Challenge 2020.
Stars: ✭ 27 (-60.87%)
Mutual labels:  reproducibility
bramble
Purely functional build system and package manager
Stars: ✭ 173 (+150.72%)
Mutual labels:  reproducibility
reprozip-examples
Examples and demos for ReproZip
Stars: ✭ 13 (-81.16%)
Mutual labels:  reproducibility
OSODOS
Open Science, Open Data, Open Source
Stars: ✭ 23 (-66.67%)
Mutual labels:  reproducibility
rr-organization1
The Organization lesson for the Reproducible Science Curriculum
Stars: ✭ 36 (-47.83%)
Mutual labels:  reproducibility
papers-as-modules
Software Papers as Software Modules: Towards a Culture of Reusable Results
Stars: ✭ 18 (-73.91%)
Mutual labels:  reproducibility
ReBench
Execute and document benchmarks reproducibly.
Stars: ✭ 48 (-30.43%)
Mutual labels:  reproducibility
scooby
🐶 🕵️ Great Dane turned Python environment detective
Stars: ✭ 36 (-47.83%)
Mutual labels:  reproducibility

ML Project Template

This repository contains a template project that can be easily adapted for all kinds of Machine Learning tasks. Typically, solving such task entails two main phases, research and production with very different focuses. The template intends to faciliatate work on ML projects by guiding practitioners to adopt some best practices.

research: exploratory data analyses, model prototyping and experiments are dumped here in a structured way

production: distilled utils lib, training job and inference service are implemented here

It is recommended to simply clone this repo and customize it to the specific use-case at hand.


Repository Structure

  • research: Scripts and Notebooks for experimentation.
    • develop (Python): Experimental code to try out new ideas and experiments. Use Jupyter notebooks wherever you can. Naming convention: YYYY-MM-DD_userid_short-description. If you cannot use a notebook and have multiple scripts/files for an experiment, create a folder with the same naming convention. Each file should be handled by one person only.
    • deliver (Python): Refactored notebooks that contain valuable insights or results (e.g. visualizations, training runs). Notebooks should be refactored, documented, contain outputs, and use the following naming schema: YYYY-MM-DD_short-description. Notebooks in deliver should not be changed or rerun. If you want to rerun a deliver Notebook, please duplicate it into the develop folder.
    • templates (Python): Refactored Notebooks that are reusable for a specific task (e.g. model training, data exploration). Notebooks should be refactored, documented, not contain any output, and use the following naming schema: short-description. If you like to make use of a template Notebook, duplicate the notebook into develop folder.
  • production: The production-ready solution(s) composed of libraries, services, and jobs.
    • python-utils-lib (Python): Utility functions that are distilled from the research phase and used across multiple scripts. Should only contain refactored and tested Python scripts/modules. Installable via pip.
    • training-job (Python/Docker): Combines required data exports, preprocessing and training scripts into a Docker container. This makes results reproducible and the production model retrainable in any ennvironment.
    • inference-service (Python/Docker): Docker container that provides the final model prediction capabilities via a REST API.

Naming Conventions

Code Artifacts

  • develop notebooks/scripts: YYYY-MM-DD_userid_short-description
  • deliver notebooks/scripts: YYYY-MM-DD_short-description
  • template notebooks/scripts: short-description
  • services: -service suffix
  • jobs: -job suffix
  • libraries: -lib suffix

Files

<dataset-desc>_<preprocessing-desc>_<training-desc>.<filetype>

Examples:

  • blogs-metadata.csv
  • blogs-metadata_cl-rs_ft-vec.vectors
  • categories2blogs_cl-rs-lm_tfidf-lsvm.model.zip
  • categories2blogs-questions_cl-rs-lm_tfidf-lsvm.model.zip

Name Identifier Descriptions:

Name Description
Dataset Identifiers:
categories2blogs Dataset containing blogs with the text content, blogs item URI, and connected primary tags.
blogs-metadata Dataset containing all blogs and related metadata (properties).
Preprocessing Identifiers:
cl Default text cleaning (lowercasing, regex cleaning).
rs Remove Stopwords.
lm Text lemmatization.
Training Identifiers:
ft-vec Text vectorizer using Fasttext.
tfidf Text vectorizer using TFIDF.
lsvm Classifier using linear SVM.
Filetype Identifiers:
.model Model file.
.vectors Binary vectors file.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].