All Projects → sdv-dev → SDGym

sdv-dev / SDGym

Licence: MIT license
Benchmarking synthetic data generation methods.

Programming Languages

python
139335 projects - #7 most used programming language
C++
36643 projects - #6 most used programming language
Makefile
30231 projects
shell
77523 projects
Dockerfile
14818 projects

Projects that are alternatives of or similar to SDGym

Tgan
Generative adversarial training for generating synthetic tabular data.
Stars: ✭ 173 (-2.26%)
Mutual labels:  tabular-data, generative-adversarial-network
DeepEcho
Synthetic Data Generation for mixed-type, multivariate time series.
Stars: ✭ 44 (-75.14%)
Mutual labels:  generative-adversarial-network, synthetic-data
mtss-gan
MTSS-GAN: Multivariate Time Series Simulation with Generative Adversarial Networks (by @firmai)
Stars: ✭ 77 (-56.5%)
Mutual labels:  generative-adversarial-network, synthetic-data
Ctgan
Conditional GAN for generating synthetic tabular data.
Stars: ✭ 297 (+67.8%)
Mutual labels:  tabular-data, generative-adversarial-network
FAST-RIR
This is the official implementation of our neural-network-based fast diffuse room impulse response generator (FAST-RIR) for generating room impulse responses (RIRs) for a given acoustic environment.
Stars: ✭ 90 (-49.15%)
Mutual labels:  generative-adversarial-network, synthetic-data
perforator
Record "perf" performance metrics for individual functions/regions of an ELF binary.
Stars: ✭ 33 (-81.36%)
Mutual labels:  benchmark
micro-runner
Micro-Runner, a CLI playground for benchmarking your JavaScript code
Stars: ✭ 27 (-84.75%)
Mutual labels:  benchmark
gretel-python-client
The Gretel Python Client allows you to interact with the Gretel REST API.
Stars: ✭ 28 (-84.18%)
Mutual labels:  synthetic-data
php-orm-benchmark
The benchmark to compare performance of PHP ORM solutions.
Stars: ✭ 82 (-53.67%)
Mutual labels:  benchmark
cache-bench
Explore the impact of virtual memory settings on caching efficiency on Linux systems under memory pressure
Stars: ✭ 25 (-85.88%)
Mutual labels:  benchmark
gobench
A benchmark framework based on Golang
Stars: ✭ 50 (-71.75%)
Mutual labels:  benchmark
Shuhai
Shuhai is a benchmarking-memory tool that allows FPGA programmers to demystify all the underlying details of memories, e.g., HBM and DDR4, on a Xilinx FPGA
Stars: ✭ 53 (-70.06%)
Mutual labels:  benchmark
pytorch-GAN
My pytorch implementation for GAN
Stars: ✭ 12 (-93.22%)
Mutual labels:  generative-adversarial-network
OptimisationAlgorithms
Searching global optima with firefly algorithm and solving traveling salesmen problem with genetic algorithm
Stars: ✭ 20 (-88.7%)
Mutual labels:  benchmark
DeepLearningBenchmarks
Benchmarks across Deep Learning Frameworks in Julia and Python
Stars: ✭ 24 (-86.44%)
Mutual labels:  benchmark
word-benchmarks
Benchmarks for intrinsic word embeddings evaluation.
Stars: ✭ 45 (-74.58%)
Mutual labels:  benchmark
playwright-test
Run unit tests with several runners or benchmark inside real browsers with playwright.
Stars: ✭ 81 (-54.24%)
Mutual labels:  benchmark
php-framework-benchmark
php framework benchmark (include laravel、symfony、silex、lumen、slim、yii2、tastphp etc)
Stars: ✭ 17 (-90.4%)
Mutual labels:  benchmark
BPPNet-Back-Projected-Pyramid-Network
This is the official GitHub repository for ECCV 2020 Workshop paper "Single image dehazing for a variety of haze scenarios using back projected pyramid network"
Stars: ✭ 35 (-80.23%)
Mutual labels:  generative-adversarial-network
react-native-startup-time
measure startup time of your react-native app
Stars: ✭ 88 (-50.28%)
Mutual labels:  benchmark

This repository is part of The Synthetic Data Vault Project, a project from DataCebo.

Development Status Travis PyPi Shield Downloads

Overview

Synthetic Data Gym (SDGym) is a framework to benchmark the performance of synthetic data generators based on SDV and SDMetrics.

Important Links
💻 Website Check out the SDV Website for more information about the project.
📙 SDV Blog Regular publshing of useful content about Synthetic Data Generation.
📖 Documentation Quickstarts, User and Development Guides, and API Reference.
:octocat: Repository The link to the Github Repository of this library.
📜 License The entire ecosystem is published under the MIT License.
⌨️ Development Status This software is in its Pre-Alpha stage.
Community Join our Slack Workspace for announcements and discussions.
Tutorials Run the SDV Tutorials in a Binder environment.

What is a Synthetic Data Generator?

A Synthetic Data Generator is a Python function (or method) that takes as input some data, which we call the real data, learns a model from it, and outputs new synthetic data that has the same structure and similar mathematical properties as the real one.

Please refer to the synthesizers documentation for instructions about how to implement your own Synthetic Data Generator and integrate with SDGym. You can also read about how to use the ones already included in SDGym and see how to run them.

Benchmark datasets

SDGym evaluates the performance of Synthetic Data Generators using single table, multi table and timeseries datasets stored as CSV files alongside an SDV Metadata JSON file.

Further details about the list of available datasets and how to add your own datasets to the collection can be found in the datasets documentation.

Install

SDGym can be installed using the following commands:

Using pip:

pip install sdgym

Using conda:

conda install -c pytorch -c conda-forge sdgym

For more installation options please visit the SDGym installation Guide

Usage

Benchmarking your own Synthesizer

SDGym evaluates Synthetic Data Generators, which are Python functions (or classes) that take as input some data, which we call the real data, learn a model from it, and output new synthetic data that has the same structure and similar mathematical properties as the real one.

As an example, let use define a synthesizer function that applies the GaussianCopula model from SDV with gaussian distribution.

import numpy as np
from sdv.tabular import GaussianCopula


def gaussian_copula(real_data, metadata):
    gc = GaussianCopula(default_distribution='gaussian')
    table_name = metadata.get_tables()[0]
    gc.fit(real_data[table_name])
    return {table_name: gc.sample()}
ℹ️ You can learn how to create your own synthesizer function here.

We can now try to evaluate this function on the asia and alarm datasets:

import sdgym

scores = sdgym.run(synthesizers=gaussian_copula, datasets=['asia', 'alarm'])
ℹ️ You can learn about different arguments for sdgym.run function here.

The output of the sdgym.run function will be a pd.DataFrame containing the results obtained by your synthesizer on each dataset.

synthesizer dataset modality metric score metric_time model_time
gaussian_copula asia single-table BNLogLikelihood -2.842690 2.762427 0.752364
gaussian_copula alarm single-table BNLogLikelihood -20.223178 7.009401 3.173832

Benchmarking the SDGym Synthesizers

If you want to run the SDGym benchmark on the SDGym Synthesizers you can directly pass the corresponding class, or a list of classes, to the sdgym.run function.

For example, if you want to run the complete benchmark suite to evaluate all the existing synthesizers you can run (⚠️ this will take a lot of time to run!):

from sdgym.synthesizers import (
    CLBN, CopulaGAN, CTGAN, HMA1, Identity, Independent,
    MedGAN, PAR, PrivBN, SDV, TableGAN, TVAE,
    Uniform, VEEGAN)

all_synthesizers = [
    CLBN,
    CTGAN,
    CopulaGAN,
    HMA1,
    Identity,
    Independent,
    MedGAN,
    PAR,
    PrivBN,
    SDV,
    TVAE,
    TableGAN,
    Uniform,
    VEEGAN,
]
scores = sdgym.run(synthesizers=all_synthesizers)

For further details about all the arguments and possibilities that the benchmark function offers please refer to the benchmark documentation

Additional References

  • Datasets used in SDGym are detailed here.
  • How to write a synthesizer is detailed here.
  • How to use benchmark function is detailed here.
  • Detailed leaderboard results for all the releases are available here.



The Synthetic Data Vault Project was first created at MIT's Data to AI Lab in 2016. After 4 years of research and traction with enterprise, we created DataCebo in 2020 with the goal of growing the project. Today, DataCebo is the proud developer of SDV, the largest ecosystem for synthetic data generation & evaluation. It is home to multiple libraries that support synthetic data, including:

  • 🔄 Data discovery & transformation. Reverse the transforms to reproduce realistic data.
  • 🧠 Multiple machine learning models -- ranging from Copulas to Deep Learning -- to create tabular, multi table and time series data.
  • 📊 Measuring quality and privacy of synthetic data, and comparing different synthetic data generation models.

Get started using the SDV package -- a fully integrated solution and your one-stop shop for synthetic data. Or, use the standalone libraries for specific needs.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].