All Projects → BorealisAI → private-data-generation

BorealisAI / private-data-generation

Licence: other
A toolbox for differentially private data generation

Programming Languages

python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to private-data-generation

Zhusuan
A probabilistic programming library for Bayesian deep learning, generative models, based on Tensorflow
Stars: ✭ 2,093 (+2516.25%)
Mutual labels:  graphical-models, generative-models
precision-recall-distributions
Assessing Generative Models via Precision and Recall (official repository)
Stars: ✭ 80 (+0%)
Mutual labels:  generative-adversarial-network, generative-models
Generative Continual Learning
No description or website provided.
Stars: ✭ 51 (-36.25%)
Mutual labels:  generative-adversarial-network, generative-models
ladder-vae-pytorch
Ladder Variational Autoencoders (LVAE) in PyTorch
Stars: ✭ 59 (-26.25%)
Mutual labels:  generative-models
gan-weightnorm-resnet
Generative Adversarial Network with Weight Normalization + ResNet
Stars: ✭ 19 (-76.25%)
Mutual labels:  generative-adversarial-network
tt-vae-gan
Timbre transfer with variational autoencoding and cycle-consistent adversarial networks. Able to transfer the timbre of an audio source to that of another.
Stars: ✭ 37 (-53.75%)
Mutual labels:  generative-adversarial-network
subjectiveqe-esrgan
PyTorch implementation of ESRGAN (ECCVW 2018) for compressed image subjective quality enhancement.
Stars: ✭ 12 (-85%)
Mutual labels:  generative-adversarial-network
Splice
Official Pytorch Implementation for "Splicing ViT Features for Semantic Appearance Transfer" presenting "Splice" (CVPR 2022)
Stars: ✭ 126 (+57.5%)
Mutual labels:  generative-models
DeepFlow
Pytorch implementation of "DeepFlow: History Matching in the Space of Deep Generative Models"
Stars: ✭ 24 (-70%)
Mutual labels:  generative-adversarial-network
ADL2019
Applied Deep Learning (2019 Spring) @ NTU
Stars: ✭ 20 (-75%)
Mutual labels:  generative-adversarial-network
Deep-Learning
It contains the coursework and the practice I have done while learning Deep Learning.🚀 👨‍💻💥 🚩🌈
Stars: ✭ 21 (-73.75%)
Mutual labels:  generative-adversarial-network
gans-collection.torch
Torch implementation of various types of GAN (e.g. DCGAN, ALI, Context-encoder, DiscoGAN, CycleGAN, EBGAN, LSGAN)
Stars: ✭ 53 (-33.75%)
Mutual labels:  generative-adversarial-network
MNIST-invert-color
Invert the color of MNIST images with PyTorch
Stars: ✭ 13 (-83.75%)
Mutual labels:  generative-adversarial-network
AvatarGAN
Generate Cartoon Images using Generative Adversarial Network
Stars: ✭ 24 (-70%)
Mutual labels:  generative-adversarial-network
keras-3dgan
Keras implementation of 3D Generative Adversarial Network.
Stars: ✭ 20 (-75%)
Mutual labels:  generative-adversarial-network
CsiGAN
An implementation for our paper: CsiGAN: Robust Channel State Information-based Activity Recognition with GANs (IEEE Internet of Things Journal, 2019), which is the semi-supervised Generative Adversarial Network (GAN) for Channel State Information (CSI) -based activity recognition.
Stars: ✭ 23 (-71.25%)
Mutual labels:  generative-adversarial-network
TextBoxGAN
Generate text boxes from input words with a GAN.
Stars: ✭ 50 (-37.5%)
Mutual labels:  generative-adversarial-network
gan-qp.pytorch
Unofficial PyTorch implementation of "GAN-QP: A Novel GAN Framework without Gradient Vanishing and Lipschitz Constraint"
Stars: ✭ 26 (-67.5%)
Mutual labels:  generative-adversarial-network
skip-thought-gan
Generating Text through Adversarial Training(GAN) using Skip-Thought Vectors
Stars: ✭ 44 (-45%)
Mutual labels:  generative-adversarial-network
projects
things I help(ed) to build
Stars: ✭ 47 (-41.25%)
Mutual labels:  generative-adversarial-network

Private Data Generation Toolbox

The goal of this toolbox is to make private generation of synthetic data samples accessible to machine learning practitioners. It currently implements 5 state of the art generative models that can generate differentially private synthetic data. We evaluate the models on 4 public datasets from domains where privacy of sensitive data is paramount. Users can benchmark the models on the existing datasets or feed a new sensitive dataset as an input and get a synthetic dataset as the output which can be distributed to third parties with strong differential privacy guarantees.

Models :

PATE-GAN : PATE-GAN : Generating Synthetic Data with Differential Privacy Guarantees. ICLR 2019

DP-WGAN : Implementation of private Wasserstein GAN using noisy gradient descent moments accountant.

RON-GAUSS : Enhancing Utility in Non-Interactive Private Data Release, Proceedings on Privacy Enhancing Technologies (PETS), vol. 2019, no. 1, 2018

Private IMLE : Implementation of private Implicit Maximum Likelihood Estimation using noisy gradient descent and moments accountant.

Private PGM : Graphical-model based estimation and inference for differential privacy. Proceedings of the 36th International Conference on Machine Learning. 2019.

NOTE : Private IMLE code is released separately from this toolbox and can be found here : https://github.com/BorealisAI/IMLE. To run IMLE, do the following first:

git clone https://github.com/BorealisAI/IMLE.git  
cp -r IMLE <root>/models

Also make sure to follow the build instructions in <root>/models/IMLE/dci_code/Makefile

Dataset description :

Adult Census : The dataset comprises of census attributes like age, gender, native country etc and the goal is to predict whether a person earns more than $ 50k a year or not. https://archive.ics.uci.edu/ml/datasets/adult

NHANES Diabetes : National Health and Nutrition Examination Survey (NHANES) questionnaire is used to predict the onset of type II diabetes. https://github.com/semerj/NHANES-diabetes/tree/master/data

Give Me Some Credit : Historical data are provided on 250,000 borrowers and task is to help in credit scoring, by predicting the probability that somebody will experience financial distress in the next two years. https://www.kaggle.com/c/GiveMeSomeCredit/data

Home Credit Default Risk : Home Credit makes use of a variety of alternative data including telco and transactional information along with the client's past financial record to predict their clients' repayment abilities. https://www.kaggle.com/c/home-credit-default-risk/data

Adult Categorical : This dataset is the same as the Adult Census dataset, but the feature values for continuous attributes are put in buckets. We evaluate Private-PGM's performance on this dataset. https://github.com/ryan112358/private-pgm/tree/master/data

The datasets can be downloaded to the /data folder by using the download_datasets.sh and can be preprocessed using the scripts in the /preprocess folder. Preprocessing is data set specific and mostly involves dealing with missing values, normalization, encoding of attribute values, splitting data into train and test etc.

Example :
sh download_datasets.sh adult
python preprocessing/preprocess_adult.py

Downstream classifiers :

Classifiers used are Logistic Regression, Multi layer Perceptron, Gaussain Naive Bayes, Random Forests and Gradient Boost with default settings from sklearn.

Data Format :

The data needs to be in csv format and has to be partitioned as train and test before feeding it to the models. The generative models are learned using the training data. The downstream classifiers are either trained using the real train data or synthetic data generated by the models. The classifiers are evaluated on the left out test data.

Currently only two attribute types are supported :

  1. All attributes are continuous : supported models are ron-gauss, pate-gan, dp-wgan, imle

  2. All attributes are categorical : supported model is private-pgm . The categorical attribute values should be between 0 and max_category - 1.

In case the data has both kinds of attributes, it needs to be pre-processed (discretization for continuous values/ encoding for categorical attrbiutes) to use one of the models. Missing values are not supported and needs to replaced appropriately by the user before usage.

NOTE : Some imputation methods compute statistics using other data samples to fill missing values. Care needs to be taken to make the computed statistics differentially private and the cost must be added to the generative modeling privacy cost to compute the total privacy cost.

The first line of the csv data file is assumed to contain the column names and the target column (labels) needs to be specified using the --target-variable flag when running the evaluation script as shown below.

How to:

python evaluate.py --target-variable=<> --train-data-path=<> --test-data-path=<> <model_name> --enable-privacy --target-epsilon=5 --target-delta=1e-5

Model names can be real-data, pate-gan, dp-wgan, ron-gauss, imle or private-pgm.

Example:

After preprocessing Adult data using the preprocess_adult.py, we can train a differentially private wasserstein GAN on it and evaluate the quality of the synthetic dataset using the below script :

python evaluate.py --target-variable='income' --train-data-path=./data/adult_processed_train.csv --test-data-path=./data/adult_processed_test.csv --normalize-data dp-wgan --enable-privacy --sigma=0.8 --target-epsilon=8

Example Output:

AUC scores of downstream classifiers on test data :
----------------------------------------
LR: 0.7411981709396546
----------------------------------------
Random Forest: 0.7540559254517339
----------------------------------------
Neural Network: 0.7311882809628891
----------------------------------------
GaussianNB: 0.7580265076488256
----------------------------------------
GradientBoostingClassifier: 0.747129484720164

Synthetic data can be saved in the /data folder using the flag --save-synthetic

Some useful user args:

General args:

--downstream-task : classification or regression

--normalize-data : Apply sigmoid function to each value in the data

--categorical : If all attrbiutes of the data are categorical

--target-variable : Attribute name denoting the target

Privacy args:

--enable-privacy : Enables private data generation. Non private mode can only be used for DP-WGAN and IMLE.

--target-epsilon : epsilon parameter of differential privacy

--target-delta : delta parameter of differential privacy

For more details refer to https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/dwork.pdf

Noisy gradient descent args:

--sigma : Gaussian noise variance multiplier. A larger sigma will make the model train for longer epochs for the same privacy budget

--clip-coeff : The coefficient to clip the gradients to before adding noise for private SGD training

--micro-batch-size : Parameter to tradeoff speed vs efficiency. Gradients are averaged for a microbatch and then clipped before adding noise

Model specific args:

PATE-GAN:

--lap-scale : Inverse laplace noise scale multiplier. A larger lap_scale will reduce the noise that is added per iteration of training

--num-teachers : Number of teacher disciminators

--teacher-iters : Teacher iterations during training per generator iteration

--student-iters : Student iterations during training per generator iteration

--num-moments : Number of higher moments to use for epsilon calculation

IMLE:

--decay-step : Learning rate decay step

--decay-rate : Learning rate decay rate

--staleness : Number of iterations after which new synthetic samples are generated

--num-samples-factor : Number of synthetic samples generated per real data point

DP-WGAN:

--clamp-lower : Lower clamp parameter for the weights of the NN in wasserstein GAN

--clamp-upper : Upper clamp parameter for the weights of the NN in wasserstein GAN

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].