All Projects → finos → datahub

finos / datahub

Licence: Apache-2.0 License
DataHub - Synthetic data library

Programming Languages

python
139335 projects - #7 most used programming language
javascript
184084 projects - #8 most used programming language
CSS
56736 projects

Projects that are alternatives of or similar to datahub

Stocks
machine learning web app game where the user competes against the AI in picking stocks
Stars: ✭ 108 (+63.64%)
Mutual labels:  sklearn, pandas
sklearn-predict
机器学习数据,预测趋势并画图
Stars: ✭ 16 (-75.76%)
Mutual labels:  sklearn, pandas
Machine Learning Projects
This repository consists of all my Machine Learning Projects.
Stars: ✭ 135 (+104.55%)
Mutual labels:  sklearn, pandas
Data-Analyst-Nanodegree
Kai Sheng Teh - Udacity Data Analyst Nanodegree
Stars: ✭ 42 (-36.36%)
Mutual labels:  sklearn, pandas
Machine Learning
从零基础开始机器学习之旅
Stars: ✭ 209 (+216.67%)
Mutual labels:  sklearn, pandas
Lambda Packs
Precompiled packages for AWS Lambda
Stars: ✭ 997 (+1410.61%)
Mutual labels:  sklearn, pandas
Data Analysis
主要是爬虫与数据分析项目总结,外加建模与机器学习,模型的评估。
Stars: ✭ 142 (+115.15%)
Mutual labels:  sklearn, pandas
Daily Stock Forecast
Daily Stock Forecasts using Machine Learning & Python
Stars: ✭ 341 (+416.67%)
Mutual labels:  sklearn, pandas
skippa
SciKIt-learn Pipeline in PAndas
Stars: ✭ 33 (-50%)
Mutual labels:  sklearn, pandas
Data Science Notebook
📖 每一个伟大的思想和行动都有一个微不足道的开始
Stars: ✭ 196 (+196.97%)
Mutual labels:  sklearn, pandas
Ml Cheatsheet
A constantly updated python machine learning cheatsheet
Stars: ✭ 136 (+106.06%)
Mutual labels:  sklearn, pandas
xpandas
Universal 1d/2d data containers with Transformers functionality for data analysis.
Stars: ✭ 25 (-62.12%)
Mutual labels:  sklearn, pandas
Tensorflow Ml Nlp
텐서플로우와 머신러닝으로 시작하는 자연어처리(로지스틱회귀부터 트랜스포머 챗봇까지)
Stars: ✭ 176 (+166.67%)
Mutual labels:  sklearn, pandas
skutil
NOTE: skutil is now deprecated. See its sister project: https://github.com/tgsmith61591/skoot. Original description: A set of scikit-learn and h2o extension classes (as well as caret classes for python). See more here: https://tgsmith61591.github.io/skutil
Stars: ✭ 29 (-56.06%)
Mutual labels:  sklearn, pandas
ml-workflow-automation
Python Machine Learning (ML) project that demonstrates the archetypal ML workflow within a Jupyter notebook, with automated model deployment as a RESTful service on Kubernetes.
Stars: ✭ 44 (-33.33%)
Mutual labels:  sklearn, pandas
Breast-Cancer-Scikitlearn
simple tutorial on Machine Learning with Scikitlearn
Stars: ✭ 33 (-50%)
Mutual labels:  sklearn
Arch-Data-Science
Archlinux PKGBUILDs for Data Science, Machine Learning, Deep Learning, NLP and Computer Vision
Stars: ✭ 92 (+39.39%)
Mutual labels:  pandas
stream2segment
A Python project to download, process and visualize medium-to-massive amount of seismic waveforms and metadata
Stars: ✭ 18 (-72.73%)
Mutual labels:  pandas
neworder
A dynamic microsimulation framework for python
Stars: ✭ 15 (-77.27%)
Mutual labels:  pandas
pandas-stubs
Pandas type stubs. Helps you type-check your code.
Stars: ✭ 84 (+27.27%)
Mutual labels:  pandas

DataHub

DataHub logo

Synthetic data generation

DataHub is a set of python libraries dedicated to the production of synthetic data to be used in tests, machine learning training, statistical analysis, and other use cases wiki. DataHub uses existing datasets to generate synthetic models. If no existing data is available it will use user-provided scripts and data rules to generate synthetic data using out-of-the-box helper datasets.

Synthetic datasets are simply artificiality manufactured sets, produced to a desired degree of accuracy. Real Data does play a part in synthetic generation, all depending on the realism you require. The product roadmaps details out the functionality planned in this respect.

DataHub's core is predominantly based around pandas data frames and object generation. A common question: Now that I have a data frame of synthetic-data, what do I do with it? The Pandas library comes with an array of options here - so for the time being sinking to databases is out of the scope of the core library, however see that examples in the test folder for some common patterns.

note As we build out a config based synthetic spec generator, we will bring this back into scope - please see our roadmap/issue list and get involved in the discussion.

Key documents

  1. For information on how to get started with DataHub see our Getting Started Guide
  2. For more technical information about DataHub and how to customize it, see the Developer Guide
  3. For high-level project direction see Road Map, Requirements Gathering Approach and Delegated Action Groups.
  4. For Feature Development, Good First Issues, Help Wanted and Bug Tracking see DataHub GitHub Issues.
  5. This project uses Gravizo for all diagrams and charts as highlighted in DataHub Issue 41.

Overview of Synthetic data

  • Synthetic data is information that's is artificially manufactured rather than generated by *real-world events.
  • Synthetic data is created algorithmically, and can be used as a stand-in for  test datasets of production data
  • Real data does play a part in synthetic data generation - depending on how realistic you want the output

License

Copyright 2020 Citigroup

Distributed under the Apache License, Version 2.0.

SPDX-License-Identifier: Apache-2.0

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].