All Projects → sfu-db → Dataprep

sfu-db / Dataprep

Licence: mit
DataPrep — The easiest way to prepare data in Python

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Dataprep

Pandas Profiling
Create HTML profiling reports from pandas DataFrame objects
Stars: ✭ 8,329 (+1203.44%)
Mutual labels:  data-science, exploratory-data-analysis, eda
Complete Life Cycle Of A Data Science Project
Complete-Life-Cycle-of-a-Data-Science-Project
Stars: ✭ 140 (-78.09%)
Mutual labels:  data-science, exploratory-data-analysis, eda
100 Days Of Ml Code
A day to day plan for this challenge. Covers both theoritical and practical aspects
Stars: ✭ 172 (-73.08%)
Mutual labels:  data-science, exploratory-data-analysis, eda
Great expectations
Always know what to expect from your data.
Stars: ✭ 5,808 (+808.92%)
Mutual labels:  data-science, exploratory-data-analysis, eda
Data Describe
data⎰describe: Pythonic EDA Accelerator for Data Science
Stars: ✭ 269 (-57.9%)
Mutual labels:  data-science, exploratory-data-analysis, eda
Sweetviz
Visualize and compare datasets, target values and associations, with one line of code.
Stars: ✭ 1,851 (+189.67%)
Mutual labels:  data-science, exploratory-data-analysis, eda
Kaggle Competitions
There are plenty of courses and tutorials that can help you learn machine learning from scratch but here in GitHub, I want to solve some Kaggle competitions as a comprehensive workflow with python packages. After reading, you can use this workflow to solve other real problems and use it as a template.
Stars: ✭ 86 (-86.54%)
Mutual labels:  data-science, exploratory-data-analysis
Xda
R package for exploratory data analysis
Stars: ✭ 112 (-82.47%)
Mutual labels:  data-science, exploratory-data-analysis
skimpy
skimpy is a light weight tool that provides summary statistics about variables in data frames within the console.
Stars: ✭ 236 (-63.07%)
Mutual labels:  exploratory-data-analysis, eda
Sparkora
Powerful rapid automatic EDA and feature engineering library with a very easy to use API 🌟
Stars: ✭ 51 (-92.02%)
Mutual labels:  exploratory-data-analysis, eda
Exploratory Data Analysis Visualization Python
Data analysis and visualization with PyData ecosystem: Pandas, Matplotlib Numpy, and Seaborn
Stars: ✭ 78 (-87.79%)
Mutual labels:  exploratory-data-analysis, eda
Spark R Notebooks
R on Apache Spark (SparkR) tutorials for Big Data analysis and Machine Learning as IPython / Jupyter notebooks
Stars: ✭ 109 (-82.94%)
Mutual labels:  data-science, exploratory-data-analysis
My Journey In The Data Science World
📢 Ready to learn or review your knowledge!
Stars: ✭ 1,175 (+83.88%)
Mutual labels:  data-science, eda
Autoeda Resources
A list of software and papers related to automatic and fast Exploratory Data Analysis
Stars: ✭ 268 (-58.06%)
Mutual labels:  exploratory-data-analysis, eda
Code
Compilation of R and Python programming codes on the Data Professor YouTube channel.
Stars: ✭ 287 (-55.09%)
Mutual labels:  data-science, exploratory-data-analysis
leila
Librería para la evaluación de calidad de datos, e interacción con el portal de datos.gov.co
Stars: ✭ 56 (-91.24%)
Mutual labels:  exploratory-data-analysis, eda
Inspectdf
🛠️ 📊 Tools for Exploring and Comparing Data Frames
Stars: ✭ 195 (-69.48%)
Mutual labels:  exploratory-data-analysis, eda
Lux
Python API for Intelligent Visual Data Discovery
Stars: ✭ 787 (+23.16%)
Mutual labels:  data-science, exploratory-data-analysis
olliePy
OlliePy is a python package which can help data scientists in exploring their data and evaluating and analysing their machine learning experiments by utilising the power and structure of modern web applications. The data scientist only needs to provide the data and any required information and OlliePy will generate the rest.
Stars: ✭ 46 (-92.8%)
Mutual labels:  exploratory-data-analysis, eda
Dataexplorer
Automate Data Exploration and Treatment
Stars: ✭ 362 (-43.35%)
Mutual labels:  data-science, eda

Documentation | Forum | Mail List

DataPrep lets you prepare your data using a single library with a few lines of code.

Currently, you can use DataPrep to:

Releases

Repo Version Downloads
PyPI
conda-forge

Installation

pip install -U dataprep

Connector

Connector is an intuitive, open-source API wrapper that speeds up development by standardizing calls to multiple APIs as a simple workflow.

Connector provides a simple wrapper to collect structured data from different Web APIs (e.g., Twitter, Spotify), making web data collection easy and efficient, without requiring advanced programming skills.

Do you want to leverage the growing number of websites that are opening their data through public APIs? Connector is for you!

Let's check out the several benefits that Connector offers:

  • A unified API: You can fetch data using one or two lines of code to get data from tens of popular websites.
  • Auto Pagination: Do you want to invoke a Web API that could return a large result set and need to handle it through pagination? Connector automatically does the pagination for you! Just specify the desired number of returned results (argument _count) without getting into unnecessary detail about a specific pagination scheme.
  • Smart API request strategy: Do you want to fetch results more quickly by making concurrent requests to Web APIs? Through the _concurrency argument, Connector simplifies concurrency, issuing API requests in parallel while respecting the API's rate limit policy.

How to fetch all publications of Andrew Y. Ng?

from dataprep.connector import connect
conn_dblp = connect("dblp", _concurrency = 5)
df = await conn_dblp.query("publication", author = "Andrew Y. Ng", _count = 2000)

Here, you can find detailed Examples.

Connector is designed to be easy to extend. If you want to connect with your own web API, you just have to write a simple configuration file to support it. This configuration file describes the API's main attributes like the URL, query parameters, authorization method, pagination properties, etc.

EDA

DataPrep.EDA is the fastest and the easiest EDA (Exploratory Data Analysis) tool in Python. It allows you to understand a Pandas/Dask DataFrame with a few lines of code in seconds.

Create Profile Reports, Fast

You can create a beautiful profile report from a Pandas/Dask DataFrame with the create_report function. DataPrep.EDA has the following advantages compared to other tools:

  • 10-100X Faster: DataPrep.EDA is 10-100X faster than Pandas-based profiling tools due to its highly optimized Dask-based computing module.
  • Interactive Visualization: DataPrep.EDA generates interactive visualizations in a report, which makes the report look more appealing to end users.
  • Big Data Support: DataPrep.EDA naturally supports big data stored in a Dask cluster by accepting a Dask dataframe as input.

The following code demonstrates how to use DataPrep.EDA to create a profile report for the titanic dataset.

from dataprep.datasets import load_dataset
from dataprep.eda import create_report
df = load_dataset("titanic")
create_report(df).show_browser()

Click here to see the generated report of the above code.

Innovative System Design

DataPrep.EDA is the only task-centric EDA system in Python. It is carefully designed to improve usability.

  • Task-Centric API Design: You can declaratively specify a wide range of EDA tasks in different granularities with a single function call. All needed visualizations will be automatically and intelligently generated for you.
  • Auto-Insights: DataPrep.EDA automatically detects and highlights the insights (e.g., a column has many outliers) to facilitate pattern discovery about the data.
  • How-to Guide : A how-to guide is provided to show the configuration of each plot function. With this feature, you can easily customize the generated visualizations.

Understand the Titanic dataset with Task-Centric API:

Click here to check all the supported tasks.

Check plot, plot_correlation, plot_missing and create_report to see how each function works.

Clean

DataPrep.Clean contains simple functions designed for cleaning and validating data in a DataFrame. It provides

  • A Unified API: each function follows the syntax clean_{type}(df, 'column name') (see an example below).
  • Speed: the computations are parallelized using Dask. It can clean 50K rows per second on a dual-core laptop (that means cleaning 1 million rows in only 20 seconds).
  • Transparency: a report is generated that summarizes the alterations to the data that occured during cleaning.

The following example shows how to clean and standardize a column of country names.

from dataprep.clean import clean_country
import pandas as pd
df = pd.DataFrame({'country': ['USA', 'country: Canada', '233', ' tr ', 'NA']})
df2 = clean_country(df, 'country')
df2
           country  country_clean
0              USA  United States
1  country: Canada         Canada
2              233        Estonia
3              tr          Turkey
4               NA            NaN

Type validation is also supported:

from dataprep.clean import validate_country
series = validate_country(df['country'])
series
0     True
1    False
2     True
3     True
4    False
Name: country, dtype: bool

Currently supports functions for: Column Headers | Country Names | Dates and Times | Email Addresses | Geographic Coordinates | IP Addresses | Phone Numbers | URLs | US Street Addresses

Documentation

The following documentation can give you an impression of what DataPrep can do:

Contribute

There are many ways to contribute to DataPrep.

  • Submit bugs and help us verify fixes as they are checked in.
  • Review the source code changes.
  • Engage with other DataPrep users and developers on StackOverflow.
  • Help each other in the DataPrep Community Discord and Mail list & Forum.
  • Twitter
  • Contribute bug fixes.
  • Providing use cases and writing down your user experience.

Please take a look at our wiki for development documentations!

Acknowledgement

Some functionalities of DataPrep are inspired by the following packages.

  • Pandas Profiling

    Inspired the report functionality and insights provided in dataprep.eda.

  • missingno

    Inspired the missing value analysis in dataprep.eda.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].