All Projects → boxuancui → Dataexplorer

boxuancui / Dataexplorer

Licence: other
Automate Data Exploration and Treatment

Programming Languages

r
7636 projects

Projects that are alternatives of or similar to Dataexplorer

Collapse
Advanced and Fast Data Transformation in R
Stars: ✭ 184 (-49.17%)
Mutual labels:  data-science, data-analysis, cran, rstats
Football Data
football (soccer) datasets
Stars: ✭ 18 (-95.03%)
Mutual labels:  data-science, data-analysis, rstats
Awesome R
A curated list of awesome R packages, frameworks and software.
Stars: ✭ 4,858 (+1241.99%)
Mutual labels:  data-science, data-analysis, rstats
Tsrepr
TSrepr: R package for time series representations
Stars: ✭ 75 (-79.28%)
Mutual labels:  data-science, data-analysis, r-package
Targets
Function-oriented Make-like declarative workflows for R
Stars: ✭ 293 (-19.06%)
Mutual labels:  data-science, r-package, rstats
Metaflow
🚀 Build and manage real-life data science projects with ease!
Stars: ✭ 5,108 (+1311.05%)
Mutual labels:  data-science, r-package, rstats
My Journey In The Data Science World
📢 Ready to learn or review your knowledge!
Stars: ✭ 1,175 (+224.59%)
Mutual labels:  data-science, data-analysis, eda
Pkgsearch
Search R packages on CRAN
Stars: ✭ 73 (-79.83%)
Mutual labels:  r-package, cran, rstats
Sweetviz
Visualize and compare datasets, target values and associations, with one line of code.
Stars: ✭ 1,851 (+411.33%)
Mutual labels:  data-science, data-analysis, eda
Mlr
Machine Learning in R
Stars: ✭ 1,542 (+325.97%)
Mutual labels:  data-science, r-package, cran
Pandas Profiling
Create HTML profiling reports from pandas DataFrame objects
Stars: ✭ 8,329 (+2200.83%)
Mutual labels:  data-science, data-analysis, eda
heddlr
Bring a functional programming mindset to R Markdown document generation
Stars: ✭ 14 (-96.13%)
Mutual labels:  cran, rstats, r-package
Drake
An R-focused pipeline toolkit for reproducibility and high-performance computing
Stars: ✭ 1,301 (+259.39%)
Mutual labels:  data-science, r-package, rstats
Elastic
R client for the Elasticsearch HTTP API
Stars: ✭ 227 (-37.29%)
Mutual labels:  data-science, r-package, rstats
pbapply
Adding progress bar to '*apply' functions in R
Stars: ✭ 115 (-68.23%)
Mutual labels:  cran, rstats, r-package
Rhub
R-hub API client
Stars: ✭ 292 (-19.34%)
Mutual labels:  r-package, rstats
Datascience course
Curso de Data Science em Português
Stars: ✭ 294 (-18.78%)
Mutual labels:  data-science, data-analysis
Pydataroad
open source for wechat-official-account (ID: PyDataLab)
Stars: ✭ 302 (-16.57%)
Mutual labels:  data-science, data-analysis
Rplos
R client for the PLoS Journals API
Stars: ✭ 289 (-20.17%)
Mutual labels:  r-package, rstats
Ggextra
📊 Add marginal histograms to ggplot2, and more ggplot2 enhancements
Stars: ✭ 299 (-17.4%)
Mutual labels:  r-package, rstats

DataExplorer

CRAN Version Downloads Total Downloads Travis Build Status AppVeyor Build Status codecov CII Best Practices GitHub Stars

Background

Exploratory Data Analysis (EDA) is the initial and an important phase of data analysis/predictive modeling. During this process, analysts/modelers will have a first look of the data, and thus generate relevant hypotheses and decide next steps. However, the EDA process could be a hassle at times. This R package aims to automate most of data handling and visualization, so that users could focus on studying the data and extracting insights.

Installation

The package can be installed directly from CRAN.

install.packages("DataExplorer")

However, the latest stable version (if any) could be found on GitHub, and installed using devtools package.

if (!require(devtools)) install.packages("devtools")
devtools::install_github("boxuancui/DataExplorer")

If you would like to install the latest development version, you may install the develop branch.

if (!require(devtools)) install.packages("devtools")
devtools::install_github("boxuancui/DataExplorer", ref = "develop")

Examples

The package is extremely easy to use. Almost everything could be done in one line of code. Please refer to the package manuals for more information. You may also find the package vignettes here.

Report

To get a report for the airquality dataset:

library(DataExplorer)
create_report(airquality)

To get a report for the diamonds dataset with response variable price:

library(ggplot2)
create_report(diamonds, y = "price")

Visualization

Instead of running create_report, you may also run each function individually for your analysis, e.g.,

## View basic description for airquality data
introduce(airquality)
rows 153
columns 6
discrete_columns 0
continuous_columns 6
all_missing_columns 0
total_missing_values 44
complete_rows 111
total_observations 918
memory_usage 6,376
## Plot basic description for airquality data
plot_intro(airquality)

## View missing value distribution for airquality data
plot_missing(airquality)

## Left: frequency distribution of all discrete variables
plot_bar(diamonds)
## Right: `price` distribution of all discrete variables
plot_bar(diamonds, with = "price")

## View frequency distribution by a discrete variable
plot_bar(diamonds, by = "cut")

## View histogram of all continuous variables
plot_histogram(diamonds)

## View estimated density distribution of all continuous variables
plot_density(diamonds)

## View quantile-quantile plot of all continuous variables
plot_qq(diamonds)

## View quantile-quantile plot of all continuous variables by feature `cut`
plot_qq(diamonds, by = "cut")

## View overall correlation heatmap
plot_correlation(diamonds)

## View bivariate continuous distribution based on `cut`
plot_boxplot(diamonds, by = "cut")

## Scatterplot `price` with all other continuous features
plot_scatterplot(split_columns(diamonds)$continuous, by = "price", sampled_rows = 1000L)

## Visualize principal component analysis
plot_prcomp(diamonds, maxcat = 5L)
#> 2 features with more than 5 categories ignored!
#> color: 7 categories
#> clarity: 8 categories

Feature Engineering

To make quick updates to your data:

## Group bottom 20% `clarity` by frequency
group_category(diamonds, feature = "clarity", threshold = 0.2, update = TRUE)

## Group bottom 20% `clarity` by `price`
group_category(diamonds, feature = "clarity", threshold = 0.2, measure = "price", update = TRUE)

## Dummify diamonds dataset
dummify(diamonds)
dummify(diamonds, select = "cut")

## Set values for missing observations
df <- data.frame("a" = rnorm(260), "b" = rep(letters, 10))
df[sample.int(260, 50), ] <- NA
set_missing(df, list(0L, "unknown"))

## Update columns
update_columns(airquality, c("Month", "Day"), as.factor)
update_columns(airquality, 1L, function(x) x^2)

## Drop columns
drop_columns(diamonds, 8:10)
drop_columns(diamonds, "clarity")

Articles

See article wiki page.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].