Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → ujjwalkarn → Xda

ujjwalkarn / Xda

R package for exploratory data analysis

Programming Languages

7636 projects

Labels

data-science data-analysis exploratory-data-analysis

Projects that are alternatives of or similar to Xda

Visualize and compare datasets, target values and associations, with one line of code.

Stars: ✭ 1,851 (+1552.68%)

Mutual labels: data-science, data-analysis, exploratory-data-analysis

Spark R Notebooks

R on Apache Spark (SparkR) tutorials for Big Data analysis and Machine Learning as IPython / Jupyter notebooks

Stars: ✭ 109 (-2.68%)

Mutual labels: data-science, data-analysis, exploratory-data-analysis

Pandas Profiling

Create HTML profiling reports from pandas DataFrame objects

Stars: ✭ 8,329 (+7336.61%)

Mutual labels: data-science, data-analysis, exploratory-data-analysis

Ai Expert Roadmap

Roadmap to becoming an Artificial Intelligence Expert in 2021

Stars: ✭ 15,441 (+13686.61%)

Mutual labels: data-science, data-analysis

TSrepr: R package for time series representations

Stars: ✭ 75 (-33.04%)

Mutual labels: data-science, data-analysis

50% faster, 50% less RAM Machine Learning. Numba rewritten Sklearn. SVD, NNMF, PCA, LinearReg, RidgeReg, Randomized, Truncated SVD/PCA, CSR Matrices all 50+% faster

Stars: ✭ 1,204 (+975%)

Mutual labels: data-science, data-analysis

🍧 A repository that contains courses I have taken on DataCamp

Stars: ✭ 69 (-38.39%)

Mutual labels: data-science, data-analysis

Dex : The Data Explorer -- A data visualization tool written in Java/Groovy/JavaFX capable of powerful ETL and publishing web visualizations.

Stars: ✭ 1,238 (+1005.36%)

Mutual labels: data-science, data-analysis

A simple Spark-powered ETL framework that just works 🍺

Stars: ✭ 79 (-29.46%)

Mutual labels: data-science, data-analysis

Accelerate your ML and Data workflows to production. Flyte is a production grade orchestration system for your Data and ML workloads. It has been battle tested at Lyft, Spotify, freenome and others and truly open-source.

Stars: ✭ 1,242 (+1008.93%)

Mutual labels: data-science, data-analysis

Bayesian Cognitive Modeling In Pymc3

PyMC3 codes of Lee and Wagenmakers' Bayesian Cognitive Modeling - A Pratical Course

Stars: ✭ 93 (-16.96%)

Mutual labels: data-science, data-analysis

Apache Superset is a Data Visualization and Data Exploration Platform

Stars: ✭ 42,634 (+37966.07%)

Mutual labels: data-science, data-analysis

Data Analysis program and framework for materials science data analytics, based on the managing framework SIMPL framework.

Stars: ✭ 73 (-34.82%)

Mutual labels: data-science, data-analysis

Ml Da Coursera Yandex Mipt

Machine Learning and Data Analysis Coursera Specialization from Yandex and MIPT

Stars: ✭ 108 (-3.57%)

Mutual labels: data-science, data-analysis

My Journey In The Data Science World

📢 Ready to learn or review your knowledge!

Stars: ✭ 1,175 (+949.11%)

Mutual labels: data-science, data-analysis

数据接口：百度、谷歌、头条、微博指数,宏观数据，利率数据，货币汇率，千里马、独角兽公司，新闻联播文字稿，影视票房数据，高校名单，疫情数据…

Stars: ✭ 1,229 (+997.32%)

Mutual labels: data-science, data-analysis

fklearn: Functional Machine Learning

Stars: ✭ 1,305 (+1065.18%)

Mutual labels: data-science, data-analysis

Spark Py Notebooks

Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks

Stars: ✭ 1,338 (+1094.64%)

Mutual labels: data-science, data-analysis

A visualisation tool for the creation and analysis of graphs

Stars: ✭ 67 (-40.18%)

Mutual labels: data-science, data-analysis

Awesome Business Intelligence

Actively curated list of awesome BI tools. PRs welcome!

Stars: ✭ 1,157 (+933.04%)

Mutual labels: data-science, data-analysis

View All Similar Projects ➔

xda: R package for exploratory data analysis

This package contains several tools to perform initial exploratory analysis on any input dataset. It includes custom functions for plotting the data as well as performing different kinds of analyses such as univariate, bivariate and multivariate investigation which is the first step of any predictive modeling pipeline. This package can be used to get a good sense of any dataset before jumping on to building predictive models.

The package is constantly under development and more functionalities will be added soon. Pull requests to add more functions are welcome!

The functions currently included in the package are mentioned below:

numSummary(mydata) function automatically detects all numeric columns in the dataframe mydata and provides their summary statistics
charSummary(mydata) function automatically detects all character columns in the dataframe mydata and provides their summary statistics
Plot(mydata, dep.var) plots all independent variables in the dataframe mydata against the dependant variable specified by the dep.var parameter
removeSpecial(mydata, vec) replaces all special characters (specified by vector vec) in the dataframe mydata with NA
bivariate(mydata, dep.var, indep.var) performs bivariate analysis between dependent variable dep.var and independent variable indep.var in the dataframe mydata

More functions to be added soon.

Note: All functions mentioned above expect mydata to be a data.frame - please convert your input dataset to a data.frame before using any function from this package.

Installation

The best way to install xda package is to install devtools package first. To install devtools, please follow instructions here. Then, use the following commands to install xda:
```
library(devtools)
install_github("ujjwalkarn/xda")
```

Alternatively, you may also use the githubinstall package for installing xda:

install.packages("githubinstall")
library(githubinstall)
githubinstall("xda")

Usage

For examples below, the popular iris dataset and the warpbreaks dataset has been used. Please refer to the documentation of each function to understand how to use it. For example, to see the documenation for the numSummary() function, use ?numSummary.

## load the package into the current session

library(xda)

numSummary()

## to view a comprehensive summary for all numeric columns in the iris dataset

numSummary(iris)

## n = total number of rows for that variable
## nunique = number of unique values
## nzeroes = number of zeroes
## iqr = interquartile range
## noutlier = number of outliers
## miss = number of rows with missing value
## miss% = percentage of total rows with missing values ((miss/n)*100)
## 5% = 5th percentile value of that variable (value below which 5 percent of the observations may be found)
## the percentile values are helpful in detecting outliers

Output

> numSummary(iris)

                n mean    sd max min range nunique nzeros  iqr lowerbound upperbound noutlier kurtosis skewness mode miss miss%   1%   5% 25%  50% 75%  95%  99%
 Sepal.Length 150 5.84 0.828 7.9 4.3   3.6      35      0 1.30       3.15       8.35        0   -0.606    0.309  5.0    0     0 4.40 4.60 5.1 5.80 6.4 7.25 7.70
 Sepal.Width  150 3.06 0.436 4.4 2.0   2.4      23      0 0.50       2.05       4.05        4    0.139    0.313  3.0    0     0 2.20 2.34 2.8 3.00 3.3 3.80 4.15
 Petal.Length 150 3.76 1.765 6.9 1.0   5.9      43      0 3.55      -3.72      10.42        0   -1.417   -0.269  1.4    0     0 1.15 1.30 1.6 4.35 5.1 6.10 6.70
 Petal.Width  150 1.20 0.762 2.5 0.1   2.4      22      0 1.50      -1.95       4.05        0   -1.358   -0.101  0.2    0     0 0.10 0.20 0.3 1.30 1.8 2.30 2.50

charSummary()

## to view a comprehensive summary for all character columns in the warpbreaks dataset

charSummary(warpbreaks)

## n = total number of rows for that variable
## miss = number of rows with missing value
## miss% = percentage of total rows with missing values ((n/miss)*100)
## unique = number of unique levels of that variable
## top5levels:count = top 5 levels (unique values) in each column sorted by count
## for example, wool has 2 unique levels 'A' and 'B' each with count of 27

Output

> charSummary(warpbreaks)

          n miss miss% unique top5levels:count
 wool    54    0     0      2       A:27, B:27
 tension 54    0     0      3 H:18, L:18, M:18

bivariate()

## to perform bivariate analysis between 'Species' and 'Sepal.Length' in the iris dataset

bivariate(iris,'Species','Sepal.Length')

## bin_Sepal.Length = 'Sepal.Length' variable has been binned into 4 equal intervals (original range is [4.3,7.9])
## for each interval of 'Sepal.Length', the number of samples from each category of 'Species' is shown 
## i.e. 39 of the 50 samples of Setosa have Sepal.Length is in the range (4.3,5.2], and so on. 
## the number of intervals (4 in this case) can be customized (see documentation)

Output

> bivariate(iris,'Species','Sepal.Length')

   bin_Sepal.Length setosa versicolor virginica
 1        (4.3,5.2]     39          5         1
 2        (5.2,6.1]     11         29        10
 3          (6.1,7]      0         16        27
 4          (7,7.9]      0          0        12

Plot()

## to plot all other variables against the 'Petal.Length' variable in the iris dataset

Plot(iris,'Petal.Length')

## some interesting patterns can be seen in the plots below and these insights can be used for predictive modeling

Output

> Plot(iris,'Petal.Length')

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 112

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (3) 🔗