All Projects → machow → Siuba

machow / Siuba

Licence: mit
Python library for using dplyr like syntax with pandas and SQL

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Siuba

tutorials
Short programming tutorials pertaining to data analysis.
Stars: ✭ 14 (-97.69%)
Mutual labels:  dplyr, pandas, data-analysis
validada
Another library for defensive data analysis.
Stars: ✭ 29 (-95.21%)
Mutual labels:  pandas, data-analysis
GreyNSights
Privacy-Preserving Data Analysis using Pandas
Stars: ✭ 18 (-97.02%)
Mutual labels:  pandas, data-analysis
Data Science Hacks
Data Science Hacks consists of tips, tricks to help you become a better data scientist. Data science hacks are for all - beginner to advanced. Data science hacks consist of python, jupyter notebook, pandas hacks and so on.
Stars: ✭ 273 (-54.88%)
Mutual labels:  data-analysis, pandas
Data-Analyst-Nanodegree
Kai Sheng Teh - Udacity Data Analyst Nanodegree
Stars: ✭ 42 (-93.06%)
Mutual labels:  pandas, data-analysis
data-analysis-using-python
Data Analysis Using Python: A Beginner’s Guide Featuring NYC Open Data
Stars: ✭ 81 (-86.61%)
Mutual labels:  pandas, data-analysis
Datagear
数据可视化分析平台,使用Java语言开发,采用浏览器/服务器架构,支持SQL、CSV、Excel、HTTP接口、JSON等多种数据源
Stars: ✭ 266 (-56.03%)
Mutual labels:  sql, data-analysis
datatile
A library for managing, validating, summarizing, and visualizing data.
Stars: ✭ 419 (-30.74%)
Mutual labels:  pandas, data-analysis
Ai Learn
人工智能学习路线图,整理近200个实战案例与项目,免费提供配套教材,零基础入门,就业实战!包括:Python,数学,机器学习,数据分析,深度学习,计算机视觉,自然语言处理,PyTorch tensorflow machine-learning,deep-learning data-analysis data-mining mathematics data-science artificial-intelligence python tensorflow tensorflow2 caffe keras pytorch algorithm numpy pandas matplotlib seaborn nlp cv等热门领域
Stars: ✭ 4,387 (+625.12%)
Mutual labels:  data-analysis, pandas
Pandas Summary
An extension to pandas dataframes describe function.
Stars: ✭ 361 (-40.33%)
Mutual labels:  data-analysis, pandas
Prettypandas
A Pandas Styler class for making beautiful tables
Stars: ✭ 376 (-37.85%)
Mutual labels:  data-analysis, pandas
Dominando-Pandas
Este repositório está destinado ao processo de aprendizagem da biblioteca Pandas.
Stars: ✭ 22 (-96.36%)
Mutual labels:  pandas, data-analysis
Product-Categorization-NLP
Multi-Class Text Classification for products based on their description with Machine Learning algorithms and Neural Networks (MLP, CNN, Distilbert).
Stars: ✭ 30 (-95.04%)
Mutual labels:  pandas, data-analysis
visions
Type System for Data Analysis in Python
Stars: ✭ 136 (-77.52%)
Mutual labels:  pandas, data-analysis
ipython-notebooks
A collection of Jupyter notebooks exploring different datasets.
Stars: ✭ 43 (-92.89%)
Mutual labels:  pandas, data-analysis
fairlens
Identify bias and measure fairness of your data
Stars: ✭ 51 (-91.57%)
Mutual labels:  pandas, data-analysis
Pydata Notebook
利用Python进行数据分析 第二版 (2017) 中文翻译笔记
Stars: ✭ 4,300 (+610.74%)
Mutual labels:  data-analysis, pandas
dataquest-guided-projects-solutions
My dataquest project solutions
Stars: ✭ 35 (-94.21%)
Mutual labels:  pandas, data-analysis
online-course-recommendation-system
Built on data from Pluralsight's course API fetched results. Works with model trained with K-means unsupervised clustering algorithm.
Stars: ✭ 31 (-94.88%)
Mutual labels:  pandas, data-analysis
Zat
Zeek Analysis Tools (ZAT): Processing and analysis of Zeek network data with Pandas, scikit-learn, Kafka and Spark
Stars: ✭ 303 (-49.92%)
Mutual labels:  data-analysis, pandas

siuba

scrappy data analysis, with seamless support for pandas and SQL

CI Documentation Status Binder

siuba (小巴) is a port of dplyr and other R libraries. It supports a tabular data analysis workflow centered on 5 common actions:

  • select() - keep certain columns of data.
  • filter() - keep certain rows of data.
  • mutate() - create or modify an existing column of data.
  • summarize() - reduce one or more columns down to a single number.
  • arrange() - reorder the rows of data.

These actions can be preceeded by a group_by(), which causes them to be applied individually to grouped rows of data. Moreover, many SQL concepts, such as distinct(), count(), and joins are implemented. Inputs to these functions can be a pandas DataFrame or SQL connection (currently postgres, redshift, or sqlite).

For more on the rationale behind tools like dplyr, see this tidyverse paper. For examples of siuba in action, see the siuba documentation.

Installation

pip install siuba

Examples

See the siuba docs or this live analysis for a full introduction.

Basic use

The code below uses the example DataFrame mtcars, to get the average horsepower (hp) per cylinder.

from siuba import group_by, summarize, _
from siuba.data import mtcars

(mtcars
  >> group_by(_.cyl)
  >> summarize(avg_hp = _.hp.mean())
  )
Out[1]: 
   cyl      avg_hp
0    4   82.636364
1    6  122.285714
2    8  209.214286

There are three key concepts in this example:

concept example meaning
verb group_by(...) a function that operates on a table, like a DataFrame or SQL table
siu expression _.hp.mean() an expression created with siuba._, that represents actions you want to perform
pipe mtcars >> group_by(...) a syntax that allows you to chain verbs with the >> operator

See introduction to siuba.

What is a siu expression (e.g. _.cyl == 4)?

A siu expression is a way of specifying what action you want to perform. This allows siuba verbs to decide how to execute the action, depending on whether your data is a local DataFrame or remote table.

from siuba import _

_.cyl == 4
Out[2]:
█─==
├─█─.
│ ├─_
│ └─'cyl'
└─4

You can also think of siu expressions as a shorthand for a lambda function.

from siuba import _

# lambda approach
mtcars[lambda _: _.cyl == 4]

# siu expression approach
mtcars[_.cyl == 4]
Out[3]: 
     mpg  cyl   disp   hp  drat     wt   qsec  vs  am  gear  carb
2   22.8    4  108.0   93  3.85  2.320  18.61   1   1     4     1
7   24.4    4  146.7   62  3.69  3.190  20.00   1   0     4     2
..   ...  ...    ...  ...   ...    ...    ...  ..  ..   ...   ...
27  30.4    4   95.1  113  3.77  1.513  16.90   1   1     5     2
31  21.4    4  121.0  109  4.11  2.780  18.60   1   1     4     2

[11 rows x 11 columns]

See siu expression section here.

Using with a SQL database

A killer feature of siuba is that the same analysis code can be run on a local DataFrame, or a SQL source.

In the code below, we set up an example database.

# Setup example data ----
from sqlalchemy import create_engine
from siuba.data import mtcars

# copy pandas DataFrame to sqlite
engine = create_engine("sqlite:///:memory:")
mtcars.to_sql("mtcars", engine, if_exists = "replace")

Next, we use the code from the first example, except now executed a SQL table.

# Demo SQL analysis with siuba ----
from siuba import _, group_by, summarize, filter
from siuba.sql import LazyTbl

# connect with siuba
tbl_mtcars = LazyTbl(engine, "mtcars")

(tbl_mtcars
  >> group_by(_.cyl)
  >> summarize(avg_hp = _.hp.mean())
  )
Out[4]: 
# Source: lazy query
# DB Conn: Engine(sqlite:///:memory:)
# Preview:
   cyl      avg_hp
0    4   82.636364
1    6  122.285714
2    8  209.214286
# .. may have more rows

See querying SQL introduction here.

Example notebooks

Below are some examples I've kept as I've worked on siuba. For the most up to date explanations, see the siuba docs

Testing

Tests are done using pytest. They can be run using the following.

# start postgres db
docker-compose up
pytest siuba
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].