All Projects → dmitrykoval → vinum

dmitrykoval / vinum

Licence: BSD-3-Clause license
Vinum is a SQL processor for Python, designed for data analysis workflows and in-memory analytics.

Programming Languages

C++
36643 projects - #6 most used programming language
python
139335 projects - #7 most used programming language
Jupyter Notebook
11667 projects
CMake
9771 projects

Projects that are alternatives of or similar to vinum

Seaborn Tutorial
This repository is my attempt to help Data Science aspirants gain necessary Data Visualization skills required to progress in their career. It includes all the types of plot offered by Seaborn, applied on random datasets.
Stars: ✭ 114 (+100%)
Mutual labels:  numpy, data-analysis
Awkward 1.0
Manipulate JSON-like data with NumPy-like idioms.
Stars: ✭ 203 (+256.14%)
Mutual labels:  numpy, data-analysis
Data Analysis
主要是爬虫与数据分析项目总结,外加建模与机器学习,模型的评估。
Stars: ✭ 142 (+149.12%)
Mutual labels:  numpy, data-analysis
Pyda 2e Zh
📖 [译] 利用 Python 进行数据分析 · 第 2 版
Stars: ✭ 866 (+1419.3%)
Mutual labels:  numpy, data-analysis
CC33Z
Curso de Ciência da Computação
Stars: ✭ 50 (-12.28%)
Mutual labels:  numpy, data-analysis
Mlcourse.ai
Open Machine Learning Course
Stars: ✭ 7,963 (+13870.18%)
Mutual labels:  numpy, data-analysis
Python Novice Inflammation
Programming with Python
Stars: ✭ 199 (+249.12%)
Mutual labels:  numpy, data-analysis
data-analysis-using-python
Data Analysis Using Python: A Beginner’s Guide Featuring NYC Open Data
Stars: ✭ 81 (+42.11%)
Mutual labels:  numpy, data-analysis
Udacity-Data-Analyst-Nanodegree
Repository for the projects needed to complete the Data Analyst Nanodegree.
Stars: ✭ 31 (-45.61%)
Mutual labels:  numpy, data-analysis
Awkward 0.x
Manipulate arrays of complex data structures as easily as Numpy.
Stars: ✭ 216 (+278.95%)
Mutual labels:  arrow, numpy
Ai Learn
人工智能学习路线图,整理近200个实战案例与项目,免费提供配套教材,零基础入门,就业实战!包括:Python,数学,机器学习,数据分析,深度学习,计算机视觉,自然语言处理,PyTorch tensorflow machine-learning,deep-learning data-analysis data-mining mathematics data-science artificial-intelligence python tensorflow tensorflow2 caffe keras pytorch algorithm numpy pandas matplotlib seaborn nlp cv等热门领域
Stars: ✭ 4,387 (+7596.49%)
Mutual labels:  numpy, data-analysis
Data-Science-Resources
A guide to getting started with Data Science and ML.
Stars: ✭ 17 (-70.18%)
Mutual labels:  numpy, data-analysis
Data Science Hacks
Data Science Hacks consists of tips, tricks to help you become a better data scientist. Data science hacks are for all - beginner to advanced. Data science hacks consist of python, jupyter notebook, pandas hacks and so on.
Stars: ✭ 273 (+378.95%)
Mutual labels:  numpy, data-analysis
100 Pandas Puzzles
100 data puzzles for pandas, ranging from short and simple to super tricky (60% complete)
Stars: ✭ 1,382 (+2324.56%)
Mutual labels:  numpy, data-analysis
visions
Type System for Data Analysis in Python
Stars: ✭ 136 (+138.6%)
Mutual labels:  numpy, data-analysis
Data Science Notebook
📖 每一个伟大的思想和行动都有一个微不足道的开始
Stars: ✭ 196 (+243.86%)
Mutual labels:  numpy, data-analysis
Cubes
Light-weight Python OLAP framework for multi-dimensional data analysis
Stars: ✭ 1,393 (+2343.86%)
Mutual labels:  olap, data-analysis
Data-Analyst-Nanodegree
Kai Sheng Teh - Udacity Data Analyst Nanodegree
Stars: ✭ 42 (-26.32%)
Mutual labels:  numpy, data-analysis
Static Frame
Immutable and grow-only Pandas-like DataFrames with a more explicit and consistent interface.
Stars: ✭ 217 (+280.7%)
Mutual labels:  numpy, data-analysis
Datscan
DatScan is an initiative to build an open-source CMS that will have the capability to solve any problem using data Analysis just with the help of various modules and a vast standardized module library
Stars: ✭ 13 (-77.19%)
Mutual labels:  numpy, data-analysis

Vinum

PyPi CI Grade_Python Codecov

Vinum is a SQL query processor for Python, designed for data analysis workflows and in-memory analytics.

When should I use Vinum?

Vinum is running inside of the host Python process and allows to execute any functions available to the interpreter as UDFs. If you are doing data analysis or running ETL in Python, Vinum allows to execute efficient SQL queries with an ability to call native Python UDFs.

Key Features:

  • Vinum is running inside of the host Python process and has a hybrid query execution model - whenever possible it would prefer native compiled version of operators and only executes Python interpreted code where strictly necessary (ie. for native Python UDFs).
  • Allows to use functions available within the host Python interpreter as UDFs, including native Python, NumPy, etc.
  • Vinum's execution model doesn't require input datasets to fit into memory, as it operates on a stream of record batches. However, the final result is fully materialized in memory.
  • Written in the mix of C++ and Python and is built from ground up on top of Apache Arrow, which provides the foundation for moving data and enables minimal overhead for transferring data to and from Numpy and Pandas.

Architecture

https://github.com/dmitrykoval/vinum/raw/main/doc/source/_static/architecture.png

Vinum uses PostgresSQL parser provided by pglast project.

Query planner and executor are implemented in Python, while all the physical operators are either implemented in C++ or use compiled vectorized kernels from Arrow or NumPy. The only exception to this is native python UDFs, which are running within interpreted Python.

Query execution model is based on the vectorized model described in the prolific paper by P. A. Boncz, M. Zukowski, and N. Nes. Monetdb/x100: Hyper-pipelining query execution. In CIDR, 2005.

Example of a query plan:

https://github.com/dmitrykoval/vinum/raw/main/doc/source/_static/query.png

Install

pip install vinum

Examples

Query python dict

Create a Table from a python dict and return result of the query as a Pandas DataFrame.

>>> import vinum as vn
>>> data = {'value': [300.1, 2.8, 880], 'mode': ['air', 'bus', 'air']}
>>> tbl = vn.Table.from_pydict(data)
>>> tbl.sql_pd("SELECT value, np.log(value) FROM t WHERE mode='air'")
   value    np.log
0  300.1  5.704116
1  880.0  6.779922

Query pandas dataframe

>>> import pandas as pd
>>> import vinum as vn
>>> data = {'col1': [1, 2, 3], 'col2': [7, 13, 17]}
>>> pdf = pd.DataFrame(data=data)
>>> tbl = vn.Table.from_pandas(pdf)
>>> tbl.sql_pd('SELECT * FROM t WHERE col2 > 10 ORDER BY col1 DESC')
   col1  col2
0     3    17
1     2    13

Run query on a csv stream

For larger datasets or datasets that won't fit into memory - stream_csv() is the recommended way to execute a query. Compressed files are also supported and can be streamed without prior extraction.

>>> import vinum as vn
>>> query = 'select passenger_count pc, count(*) from t group by pc'
>>> vn.stream_csv('taxi.csv.bz2').sql(query).to_pandas()
   pc  count
0   0    165
1   5   3453
...

Read and query csv

>>> import vinum as vn
>>> tbl = vn.read_csv('taxi.csv')
>>> res_tbl = tbl.sql('SELECT key, fare_amount, passenger_count FROM t '
...                   'WHERE fare_amount > 5 LIMIT 3')
>>> res_tbl.to_pandas()
                            key  fare_amount  passenger_count
0   2010-01-05 16:52:16.0000002         16.9                1
1  2011-08-18 00:35:00.00000049          5.7                2
2   2012-04-21 04:30:42.0000001          7.7                1

Compute Euclidean distance with numpy functions

Use any numpy functions via the 'np.*' namespace.

>>> import vinum as vn
>>> tbl = vn.Table.from_pydict({'x': [1, 2, 3], 'y': [7, 13, 17]})
>>> tbl.sql_pd('SELECT *, np.sqrt(np.square(x) + np.square(y)) dist '
...            'FROM t ORDER BY dist DESC')
   x   y       dist
0  3  17  17.262677
1  2  13  13.152946
2  1   7   7.071068

Compute Euclidean distance with vectorized UDF

Register UDF performing vectorized operations on Numpy arrays.

>>> import vinum as vn
>>> vn.register_numpy('distance',
...                   lambda x, y: np.sqrt(np.square(x) + np.square(y)))
>>> tbl = vn.Table.from_pydict({'x': [1, 2, 3], 'y': [7, 13, 17]})
>>> tbl.sql_pd('SELECT *, distance(x, y) AS dist '
...            'FROM t ORDER BY dist DESC')
   x   y       dist
0  3  17  17.262677
1  2  13  13.152946
2  1   7   7.071068

Compute Euclidean distance with python UDF

Register Python lambda function as UDF.

>>> import math
>>> import vinum as vn
>>> vn.register_python('distance', lambda x, y: math.sqrt(x**2 + y**2))
>>> tbl = vn.Table.from_pydict({'x': [1, 2, 3], 'y': [7, 13, 17]})
>>> tbl.sql_pd('SELECT x, y, distance(x, y) AS dist FROM t')
   x   y       dist
0  1   7   7.071068
1  2  13  13.152946
2  3  17  17.262677

Group by z-score

>>> import numpy as np
>>> import vinum as vn
>>> def z_score(x: np.ndarray):
...     "Compute Standard Score"
...     mean = np.mean(x)
...     std = np.std(x)
...     return (x - mean) / std
...
>>> vn.register_numpy('score', z_score)
>>> tbl = vn.read_csv('taxi.csv')
>>> tbl.sql_pd('select to_int(score(fare_amount)) AS bucket, avg(fare_amount), count(*) '
...            'FROM t GROUP BY bucket ORDER BY bucket limit 3')
   bucket        avg  count_star
0      -1  -1.839000          10
1       0   8.817733       45158
2       1  25.155522        2376

Documentation

What Vinum is not

Vinum is not a Database Management System, there are no plans to support insert/update/delete statements and transactions. If you need a DBMS designed for data analytics and OLAP, or don't need Python UDFs, consider using excellent DuckDB - it is based on solid scientific foundation and is very fast.

Dependencies

Inspiration

Future plans

  • Support joins and nested queries.
  • Consider Gandiva for expression evaluation.
  • Parallel execution.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].