Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

vertica / Verticapy

Licence: apache-2.0

VerticaPy is a Python library that exposes sci-kit like functionality to conduct data science projects on data stored in Vertica, thus taking advantage Vertica’s speed and built-in analytics and machine learning capabilities.

Programming Languages

python

139335 projects - #7 most used programming language

Labels

machine-learning data-science data-visualization big-data

Projects that are alternatives of or similar to Verticapy

Pretzel

Javascript full-stack framework for Big Data visualisation and analysis

Stars: ✭ 26 (-55.93%)

Mutual labels: data-science, big-data, data-visualization

Pythondata

repo for code published on pythondata.com

Stars: ✭ 113 (+91.53%)

Mutual labels: data-science, big-data, data-visualization

My Journey In The Data Science World

📢 Ready to learn or review your knowledge!

Stars: ✭ 1,175 (+1891.53%)

Mutual labels: data-science, big-data, data-visualization

Just Dashboard

📊 📋 Dashboards using YAML or JSON files

Stars: ✭ 1,511 (+2461.02%)

Mutual labels: data-science, big-data, data-visualization

Vizuka

Explore high-dimensional datasets and how your algo handles specific regions.

Stars: ✭ 100 (+69.49%)

Mutual labels: data-science, big-data, data-visualization

Courses

Quiz & Assignment of Coursera

Stars: ✭ 454 (+669.49%)

Mutual labels: data-science, big-data, data-visualization

Biolitmap

Code for the paper "BIOLITMAP: a web-based geolocated and temporal visualization of the evolution of bioinformatics publications" in Oxford Bioinformatics.

Stars: ✭ 18 (-69.49%)

Mutual labels: data-science, data-visualization

Model Describer

model-describer : Making machine learning interpretable to humans

Stars: ✭ 22 (-62.71%)

Mutual labels: data-science, data-visualization

Socrat

A Dynamic Web Toolbox for Interactive Data Processing, Analysis, and Visualization

Stars: ✭ 26 (-55.93%)

Mutual labels: data-science, data-visualization

Data Science On Gcp

Source code accompanying book: Data Science on the Google Cloud Platform, Valliappa Lakshmanan, O'Reilly 2017

Stars: ✭ 864 (+1364.41%)

Mutual labels: data-science, data-visualization

Sciblog support

Support content for my blog

Stars: ✭ 694 (+1076.27%)

Mutual labels: data-science, big-data

Autodl

Automated Deep Learning without ANY human intervention. 1'st Solution for AutoDL [email protected]

Stars: ✭ 854 (+1347.46%)

Mutual labels: data-science, big-data

Dataconfs

A list of conferences connected with data worldwide.

Stars: ✭ 36 (-38.98%)

Mutual labels: data-science, data-visualization

Football Data

football (soccer) datasets

Stars: ✭ 18 (-69.49%)

Mutual labels: data-science, data-visualization

Awesome Python Data Science

Probably the best curated list of data science software in Python.

Stars: ✭ 812 (+1276.27%)

Mutual labels: data-science, data-visualization

Cookbook 2nd

IPython Cookbook, Second Edition, by Cyrille Rossant, Packt Publishing 2018

Stars: ✭ 704 (+1093.22%)

Mutual labels: data-science, data-visualization

Dataflowjavasdk

Google Cloud Dataflow provides a simple, powerful model for building both batch and streaming parallel data processing pipelines.

Stars: ✭ 854 (+1347.46%)

Mutual labels: data-science, big-data

Soccergraphr

Soccer Analytics in R using OPTA data

Stars: ✭ 42 (-28.81%)

Mutual labels: data-science, data-visualization

Attaca

Robust, distributed version control for large files.

Stars: ✭ 41 (-30.51%)

Mutual labels: data-science, big-data

Data Science Lunch And Learn

Resources for weekly Data Science Lunch & Learns

Stars: ✭ 49 (-16.95%)

Mutual labels: data-science, data-visualization

View All Similar Projects ➔

📢 2020-06-27: VerticaPy is the new name for Vertica-ML-Python.

VerticaPy

VerticaPy is a Python library with scikit-like functionality used to conduct data science projects on data stored in Vertica, taking advantage Vertica’s speed and built-in analytics and machine learning features. It supports the entire data science life cycle, uses a ‘pipeline’ mechanism to sequentialize data transformation operations, and offers beautiful graphical options.

Nowadays, 'Big Data' is one of the main topics in the data science world, and data scientists are often at the center of any organization. The benefits of becoming more data-driven are undeniable and are often needed to survive in the industry.

Vertica was the first real analytic columnar database and is still the fastest in the market. However, SQL alone isn't flexible enough to meet the needs of data scientists.

Python has quickly become the most popular tool in this domain, owing much of its flexibility to its high-level of abstraction and impressively large and ever-growing set of libraries. Its accessibility has led to the development of popular and perfomant APIs, like pandas and scikit-learn, and a dedicated community of data scientists. Unfortunately, Python only works in-memory as a single-node process. This problem has led to the rise of distributed programming languages, but they too, are limited as in-memory processes and, as such, will never be able to process all of your data in this era, and moving data for processing is prohobitively expensive. On top of all of this, data scientists must also find convenient ways to deploy their data and models. The whole process is time consuming.

VerticaPy aims to solve all of these problems. The idea is simple: instead of moving data around for processing, VerticaPy brings the logic to the data.

3 years in the making, we're proud to bring you VerticaPy.

Main Advantages:

Easy Data Exploration.
Fast Data Preparation.
In-Database Machine Learning.
Easy Model Evaluation.
Easy Model Deployment.

Installation

To install VerticaPy with pip:

# Latest release version
[email protected]:~$ pip3 install verticapy

# Latest commit on master branch
[email protected]:~$ pip3 install git+https://github.com/vertica/[email protected]

To install VerticaPy from source, run the following command from the root directory:

[email protected]:~$ python3 setup.py install

Documentation

A well-detailed HTML documentation is available at:

https://www.vertica.com/python/documentation_last/

Connecting to the Database

VerticaPy is compatible with several clients.

Native Client (Recommended)

import vertica_python

# Connection using all the DSN information
conn_info = {'host': "10.211.55.14", 
             'port': 5433, 
             'user': "dbadmin", 
             'password': "XxX", 
             'database': "testdb"}
cur = vertica_python.connect(** conn_info).cursor()

# Connection using directly the DSN
from verticapy.utilities import to_vertica_python_format # This function will parse the odbc.ini file
dsn = "VerticaDSN"
cur = vertica_python.connect(** to_vertica_python_format(dsn)).cursor()

ODBC

import pyodbc

# Connection using all the DSN information
driver = "/Library/Vertica/ODBC/lib/libverticaodbc.dylib"
server = "10.211.55.14"
database = "testdb"
port = "5433"
uid = "dbadmin"
pwd = "XxX"
dsn = ("DRIVER={}; SERVER={}; DATABASE={}; PORT={}; UID={}; PWD={};").format(driver, server, database, port, uid, pwd)
cur = pyodbc.connect(dsn).cursor()

# Connection using directly the DSN
dsn = ("DSN=VerticaDSN")
cur = pyodbc.connect(dsn).cursor()

JDBC

import jaydebeapi

# Vertica Server Details
database = "testdb"
hostname = "10.211.55.14"
port = "5433"
uid = "dbadmin"
pwd = "XxX"

# Vertica JDBC class name
jdbc_driver_name = "com.vertica.jdbc.Driver"

# Vertica JDBC driver path
jdbc_driver_loc = "/Library/Vertica/JDBC/vertica-jdbc-9.3.1-0.jar"

# JDBC connection string
connection_string = 'jdbc:vertica://' + hostname + ':' + port + '/' + database
url = '{}:user={};password={}'.format(connection_string, uid, pwd)
conn = jaydebeapi.connect(jdbc_driver_name, connection_string, {'user': uid, 'password': pwd}, jars = jdbc_driver_loc)
cur = conn.cursor()

Quick Start

Install the library using the pip command.

[email protected]:~$ pip3 install verticapy

Create a vertica cursor.

from verticapy import vertica_conn
cur = vertica_conn("VerticaDSN").cursor()

Create the Virtual DataFrame of your relation.

from verticapy import vDataFrame
vdf = vDataFrame("my_relation", cursor = cur)

If you don't have data on hand, you can easily import well-known datasets.

from verticapy.learn.datasets import load_titanic
vdf = load_titanic(cursor = cur)

Examine your data:

vdf.describe()

# Output
               min       25%        50%        75%   
age           0.33      21.0       28.0       39.0   
body           1.0     79.25      160.5      257.5   
fare           0.0    7.8958    14.4542    31.3875   
parch          0.0       0.0        0.0        0.0   
pclass         1.0       1.0        3.0        3.0   
sibsp          0.0       0.0        0.0        1.0   
survived       0.0       0.0        0.0        1.0   
                   max    unique  
age               80.0        96  
body             328.0       118  
fare          512.3292       277  
parch              9.0         8  
pclass             3.0         3  
sibsp              8.0         7  
survived           1.0         2

Print the SQL query with the set_option function:

set_option("sql_on", True)
vdf.describe()

# Output
## Compute the descriptive statistics of all the numerical columns ##

SELECT 
  SUMMARIZE_NUMCOL("age","body","survived","pclass","parch","fare","sibsp") OVER ()
FROM public.titanic

With VerticaPy, it is now possible to solve a ML problem with few lines of code.

from verticapy.learn.model_selection import cross_validate
from verticapy.learn.ensemble import RandomForestClassifier

# Data Preparation
vdf["sex"].label_encode()["boat"].fillna(method = "0ifnull")["name"].str_extract(' ([A-Za-z]+)\.').eval("family_size", expr = "parch + sibsp + 1").drop(columns = ["cabin", "body", "ticket", "home.dest"])["fare"].fill_outliers().fillna()

# Model Evaluation
cross_validate(RandomForestClassifier("rf_titanic", cur, max_leaf_nodes = 100, n_estimators = 30), 
               vdf, 
               ["age", "family_size", "sex", "pclass", "fare", "boat"], 
               "survived", 
               cutoff = 0.35)

# Output
                           auc               prc_auc   
1-fold      0.9877114427860691    0.9530465915039339   
2-fold      0.9965555014605642    0.7676485351425721   
3-fold      0.9927239216549301    0.6419135521132449   
avg             0.992330288634        0.787536226253   
std           0.00362128464093         0.12779562393   
                     accuracy              log_loss   
1-fold      0.971291866028708    0.0502052541223871   
2-fold      0.983253588516746    0.0298167751798457   
3-fold      0.964824120603015    0.0392745694400433   
avg            0.973123191716       0.0397655329141   
std           0.0076344236729      0.00833079837099   
                     precision                recall   
1-fold                    0.96                  0.96   
2-fold      0.9556962025316456                   1.0   
3-fold      0.9647887323943662    0.9383561643835616   
avg             0.960161644975        0.966118721461   
std           0.00371376912311        0.025535200301   
                      f1-score                   mcc   
1-fold      0.9687259282082884    0.9376119402985075   
2-fold      0.9867172675521821    0.9646971010878469   
3-fold      0.9588020287309097    0.9240569687684576   
avg              0.97141507483        0.942122003385   
std            0.0115538960753       0.0168949813163   
                  informedness            markedness   
1-fold      0.9376119402985075    0.9376119402985075   
2-fold      0.9737827715355807    0.9556962025316456   
3-fold      0.9185148945422918    0.9296324823943662   
avg             0.943303202125        0.940980208408   
std            0.0229190954261       0.0109037699717   
                           csi  
1-fold      0.9230769230769231  
2-fold      0.9556962025316456  
3-fold      0.9072847682119205  
avg             0.928685964607  
std            0.0201579224026

Enjoy!

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 59

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (3) 🔗