All Projects → treasure-data → pytd

treasure-data / pytd

Licence: Apache-2.0 license
Treasure Data Driver for Python

Programming Languages

Jupyter Notebook
11667 projects
python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to pytd

pandas twitter
Analyzing Trump's tweets using Python (Pandas + Twitter workshop)
Stars: ✭ 81 (+440%)
Mutual labels:  pandas
toucan-connectors
Connectors available to retrieve data in Toucan Toco small apps
Stars: ✭ 13 (-13.33%)
Mutual labels:  pandas
wax-ml
A Python library for machine-learning and feedback loops on streaming data
Stars: ✭ 36 (+140%)
Mutual labels:  pandas
hamilton
A scalable general purpose micro-framework for defining dataflows. You can use it to create dataframes, numpy matrices, python objects, ML models, etc.
Stars: ✭ 612 (+3980%)
Mutual labels:  pandas
fal
do more with dbt. fal helps you run Python alongside dbt, so you can send Slack alerts, detect anomalies and build machine learning models.
Stars: ✭ 567 (+3680%)
Mutual labels:  pandas
saddle
SADDLE: Scala Data Library
Stars: ✭ 23 (+53.33%)
Mutual labels:  pandas
pyjanitor
Clean APIs for data cleaning. Python implementation of R package Janitor
Stars: ✭ 970 (+6366.67%)
Mutual labels:  pandas
tutorials
Short programming tutorials pertaining to data analysis.
Stars: ✭ 14 (-6.67%)
Mutual labels:  pandas
cracking-the-pandas-cheat-sheet
인프런 - 단 두 장의 문서로 데이터 분석과 시각화 뽀개기
Stars: ✭ 62 (+313.33%)
Mutual labels:  pandas
cognipy
In-memory Graph Database and Knowledge Graph with Natural Language Interface, compatible with Pandas
Stars: ✭ 31 (+106.67%)
Mutual labels:  pandas
pandas-workshop
An introductory workshop on pandas with notebooks and exercises for following along.
Stars: ✭ 161 (+973.33%)
Mutual labels:  pandas
pybacen
This library was developed for economic analysis in the Brazilian scenario (Investments, micro and macroeconomic indicators)
Stars: ✭ 40 (+166.67%)
Mutual labels:  pandas
xpandas
Universal 1d/2d data containers with Transformers functionality for data analysis.
Stars: ✭ 25 (+66.67%)
Mutual labels:  pandas
chatstats
💬📊 Fun data visualizations for Facebook Messenger chats
Stars: ✭ 18 (+20%)
Mutual labels:  pandas
DataProfiler
What's in your data? Extract schema, statistics and entities from datasets
Stars: ✭ 843 (+5520%)
Mutual labels:  pandas
Data-Science-101
Notes and tutorials on how to use python, pandas, seaborn, numpy, matplotlib, scipy for data science.
Stars: ✭ 19 (+26.67%)
Mutual labels:  pandas
weaverbird
A visual data pipeline builder with various backends
Stars: ✭ 65 (+333.33%)
Mutual labels:  pandas
Python-Data-Visualization
D-Lab's 3 hour introduction to data visualization with Python. Learn how to create histograms, bar plots, box plots, scatter plots, compound figures, and more, using matplotlib and seaborn.
Stars: ✭ 42 (+180%)
Mutual labels:  pandas
datasets
🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
Stars: ✭ 13,870 (+92366.67%)
Mutual labels:  pandas
onelinerhub
2.5k code solutions with clear explanation @ onelinerhub.com
Stars: ✭ 645 (+4200%)
Mutual labels:  pandas

pytd

Build status PyPI version docs status

pytd provides user-friendly interfaces to Treasure Data’s REST APIs, Presto query engine, and Plazma primary storage.

The seamless connection allows your Python code to efficiently read/write a large volume of data from/to Treasure Data. Eventually, pytd makes your day-to-day data analytics work more productive.

Installation

pip install pytd

Usage

Set your API key and endpoint to the environment variables, TD_API_KEY and TD_API_SERVER, respectively, and create a client instance:

import pytd

client = pytd.Client(database='sample_datasets')
# or, hard-code your API key, endpoint, and/or query engine:
# >>> pytd.Client(apikey='1/XXX', endpoint='https://api.treasuredata.com/', database='sample_datasets', default_engine='presto')

Query in Treasure Data

Issue Presto query and retrieve the result:

client.query('select symbol, count(1) as cnt from nasdaq group by 1 order by 1')
# {'columns': ['symbol', 'cnt'], 'data': [['AAIT', 590], ['AAL', 82], ['AAME', 9252], ..., ['ZUMZ', 2364]]}

In case of Hive:

client.query('select hivemall_version()', engine='hive')
# {'columns': ['_c0'], 'data': [['0.6.0-SNAPSHOT-201901-r01']]} (as of Feb, 2019)

It is also possible to explicitly initialize pytd.Client for Hive:

client_hive = pytd.Client(database='sample_datasets', default_engine='hive')
client_hive.query('select hivemall_version()')

Write data to Treasure Data

Data represented as pandas.DataFrame can be written to Treasure Data as follows:

import pandas as pd

df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 10]})
client.load_table_from_dataframe(df, 'takuti.foo', writer='bulk_import', if_exists='overwrite')

For the writer option, pytd supports three different ways to ingest data to Treasure Data:

  1. Bulk Import API: bulk_import (default)
    • Convert data into a CSV file and upload in the batch fashion.
  2. Presto INSERT INTO query: insert_into
    • Insert every single row in DataFrame by issuing an INSERT INTO query through the Presto query engine.
    • Recommended only for a small volume of data.
  3. td-spark: spark
    • Local customized Spark instance directly writes DataFrame to Treasure Data’s primary storage system.

Characteristics of each of these methods can be summarized as follows:

  bulk_import insert_into spark
Scalable against data volume  
Write performance for larger data    
Memory efficient  
Disk efficient    
Minimal package dependency  

Enabling Spark Writer

Since td-spark gives special access to the main storage system via PySpark, follow the instructions below:

  1. Contact [email protected] to activate the permission to your Treasure Data account. Note that the underlying component, Plazma Public API, limits its free tier at 100GB Read and 100TB Write.
  2. Install pytd with [spark] option if you use the third option: pip install pytd[spark]

If you want to use existing td-spark JAR file, creating SparkWriter with td_spark_path option would be helpful.

from pytd.writer import SparkWriter

writer = SparkWriter(td_spark_path='/path/to/td-spark-assembly.jar')
client.load_table_from_dataframe(df, 'mydb.bar', writer=writer, if_exists='overwrite')

Comparison between pytd, td-client-python, and pandas-td

Treasure Data offers three different Python clients on GitHub, and the following list summarizes their characteristics.

  1. td-client-python
  2. pytd
    • Access to Plazma via td-spark as introduced above.
    • Efficient connection to Presto based on presto-python-client.
    • Multiple data ingestion methods and a variety of utility functions.
  3. pandas-td (deprecated)
    • Old tool optimized for pandas and Jupyter Notebook.
    • pytd offers its compatible function set (see below for the detail).

An optimal choice of package depends on your specific use case, but common guidelines can be listed as follows:

  • Use td-client-python if you want to execute basic CRUD operations from Python applications.
  • Use pytd for (1) analytical purpose relying on pandas and Jupyter Notebook, and (2) achieving more efficient data access at ease.
  • Do not use pandas-td. If you are using pandas-td, replace the code with pytd based on the following guidance as soon as possible.

How to replace pandas-td

pytd offers pandas-td-compatible functions that provide the same functionalities more efficiently. If you are still using pandas-td, we recommend you to switch to pytd as follows.

First, install the package from PyPI:

pip install pytd
# or, `pip install pytd[spark]` if you wish to use `to_td`

Next, make the following modifications on the import statements.

Before:

import pandas_td as td
In [1]: %%load_ext pandas_td.ipython

After:

import pytd.pandas_td as td
In [1]: %%load_ext pytd.pandas_td.ipython

Consequently, all pandas_td code should keep running correctly with pytd. Report an issue from here if you noticed any incompatible behaviors.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].