All Projects β†’ hoxo-m β†’ dplyr.teradata

hoxo-m / dplyr.teradata

Licence: other
A Teradata Backend for dplyr

Programming Languages

r
7636 projects

Projects that are alternatives of or similar to dplyr.teradata

datawizard
Magic potions to clean and transform your data πŸ§™
Stars: ✭ 149 (+831.25%)
Mutual labels:  dplyr, r-package
r-resources-for-data-science
A biggest collection of free books and other resources for R programming
Stars: ✭ 24 (+50%)
Mutual labels:  r-language
Spark-for-data-engineers
Apache Spark for data engineers
Stars: ✭ 22 (+37.5%)
Mutual labels:  r-language
parsons
Create parsons problems to teach programming in learnr tutorials
Stars: ✭ 20 (+25%)
Mutual labels:  r-package
geoknife
R tools for geo-web processing of gridded data via the Geo Data Portal. geoknife slices up gridded data according to overlap with irregular features, such as watersheds, lakes, points, etc.
Stars: ✭ 64 (+300%)
Mutual labels:  r-package
tbltools
πŸ—œπŸ”’ Tools for Working with Tibbles
Stars: ✭ 34 (+112.5%)
Mutual labels:  r-package
modeltime.gluonts
GluonTS Deep Learning with Modeltime
Stars: ✭ 31 (+93.75%)
Mutual labels:  r-package
sl3
πŸ’ͺ πŸ€” Modern Super Learning with Machine Learning Pipelines
Stars: ✭ 93 (+481.25%)
Mutual labels:  r-package
WeightedTreemaps
Create Voronoi and Sunburst Treemaps from Hierarchical data
Stars: ✭ 33 (+106.25%)
Mutual labels:  r-package
middlechild
β›” ARCHIVED β›” Tools to Intercept, Validate and Consume Web/Network Traffic
Stars: ✭ 15 (-6.25%)
Mutual labels:  r-package
geojson
GeoJSON classes for R
Stars: ✭ 32 (+100%)
Mutual labels:  r-package
LandR
Landscape Ecosystem Modelling in R
Stars: ✭ 14 (-12.5%)
Mutual labels:  r-package
worldfootballR
A wrapper for extracting world football (soccer) data from FBref, Transfermark, Understat and fotmob
Stars: ✭ 188 (+1075%)
Mutual labels:  r-package
dashboard
A R package dashboard generator
Stars: ✭ 42 (+162.5%)
Mutual labels:  r-package
bittrex
A R Client for the Bittrex Crypto-Currency Exchange
Stars: ✭ 26 (+62.5%)
Mutual labels:  r-package
flipper
Make it easy to flip through R packages from CRAN, Bioconductor, and GitHub
Stars: ✭ 13 (-18.75%)
Mutual labels:  r-package
testit
A simple package for testing R packages
Stars: ✭ 45 (+181.25%)
Mutual labels:  r-package
ropenaq
β›” ARCHIVED β›” Accesses Air Quality Data from the Open Data Platform OpenAQ
Stars: ✭ 69 (+331.25%)
Mutual labels:  r-package
polyglot
πŸŽ“Use the R Console as an interactive learning environment
Stars: ✭ 26 (+62.5%)
Mutual labels:  r-package
NetBID
Data-driven Network-based Bayesian Inference of Drivers
Stars: ✭ 21 (+31.25%)
Mutual labels:  r-package

A Teradata Backend for dplyr

Koji Makiyama (@hoxo-m)

Travis-CI Build Status CRAN Version Coverage Status

1. Overview

The package provides a Teradata backend for dplyr.

It makes it possible to operate Teradata Database in the same way as manipulating data frames with dplyr.

library(dplyr.teradata)

# Establish a connection to Teradata
con <- dbConnect(odbc(), 
                 driver = "{Teradata Driver}", DBCName = "host_name_or_IP_address",
                 uid = "user_name", pwd = "*****")
my_table <- tbl(con, "my_table_name")

# Build a query
q <- my_table %>% 
  filter(between(date, "2017-01-01", "2017-01-03")) %>% 
  group_by(date) %>%
  summarise(n = n()) %>%
  arrange(date)

show_query(q)
#> <SQL>
#> SELECT "date", count(*) AS "n"
#> FROM "my_table_name"
#> WHERE ("date" BETWEEN '2017-01-01' AND '2017-01-03')
#> GROUP BY "date"
#> ORDER BY "date"

# Send the query and get its result on R
df <- q %>% collect
df
#> # A tibble: 3 x 2
#>          date        n
#>        <date>    <int>
#>  1 2017-01-01   123456
#>  2 2017-01-02  7891011
#>  3 2017-01-03 12131415

2. Installation

You can install the dplyr.teradata package from CRAN.

install.packages("dplyr.teradata")

You can also install the development version of the package from GitHub.

install.packages("remotes") # if you have not installed "remotes" package
remotes::install_github("hoxo-m/dplyr.teradata")

The source code for dplyr.teradata package is available on GitHub at

3. Motivation

The package provides a Teradata backend for dplyr. It makes it possible to build SQL for Teradata Database in the same way as manipulating data frames with the dplyr package. It also can send the queries and then receive its results on R.

Therefore, you can complete data analysis with Teradata only on R. It means that you are freed from troublesome switching of tools and switching thoughts that cause mistakes.

4. Usage

The package uses the odbc package to connect database and the dbplyr package to build SQL.

First, you need to establish an ODBC connection to Teradata. See:

# Establish a connection to Teradata
con <- dbConnect(odbc(), 
                 driver = "{Teradata Driver}", DBCName = "host_name_or_IP_address",
                 uid = "user_name", pwd = "*****")

Second, you need to specify a table to build SQL. See:

To specify a table, you can use tbl():

# Getting table
my_table <- tbl(con, "my_table_name")

# Getting table in schema
my_table <- tbl(con, in_schema("my_schema", "my_table_name"))

Third, you build queries. It can do in the same way as manipulating data frames with dplyr:

For example, you can use follows:

  • mutate() adds new columns that are functions of existing columns.
  • select() picks columns based on their names.
  • filter() picks rows based on their values.
  • summarise() reduces multiple values down to a single summary.
  • arrange() changes the ordering of the rows.
# Build a query
q <- my_table %>% 
  filter(between(date, "2017-01-01", "2017-01-03")) %>% 
  group_by(date) %>%
  summarise(n = n()) %>%
  arrange(date)

n() is a function in dplyr to return the number of rows in the current group but here it will be translated to count(*) as a SQL function.

If you want to show built queries, use show_query():

show_query(q)
#> <SQL>
#> SELECT "date", count(*) AS "n"
#> FROM "my_table_name"
#> WHERE ("date" BETWEEN '2017-01-01' AND '2017-01-03')
#> GROUP BY "date"
#> ORDER BY "date"

Finally, you send built queries and get its results on R using collect().

# Send the query and get its result on R
df <- q %>% collect
df
#> # A tibble: 3 x 2
#>          date        n
#>        <date>    <int>
#>  1 2017-01-01   123456
#>  2 2017-01-02  7891011
#>  3 2017-01-03 12131415

5. Translatable functions

The package mainly use dbplyr to translate manipulations into queries.

Translatable functions are the available functions in manipulations that it can translate into SQL functions.

For instance, n() is translated to count(*) in the above example.

To know translatable functions for Teradata, refer the following:

Here, we introduce the special translatable functions that it becomes available by dplyr.teradata.

5.1. Treat Boolean

Teradata does not have the boolean data type. So when you use boolean, you need to write some complex statements. The package has several functions to treat it briefly.

bool_to_int transforms boolean to integer.

mutate(is_positive = bool_to_int(x > 0L))
#> <SQL> CASE WHEN (`x` > 0) THEN 1 WHEN NOT(`x` > 0) THEN 0 END

count_if() or n_if() counts a number of rows satisfying a condition.

summarize(n = count_if(x > 0L))
#> <SQL> SUM(CASE WHEN (`x` > 0) THEN 1 WHEN NOT(`x` > 0) THEN 0 END)

5.2. to_timestamp()

When your tables has some columns stored UNIX time and you want to convert it to timestamp, you need to write complex SQL.

to_timestamp() is a translatable function that makes it easy.

mutate(ts = to_timestamp(unixtime_column))

Such as above manipulation is translated into SQL like following:

#> <SQL> CAST(DATE '1970-01-01' + (`unixtime_column` / 86400) AS TIMESTAMP(0)) + (`unixtime_column` MOD 86400) * (INTERVAL '00:00:01' HOUR TO SECOND)

5.3. cut()

cut() is very useful function that you can use in base R.

For example, you want to cut values of x into three parts of ranges by break points 2 and 4:

x <- 1:6
breaks <- c(0, 2, 4, 6)
cut(x, breaks)
#> [1] (0,2] (0,2] (2,4] (2,4] (4,6] (4,6]
#> Levels: (0,2] (2,4] (4,6]

dplyr.teradata has a translatable function similar to this:

breaks = c(0, 2, 4, 6)
mutate(y = cut(x, breaks))

In the result, it is translated to a CASE WHEN statement as follows:

#> <SQL> CASE
#>  WHEN x > 0 AND x <= 2 THEN '(0,2]'
#>  WHEN x > 2 AND x <= 4 THEN '(2,4]'
#>  WHEN x > 4 AND x <= 6 THEN '(4,6]'
#>  ELSE NULL
#> END

Arguments of base cut() are also available:

breaks = c(0, 2, 4, 6)
mutate(y = cut(x, breaks, labels = "-", include.lowest = TRUE))
#> <SQL> CASE
#>  WHEN x >= 0 AND x <= 2 THEN '0-2'
#>  WHEN x > 2 AND x <= 4 THEN '3-4'
#>  WHEN x > 4 AND x <= 6 THEN '5-6'
#>  ELSE NULL
#> END

6. Miscellaneous

6.1. Sampling Data

Teradata supports sampling rows from tables:

and dplyr has the same purpose verb slice_sample(). The package makes them work well.

For example, by the number of rows:

q <- my_table %>% slice_sample(n = 100L)

show_query(q)
#> <SQL>
#> SELECT *
#> FROM "my_table_name"
#> SAMPLE RANDOMIZED ALLOCATION 100

or by the proportion of rows:

q <- my_table %>% slice_sample(prop = 0.1)

show_query(q)
#> <SQL>
#> SELECT *
#> FROM "my_table_name"
#> SAMPLE RANDOMIZED ALLOCATION 0.1

It also supports sampling with replacement:

q <- my_table %>% slice_sample(n = 100L, replace = TRUE)

show_query(q)
#> <SQL>
#> SELECT *
#> FROM "my_table_name"
#> SAMPLE WITH REPLACEMENT RANDOMIZED ALLOCATION 100

and supports a random sample stratified by AMPs (it is much faster, especially for very large samples):

q <- my_table %>% slice_sample(n = 100L, randomized_allocation = FALSE)

show_query(q)
#> <SQL>
#> SELECT *
#> FROM "my_table_name"
#> SAMPLE 100

The package currently supports the verbs old versions.

# By the number of rows
q <- my_table %>% sample_n(100L)
# By the proportion of rows
q <- my_table %>% sample_frac(0.1)

6.2. blob_to_string()

The blob objects from databases sometimes prevents manipulations with dplyr.

You might want to convert them to string.

blob_to_string() is a function to make it easy:

x <- blob::as_blob("Good morning")
x
#> <blob[1]>
#> [1] blob[12 B]

# print raw data in blob
x[[1]]
#>  [1] 47 6f 6f 64 20 6d 6f 72 6e 69 6e 67

blob_to_string(x)
#> [1] "476f6f64206d6f726e696e67"

7. Related work

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].