All Projects → exasol → hadoop-etl-udfs

exasol / hadoop-etl-udfs

Licence: MIT license
The Hadoop ETL UDFs are the main way to load data from Hadoop into EXASOL

Programming Languages

java
68154 projects - #9 most used programming language
python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to hadoop-etl-udfs

DaFlow
Apache-Spark based Data Flow(ETL) Framework which supports multiple read, write destinations of different types and also support multiple categories of transformation rules.
Stars: ✭ 24 (+41.18%)
Mutual labels:  hive, hadoop, parquet
Drill
Apache Drill is a distributed MPP query layer for self describing data
Stars: ✭ 1,619 (+9423.53%)
Mutual labels:  hive, hadoop, parquet
Eel Sdk
Big Data Toolkit for the JVM
Stars: ✭ 140 (+723.53%)
Mutual labels:  hive, hadoop, parquet
Hive Jdbc Uber Jar
Hive JDBC "uber" or "standalone" jar based on the latest Apache Hive version
Stars: ✭ 188 (+1005.88%)
Mutual labels:  hive, hadoop
r-exasol
The EXASOL package for R provides an interface to the EXASOL database.
Stars: ✭ 22 (+29.41%)
Mutual labels:  exasol, exasol-integration
Linkis
Linkis helps easily connect to various back-end computation/storage engines(Spark, Python, TiDB...), exposes various interfaces(REST, JDBC, Java ...), with multi-tenancy, high performance, and resource control.
Stars: ✭ 2,323 (+13564.71%)
Mutual labels:  hive, udf
Movie recommend
基于Spark的电影推荐系统,包含爬虫项目、web网站、后台管理系统以及spark推荐系统
Stars: ✭ 2,092 (+12205.88%)
Mutual labels:  hive, hadoop
sqlalchemy exasol
SQLAlchemy dialect for EXASOL
Stars: ✭ 34 (+100%)
Mutual labels:  exasol, exasol-integration
Facebook Hive Udfs
Facebook's Hive UDFs
Stars: ✭ 213 (+1152.94%)
Mutual labels:  hive, hadoop
dpkb
大数据相关内容汇总,包括分布式存储引擎、分布式计算引擎、数仓建设等。关键词:Hadoop、HBase、ES、Kudu、Hive、Presto、Spark、Flink、Kylin、ClickHouse
Stars: ✭ 123 (+623.53%)
Mutual labels:  hive, hadoop
hive to es
同步Hive数据仓库数据到Elasticsearch的小工具
Stars: ✭ 21 (+23.53%)
Mutual labels:  hive, hadoop
xxhadoop
Data Analysis Using Hadoop/Spark/Storm/ElasticSearch/MachineLearning etc. This is My Daily Notes/Code/Demo. Don't fork, Just star !
Stars: ✭ 37 (+117.65%)
Mutual labels:  hive, hadoop
hadoopoffice
HadoopOffice - Analyze Office documents using the Hadoop ecosystem (Spark/Flink/Hive)
Stars: ✭ 56 (+229.41%)
Mutual labels:  hive, hadoop
Bigdata docker
Big Data Ecosystem Docker
Stars: ✭ 161 (+847.06%)
Mutual labels:  hive, hadoop
Presto
The official home of the Presto distributed SQL query engine for big data
Stars: ✭ 12,957 (+76117.65%)
Mutual labels:  hive, hadoop
spark-connector
A connector for Apache Spark to access Exasol
Stars: ✭ 13 (-23.53%)
Mutual labels:  exasol, exasol-integration
smart-data-lake
Smart Automation Tool for building modern Data Lakes and Data Pipelines
Stars: ✭ 79 (+364.71%)
Mutual labels:  hive, hadoop
hive-bigquery-storage-handler
Hive Storage Handler for interoperability between BigQuery and Apache Hive
Stars: ✭ 16 (-5.88%)
Mutual labels:  hive, hadoop
Hadoopcryptoledger
Hadoop Crypto Ledger - Analyzing CryptoLedgers, such as Bitcoin Blockchain, on Big Data platforms, such as Hadoop/Spark/Flink/Hive
Stars: ✭ 126 (+641.18%)
Mutual labels:  hive, hadoop
dockerfiles
Multi docker container images for main Big Data Tools. (Hadoop, Spark, Kafka, HBase, Cassandra, Zookeeper, Zeppelin, Drill, Flink, Hive, Hue, Mesos, ... )
Stars: ✭ 29 (+70.59%)
Mutual labels:  hive, hadoop

Hadoop ETL UDFs

Build Status GitHub Release

Overview

Hadoop ETL UDFs are the main way to transfer data between Exasol and Hadoop (HCatalog tables on HDFS). The SQL syntax for calling the UDFs is similar to that of Exasol's native IMPORT and EXPORT commands, but with added UDF paramters for specifying the various necessary and optional Hadoop properties.

A brief overview of features includes support for:

  • HCatalog Metadata (e.g., table location, columns, partitions).
  • Multiple file formats (e.g., Parquet, ORC, RCFile)
  • HDFS HA
  • Partitions
  • Parallelization

For a more detailed description of the features, please refer to the IMPORT and EXPORT sections below.

Getting Started

Before you can start using the Hadoop ETL UDFs, you have to deploy the UDFs in your Exasol database. Please follow the step-by-step deployment guide.

Using the UDFs

After deloying the UDFs, you can begin using them to easily transfer data to and from Hadoop.

IMPORT

The IMPORT UDFs load data into Exasol from Hadoop (HCatalog tables on HDFS). To import data, you just need to execute the SQL statement IMPORT INTO ... FROM SCRIPT ETL.IMPORT_HCAT_TABLE WITH ... with the appropriate parameters. This calls the ETL.IMPORT_HCAT_TABLE UDF, which was previously created during deployment.

For example, run the following statement to import data into an existing table.

CREATE TABLE sample_07 (code VARCHAR(1000), description VARCHAR (1000), total_emp INT, salary INT);

IMPORT INTO sample_07
FROM SCRIPT ETL.IMPORT_HCAT_TABLE WITH
 HCAT_DB         = 'default'
 HCAT_TABLE      = 'sample_07'
 HCAT_ADDRESS    = 'thrift://hive-metastore-host:9083'
 HCAT_USER       = 'hive'
 HDFS_USER       = 'hdfs';

Please see the IMPORT details for a full description.

EXPORT

Note: This functionality is available in Exasol starting with version 6.0.3.

The EXPORT UDFs load data from Exasol into Hadoop (HCatalog tables on HDFS). To export data, you just need to execute the SQL statement EXPORT ... INTO SCRIPT ETL.EXPORT_HCAT_TABLE WITH ... with the appropriate parameters. This calls the ETL.EXPORT_HCAT_TABLE UDF, which was previously created during deployment.

For example, run the following statement to export data from an existing table.

CREATE TABLE TABLE1 (COL1 SMALLINT, COL2 INT, COL3 VARCHAR(50));

EXPORT TABLE1
INTO SCRIPT ETL.EXPORT_HCAT_TABLE WITH
 HCAT_DB         = 'default'
 HCAT_TABLE      = 'test_table'
 HCAT_ADDRESS    = 'thrift://hive-metastore-host:9083'
 HCAT_USER       = 'hive'
 HDFS_USER       = 'hdfs';

Please see the EXPORT details for a full description.

Frequent Issues

  • In case you cannot connect to certain parts of Hadoop it is a good idea to test the DNS hostname resolution and TCP/IP connectivity to all hosts and ports of Hadoop (HCatalog, HDFS, and Kerberos servers if used). For this you can use the python script in solution 325. Note that this script is designed for testing http connections, so you can ignore the http check failures.

  • Google DataProc Integration issues.

  • Hive null values are imported as \N. For now, you can do post processing after the import to convert them into proper values.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].