All Projects → ExpediaGroup → corc

ExpediaGroup / corc

Licence: Apache-2.0 license
An ORC File Scheme for the Cascading data processing platform.

Programming Languages

java
68154 projects - #9 most used programming language

Projects that are alternatives of or similar to corc

webhdfs
Node.js WebHDFS REST API client
Stars: ✭ 88 (+528.57%)
Mutual labels:  hadoop
hive-bigquery-storage-handler
Hive Storage Handler for interoperability between BigQuery and Apache Hive
Stars: ✭ 16 (+14.29%)
Mutual labels:  hadoop
LogAnalyzeHelper
论坛日志分析系统清洗程序(包含IP规则库,UDF开发,MapReduce程序,日志数据)
Stars: ✭ 33 (+135.71%)
Mutual labels:  hadoop
bigdata-doc
大数据学习笔记,学习路线,技术案例整理。
Stars: ✭ 37 (+164.29%)
Mutual labels:  hadoop
hive to es
同步Hive数据仓库数据到Elasticsearch的小工具
Stars: ✭ 21 (+50%)
Mutual labels:  hadoop
Data-pipeline-project
Data pipeline project
Stars: ✭ 18 (+28.57%)
Mutual labels:  hadoop
TonY
TonY is a framework to natively run deep learning frameworks on Apache Hadoop.
Stars: ✭ 687 (+4807.14%)
Mutual labels:  hadoop
pyspark-ML-in-Colab
Pyspark in Google Colab: A simple machine learning (Linear Regression) model
Stars: ✭ 32 (+128.57%)
Mutual labels:  hadoop
iis
Information Inference Service of the OpenAIRE system
Stars: ✭ 16 (+14.29%)
Mutual labels:  hadoop
datalake-etl-pipeline
Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations
Stars: ✭ 39 (+178.57%)
Mutual labels:  hadoop
learning-hadoop-and-spark
Companion to Learning Hadoop and Learning Spark courses on Linked In Learning
Stars: ✭ 146 (+942.86%)
Mutual labels:  hadoop
HDFS-Netdisc
基于Hadoop的分布式云存储系统 🌴
Stars: ✭ 56 (+300%)
Mutual labels:  hadoop
dockerfiles
Multi docker container images for main Big Data Tools. (Hadoop, Spark, Kafka, HBase, Cassandra, Zookeeper, Zeppelin, Drill, Flink, Hive, Hue, Mesos, ... )
Stars: ✭ 29 (+107.14%)
Mutual labels:  hadoop
openPDC
Open Source Phasor Data Concentrator
Stars: ✭ 109 (+678.57%)
Mutual labels:  hadoop
big-data-exploration
[Archive] Intern project - Big Data Exploration using MongoDB - This Repository is NOT a supported MongoDB product
Stars: ✭ 43 (+207.14%)
Mutual labels:  hadoop
dpkb
大数据相关内容汇总,包括分布式存储引擎、分布式计算引擎、数仓建设等。关键词:Hadoop、HBase、ES、Kudu、Hive、Presto、Spark、Flink、Kylin、ClickHouse
Stars: ✭ 123 (+778.57%)
Mutual labels:  hadoop
the-apache-ignite-book
All code samples, scripts and more in-depth examples for The Apache Ignite Book. Include Apache Ignite 2.6 or above
Stars: ✭ 65 (+364.29%)
Mutual labels:  hadoop
BigInsights-on-Apache-Hadoop
Example projects for 'BigInsights for Apache Hadoop' on IBM Bluemix
Stars: ✭ 21 (+50%)
Mutual labels:  hadoop
disk
基于hadoop+hbase+springboot实现分布式网盘系统
Stars: ✭ 53 (+278.57%)
Mutual labels:  hadoop
qs-hadoop
大数据生态圈学习
Stars: ✭ 18 (+28.57%)
Mutual labels:  hadoop
   O~~~   O~~    O~ O~~~   O~~~
 O~~    O~~  O~~  O~~    O~~   
O~~    O~~    O~~ O~~   O~~    
 O~~    O~~  O~~  O~~    O~~   
   O~~~   O~~    O~~~      O~~~

Use corc to read and write data in the Optimized Row Columnar (ORC) file format in your Cascading applications. The reading of ACID datasets is also supported.

Status ⚠️

This project is no longer in active development.

Start using

You can obtain corc from Maven Central :

Maven Central GitHub license

Cascading Dependencies

Corc has been built and tested against Cascading 3.3.0.

Hive Dependencies

Corc is built with Hive 2.3.4. Several dependencies will need to be included when using Corc:

<dependency>
  <groupId>org.apache.hive</groupId>
  <artifactId>hive-exec</artifactId>
  <version>2.3.4</version>
  <classifier>core</classifier>
</dependency>
<dependency>
  <groupId>org.apache.hive</groupId>
  <artifactId>hive-serde</artifactId>
  <version>2.3.4</version>
</dependency>
<dependency>
  <groupId>com.esotericsoftware.kryo</groupId>
  <artifactId>kryo</artifactId>
  <version>2.22</version>
</dependency>
<dependency>
  <groupId>com.google.protobuf</groupId>
  <artifactId>protobuf-java</artifactId>
  <version>2.5.0</version>
</dependency>

Overview

Supported types

HiveCascading/Java
STRINGString
BOOLEANBoolean
TINYINTByte
SMALLINTShort
INTInteger
BIGINTLong
FLOATFloat
DOUBLEDouble
TIMESTAMPjava.sql.Timestamp
DATEjava.sql.Date
BINARYbyte[]
CHARString (HiveChar)
VARCHARString (HiveVarchar)
DECIMALBigDecimal (HiveDecimal)
ARRAYList<Object>
MAPMap<Object, Object>
STRUCTList<Object>
UNIONTYPESub-type

Constructing an OrcFile instance

OrcFile provides two public constructors; one for sourcing and one for sinking. However, these are provided to be more flexible for others who may wish to extend the class. It is advised to construct an instance via the SourceBuilder and SinkBuilder classes.

SourceBuilder

Create a builder:

SourceBuilder builder = OrcFile.source();

Specify the fields that should be read. If the declared schema is a subset of the complete schema, then column projection will occur:

builder.declaredFields(fields);
// or
builder.columns(structTypeInfo);
// or
builder.columns(structTypeInfoString);

Specify the complete schema of the underlying ORC Files. This is only required for reading ORC Files that back a transactional Hive table. The default behaviour should be to obtain the schema from the ORC Files being read:

builder.schemaFromFile();
// or
builder.schema(fields);
// or
builder.schema(structTypeInfo);
// or
builder.schema(structTypeInfoString);

ORC Files support predicate pushdown. This allows whole row groups to be skipped if they do not contain any rows that match the given SearchArgument:

Fields message = new Fields("message", String.class);
SearchArgument searchArgument = SearchArgumentFactory.newBuilder()
    .startAnd()
    .equals(message, "hello")
    .end()
    .build();

builder.searchArgument(searchArgument);

When passing objects to the SearchArgument.Builder, care should be taken to choose the correct type:

HiveJava
STRINGString
BOOLEANBoolean
TINYINTByte
SMALLINTShort
INTInteger
BIGINTLong
FLOATFloat
DOUBLEDouble
TIMESTAMPjava.sql.Timestamp
DATEorg.apache.hadoop.hive.serde2.io.DateWritable
CHARString (HiveChar)
VARCHARString (HiveVarchar)
DECIMALBigDecimal

When reading ORC Files that back a transactional Hive table, include the VirtualColumn#ROWID ("ROW__ID") virtual column. The column will be prepended to the record's Fields:

builder.prependRowId();

Finally, build the OrcFile:

OrcFile orcFile = builder.build();

SinkBuilder

OrcFile orcFile = OrcFile.sink()
    .schema(schema)
    .build();

The schema parameter can be one of Fields, StructTypeInfo or the String representation of the StructTypeInfo. When providing a Fields instance, care must be taken when deciding how best to specify the types as there is no one-to-one bidirectional mapping between Cascading types and Hive types. The TypeInfo is able to represent richer, more complex types. Consider your ORC File schema and the mappings to Fields types carefully.

Constructing a StructTypeInfo instance

List<String> names = new ArrayList<>();
names.add("col0");
names.add("col1");

List<TypeInfo> typeInfos = new ArrayList<>();
typeInfos.add(TypeInfoFactory.stringTypeInfo);
typeInfos.add(TypeInfoFactory.longTypeInfo);

StructTypeInfo structTypeInfo = (StructTypeInfo) TypeInfoFactory.getStructTypeInfo(names, typeInfos);

or...

String typeString = "struct<col0:string,col1:bigint>";

StructTypeInfo structTypeInfo = (StructTypeInfo) TypeInfoUtils.getTypeInfoFromTypeString(typeString);

or, via the convenience builder...

StructTypeInfo structTypeInfo = new StructTypeInfoBuilder()
    .add("col0", TypeInfoFactory.stringTypeInfo)
    .add("col1", TypeInfoFactory.longTypeInfo)
    .build();

Reading transactional Hive tables

Corc also supports the reading of ACID datasets that underpin transactional Hive tables. However, for this to work effectively with an active Hive table you must provide your own lock management. We intend to make this functionality available in the cascading-hive project. When reading the data you may optionally include the virtual RecordIdentifer column, also known as the ROW__ID column, with one of the following approaches:

  1. Add a field named 'ROW__ID' to your Fields definition. This must be of type org.apache.hadoop.hive.ql.io.RecordIdentifier. For convenience you can use the constant OrcFile#ROW__ID with some fields arithmetic: Fields myFields = Fields.join(OrcFile.ROW__ID, myFields);.
  2. Use the OrcFile.source().prependRowId() option. Be sure to exclude the RecordIdentifer column from your typeInfo instance. The ROW__ID field will be added to your tuple stream automatically.

Usage

OrcFile can be used with Hfs, just like TextDelimited.

OrcFile orcFile = ...
String path = ...
Hfs hfs = new Hfs(orcFile, path);

Credits

Created by Dave Maughan & Elliot West, with thanks to: Patrick Duin, James Grant & Adrian Woodhead.

Legal

This project is available under the Apache 2.0 License.

Copyright 2015-2020 Expedia, Inc.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].