All Projects → ExpediaGroup → datasqueeze

ExpediaGroup / datasqueeze

Licence: Apache-2.0 license
Hadoop utility to compact small files

Programming Languages

java
68154 projects - #9 most used programming language

Projects that are alternatives of or similar to datasqueeze

HDFS-Netdisc
基于Hadoop的分布式云存储系统 🌴
Stars: ✭ 56 (+211.11%)
Mutual labels:  hadoop, hdfs, hadoop-filesystem
kafka-connect-fs
Kafka Connect FileSystem Connector
Stars: ✭ 107 (+494.44%)
Mutual labels:  hadoop, hdfs, hadoop-filesystem
Hdfs Shell
HDFS Shell is a HDFS manipulation tool to work with functions integrated in Hadoop DFS
Stars: ✭ 117 (+550%)
Mutual labels:  hadoop, hdfs
Dynamometer
A tool for scale and performance testing of HDFS with a specific focus on the NameNode.
Stars: ✭ 122 (+577.78%)
Mutual labels:  hadoop, hdfs
docker-hadoop
Docker image for main Apache Hadoop components (Yarn/Hdfs)
Stars: ✭ 59 (+227.78%)
Mutual labels:  hadoop, hdfs
Repository
个人学习知识库涉及到数据仓库建模、实时计算、大数据、Java、算法等。
Stars: ✭ 92 (+411.11%)
Mutual labels:  hadoop, hdfs
Bigdata Notes
大数据入门指南 ⭐
Stars: ✭ 10,991 (+60961.11%)
Mutual labels:  hadoop, hdfs
Bigdata docker
Big Data Ecosystem Docker
Stars: ✭ 161 (+794.44%)
Mutual labels:  hadoop, hdfs
Jsr203 Hadoop
A Java NIO file system provider for HDFS
Stars: ✭ 35 (+94.44%)
Mutual labels:  hadoop, hdfs
bigdata-doc
大数据学习笔记,学习路线,技术案例整理。
Stars: ✭ 37 (+105.56%)
Mutual labels:  hadoop, hdfs
teraslice
Scalable data processing pipelines in JavaScript
Stars: ✭ 48 (+166.67%)
Mutual labels:  hadoop, hdfs
skein
A tool and library for easily deploying applications on Apache YARN
Stars: ✭ 128 (+611.11%)
Mutual labels:  hadoop, hdfs
Wifi
基于wifi抓取信息的大数据查询分析系统
Stars: ✭ 93 (+416.67%)
Mutual labels:  hadoop, hdfs
Camus
Mirror of Linkedin's Camus
Stars: ✭ 81 (+350%)
Mutual labels:  hadoop, hdfs
Ibis
A pandas-like deferred expression system, with first-class SQL support
Stars: ✭ 1,630 (+8955.56%)
Mutual labels:  hadoop, hdfs
Learning Spark
零基础学习spark,大数据学习
Stars: ✭ 37 (+105.56%)
Mutual labels:  hadoop, hdfs
Spark With Python
Fundamentals of Spark with Python (using PySpark), code examples
Stars: ✭ 150 (+733.33%)
Mutual labels:  hadoop, hdfs
Hadoop For Geoevent
ArcGIS GeoEvent Server sample Hadoop connector for storing GeoEvents in HDFS.
Stars: ✭ 5 (-72.22%)
Mutual labels:  hadoop, hdfs
Bigdata Interview
🎯 🌟[大数据面试题]分享自己在网络上收集的大数据相关的面试题以及自己的答案总结.目前包含Hadoop/Hive/Spark/Flink/Hbase/Kafka/Zookeeper框架的面试题知识总结
Stars: ✭ 857 (+4661.11%)
Mutual labels:  hadoop, hdfs
hive to es
同步Hive数据仓库数据到Elasticsearch的小工具
Stars: ✭ 21 (+16.67%)
Mutual labels:  hadoop, hdfs

DataSqueeze

Maven Central Build Status License

Overview

DataSqueeze is a Hadoop utility for compacting small files into larger files. It copies and compacts files from a source directory to a target directory, maintaining the directory structure of the source.

Documentation

This README is intended to provide detailed technical documentation for advanced users.

General operation

DataSqueeze supports two types of compaction

  1. Normal Compaction - We compact files from source to target path.

    Below is a high level summary of the steps that Compaction Utility performs during the course of a typical run for normal compaction.

     a. Fetch the source file paths to be compacted from the source path provided.
     b. Perform mapreduce job using the following configuration
         1. Mapper maps records together based on same parent directory and emits parent directory as key.
         2. Reducer reduces records based on same key but writes data to the target directory provided by the user, 
            retaining the directory structure.
    
  2. In-Place Compaction - Performs compaction on the source path. This is not recommended on AWS-S3, since the performance will be terrible.

    Below is a high level summary of the steps that Compaction Utility performs during the course of a typical run for in-place compaction.

     a. Fetch the file paths to be compacted from the source path provided.
     b. Perform mapreduce job using the following configuration
         1. Mapper maps records together based on same parent directory and emits parent directory as key.
         2. Reducer reduces records based on same key but writes data to the target directory provided by the user, 
            retaining the directory structure.
     c. Store the compacted files on temp-compacted path.
     d. Move files from source to temp location.
     e. Move files from temp-compacted location to source location specified by the user.
    

Requirements

  • MacOS or Linux
  • Java 7 or later
  • Maven 3.x (for building)
  • rpmbuild (for building RPMs)

Building DataSqueeze

DataSqueeze is a standard Maven project. Run the following in the project root folder:

mvn clean package

The compiled JAR can be found at datasqueeze/target/datasqueeze.jar.

To build an RPM, use the optional Maven profile -P rpm:

mvn clean package -P rpm

This requires rpmbuild to be installed, otherwise an error will occur.

Running DataSqueeze

There are two different ways of running DataSqueeze:

  1. CLI - a. For TEXT/ORC/SEQ

        hadoop jar datasqueeze.jar com.expedia.dsp.data.squeeze.Utility
        -sp s3a://edwprod/user/ysontakke/compactiontest1/ -tp s3a://edwprod/user/ysontakke/compactionoutput_text_yash_1/
        -threshold 12345

    b. For AVRO

        hadoop jar datasqueeze.jar com.expedia.dsp.data.squeeze.Utility
        -sp s3a://edwprod/user/ysontakke/compactiontest1/ -tp s3a://edwprod/user/ysontakke/compactionoutput_text_yash_1/
        -threshold 12345 -fileType AVRO -schemaPath s3a://edwprod/user/ysontakke/compactionschema_text_yash_1/schema.avsc

    CLI uses four parameters:-

    * sp (SourcePath) - Source location for compaction
    * tp (TargetPath) - Target location for compaction. If target path is not provided, inplace compaction is performed
    * threshold - Optional field. threshold in bytes for compaction. If file size is greater then no compaction on file,
      file is just copied to target directory. Optional parameter, if not provided defaults to 134217728 (128 MB)
    * maxReducers - Max reducers for the Map Reduce job
    * fileType - Type of file to be compacted (AVRO / TEXT / SEQ / ORC). It is mandatory for AVRO
    * schemaPath - schema used for compaction (mandatory for AVRO)
    
  2. API - CompactionManager

        CompactionResponse compact() throws Exception;

Tests

Currently, the tests for DataSqueeze cannot be made publicly available, but we are working on getting them open sourced.

Contributing

We gladly accept contributions to DataSqueeze in the form of issues, feature requests, and pull requests!

Licensing

Copyright © 2017-2021 Expedia, Inc.

DataSqueeze is licensed under the Apache 2.0 license; refer to LICENSE for the complete text.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].