All Projects → mullerhai → Hsuntzu

mullerhai / Hsuntzu

HDFS compress tar zip snappy gzip uncompress untar codec hadoop spark

Programming Languages

scala
5932 projects

Projects that are alternatives of or similar to Hsuntzu

Rumble
⛈️ Rumble 1.11.0 "Banyan Tree"🌳 for Apache Spark | Run queries on your large-scale, messy JSON-like data (JSON, text, CSV, Parquet, ROOT, AVRO, SVM...) | No install required (just a jar to download) | Declarative Machine Learning and more
Stars: ✭ 58 (-57.04%)
Mutual labels:  hdfs
Repository
个人学习知识库涉及到数据仓库建模、实时计算、大数据、Java、算法等。
Stars: ✭ 92 (-31.85%)
Mutual labels:  hdfs
Hdfs Shell
HDFS Shell is a HDFS manipulation tool to work with functions integrated in Hadoop DFS
Stars: ✭ 117 (-13.33%)
Mutual labels:  hdfs
Cloud Note
基于分布式的云笔记(参考某道云笔记),数据存储在redis与hbase中
Stars: ✭ 71 (-47.41%)
Mutual labels:  hdfs
Compress.js
A simple JavaScript based client-side image compression algorithm
Stars: ✭ 86 (-36.3%)
Mutual labels:  compress
Html Minifier Terser
actively maintained fork of html-minifier - minify HTML, CSS and JS code using terser - supports ES6 code
Stars: ✭ 106 (-21.48%)
Mutual labels:  compress
Tiledb
The Universal Storage Engine
Stars: ✭ 1,072 (+694.07%)
Mutual labels:  hdfs
Apiproject
[https://www.sofineday.com], golang项目开发脚手架,集成最佳实践(gin+gorm+go-redis+mongo+cors+jwt+json日志库zap(支持日志收集到kafka或mongo)+消息队列kafka+微信支付宝支付gopay+api加密+api反向代理+go modules依赖管理+headless爬虫chromedp+makefile+二进制压缩+livereload热加载)
Stars: ✭ 124 (-8.15%)
Mutual labels:  compress
Wifi
基于wifi抓取信息的大数据查询分析系统
Stars: ✭ 93 (-31.11%)
Mutual labels:  hdfs
Ibis
A pandas-like deferred expression system, with first-class SQL support
Stars: ✭ 1,630 (+1107.41%)
Mutual labels:  hdfs
Tiledb Py
Python interface to the TileDB storage manager
Stars: ✭ 78 (-42.22%)
Mutual labels:  hdfs
Bigdata File Viewer
A cross-platform (Windows, MAC, Linux) desktop application to view common bigdata binary format like Parquet, ORC, AVRO, etc. Support local file system, HDFS, AWS S3, Azure Blob Storage ,etc.
Stars: ✭ 86 (-36.3%)
Mutual labels:  hdfs
Py7zr
7zip in python3 with ZStandard, PPMd, LZMA2, LZMA1, Delta, BCJ, BZip2, and Deflate compressions, and AES encryption.
Stars: ✭ 110 (-18.52%)
Mutual labels:  compress
Big Data Engineering Coursera Yandex
Big Data for Data Engineers Coursera Specialization from Yandex
Stars: ✭ 71 (-47.41%)
Mutual labels:  hdfs
Dynamometer
A tool for scale and performance testing of HDFS with a specific focus on the NameNode.
Stars: ✭ 122 (-9.63%)
Mutual labels:  hdfs
Flume Canal Source
Flume NG Canal source
Stars: ✭ 56 (-58.52%)
Mutual labels:  hdfs
Bigdata Notes
大数据入门指南 ⭐
Stars: ✭ 10,991 (+8041.48%)
Mutual labels:  hdfs
Slim
Surprisingly space efficient trie in Golang(11 bits/key; 100 ns/get).
Stars: ✭ 1,705 (+1162.96%)
Mutual labels:  compress
Elasticctr
ElasticCTR,即飞桨弹性计算推荐系统,是基于Kubernetes的企业级推荐系统开源解决方案。该方案融合了百度业务场景下持续打磨的高精度CTR模型、飞桨开源框架的大规模分布式训练能力、工业级稀疏参数弹性调度服务,帮助用户在Kubernetes环境中一键完成推荐系统部署,具备高性能、工业级部署、端到端体验的特点,并且作为开源套件,满足二次深度开发的需求。
Stars: ✭ 123 (-8.89%)
Mutual labels:  hdfs
Tiny Html Minifier
Minify HTML in PHP with just a single class
Stars: ✭ 114 (-15.56%)
Mutual labels:  compress

HsunTzu

Apache 2.0 Build Status Codacy Badge

Overview

version Beat 2.0

Very Fast Hdfs Origin File To Compress Decompress Untar Tarball

avatar

LISENCE.  Apache 2.0

工欲善其事必先利其器 --荀子 HSUNTZU

这个工具主要是 应用在HDFS上 做文件及日志的压缩归档 和逆操作 解压 等,支持 多目录同时 并行 压缩,支持HDFS 现有的六种压缩格式,经测试 在PB级数据上完全没有问题,该压缩工具使用不会占用 MapReduce Job队列,友好支持在 shell repl 中运行,也支持 集成到独立的项目中,使用前请确认 贵司 的HDFS集群环境,需要配置一下集群的地址 等等信息 ,在运行命令时,也需要指定必要的参数,比如 要压缩的文件路径 操作后的输出路径,压缩类型 配置文件路径 输入的压缩格式 输出的压缩格式,项目还在不断添加新的功能中,欢迎大家踊跃尝试,解决 HDFS上的文件归档痛点,释放 HDFS更大的空间。现在支持四种完美的类型,1.原始文件被压2.原始文件打包 ,3,tar包文件 解压为原始文件 4,批量目录文件的压缩或打包,不建议在mac 和Windows系统尝试,建议使用 centos 服务系统

优势

快:  支持并行 |不占MapReduce 任务队列  稳 :不会中途宕机终止 | 性能占用 平稳    准 : 确保数据的完整性 |不丢失 不重复  不损坏  多 : 支持 HDFS 现存的六种压缩格式 和 直接在 HDFS 打包 不经 本地系统   可配置  可复用  可集成

 大数据治理 思想

 普通人 比较关注 这个工具的 压缩率到底是多少 ,我可以简单回答一下 snappy 大概是3-5 倍  gzip default defalte 大概是 6 -12倍  bzip 大概是 4-7倍  lz4 大概是 4-5倍, 然鹅儿, 这并不全面 ,具体的压缩比率和 你的文件内容也有非常大的关系,单独谈压缩率都是耍流氓。 我们 不该只谈论某一种性能的优势,我们在处理大数据的时候,要考虑很多实际问题,比如 集群容量规划 ,压缩解压打包 过程 对集群性能的占用和损耗 未来归档数据 被计算是否支持split ,压缩时间消耗, 未来的跨集群调度数据 集群扩容 数据 Rebalance 等等,很多时候这是一个相互妥协的过程 ,需要权衡 需要谨慎定夺 考虑 到底该如何选择

回答一些疑问

 是新的压缩算法嘛? ---->  不是的,是现有的 HDFS 六种压缩算法的完全封装 和 Apache Commons Tar Api 的封装调用,并没有自研新压缩算法,自研新的算法 ,关键hadoop不支持 也是白扯
 hadoop 难道没有这些命令 嘛? ---> hadoop 只定义开放了API,自己别没有去做封装实现,我做大数据治理多年,假如hadoop 有的话,我又何必造轮子呢,这个压缩 打包正是 大数据处理的一个痛点 难点, hadoop 自有的 HAR 类型 只打包 并不压缩
单纯 调用人家API 没有技术含量呀? -->  呵呵,你很牛逼 ,期待你自己整个开源的 让大家使用嘛
你这东西也就是半产品 吧?   --->  你只要会用,哪怕是一坨幸运 也可以在你的项目中闪闪发光的 带来价值
这个未来会加入 新功能嘛?   --->  会的,我会一直维护,
这个工具可靠嘛?   --->  当然 了,已经在多位 大厂 经过PB级生产环境的考验,包括 DIDI 360 的数据平台

avatar

Good tools are prerequisite to the successful execution of a job

How To Use It ! ! !

Shell Command model & argument like below content: [MAYBE YOU NEED EDIT CONFIGFILE FIRST]

HadoopExecPath jar HsunTzuPro-beat-2.0.jar InputDir OutputDir OperateType ConfigFilePath InputCodec OutputCodec

hadoop执行脚本  jar   HsunTzuPro-beat-2.0.jar 待压缩/打包/解压的输入文件目录   解压/解包/压缩的输出文件目录  操作类型  配置文件路径  输入的压缩格式  输出的压缩格式

First

You need install jdk 8 scala 2.12.1 + sbt 1.0.4 + hadoop 2.8.1 + ,

also you can edit the version on build.sbt and ./project/build.properties

Get

git clone [email protected]:mullerhai/HsunTzu.git

cd ./HsunTzu

Compile

sbt clean compile

Package

sbt update

sbt assembly

Run

hadoop jar   HsunTzuPro-beat-2.0.jar   inputPath   outPath CompressType PropertiesFilePath inputCodec OutputCodec

you will see the logger info on console output

something. before run. you need to know.

CompressType. use. number instead of. compress method invoke

case "1" => exec.originFileToCompressFile

case "2" => exec.tarFileToOriginFile

case "3" => exec.compressFileToOriginFile

case "4" => exec.oneCompressConvertOtherCompress

compress codec use number instead of. codec class ,example. use. 1 instead of snappycodec

case "0" => deflateCode

case "1" => snappyCodec

case "2" => gzipCodec

case "3" => lz4Codec

case "4" => bZip2Codec

case "5" => defaultCodec

case _ => deflateCodec

PropertiesFilePath

you need create one property file , example. /usr/local/info.properties

you need decleard the file prefix in. Key.[files]. of your property file. for. select compress or decompress or untar

if. you want to. decleard. the. hdfs address and port and operetion user ,you need.  to fill the file. in. property file

like this. intention. the key. must. like this  [ hdfsAddr hdfsPort FsKey. FsUserKey. hadoopUser HDFSPORTDOTSUFFIX. files] !!

hdfsAddr=hdfs://192.168.255.161:9000

hdfsPort=9000

FsKey=fs.defaultFS

FsUserKey=HADOOP_USER_NAME

hadoopUser=linkedme_hadoop

HDFSPORTDOTSUFFIX= :9000/

files=biz,ad_status,ad_behavior

argument you need to. declard six ,maybe. the last argument is not can use.

run example.

hadoop jar HsunTzuPro-beat-2.0.jar /facishare-data/taru/20170820 /facishare-data/gao 1 /usr/local/info.properties 1 0

this. is just for originFile.to CompressFile, and. compress file use. snappyCodec.

hadoop jar HsunTzuPro-beat-2.0.jar /facishare-data/taruns/taruns/tarun/20170820/ /facishare-data/xin 4 /usr/local/info.properties 0 1

this is just. for deflateCodec compress files. convert to snappyCodec Compress files. oneCompressConvertOtherCompress

hadoop jar HsunTzuPro-beat-2.0.jar /facishare-data/taruns/taruns/tarun/20170820/ /facishare-data/xin 3 /usr/local/info.properties 3 0

this is just lz4Codec Compress files. Decompress. to. origin files

hadoop jar HsunTzuPro-beat-2.0.jar /facishare-data/taruns/taruns/tarun/20170820/ /facishare-data/xin 2 /usr/local/info.properties 0 0

this is just tarball file. to origin file

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].