Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → mullerhai → Hsuntzu

mullerhai / Hsuntzu

HDFS compress tar zip snappy gzip uncompress untar codec hadoop spark

Programming Languages

scala

5932 projects

Labels

hdfs compress

Projects that are alternatives of or similar to Hsuntzu

Rumble

⛈️ Rumble 1.11.0 "Banyan Tree"🌳 for Apache Spark | Run queries on your large-scale, messy JSON-like data (JSON, text, CSV, Parquet, ROOT, AVRO, SVM...) | No install required (just a jar to download) | Declarative Machine Learning and more

Stars: ✭ 58 (-57.04%)

Mutual labels: hdfs

Repository

个人学习知识库涉及到数据仓库建模、实时计算、大数据、Java、算法等。

Stars: ✭ 92 (-31.85%)

Mutual labels: hdfs

Hdfs Shell

HDFS Shell is a HDFS manipulation tool to work with functions integrated in Hadoop DFS

Stars: ✭ 117 (-13.33%)

Mutual labels: hdfs

Cloud Note

基于分布式的云笔记（参考某道云笔记），数据存储在redis与hbase中

Stars: ✭ 71 (-47.41%)

Mutual labels: hdfs

Compress.js

A simple JavaScript based client-side image compression algorithm

Stars: ✭ 86 (-36.3%)

Mutual labels: compress

Html Minifier Terser

actively maintained fork of html-minifier - minify HTML, CSS and JS code using terser - supports ES6 code

Stars: ✭ 106 (-21.48%)

Mutual labels: compress

Tiledb

The Universal Storage Engine

Stars: ✭ 1,072 (+694.07%)

Mutual labels: hdfs

Apiproject

[https://www.sofineday.com], golang项目开发脚手架,集成最佳实践(gin+gorm+go-redis+mongo+cors+jwt+json日志库zap(支持日志收集到kafka或mongo)+消息队列kafka+微信支付宝支付gopay+api加密+api反向代理+go modules依赖管理+headless爬虫chromedp+makefile+二进制压缩+livereload热加载)

Stars: ✭ 124 (-8.15%)

Mutual labels: compress

Wifi

基于wifi抓取信息的大数据查询分析系统

Stars: ✭ 93 (-31.11%)

Mutual labels: hdfs

Ibis

A pandas-like deferred expression system, with first-class SQL support

Stars: ✭ 1,630 (+1107.41%)

Mutual labels: hdfs

Tiledb Py

Python interface to the TileDB storage manager

Stars: ✭ 78 (-42.22%)

Mutual labels: hdfs

Bigdata File Viewer

A cross-platform (Windows, MAC, Linux) desktop application to view common bigdata binary format like Parquet, ORC, AVRO, etc. Support local file system, HDFS, AWS S3, Azure Blob Storage ,etc.

Stars: ✭ 86 (-36.3%)

Mutual labels: hdfs

Py7zr

7zip in python3 with ZStandard, PPMd, LZMA2, LZMA1, Delta, BCJ, BZip2, and Deflate compressions, and AES encryption.

Stars: ✭ 110 (-18.52%)

Mutual labels: compress

Big Data Engineering Coursera Yandex

Big Data for Data Engineers Coursera Specialization from Yandex

Stars: ✭ 71 (-47.41%)

Mutual labels: hdfs

Dynamometer

A tool for scale and performance testing of HDFS with a specific focus on the NameNode.

Stars: ✭ 122 (-9.63%)

Mutual labels: hdfs

Flume Canal Source

Flume NG Canal source

Stars: ✭ 56 (-58.52%)

Mutual labels: hdfs

Bigdata Notes

大数据入门指南 ⭐

Stars: ✭ 10,991 (+8041.48%)

Mutual labels: hdfs

Slim

Surprisingly space efficient trie in Golang(11 bits/key; 100 ns/get).

Stars: ✭ 1,705 (+1162.96%)

Mutual labels: compress

Elasticctr

ElasticCTR，即飞桨弹性计算推荐系统，是基于Kubernetes的企业级推荐系统开源解决方案。该方案融合了百度业务场景下持续打磨的高精度CTR模型、飞桨开源框架的大规模分布式训练能力、工业级稀疏参数弹性调度服务，帮助用户在Kubernetes环境中一键完成推荐系统部署，具备高性能、工业级部署、端到端体验的特点，并且作为开源套件，满足二次深度开发的需求。

Stars: ✭ 123 (-8.89%)

Mutual labels: hdfs

Tiny Html Minifier

Minify HTML in PHP with just a single class

Stars: ✭ 114 (-15.56%)

Mutual labels: compress

View All Similar Projects ➔

HsunTzu

Overview

version Beat 2.0

Very Fast Hdfs Origin File To Compress Decompress Untar Tarball

LISENCE. Apache 2.0

工欲善其事必先利其器 --荀子 HSUNTZU

这个工具主要是应用在HDFS上做文件及日志的压缩归档和逆操作解压等，支持多目录同时并行压缩，支持HDFS 现有的六种压缩格式，经测试在PB级数据上完全没有问题，该压缩工具使用不会占用 MapReduce Job队列，友好支持在 shell repl 中运行，也支持集成到独立的项目中，使用前请确认贵司的HDFS集群环境，需要配置一下集群的地址等等信息，在运行命令时，也需要指定必要的参数，比如要压缩的文件路径操作后的输出路径，压缩类型配置文件路径输入的压缩格式输出的压缩格式，项目还在不断添加新的功能中，欢迎大家踊跃尝试，解决 HDFS上的文件归档痛点，释放 HDFS更大的空间。现在支持四种完美的类型，1.原始文件被压2.原始文件打包，3，tar包文件解压为原始文件 4，批量目录文件的压缩或打包,不建议在mac 和Windows系统尝试，建议使用 centos 服务系统

优势

快：支持并行 |不占MapReduce 任务队列稳：不会中途宕机终止 | 性能占用平稳准：确保数据的完整性 |不丢失不重复不损坏多：支持 HDFS 现存的六种压缩格式和直接在 HDFS 打包不经本地系统可配置可复用可集成

大数据治理思想

普通人比较关注这个工具的压缩率到底是多少，我可以简单回答一下 snappy 大概是3-5 倍 gzip default defalte 大概是 6 -12倍 bzip 大概是 4-7倍 lz4 大概是 4-5倍，然鹅儿，这并不全面，具体的压缩比率和你的文件内容也有非常大的关系，单独谈压缩率都是耍流氓。我们不该只谈论某一种性能的优势，我们在处理大数据的时候，要考虑很多实际问题，比如集群容量规划，压缩解压打包过程对集群性能的占用和损耗未来归档数据被计算是否支持split ，压缩时间消耗，未来的跨集群调度数据集群扩容数据 Rebalance 等等，很多时候这是一个相互妥协的过程，需要权衡需要谨慎定夺考虑到底该如何选择

回答一些疑问

是新的压缩算法嘛？ ----> 不是的，是现有的 HDFS 六种压缩算法的完全封装和 Apache Commons Tar Api 的封装调用，并没有自研新压缩算法，自研新的算法，关键hadoop不支持也是白扯

hadoop 难道没有这些命令嘛？ ---> hadoop 只定义开放了API，自己别没有去做封装实现，我做大数据治理多年，假如hadoop 有的话，我又何必造轮子呢，这个压缩打包正是大数据处理的一个痛点难点， hadoop 自有的 HAR 类型只打包并不压缩

单纯调用人家API 没有技术含量呀？ --> 呵呵，你很牛逼，期待你自己整个开源的让大家使用嘛

你这东西也就是半产品吧？ ---> 你只要会用，哪怕是一坨幸运也可以在你的项目中闪闪发光的带来价值

这个未来会加入新功能嘛？ ---> 会的，我会一直维护，

这个工具可靠嘛？ ---> 当然了，已经在多位大厂经过PB级生产环境的考验，包括 DIDI 360 的数据平台

Good tools are prerequisite to the successful execution of a job

How To Use It ! ! !

Shell Command model & argument like below content: [MAYBE YOU NEED EDIT CONFIGFILE FIRST]

HadoopExecPath jar HsunTzuPro-beat-2.0.jar InputDir OutputDir OperateType ConfigFilePath InputCodec OutputCodec

hadoop执行脚本 jar HsunTzuPro-beat-2.0.jar 待压缩/打包/解压的输入文件目录解压/解包/压缩的输出文件目录操作类型配置文件路径输入的压缩格式输出的压缩格式

First

You need install jdk 8 scala 2.12.1 + sbt 1.0.4 + hadoop 2.8.1 + ,

also you can edit the version on build.sbt and ./project/build.properties

Get

git clone [email protected]:mullerhai/HsunTzu.git

cd ./HsunTzu

Compile

sbt clean compile

Package

sbt update

sbt assembly

Run

hadoop jar HsunTzuPro-beat-2.0.jar inputPath outPath CompressType PropertiesFilePath inputCodec OutputCodec

you will see the logger info on console output

something. before run. you need to know.

CompressType. use. number instead of. compress method invoke

case "1" => exec.originFileToCompressFile

case "2" => exec.tarFileToOriginFile

case "3" => exec.compressFileToOriginFile

case "4" => exec.oneCompressConvertOtherCompress

compress codec use number instead of. codec class ,example. use. 1 instead of snappycodec

case "0" => deflateCode

case "1" => snappyCodec

case "2" => gzipCodec

case "3" => lz4Codec

case "4" => bZip2Codec

case "5" => defaultCodec

case _ => deflateCodec

PropertiesFilePath

you need create one property file , example. /usr/local/info.properties

you need decleard the file prefix in. Key.[files]. of your property file. for. select compress or decompress or untar

if. you want to. decleard. the. hdfs address and port and operetion user ,you need. to fill the file. in. property file

like this. intention. the key. must. like this [ hdfsAddr hdfsPort FsKey. FsUserKey. hadoopUser HDFSPORTDOTSUFFIX. files] !!

hdfsAddr=hdfs://192.168.255.161:9000

hdfsPort=9000

FsKey=fs.defaultFS

FsUserKey=HADOOP_USER_NAME

hadoopUser=linkedme_hadoop

HDFSPORTDOTSUFFIX= :9000/

files=biz,ad_status,ad_behavior

argument you need to. declard six ,maybe. the last argument is not can use.

run example.

hadoop jar HsunTzuPro-beat-2.0.jar /facishare-data/taru/20170820 /facishare-data/gao 1 /usr/local/info.properties 1 0

this. is just for originFile.to CompressFile, and. compress file use. snappyCodec.

hadoop jar HsunTzuPro-beat-2.0.jar /facishare-data/taruns/taruns/tarun/20170820/ /facishare-data/xin 4 /usr/local/info.properties 0 1

this is just. for deflateCodec compress files. convert to snappyCodec Compress files. oneCompressConvertOtherCompress

hadoop jar HsunTzuPro-beat-2.0.jar /facishare-data/taruns/taruns/tarun/20170820/ /facishare-data/xin 3 /usr/local/info.properties 3 0

this is just lz4Codec Compress files. Decompress. to. origin files

hadoop jar HsunTzuPro-beat-2.0.jar /facishare-data/taruns/taruns/tarun/20170820/ /facishare-data/xin 2 /usr/local/info.properties 0 0

this is just tarball file. to origin file

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 135

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

mullerhai / Hsuntzu

Programming Languages

Labels

Projects that are alternatives of or similar to Hsuntzu

HsunTzu

Overview

version Beat 2.0

Very Fast Hdfs Origin File To Compress Decompress Untar Tarball

LISENCE. Apache 2.0

工欲善其事必先利其器 --荀子 HSUNTZU

优势

快： 支持并行 |不占MapReduce 任务队列 稳 ：不会中途宕机终止 | 性能占用 平稳 准 ： 确保数据的完整性 |不丢失 不重复 不损坏 多 ： 支持 HDFS 现存的六种压缩格式 和 直接在 HDFS 打包 不经 本地系统 可配置 可复用 可集成

大数据治理 思想

回答一些疑问

是新的压缩算法嘛？ ----> 不是的，是现有的 HDFS 六种压缩算法的完全封装 和 Apache Commons Tar Api 的封装调用，并没有自研新压缩算法，自研新的算法 ，关键hadoop不支持 也是白扯

单纯 调用人家API 没有技术含量呀？ --> 呵呵，你很牛逼 ，期待你自己整个开源的 让大家使用嘛

你这东西也就是半产品 吧？ ---> 你只要会用，哪怕是一坨幸运 也可以在你的项目中闪闪发光的 带来价值

这个未来会加入 新功能嘛？ ---> 会的，我会一直维护，

这个工具可靠嘛？ ---> 当然 了，已经在多位 大厂 经过PB级生产环境的考验，包括 DIDI 360 的数据平台

Good tools are prerequisite to the successful execution of a job

How To Use It ! ! !

Shell Command model & argument like below content: [MAYBE YOU NEED EDIT CONFIGFILE FIRST]

HadoopExecPath jar HsunTzuPro-beat-2.0.jar InputDir OutputDir OperateType ConfigFilePath InputCodec OutputCodec

hadoop执行脚本 jar HsunTzuPro-beat-2.0.jar 待压缩/打包/解压的输入文件目录 解压/解包/压缩的输出文件目录 操作类型 配置文件路径 输入的压缩格式 输出的压缩格式

First

You need install jdk 8 scala 2.12.1 + sbt 1.0.4 + hadoop 2.8.1 + ,

also you can edit the version on build.sbt and ./project/build.properties

Get

git clone [email protected]:mullerhai/HsunTzu.git

cd ./HsunTzu

Compile

sbt clean compile

Package

sbt update

sbt assembly

Run

hadoop jar HsunTzuPro-beat-2.0.jar inputPath outPath CompressType PropertiesFilePath inputCodec OutputCodec

you will see the logger info on console output

something. before run. you need to know.

CompressType. use. number instead of. compress method invoke

case "1" => exec.originFileToCompressFile

case "2" => exec.tarFileToOriginFile

case "3" => exec.compressFileToOriginFile

case "4" => exec.oneCompressConvertOtherCompress

compress codec use number instead of. codec class ,example. use. 1 instead of snappycodec

case "0" => deflateCode

case "1" => snappyCodec

case "2" => gzipCodec

case "3" => lz4Codec

case "4" => bZip2Codec

case "5" => defaultCodec

case _ => deflateCodec

PropertiesFilePath

you need create one property file , example. /usr/local/info.properties

you need decleard the file prefix in. Key.[files]. of your property file. for. select compress or decompress or untar

if. you want to. decleard. the. hdfs address and port and operetion user ,you need. to fill the file. in. property file

like this. intention. the key. must. like this [ hdfsAddr hdfsPort FsKey. FsUserKey. hadoopUser HDFSPORTDOTSUFFIX. files] !!

hdfsAddr=hdfs://192.168.255.161:9000

hdfsPort=9000

FsKey=fs.defaultFS

FsUserKey=HADOOP_USER_NAME

hadoopUser=linkedme_hadoop

HDFSPORTDOTSUFFIX= :9000/

files=biz,ad_status,ad_behavior

argument you need to. declard six ,maybe. the last argument is not can use.

run example.

hadoop jar HsunTzuPro-beat-2.0.jar /facishare-data/taru/20170820 /facishare-data/gao 1 /usr/local/info.properties 1 0

this. is just for originFile.to CompressFile, and. compress file use. snappyCodec.

hadoop jar HsunTzuPro-beat-2.0.jar /facishare-data/taruns/taruns/tarun/20170820/ /facishare-data/xin 4 /usr/local/info.properties 0 1

this is just. for deflateCodec compress files. convert to snappyCodec Compress files. oneCompressConvertOtherCompress

hadoop jar HsunTzuPro-beat-2.0.jar /facishare-data/taruns/taruns/tarun/20170820/ /facishare-data/xin 3 /usr/local/info.properties 3 0

this is just lz4Codec Compress files. Decompress. to. origin files

hadoop jar HsunTzuPro-beat-2.0.jar /facishare-data/taruns/taruns/tarun/20170820/ /facishare-data/xin 2 /usr/local/info.properties 0 0

this is just tarball file. to origin file

快：支持并行 |不占MapReduce 任务队列稳：不会中途宕机终止 | 性能占用平稳准：确保数据的完整性 |不丢失不重复不损坏多：支持 HDFS 现存的六种压缩格式和直接在 HDFS 打包不经本地系统可配置可复用可集成

大数据治理思想

是新的压缩算法嘛？ ----> 不是的，是现有的 HDFS 六种压缩算法的完全封装和 Apache Commons Tar Api 的封装调用，并没有自研新压缩算法，自研新的算法，关键hadoop不支持也是白扯

单纯调用人家API 没有技术含量呀？ --> 呵呵，你很牛逼，期待你自己整个开源的让大家使用嘛

你这东西也就是半产品吧？ ---> 你只要会用，哪怕是一坨幸运也可以在你的项目中闪闪发光的带来价值

这个未来会加入新功能嘛？ ---> 会的，我会一直维护，

这个工具可靠嘛？ ---> 当然了，已经在多位大厂经过PB级生产环境的考验，包括 DIDI 360 的数据平台

hadoop执行脚本 jar HsunTzuPro-beat-2.0.jar 待压缩/打包/解压的输入文件目录解压/解包/压缩的输出文件目录操作类型配置文件路径输入的压缩格式输出的压缩格式