Addax is an open source universal ETL tool that supports most of those RDBMS and NoSQLs on the planet, helping you transfer data from any one place to another.

Stars: ✭ 615 (+2828.57%)

Mutual labels: hive, hadoop, impala

God Of Bigdata

专注大数据学习面试，大数据成神之路开启。Flink/Spark/Hadoop/Hbase/Hive...

Stars: ✭ 6,008 (+28509.52%)

Mutual labels: hive, hadoop, hdfs

Wifi

基于wifi抓取信息的大数据查询分析系统

Stars: ✭ 93 (+342.86%)

Mutual labels: hive, hadoop, hdfs

Bigdata Notes

大数据入门指南 ⭐

Stars: ✭ 10,991 (+52238.1%)

Mutual labels: hive, hadoop, hdfs

Datax

DataX is an open source universal ETL tool that support Cassandra, ClickHouse, DBF, Hive, InfluxDB, Kudu, MySQL, Oracle, Presto(Trino), PostgreSQL, SQL Server

Stars: ✭ 116 (+452.38%)

Mutual labels: hive, hadoop

Hadoopcryptoledger

Hadoop Crypto Ledger - Analyzing CryptoLedgers, such as Bitcoin Blockchain, on Big Data platforms, such as Hadoop/Spark/Flink/Hive

Stars: ✭ 126 (+500%)

Mutual labels: hive, hadoop

Avro Hadoop Starter

Example MapReduce jobs in Java, Hive, Pig, and Hadoop Streaming that work on Avro data.

Stars: ✭ 110 (+423.81%)

Mutual labels: hive, hadoop

DataX-src

DataX 是异构数据广泛使用的离线数据同步工具/平台，实现包括 MySQL、Oracle、SqlServer、Postgre、HDFS、Hive、ADS、HBase、OTS、ODPS 等各种异构数据源之间高效的数据同步功能。

Stars: ✭ 21 (+0%)

Mutual labels: hive, hdfs

Eel Sdk

Big Data Toolkit for the JVM

Stars: ✭ 140 (+566.67%)

Mutual labels: hive, hadoop

Movie recommend

基于Spark的电影推荐系统，包含爬虫项目、web网站、后台管理系统以及spark推荐系统

Stars: ✭ 2,092 (+9861.9%)

Mutual labels: hive, hadoop

Haproxy Configs

80+ HAProxy Configs for Hadoop, Big Data, NoSQL, Docker, Elasticsearch, SolrCloud, HBase, MySQL, PostgreSQL, Apache Drill, Hive, Presto, Impala, Hue, ZooKeeper, SSH, RabbitMQ, Redis, Riak, Cloudera, OpenTSDB, InfluxDB, Prometheus, Kibana, Graphite, Rancher etc.

Stars: ✭ 106 (+404.76%)

Mutual labels: hive, hadoop

Presto

The official home of the Presto distributed SQL query engine for big data

Stars: ✭ 12,957 (+61600%)

Mutual labels: hive, hadoop

Linkis

Linkis helps easily connect to various back-end computation/storage engines(Spark, Python, TiDB...), exposes various interfaces(REST, JDBC, Java ...), with multi-tenancy, high performance, and resource control.

Stars: ✭ 2,323 (+10961.9%)

Mutual labels: hive, impala

View All Similar Projects ➔

同步hive数据到Elasticsearch的工具

可选全量（默认）和增量；
同时支持编写SQL产生中间结果表，再导入到ES；

已经支持从impala渠道导数据，极大提升导数据速度

采用分页查询机制，数据集过多时不会撑爆内存；
我实习期的公司的数据分析、产品、运营经常需要看各种报表，多是分析统计类需求，Elasticsearch适合做统计分析，结合Kibana可以直接生成报表！对这类常有的统计类需求，我的通常做法是从hive数据仓库导数据表到ES，或者先用HQL或ImpalaSQL筛选出结果表，ES拿到数据再进行聚合统计，如(Date Histogram)每天、每周、每月、某人的数据。
kibana再生成各类可视化图表，最终数据直观展现！
力求简洁的配置，方便使用。

Elastic官方已经有了Hive integration的同步工具，但是由于使用的hive版本太低，ES又已经是最新版本，尝试使用hive integration时一直报错，为尽快适应当前需求手动造了该轮子。

脚本使用说明:

环境: Python2 Python3
命令 python hive_to_es.py config=<配置文件路径.ini> [可选，需要导入的表: tables=table1,table2...]

配置文件使用说明：使用.ini后缀的配置文件

;Elasticsearch地址(有多节点，地址用逗号','隔开)、用户名、密码
[es]
hosts = 192.168.3.100:9200
username = elastic
password = 888888

;存入的es的index默认等于hive或impala中的数据库名称
;在这里可配置自定义全局index名，所有导出表将默认导到该index
;default_index = tqc_ttt

;数据平台，默认是hive
;by = impala

;Hive地址、端口、数据库名、用户等配置
[hive]
host = 127.0.0.1
port = 10000
user = hiveuser
auth_mechanism = PLAIN
database = dbname

;Impala地址、端口、数据库名等配置
[impala]
host = 127.0.0.1
port = 21050
database = dbname


;需要导到ES的各个表的名称，同时也是导到ES的type名(可配置)；
;如果是通过SQL筛选出新的结果表再导入ES，结果表名称可自定义，但必须再在下面给出SQL文件路径的配置
[table]
tables = student,score,teacher,my_result_a,my_result_b

;SQL筛选结果表my_result_a
[my_result_a]
;通过编写HQL或ImpalaSQL获得新的结果集表导入ES时的SQL文件路径，目前还不支持带有注释的SQL
sql_path = ./sql/hql_test1.sql

;再定义另一想要导出到ES的结果表
[my_result_b]
sql_path = ./sql/hql_test2.sql


# 如需要对导出表或者结果表作出更多配置，可进行如下可选配置

;配置头为对应要导出的表或结果表的名称
;[student]

;若不使用默认index，则配置此目标index
;es_index = tqc_test
;若不使用默认type，则配置此目标type；默认type与表名一致
;es_type = tqc_test_type

;限定导出的字段
;columns = date,name,age,address,sex

;选择一个字段作为ES文档中的id
;id_column = student_id

;字段名映射，这里hive表中的name字段映射为ES中的name_in_es，sex字段映射为ES中的sex_in_es...
;column_mapping = date=@timestamp,name=name_in_es,sex=sex_in_es

;where条件语句，导表时限定字段数据值条件
;where = age>20 AND name LIKE 'abc%'

;通过编写HQL或ImpalaSQL获得新的结果集表导入ES时的SQL文件路径，目前还不支持带有注释的SQL
;sql_path = ./sql/hql_test1.sql

;分页查询配置，为了防止一次查询出所有数据，导致结果集过大，内存吃不消，无分页配置时默认分页大小30000
;page_size = 1000

;全量 & 增量：导入数据前是否清空该type下所有数据，默认=true：清空原有type中数据，再把新数据导入ES（全量更新数据）。
;overwrite = false

TODO: 使用多线程

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

TQCCC / hive_to_es

Programming Languages

Labels

Projects that are alternatives of or similar to hive to es

同步hive数据到Elasticsearch的工具

已经支持从impala渠道导数据，极大提升导数据速度