Data Science Ipython NotebooksData science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
Stars: ✭ 22,048 (+7609.09%)
Repository个人学习知识库涉及到数据仓库建模、实时计算、大数据、Java、算法等。
Stars: ✭ 92 (-67.83%)
Data Algorithms Book MapReduce, Spark, Java, and Scala for Data Algorithms Book
Stars: ✭ 949 (+231.82%)
Avro Hadoop StarterExample MapReduce jobs in Java, Hive, Pig, and Hadoop Streaming that work on Avro data.
Stars: ✭ 110 (-61.54%)
Bigdata Interview🎯 🌟[大数据面试题]分享自己在网络上收集的大数据相关的面试题以及自己的答案总结.目前包含Hadoop/Hive/Spark/Flink/Hbase/Kafka/Zookeeper框架的面试题知识总结
Stars: ✭ 857 (+199.65%)
AsakusafwAsakusa Framework
Stars: ✭ 114 (-60.14%)
learning-hadoop-and-sparkCompanion to Learning Hadoop and Learning Spark courses on Linked In Learning
Stars: ✭ 146 (-48.95%)
CascadingCascading is a feature rich API for defining and executing complex and fault tolerant data processing flows locally or on a cluster. See https://github.com/Cascading/cascading for the release repository.
Stars: ✭ 318 (+11.19%)
Bigdata💎🔥大数据学习笔记
Stars: ✭ 488 (+70.63%)
gomrjobgomrjob - a Go Framework for Hadoop Map Reduce Jobs
Stars: ✭ 39 (-86.36%)
SrcA light-weight distributed stream computing framework for Golang
Stars: ✭ 67 (-76.57%)
big dataA collection of tutorials on Hadoop, MapReduce, Spark, Docker
Stars: ✭ 34 (-88.11%)
XLearning-GPUqihoo360 xlearning with GPU support; AI on Hadoop
Stars: ✭ 22 (-92.31%)
TitanDataOperationSystem最好的大数据项目。《Titan数据运营系统》,本项目是一个全栈闭环系统,我们有用作数据可视化的web系统,然后用flume-kafaka-flume进行日志的读取,在hive设计数仓,编写spark代码进行数仓表之间的转化以及ads层表到mysql的迁移,使用azkaban进行定时任务的调度,使用技术:Java/Scala语言,Hadoop、Spark、Hive、Kafka、Flume、Azkaban、SpringBoot,Bootstrap, Echart等;
Stars: ✭ 62 (-78.32%)
mapreduceA in-process MapReduce library to help you optimizing service response time or concurrent task processing.
Stars: ✭ 93 (-67.48%)
knitDeprecated, please use https://github.com/jcrist/skein or https://github.com/dask/dask-yarn instead
Stars: ✭ 53 (-81.47%)
leaflet heatmap简单的可视化湖州通话数据 假设数据量很大,没法用浏览器直接绘制热力图,把绘制热力图这一步骤放到线下计算分析。使用Apache Spark并行计算数据之后,再使用Apache Spark绘制热力图,然后用leafletjs加载OpenStreetMap图层和热力图图层,以达到良好的交互效果。现在使用Apache Spark实现绘制,可能是Apache Spark不擅长这方面的计算或者是我没有设计好算法,并行计算的速度比不上单机计算。Apache Spark绘制热力图和计算代码在这 https://github.com/yuanzhaokang/ParallelizeHeatmap.git .
Stars: ✭ 13 (-95.45%)
py-hdfs-mountMount HDFS with fuse, works with kerberos!
Stars: ✭ 13 (-95.45%)
infantryRun MapReduce in user's browser.
Stars: ✭ 14 (-95.1%)
big-data-liteSamples to the Oracle Big Data Lite VM
Stars: ✭ 41 (-85.66%)
dtailDTail is a distributed DevOps tool for tailing, grepping, catting logs and other text files on many remote machines at once.
Stars: ✭ 112 (-60.84%)
cmuxA set of commands for managing CDH clusters using Cloudera Manager REST API.
Stars: ✭ 34 (-88.11%)
hadoop-docker-liteDocker build project to setup a lightweight hadoop cluster containing hadoop, pig, zookeeper, hbase, phoenix, storm, kafka, kafka manager
Stars: ✭ 24 (-91.61%)
Tdigestt-Digest data structure in Python. Useful for percentiles and quantiles, including distributed enviroments like PySpark
Stars: ✭ 274 (-4.2%)
cloud云计算之hadoop、hive、hue、oozie、sqoop、hbase、zookeeper环境搭建及配置文件
Stars: ✭ 48 (-83.22%)
AddaxAddax is an open source universal ETL tool that supports most of those RDBMS and NoSQLs on the planet, helping you transfer data from any one place to another.
Stars: ✭ 615 (+115.03%)
yuzhouwanCode Library for My Blog
Stars: ✭ 39 (-86.36%)
platys-modern-data-platformSupport for generating modern platforms dynamically with services such as Kafka, Spark, Streamsets, HDFS, ....
Stars: ✭ 35 (-87.76%)
ros hadoopHadoop splittable InputFormat for ROS. Process rosbag with Hadoop Spark and other HDFS compatible systems.
Stars: ✭ 92 (-67.83%)
Android NosqlLightweight, simple structured NoSQL database for Android
Stars: ✭ 284 (-0.7%)
Hadoop Mini Clustershadoop-mini-clusters provides an easy way to test Hadoop projects directly in your IDE
Stars: ✭ 265 (-7.34%)
ibisIBIS is a workflow creation-engine that abstracts the Hadoop internals of ingesting RDBMS data.
Stars: ✭ 48 (-83.22%)
cobra-policytoolManage Apache Atlas and Ranger configuration for your Hadoop environment.
Stars: ✭ 16 (-94.41%)
spark-utillow-level helpers for Apache Spark libraries and tests
Stars: ✭ 16 (-94.41%)
clusterdockclusterdock is a framework for creating Docker-based container clusters
Stars: ✭ 26 (-90.91%)
durablefunctions-mapreduce-dotnetAn implementation of MapReduce on top of C# Durable Functions over the NYC 2017 Taxi dataset to compute average ride time per-day
Stars: ✭ 20 (-93.01%)
bigdata-funA complete (distributed) BigData stack, running in containers
Stars: ✭ 14 (-95.1%)
connected-componentMap Reduce Implementation of Connected Component on Apache Spark
Stars: ✭ 68 (-76.22%)
flokkrDocumentation placeholder and utilities for all the other containers.
Stars: ✭ 30 (-89.51%)
fsbrowserFast desktop client for Hadoop Distributed File System
Stars: ✭ 27 (-90.56%)
swordfishOpen-source distribute workflow schedule tools, also support streaming task.
Stars: ✭ 35 (-87.76%)
hadoop-deployment-bashCode for the deployment of Hadoop clusters, written in Bourne or Bourne Again shell.
Stars: ✭ 31 (-89.16%)
basinBasin is a visual programming editor for building Spark and PySpark pipelines. Easily build, debug, and deploy complex ETL pipelines from your browser
Stars: ✭ 25 (-91.26%)
GuitarA Simple and Efficient Distributed Multidimensional BI Analysis Engine.
Stars: ✭ 86 (-69.93%)