Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

Stars: ✭ 5,656 (+10374.07%)

Mutual labels: spark, hadoop

God Of Bigdata

专注大数据学习面试，大数据成神之路开启。Flink/Spark/Hadoop/Hbase/Hive...

Stars: ✭ 6,008 (+11025.93%)

Mutual labels: spark, hadoop

Interview Questions Collection

按知识领域整理面试题，包括C++、Java、Hadoop、机器学习等

Stars: ✭ 21 (-61.11%)

Mutual labels: spark, hadoop

Data Science Ipython Notebooks

Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.

Stars: ✭ 22,048 (+40729.63%)

Mutual labels: spark, hadoop

Bigdataguide

大数据学习，从零开始学习大数据，包含大数据学习各阶段学习视频、面试资料

Stars: ✭ 817 (+1412.96%)

Mutual labels: spark, hadoop

Iceberg

Iceberg is a table format for large, slow-moving tabular data

Stars: ✭ 393 (+627.78%)

Mutual labels: spark, hadoop

Bigdata Interview

🎯 🌟[大数据面试题]分享自己在网络上收集的大数据相关的面试题以及自己的答案总结.目前包含Hadoop/Hive/Spark/Flink/Hbase/Kafka/Zookeeper框架的面试题知识总结

Stars: ✭ 857 (+1487.04%)

Mutual labels: spark, hadoop

Dockerfiles

50+ DockerHub public images for Docker & Kubernetes - Hadoop, Kafka, ZooKeeper, HBase, Cassandra, Solr, SolrCloud, Presto, Apache Drill, Nifi, Spark, Consul, Riak, TeamCity and DevOps tools built on the major Linux distros: Alpine, CentOS, Debian, Fedora, Ubuntu

Stars: ✭ 847 (+1468.52%)

Mutual labels: spark, hadoop

Marmaray

Generic Data Ingestion & Dispersal Library for Hadoop

Stars: ✭ 414 (+666.67%)

Mutual labels: spark, hadoop

Learning Spark

零基础学习spark，大数据学习

Stars: ✭ 37 (-31.48%)

Mutual labels: spark, hadoop

Devops Python Tools

80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Function, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.

Stars: ✭ 406 (+651.85%)

Mutual labels: spark, hadoop

Pdf

编程电子书，电子书，编程书籍，包括C，C#，Docker，Elasticsearch，Git，Hadoop，HeadFirst，Java，Javascript，jvm，Kafka，Linux，Maven，MongoDB，MyBatis，MySQL，Netty，Nginx，Python，RabbitMQ，Redis，Scala，Solr，Spark，Spring，SpringBoot，SpringCloud，TCPIP，Tomcat，Zookeeper，人工智能，大数据类，并发编程，数据库类，数据挖掘，新面试题，架构设计，算法系列，计算机类，设计模式，软件测试，重构优化，等更多分类

Stars: ✭ 12,009 (+22138.89%)

Mutual labels: spark, hadoop

Big data architect skills

一个大数据架构师应该掌握的技能

Stars: ✭ 400 (+640.74%)

Mutual labels: spark, hadoop

Useractionanalyzeplatform

电商用户行为分析大数据平台

Stars: ✭ 645 (+1094.44%)

Mutual labels: spark, hadoop

Wedatasphere

WeDataSphere is a financial level one-stop open-source suitcase for big data platforms. Currently the source code of Scriptis and Linkis has already been released to the open-source community. WeDataSphere, Big Data Made Easy!

Stars: ✭ 372 (+588.89%)

Mutual labels: spark, hadoop

Bigdl

Building Large-Scale AI Applications for Distributed Big Data

Stars: ✭ 3,813 (+6961.11%)

Mutual labels: spark, hadoop

Kylo

Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.

Stars: ✭ 916 (+1596.3%)

Mutual labels: spark, hadoop

Data Algorithms Book

MapReduce, Spark, Java, and Scala for Data Algorithms Book

Stars: ✭ 949 (+1657.41%)

Mutual labels: spark, hadoop

View All Similar Projects ➔

Segence Hadoop Docker container

Overview

This Docker container contains a full Hadoop distribution with the following components:

Hadoop 2.9.2 (including YARN)
Oracle JDK 8
Scala 2.11.11
Spark 2.4.4
Zeppelin 0.8.2

Setting up a new Hadoop cluster

For all below steps, the Docker image segence/hadoop:latest has to be built or pulled from DockerHub.

Build the current image locally: ./build-docker-image.sh
Pull from DockerHub: docker pull segence/hadoop:latest

The default SSH port of the Docker containers is 2222. This is, so in a standalone cluster setup, each namenode and datanode can be ran on separate physical server (or virtual appliances). The servers can still use the default SSH port 22 but the datanodes, running inside Docker containers, will be using port 2222. This port is automatically exposed. It's important to note that all exposed ports have to be open on the firewall of the physical servers, so other nodes in the cluster can access them.

Hadoop host names have to follow the hadoop-* format. Automatic acceptance of SSH host key fingerprints is enabled for any hosts with the domain name hadoop-*. This is so Hadoop nodes on the cluster can establish SSH connections between each other without manually accepting host keys. This means that the Docker hosts, which run the Hadoop namenode and datanode containers have to be mapped to DNS names. One can use for example Microsoft Windows Server to set up a DNS server.

Setting up a local Hadoop cluster

This will set up a local Hadoop cluster using bridged networking with one namenode and one datanode.

Go into the cluster-setup/local-cluster directory: cd cluster-setup/local-cluster
Edit the slaves-config/slaves file if you want to add more slaves (datanodes) other than the default one slave node. If you add more slaves then also edit the docker-compose.yml file by adding more slave node configurations.
Launch the new cluster: docker-compose up -d

You can log into the namenode (master) by issuing docker exec -it hadoop-namenode bash and to the datanode (slave) by docker exec -it hadoop-datanode1 bash.

By default, the HDFS replication factor is set to 1, because it is assumed that a local Docker cluster will be started with a single datanode. To override the replication setting simply change the HDFS_REPLICATION_FACTOR environment variable in the docker-compose.yml file (and also add more datanodes). Adding more data nodes adds the complexity of exposing all datanode UI ports to localhost. In this scenario, no UI ports should be exposed to avoid the conflict.

Setting up a standalone Hadoop cluster

Preparation

Create the hadoop user on the host system, e.g. useradd hadoop

RHEL/CentOS note

If you encounter the following error message when running a Docker container: WARNING: IPv4 forwarding is disabled. Networking will not work. then turn on packet forwading (RHEL 7): /sbin/sysctl -w net.ipv4.ip_forward=1

More info

You can use the script cluster-setup/standalone-cluster/setup-rhel.sh to achieve the above, as well as to create the required directories and change their ownership (as in point 1 and 2 below, so you can skip them if you used the RHEL setup script).

The cluster setup runs with host networking, so the Hadoop nodes will get the hostname and DNS settings directly from the host machine. Make sure IP addresses and DNS names, as well as DNS resolution is correctly set up on the host machines.

Namenode setup

Create the following directories on the host:

Directory for the HDFS data: /hadoop/data
Directory for MapReduce/Spark deployments: /hadoop/deployments

Make the hadoop user own the directories: chown -R hadoop:hadoop /hadoop
Create the file /hadoop/slaves-config/slaves listing all slave node host names on a separate line
Copy the start-namenode.sh file onto the system (e.g. into /hadoop/start-namenode.sh)
Launch the new namenode: /hadoop/start-namenode.sh [HDFS REPLICATION FACTOR], where HDFS REPLICATION FACTOR is the replication factor for HDFS blocks (defaults to 2).

Datanode setup

Create the directory for the HDFS data: /hadoop/data
Make the hadoop user own the directories: chown -R hadoop:hadoop /hadoop
Create the file /hadoop/slaves-config/slaves listing all slave node host names on a separate line
Copy the start-datanode.sh file onto the system (e.g. into /hadoop/start-datanode.sh)
Launch the new datanode with its ID: /hadoop/start-datanode.sh <NAMENODE HOST NAME> [HDFS REPLICATION FACTOR], where NAMENODE HOST NAME is the host name of the namenode and HDFS REPLICATION FACTOR is the replication factor for HDFS blocks (defaults to 2, and has to be consistent throughout all cluster nodes).

Starting the cluster

Once either a local or standalone cluster is provisioned, follow the below steps:

Log in to the namenode, e.g. docker exec -it hadoop-namenode bash
Become the hadoop user: su hadoop
Format the HDFS namenode: ~/utils/format-namenode.sh

HDFS has to be restarted the first time before MapReduce jobs can be successfully ran. This is because HDFS creates some data on the first run but stopping it can clean up its state so MapReduce jobs can be ran through YARN afterwards.

Execute the following commands as root:

Start Hadoop: service hadoop start
Stop Hadoop: service hadoop stop
Start Hadoop again: service hadoop start

Restarting HDFS has to be done after formatting the namenode, otherwise running the first MapReduce job will always fail and the cluster will terminate.

Verify that the datanodes are correctly registered, issue on namenode: hdfs dfsadmin -report

Interacting with HDFS

Local access

When you're logged into the namenode, simply use the hadoop dfs ... command to interact with the HDFS cluster. E.g. listing the contents of the root of the file system: hdfs dfs -ls /

Remote access

Make sure you've got the same version of Hadoop downloaded and extracted on your local system
Go into your Hadoop installation directory, and then into the bin directory.
For example to list the contents of your HDFS cluster, use: ./hdfs dfs -ls hdfs://localhost:8020/

Web interfaces

List of web interfaces

Web UIs	URL
Hadoop Name Node	http://localhost:50070
Hadoop Data Node	http://localhost:50075
WebHDFS REST API	http://localhost:50070/webhdfs/v1
NodeManager UI on Data Node	http://localhost:8042
YARN Resource Manager	http://localhost:8088
Spark UI	http://localhost:4040
Zeppelin UI	http://localhost:9001

Change localhost to the IP address or host name of the namenode.

WebHDFS REST API

Getting home directory: http://localhost:50070/webhdfs/v1/?op=GETHOMEDIRECTORY
List root of the HDFS volume: http://localhost:50070/webhdfs/v1/?op=LISTSTATUS

Running sample jobs using YARN

Running the word count example

The script will create a directory called input with some sample files. It'll upload them into the HDFS cluster and run a simple MapReduce job. It'll print the results to the console.

Log in to the namenode, e.g. docker exec -it hadoop-namenode bash
Become the hadoop user: su hadoop
Go into the home directory of the hadoop user: cd
Run ~/utils/run-wordcount.sh

Running a sample interactive Spark job

Running the word count example leaves some files in HDFS. The below example Spark job reads those files and simply splits the file contents by whitespaces. It then prints out the results.

Once you're on the namenode, issue spark-shell:

val input = sc.textFile("/user/hadoop/input")
val splitContent = input.map(r => r.split(" "))
splitContent.foreach(line => println(line.toSeq))

To run it against the YARN cluster, launch with spark-shell --master=yarn. In this way, println does not execute in the driver since you are executing it on elements of the RDD. It executes in an executor, which can happen to execute in-process in local mode. In general you should not expect this to print results in the driver.

Running a notebook in Zeppelin

Zeppelin has to be ran as the hadoop user, so make sure to start the service as the hadoop user. It won't run properly with all interpreters under a different user!

Log in to the namenode, e.g. docker exec -it hadoop-namenode bash
Become the hadoop user: su hadoop
Go into the home directory of the hadoop user: cd
Start Zeppelin: zeppelin-daemon.sh start
Open the Zeppelin UI in your browser
On the home page, click on 'Create new note'
The below snippet is written in R. It loads the directory content of the input files used in the sample MapReduce job:

%r
df <- read.df(sqlContext, "/user/hadoop/input/", source = "text")
head(df)

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 54

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (4) 🔗