All Projects → ly16 → GooglePlay-Web-Crawler

ly16 / GooglePlay-Web-Crawler

Licence: other
Mapreduce project by Hadoop, Nutch, AWS EMR, Pig, Tez, Hive

Programming Languages

java
68154 projects - #9 most used programming language
PigLatin
29 projects

Projects that are alternatives of or similar to GooglePlay-Web-Crawler

Repository
个人学习知识库涉及到数据仓库建模、实时计算、大数据、Java、算法等。
Stars: ✭ 92 (+411.11%)
Mutual labels:  hive, hadoop, mapreduce
cloud
云计算之hadoop、hive、hue、oozie、sqoop、hbase、zookeeper环境搭建及配置文件
Stars: ✭ 48 (+166.67%)
Mutual labels:  hive, hadoop, pig
Bigdata
💎🔥大数据学习笔记
Stars: ✭ 488 (+2611.11%)
Mutual labels:  hive, hadoop, mapreduce
Bigdata Notes
大数据入门指南 ⭐
Stars: ✭ 10,991 (+60961.11%)
Mutual labels:  hive, hadoop, mapreduce
web-click-flow
网站点击流离线日志分析
Stars: ✭ 14 (-22.22%)
Mutual labels:  hive, hadoop, mapreduce
bigdata-doc
大数据学习笔记,学习路线,技术案例整理。
Stars: ✭ 37 (+105.56%)
Mutual labels:  hive, hadoop, mapreduce
Avro Hadoop Starter
Example MapReduce jobs in Java, Hive, Pig, and Hadoop Streaming that work on Avro data.
Stars: ✭ 110 (+511.11%)
Mutual labels:  hive, hadoop, mapreduce
learning-hadoop-and-spark
Companion to Learning Hadoop and Learning Spark courses on Linked In Learning
Stars: ✭ 146 (+711.11%)
Mutual labels:  emr, hadoop, mapreduce
qs-hadoop
大数据生态圈学习
Stars: ✭ 18 (+0%)
Mutual labels:  hadoop, mapreduce
BigInsights-on-Apache-Hadoop
Example projects for 'BigInsights for Apache Hadoop' on IBM Bluemix
Stars: ✭ 21 (+16.67%)
Mutual labels:  hive, hadoop
hadoopoffice
HadoopOffice - Analyze Office documents using the Hadoop ecosystem (Spark/Flink/Hive)
Stars: ✭ 56 (+211.11%)
Mutual labels:  hive, hadoop
dockerfiles
Multi docker container images for main Big Data Tools. (Hadoop, Spark, Kafka, HBase, Cassandra, Zookeeper, Zeppelin, Drill, Flink, Hive, Hue, Mesos, ... )
Stars: ✭ 29 (+61.11%)
Mutual labels:  hive, hadoop
ETL-Starter-Kit
📁 Extract, Transform, Load (ETL) 👷 refers to a process in database usage and especially in data warehousing. This repository contains a starter kit featuring ETL related work.
Stars: ✭ 21 (+16.67%)
Mutual labels:  hive, pig
xxhadoop
Data Analysis Using Hadoop/Spark/Storm/ElasticSearch/MachineLearning etc. This is My Daily Notes/Code/Demo. Don't fork, Just star !
Stars: ✭ 37 (+105.56%)
Mutual labels:  hive, hadoop
Data-pipeline-project
Data pipeline project
Stars: ✭ 18 (+0%)
Mutual labels:  hadoop, mapreduce
hadoop-etl-udfs
The Hadoop ETL UDFs are the main way to load data from Hadoop into EXASOL
Stars: ✭ 17 (-5.56%)
Mutual labels:  hive, hadoop
the-apache-ignite-book
All code samples, scripts and more in-depth examples for The Apache Ignite Book. Include Apache Ignite 2.6 or above
Stars: ✭ 65 (+261.11%)
Mutual labels:  hive, hadoop
beekeeper
Service for automatically managing and cleaning up unreferenced data
Stars: ✭ 43 (+138.89%)
Mutual labels:  hive, s3
liquibase-impala
Liquibase extension to add Impala Database support
Stars: ✭ 23 (+27.78%)
Mutual labels:  hive, hadoop
aaocp
一个对用户行为日志进行分析的大数据项目
Stars: ✭ 53 (+194.44%)
Mutual labels:  hive, hadoop

GooglePlay Web Crawler

What is Hadoop Ecosystem?

hadoop

  • The core compositions of Hadoop are HDFS, Yarn, and other engines and App, like Mapreduce, Tez, Nutch, Pig, Hive, Spark, etc.
  • HDFS is composed of NameNode and DataNode for data storage.
  • Yarn is composed of Resource Manager and node Manager for resource assignment.
  • APPs like Pig, Hive are higher level language processor. They can conduct mapreduce job much easier.

How does web crawler work?

  • Use a customized Nutch to crawl apps metadata in GooglePlay
  • Inject seed to nutchDB
  • Generate urls to crawl from nutchDB
  • Fetch app meatadata from html pages
  • parse extracted metadata and outlinks
  • update nutchDB with new outlinks
  • Pig Loadfunc transforms nutchDB to readable text file form
  • Create table and manage data by Hive

Command Line

  • git clone
git clone https://github.com/apache/nutch
git checkout release-1.12
  • customize nutch
patch -p1 < /googleplaycrawler/googleplaycrawler.patch
  • run googleplaycrawler on single nutch cluster
echo "https://play.google.com/store/apps/details?id=com.facebook.orca" > seed

hadoop fs -put seed 

hadoop jar build/apache-nutch-1.12.job org.apache.nutch.googleplay.GooglePlayCrawler seed -numFetchers 10

  • check output

hadoop fs -text file:///xxxxxx/nutchdb/segments/xxxxx/parse_data/part-00000/data

  • fix the skew data job
patch -p1 < fixskew.patch
  • uplode seeds file and run web scrawler in AWS EMR emr

  • Aws S3

register target/nutchdbloader-0.0.1-SNAPSHOT.jar
register /home/hadoop/hadoop-2.7.3/share/hadoop/tools/lib/hadoop-aws-2.7.3.jar register nutch-1.12.jar
loaded = load 's3n://test/nutchdb/segments/*/parse_data/part-*/data' using com.example.NutchParsedDataLoader();
filtered = filter loaded by $0 is not null;
store filtered into 'output';

text reults

results

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].