Alternatives and detailed information of GooglePlay-Web-Crawler

Multi docker container images for main Big Data Tools. (Hadoop, Spark, Kafka, HBase, Cassandra, Zookeeper, Zeppelin, Drill, Flink, Hive, Hue, Mesos, ... )

Stars: ✭ 29 (+61.11%)

Mutual labels: hive, hadoop

ETL-Starter-Kit

📁 Extract, Transform, Load (ETL) 👷 refers to a process in database usage and especially in data warehousing. This repository contains a starter kit featuring ETL related work.

Stars: ✭ 21 (+16.67%)

Mutual labels: hive, pig

xxhadoop

Data Analysis Using Hadoop/Spark/Storm/ElasticSearch/MachineLearning etc. This is My Daily Notes/Code/Demo. Don't fork, Just star !

Stars: ✭ 37 (+105.56%)

Mutual labels: hive, hadoop

Data-pipeline-project

Data pipeline project

Stars: ✭ 18 (+0%)

Mutual labels: hadoop, mapreduce

hadoop-etl-udfs

The Hadoop ETL UDFs are the main way to load data from Hadoop into EXASOL

Stars: ✭ 17 (-5.56%)

Mutual labels: hive, hadoop

the-apache-ignite-book

All code samples, scripts and more in-depth examples for The Apache Ignite Book. Include Apache Ignite 2.6 or above

Stars: ✭ 65 (+261.11%)

Mutual labels: hive, hadoop

beekeeper

Service for automatically managing and cleaning up unreferenced data

Stars: ✭ 43 (+138.89%)

Mutual labels: hive, s3

liquibase-impala

Liquibase extension to add Impala Database support

Stars: ✭ 23 (+27.78%)

Mutual labels: hive, hadoop

aaocp

一个对用户行为日志进行分析的大数据项目

Stars: ✭ 53 (+194.44%)

Mutual labels: hive, hadoop

View All Similar Projects ➔

GooglePlay Web Crawler

What is Hadoop Ecosystem?

The core compositions of Hadoop are HDFS, Yarn, and other engines and App, like Mapreduce, Tez, Nutch, Pig, Hive, Spark, etc.
HDFS is composed of NameNode and DataNode for data storage.
Yarn is composed of Resource Manager and node Manager for resource assignment.
APPs like Pig, Hive are higher level language processor. They can conduct mapreduce job much easier.

How does web crawler work?

Use a customized Nutch to crawl apps metadata in GooglePlay

Inject seed to nutchDB
Generate urls to crawl from nutchDB
Fetch app meatadata from html pages
parse extracted metadata and outlinks
update nutchDB with new outlinks
Pig Loadfunc transforms nutchDB to readable text file form
Create table and manage data by Hive

Command Line

git clone

git clone https://github.com/apache/nutch
git checkout release-1.12

customize nutch

patch -p1 < /googleplaycrawler/googleplaycrawler.patch

run googleplaycrawler on single nutch cluster

echo "https://play.google.com/store/apps/details?id=com.facebook.orca" > seed

hadoop fs -put seed 

hadoop jar build/apache-nutch-1.12.job org.apache.nutch.googleplay.GooglePlayCrawler seed -numFetchers 10

check output


hadoop fs -text file:///xxxxxx/nutchdb/segments/xxxxx/parse_data/part-00000/data

fix the skew data job

patch -p1 < fixskew.patch

uplode seeds file and run web scrawler in AWS EMR
Aws S3

register target/nutchdbloader-0.0.1-SNAPSHOT.jar
register /home/hadoop/hadoop-2.7.3/share/hadoop/tools/lib/hadoop-aws-2.7.3.jar register nutch-1.12.jar
loaded = load 's3n://test/nutchdb/segments/*/parse_data/part-*/data' using com.example.NutchParsedDataLoader();
filtered = filter loaded by $0 is not null;
store filtered into 'output';

text reults

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

ly16 / GooglePlay-Web-Crawler

Programming Languages

Labels

Projects that are alternatives of or similar to GooglePlay-Web-Crawler

GooglePlay Web Crawler

What is Hadoop Ecosystem?

How does web crawler work?

Command Line

text reults