All Projects → commoncrawl → Cc Pyspark

commoncrawl / Cc Pyspark

Licence: mit
Process Common Crawl data with Python and Spark

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Cc Pyspark

Live log analyzer spark
Spark Application for analysis of Apache Access logs and detect anamolies! Along with Medium Article.
Stars: ✭ 14 (-90.48%)
Mutual labels:  spark, pyspark
W2v
Word2Vec models with Twitter data using Spark. Blog:
Stars: ✭ 64 (-56.46%)
Mutual labels:  spark, pyspark
Sparkmagic
Jupyter magics and kernels for working with remote Spark clusters
Stars: ✭ 954 (+548.98%)
Mutual labels:  spark, pyspark
Scriptis
Scriptis is for interactive data analysis with script development(SQL, Pyspark, HiveQL), task submission(Spark, Hive), UDF, function, resource management and intelligent diagnosis.
Stars: ✭ 696 (+373.47%)
Mutual labels:  spark, pyspark
Hnswlib
Java library for approximate nearest neighbors search using Hierarchical Navigable Small World graphs
Stars: ✭ 108 (-26.53%)
Mutual labels:  spark, pyspark
Spark Tdd Example
A simple Spark TDD example
Stars: ✭ 23 (-84.35%)
Mutual labels:  spark, pyspark
Pysparkgeoanalysis
🌐 Interactive Workshop on GeoAnalysis using PySpark
Stars: ✭ 63 (-57.14%)
Mutual labels:  spark, pyspark
data-algorithms-with-spark
O'Reilly Book: [Data Algorithms with Spark] by Mahmoud Parsian
Stars: ✭ 34 (-76.87%)
Mutual labels:  spark, pyspark
Relation extraction
Relation Extraction using Deep learning(CNN)
Stars: ✭ 96 (-34.69%)
Mutual labels:  spark, pyspark
Spark Py Notebooks
Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks
Stars: ✭ 1,338 (+810.2%)
Mutual labels:  spark, pyspark
Pyspark Example Project
Example project implementing best practices for PySpark ETL jobs and applications.
Stars: ✭ 633 (+330.61%)
Mutual labels:  spark, pyspark
Eat pyspark in 10 days
pyspark🍒🥭 is delicious,just eat it!😋😋
Stars: ✭ 116 (-21.09%)
Mutual labels:  spark, pyspark
Devops Python Tools
80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Function, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.
Stars: ✭ 406 (+176.19%)
Mutual labels:  spark, pyspark
Sparkling Titanic
Training models with Apache Spark, PySpark for Titanic Kaggle competition
Stars: ✭ 12 (-91.84%)
Mutual labels:  spark, pyspark
basin
Basin is a visual programming editor for building Spark and PySpark pipelines. Easily build, debug, and deploy complex ETL pipelines from your browser
Stars: ✭ 25 (-82.99%)
Mutual labels:  spark, pyspark
Optimus
🚚 Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark
Stars: ✭ 986 (+570.75%)
Mutual labels:  spark, pyspark
aut
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
Stars: ✭ 111 (-24.49%)
Mutual labels:  spark, pyspark
spark-extension
A library that provides useful extensions to Apache Spark and PySpark.
Stars: ✭ 25 (-82.99%)
Mutual labels:  spark, pyspark
Spark python ml examples
Spark 2.0 Python Machine Learning examples
Stars: ✭ 87 (-40.82%)
Mutual labels:  spark, pyspark
Pyspark Cheatsheet
🐍 Quick reference guide to common patterns & functions in PySpark.
Stars: ✭ 108 (-26.53%)
Mutual labels:  spark, pyspark

Common Crawl Logo

Common Crawl PySpark Examples

This project provides examples how to process the Common Crawl dataset with Apache Spark and Python:

Further information about the examples and available options is shown via the command-line option --help.

Setup

To develop and test locally, you will need to install

pip install -r requirements.txt
  • (optionally, and only if you want to query the columnar index) install S3 support libraries so that Spark can load the columnar index from S3

Compatibility and Requirements

Tested with Spark 2.1.0 – 2.4.6 in combination with Python 2.7 or 3.5, 3.6, 3.7, and with Spark 3.0.0 in combination with Python 3.7 and 3.8

Get Sample Data

To develop locally, you'll need at least three data files – one for each format used in at least one of the examples. They can be fetched from the following links:

Alternatively, running get-data.sh will download the sample data. It also writes input files containing

  • sample input as file:// URLs
  • all input of one monthly crawl as s3:// URLs

Note that the sample data is from an older crawl (CC-MAIN-2017-13 run in March 2017). If you want to use more recent data, please visit the Common Crawl site.

Running locally

First, point the environment variable SPARK_HOME to your Spark installation. Then submit a job via

$SPARK_HOME/bin/spark-submit ./server_count.py \
	--num_output_partitions 1 --log_level WARN \
	./input/test_warc.txt servernames

This will count web server names sent in HTTP response headers for the sample WARC input and store the resulting counts in the SparkSQL table "servernames" in your warehouse location defined by spark.sql.warehouse.dir (usually in your working directory as ./spark-warehouse/servernames).

The output table can be accessed via SparkSQL, e.g.,

$SPARK_HOME/bin/pyspark
>>> df = sqlContext.read.parquet("spark-warehouse/servernames")
>>> for row in df.sort(df.val.desc()).take(10): print(row)
... 
Row(key=u'Apache', val=9396)
Row(key=u'nginx', val=4339)
Row(key=u'Microsoft-IIS/7.5', val=3635)
Row(key=u'(no server in HTTP header)', val=3188)
Row(key=u'cloudflare-nginx', val=2743)
Row(key=u'Microsoft-IIS/8.5', val=1459)
Row(key=u'Microsoft-IIS/6.0', val=1324)
Row(key=u'GSE', val=886)
Row(key=u'Apache/2.2.15 (CentOS)', val=827)
Row(key=u'Apache-Coyote/1.1', val=790)

See also

Running in Spark cluster over large amounts of data

As the Common Crawl dataset lives in the Amazon Public Datasets program, you can access and process it on Amazon AWS (in the us-east-1 AWS region) without incurring any transfer costs. The only cost that you incur is the cost of the machines running your Spark cluster.

  1. spinning up the Spark cluster: AWS EMR contains a ready-to-use Spark installation but you'll find multiple descriptions on the web how to deploy Spark on a cheap cluster of AWS spot instances. See also launching Spark on a cluster.

  2. choose appropriate cluster-specific settings when submitting jobs and also check for relevant command-line options (e.g., --num_input_partitions or --num_output_partitions, see below)

  3. don't forget to deploy all dependencies in the cluster, see advanced dependency management

  4. also the the file sparkcc.py needs to be deployed or added as argument --py-files sparkcc.py to spark-submit. Note: some of the examples require further Python files as dependencies.

Command-line options

All examples show the available command-line options if called with the parameter --help or -h, e.g.

$SPARK_HOME/bin/spark-submit ./server_count.py --help

Overwriting Spark configuration properties

There are many Spark configuration properties which allow to tune the job execution or output, see for example see tuning Spark or EMR Spark memory tuning.

It's possible to overwrite Spark properties when submitting the job:

$SPARK_HOME/bin/spark-submit \
    --conf spark.sql.warehouse.dir=myWareHouseDir \
    ... (other Spark options, flags, config properties) \
    ./server_count.py \
    ... (program-specific options)

Installation of S3 Support Libraries

While WARC/WAT/WET files are read using boto3, accessing the columnar URL index (see option --query of CCIndexSparkJob) is done directly by the SparkSQL engine and requires that S3 support libraries are available. These libs are usually provided when the Spark job is run on a Hadoop cluster running on AWS (eg. EMR). However, they may not be provided for any Spark distribution and are usually absent when running Spark locally (not in a Hadoop cluster). In these situations, the easiest way is to add the libs as required packages by adding --packages org.apache.hadoop:hadoop-aws:3.2.0 to the arguments of spark-submit. This will make Spark manage the dependencies - the hadoop-aws package and transitive dependencies are downloaded as Maven dependencies. Note that the required version of hadoop-aws package depends on the Hadoop version bundled with your Spark installation, e.g., Spark 3.0.0 bundled with Hadoop 3.2.0 (spark-3.0.0-bin-hadoop3.2.tgz).

Please also note that:

  • the schema of the URL referencing the columnar index depends on the actual S3 file system implementation: it's s3:// on EMR but s3a:// when using s3a.
  • data can be accessed anonymously using s3a.AnonymousAWSCredentialsProvider. This requires Hadoop 2.9 or newer.
  • without anonymous access valid AWS credentials need to be provided, e.g., by setting spark.hadoop.fs.s3a.access.key and spark.hadoop.fs.s3a.secret.key in the Spark configuration.

Example call to count words in 10 WARC records host under the .is top-level domain:

$SPARK_HOME/bin/spark-submit \
    --packages org.apache.hadoop:hadoop-aws:3.2.0 \
    --conf spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider \
    ./cc_index_word_count.py \
    --query "SELECT url, warc_filename, warc_record_offset, warc_record_length, content_charset FROM ccindex WHERE crawl = 'CC-MAIN-2020-24' AND subset = 'warc' AND url_host_tld = 'is' LIMIT 10" \
    s3a://commoncrawl/cc-index/table/cc-main/warc/ \
    myccindexwordcountoutput \
    --num_output_partitions 1 \
    --output_format json

Columnar index and schema merging

The schema of the columnar URL index has been extended over time by adding new columns. If you want to query one of the new columns (e.g., content_languages), the following Spark configuration option needs to be set:

--conf spark.sql.parquet.mergeSchema=true

However, this option impacts the query performance, so use with care! Please also read cc-index-table about configuration options to improve the performance of Spark SQL queries.

Alternatively, it's possible configure the table schema explicitly:

Credits

Examples are originally ported from Stephen Merity's cc-mrjob with the following changes and upgrades:

  • based on Apache Spark (instead of mrjob)
  • boto3 supporting multi-part download of data from S3
  • warcio a Python 2 and Python 3 compatible module to access WARC files

Further inspirations are taken from

License

MIT License, as per LICENSE

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].