All Projects → dbis-ilm → piglet

dbis-ilm / piglet

Licence: Apache-2.0 license
A compiler for Pig Latin to Spark and Flink.

Programming Languages

scala
5932 projects
PigLatin
29 projects

Projects that are alternatives of or similar to piglet

logparser
Easy parsing of Apache HTTPD and NGINX access logs with Java, Hadoop, Hive, Pig, Flink, Beam, Storm, Drill, ...
Stars: ✭ 139 (+504.35%)
Mutual labels:  pig, flink
flink-training-troubleshooting
No description or website provided.
Stars: ✭ 41 (+78.26%)
Mutual labels:  flink
dlink
Dinky is an out of the box one-stop real-time computing platform dedicated to the construction and practice of Unified Streaming & Batch and Unified Data Lake & Data Warehouse. Based on Apache Flink, Dinky provides the ability to connect many big data frameworks including OLAP and Data Lake.
Stars: ✭ 1,535 (+6573.91%)
Mutual labels:  flink
flink-demo
Flink Demo
Stars: ✭ 39 (+69.57%)
Mutual labels:  flink
open-stream-processing-benchmark
This repository contains the code base for the Open Stream Processing Benchmark.
Stars: ✭ 37 (+60.87%)
Mutual labels:  flink
coolplayflink
Flink: Stateful Computations over Data Streams
Stars: ✭ 14 (-39.13%)
Mutual labels:  flink
SANSA-Stack
Big Data RDF Processing and Analytics Stack built on Apache Spark and Apache Jena http://sansa-stack.github.io/SANSA-Stack/
Stars: ✭ 130 (+465.22%)
Mutual labels:  flink
emma
A quotation-based Scala DSL for scalable data analysis.
Stars: ✭ 61 (+165.22%)
Mutual labels:  flink
PiggyAuth
Safe & feature-rich auth plugin. Project has been discontinued
Stars: ✭ 33 (+43.48%)
Mutual labels:  pig
flink-streaming-source-analysis
flink 流处理源码分析
Stars: ✭ 47 (+104.35%)
Mutual labels:  flink
dockerfiles
Multi docker container images for main Big Data Tools. (Hadoop, Spark, Kafka, HBase, Cassandra, Zookeeper, Zeppelin, Drill, Flink, Hive, Hue, Mesos, ... )
Stars: ✭ 29 (+26.09%)
Mutual labels:  flink
seatunnel-example
seatunnel plugin developing examples.
Stars: ✭ 27 (+17.39%)
Mutual labels:  flink
hadoopoffice
HadoopOffice - Analyze Office documents using the Hadoop ecosystem (Spark/Flink/Hive)
Stars: ✭ 56 (+143.48%)
Mutual labels:  flink
Websockets-Vertx-Flink-Kafka
A simple request response cycle using Websockets, Eclipse Vert-x server, Apache Kafka, Apache Flink.
Stars: ✭ 14 (-39.13%)
Mutual labels:  flink
flink-connectors
Apache Flink connectors for Pravega.
Stars: ✭ 84 (+265.22%)
Mutual labels:  flink
flink-connector-kudu
基于Apache-bahir-kudu-connector的flink-connector-kudu,支持Flink1.11.x DynamicTableSource/Sink,支持Range分区等
Stars: ✭ 40 (+73.91%)
Mutual labels:  flink
Archived-SANSA-Query
SANSA Query Layer
Stars: ✭ 31 (+34.78%)
Mutual labels:  flink
cassandra.realtime
Different ways to process data into Cassandra in realtime with technologies such as Kafka, Spark, Akka, Flink
Stars: ✭ 25 (+8.7%)
Mutual labels:  flink
Real-time-Data-Warehouse
Real-time Data Warehouse with Apache Flink & Apache Kafka & Apache Hudi
Stars: ✭ 52 (+126.09%)
Mutual labels:  flink
flink-learn
Learning Flink : Flink CEP,Flink Core,Flink SQL
Stars: ✭ 70 (+204.35%)
Mutual labels:  flink

Pig Latin Compiler for Apache Spark / Flink

The goal of this project is to build a compiler for the Pig Latin dataflow language on modern data analytics platforms such as Apache Spark and Apache Flink. The project is not intented as a replacement or competitor of the official Pig compiler for Hadoop or its extensions such as PigSpork. Instead we have the following goals:

  • We want to build a compiler from scratch that compiles natively to the Scala-based Spark/Flink API and avoids all the stuff needed for MapReduce/Hadoop.
  • Though, we are aiming at being compatible to the original Pig compiler we plan to integrate extensiblity features allowing to define and use user-defined operators (not only UDFs) and in this way being able to integrate extensions for graph processing or machine learning.

Installation

Clone & Update

Simply clone the git project.

Build

To build the project, in the project directory invoke

sbt package

This will build the (main) Pig compiler project as well as the shipped backends. (i.e. sparklib and flinklib)

There are several test cases included which should be passed: unit tests can be executed by sbt test, integration tests which compile and execute Pig scripts on Spark or Flink are executed by sbt it:test.

Note that building the compiler requires the most recent Spark and Flink jars, but they will be downloaded by sbt automatically.

If you want to use the compiler with the frontend scripts (see below), you have to build an assembly:

sbt assembly

Usage

We provide a simple wrapper script for processing Pig scripts. Just call it with

piglet --master local[4] --backend spark your_script.pig

To run this script you have to specify the full path to the platform distribution jar in the environment variable SPARK_JAR for Spark (e.g. spark-assembly-1.5.2-hadoop2.6.0.jar) and in FLINK_JAR (e.g. flink-dist_2.11-1.0.0.jar) for Flink. For Flink you also have to provide the path to the conf directory in FLINK_CONF_DIR.

An example for Spark could look like the following:

export SPARK_JAR=/opt/spark-1.6.0/assembly/target/scala-2.11/spark-assembly-1.6.0-hadoop2.6.0.jar
piglet --master local[4] --backend spark your_script.pig

The equivalent for Flink would be:

export FLINK_JAR=/opt/flink-1.0.0/build-target/lib/flink-dist_2.11-1.0.0.jar
export FLINK_CONF_DIR=/opt/flink-1.0.0/build-target/conf
piglet --master local[4] --backend flink your_script.pig

Note, that both for Spark and Flink you need a version built for Scala 2.11 (see e.g. Spark doc and Flink doc) and the same version used for building must also be used for execution. For Flink you have to run the start script found in the bin directory (e.g. /opt/flink-1.0.0/build-target/bin/start-local.sh) before executing scripts.

The following options are supported:

  • --master m specifies the master (local, yarn-client, yarn)
  • --compile compile and build jar, but do not execute
  • --profiling
  • --outdir dir specifies the output directory for the generated code
  • --backend b specifies the backend to execute the script. Currently, we support
    • spark (Apache Spark in batch mode)
    • sparks: Apache Spark Streaming
    • flink: Apache Flink in batch mode
    • flinks: Apache Flink Streaming
    • mapreduce: Apache Hadoop (by simply passing the script to the original Pig compiler)
  • --backend_dir dir
  • --params key=value, ...
  • --update-config
  • --show-plan Print the resulting dataflow plan
  • --show-stats Show execution runtimes for (some) Piglet methods
  • --keep Keep generated files
  • --sequential If more than one input script is provided, do not merge them but execute them sequentially
  • --log-level l
  • --backend-args key=value, ...

In addition, you can start an interactive Pig shell similar to Grunt:

piglet --interactive --backend spark

where Pig statements can be entered at the prompt and are executed as soon as a DUMP or STORE statement is entered. Furthermore, the schema can be printed using DESCRIBE.

Docker

Piglet can also be run as a Docker container. However, the image is not yet on DockerHub, so it has to be built manually:

sbt clean package assembly
docker build -t dbis/piglet .

Currently, the Docker image supports the Spark backend only.

To start the container, run:

docker run -it --rm --name piglet dbis/piglet

This uses the container's entrypoint which runs piglet. The above command will print the help message.

You can start the interactive mode, using -i option and enter your script.

docker run -it --rm --name piglet dbis/piglet -b spark -i

Alternatively, you can add your existing files into the container by mounting volumes and run the script in batch mode:

docker run -it --rm --name piglet -v /tmp/test.pig:/test.pig dbis/piglet -b spark /test.pig

As mentioned before, the container provides an entrypoint that executes piglet. In case you need a bash for that container, you need to overwrite the entrypoint:

docker run -it --rm --name piglet --entrypoint /bin/bash dbis/piglet

Configuration

To configure the program, we ship a configuration file. When starting the program for the first time, we will create our program home directory in your home directory and also copy the configuration file into this directory. More specifically, we will create a folder ~/.piglet (on *nix like systems) and copy the configuration file application.conf to this location.

If you update Piglet to a new version and the configuration file still exists from a previous version, a configuration exception might occur because we cannot find new configuration keys introduced by the new Piglet version in the existing config file. In such cases, you can start piglet with the -u (--update-config) option. This will force the override of your old configuration (make sure you have a backup if needed). Alternatively, you can simply remove the existing ~/.piglet/application.conf. This will also trigger the copy routine.

We use the Typesafe Config library.

Backends

As stated before, we support various backends that are used to execute the scripts. You can add your own backend by creating a jar file that contains the necessary configuration information and classes and adding it to the classpath (e.g. using the BACKEND_DIR variable).

More detailed information on how to create backends can be found in backends.md

Further Information

  • Details on the supported language features (statements, functions, etc.) are described here.
  • Documentation on how to setup integration with Zeppelin.
  • We use the Scala testing framework as well as the scoverage tool for test coverage. You can produce a coverage report by running sbt clean coverage test. The results can be found in target/scala-2.11/scoverage-report/index.html.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].