All Projects → confluentinc → Camus

confluentinc / Camus

Licence: apache-2.0
Mirror of Linkedin's Camus

Programming Languages

java
68154 projects - #9 most used programming language

Projects that are alternatives of or similar to Camus

Kafka Connect Hdfs
Kafka Connect HDFS connector
Stars: ✭ 400 (+393.83%)
Mutual labels:  kafka, hadoop, confluent, hdfs
Hadoop For Geoevent
ArcGIS GeoEvent Server sample Hadoop connector for storing GeoEvents in HDFS.
Stars: ✭ 5 (-93.83%)
Mutual labels:  hadoop, hdfs, connector
Bigdata Interview
🎯 🌟[大数据面试题]分享自己在网络上收集的大数据相关的面试题以及自己的答案总结.目前包含Hadoop/Hive/Spark/Flink/Hbase/Kafka/Zookeeper框架的面试题知识总结
Stars: ✭ 857 (+958.02%)
Mutual labels:  kafka, hadoop, hdfs
Repository
个人学习知识库涉及到数据仓库建模、实时计算、大数据、Java、算法等。
Stars: ✭ 92 (+13.58%)
Mutual labels:  kafka, hadoop, hdfs
Bigdata Notes
大数据入门指南 ⭐
Stars: ✭ 10,991 (+13469.14%)
Mutual labels:  kafka, hadoop, hdfs
God Of Bigdata
专注大数据学习面试,大数据成神之路开启。Flink/Spark/Hadoop/Hbase/Hive...
Stars: ✭ 6,008 (+7317.28%)
Mutual labels:  kafka, hadoop, hdfs
kafka-connect-fs
Kafka Connect FileSystem Connector
Stars: ✭ 107 (+32.1%)
Mutual labels:  hadoop, confluent, hdfs
Examples
Demo applications and code examples for Confluent Platform and Apache Kafka
Stars: ✭ 571 (+604.94%)
Mutual labels:  kafka, confluent, connector
Kafka Connect Elasticsearch
Kafka Connect Elasticsearch connector
Stars: ✭ 550 (+579.01%)
Mutual labels:  kafka, confluent
Kafka Connect Jdbc
Kafka Connect connector for JDBC-compatible databases
Stars: ✭ 698 (+761.73%)
Mutual labels:  kafka, confluent
Bigdataguide
大数据学习,从零开始学习大数据,包含大数据学习各阶段学习视频、面试资料
Stars: ✭ 817 (+908.64%)
Mutual labels:  kafka, hadoop
Sparta
Real Time Analytics and Data Pipelines based on Spark Streaming
Stars: ✭ 513 (+533.33%)
Mutual labels:  kafka, hdfs
Bigdata
💎🔥大数据学习笔记
Stars: ✭ 488 (+502.47%)
Mutual labels:  hadoop, hdfs
Stream Reactor
Streaming reference architecture for ETL with Kafka and Kafka-Connect. You can find more on http://lenses.io on how we provide a unified solution to manage your connectors, most advanced SQL engine for Kafka and Kafka Streams, cluster monitoring and alerting, and more.
Stars: ✭ 753 (+829.63%)
Mutual labels:  kafka, connector
Devops Python Tools
80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Function, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.
Stars: ✭ 406 (+401.23%)
Mutual labels:  hadoop, hdfs
Szt Bigdata
深圳地铁大数据客流分析系统🚇🚄🌟
Stars: ✭ 826 (+919.75%)
Mutual labels:  kafka, hadoop
Dockerfiles
50+ DockerHub public images for Docker & Kubernetes - Hadoop, Kafka, ZooKeeper, HBase, Cassandra, Solr, SolrCloud, Presto, Apache Drill, Nifi, Spark, Consul, Riak, TeamCity and DevOps tools built on the major Linux distros: Alpine, CentOS, Debian, Fedora, Ubuntu
Stars: ✭ 847 (+945.68%)
Mutual labels:  kafka, hadoop
Jsr203 Hadoop
A Java NIO file system provider for HDFS
Stars: ✭ 35 (-56.79%)
Mutual labels:  hadoop, hdfs
Cp Docker Images
[DEPRECATED] Docker images for Confluent Platform.
Stars: ✭ 975 (+1103.7%)
Mutual labels:  kafka, confluent
Learning Spark
零基础学习spark,大数据学习
Stars: ✭ 37 (-54.32%)
Mutual labels:  hadoop, hdfs

Intro

FOSSA Status

Camus is LinkedIn's Kafka->HDFS pipeline. It is a mapreduce job that does distributed data loads out of Kafka. It includes the following features:

  • Automatic discovery of topics
  • Avro schema management / In progress
  • Date partitioning

It is used at LinkedIn where it processes tens of billions of messages per day. You can get a basic overview from this paper: Building LinkedIn’s Real-time Activity Data Pipeline.

There is a Google Groups mailing list that you can email or search if you have any questions.

For a more detailed documentation on the main Camus components, please see Camus InputFormat and OutputFormat Behavior

Brief Overview

All work is done within a single Hadoop job divided into three stages:

  1. Setup stage fetches available topics and partitions from Zookeeper and the latest offsets from the Kafka Nodes.

  2. Hadoop job stage allocates topic pulls among a set number of tasks. Each task does the following:

    • Fetch events from Kafka server and collect count statistics.
    • Move data files to corresponding directories based on timestamps of events.
    • Produce count events and write to HDFS. * TODO: Determine status of open sourcing Kafka Audit.
    • Store updated offsets in HDFS.
  3. Cleanup stage reads counts from all tasks, aggregates the values, and submits the results to Kafka for consumption by Kafka Audit.

Setup Stage

  1. Setup stage fetches from Zookeeper Kafka broker urls and topics (in /brokers/id and /brokers/topics). This data is transient and will be gone once Kafka server is down.

  2. Topic offsets stored in HDFS. Camus maintains its own status by storing offset for each topic in HDFS. This data is persistent.

  3. Setup stage allocates all topics and partitions among a fixed number of tasks.

Hadoop Stage

1. Pulling the Data

Each hadoop task uses a list of topic partitions with offsets generated by setup stage as input. It uses them to initialize Kafka requests and fetch events from Kafka brokers. Each task generates four types of outputs (by using a custom MultipleOutputFormat): Avro data files, Count statistics files, Updated offset files, and Error files.

  • Note, each task generates an error file even if no errors were encountered. If no errors occurred, the file is empty.

2. Committing the data

Once a task has successfully completed, all topics pulled are committed to their final output directories. If a task doesn't complete successfully, then none of the output is committed. This allows the hadoop job to use speculative execution. Speculative execution happens when a task appears to be running slowly. In that case the job tracker then schedules the task on a different node and runs both the main task and the speculative task in parallel. Once one of the tasks completes, the other task is killed. This prevents a single overloaded hadoop node from slowing down the entire ETL.

3. Producing Audit Counts

Successful tasks also write audit counts to HDFS.

4. Storing the Offsets

Final offsets are written to HDFS and consumed by the subsequent job.

Job Cleanup

Once the hadoop job has completed, the main client reads all the written audit counts and aggregates them. The aggregated results are then submitted to Kafka.

Setting up Camus

Building the project

You can build Camus with:

mvn clean package

Note that there are two jars that are not currently in a public Maven repo. These jars (kafka-0.7.2 and avro-schema-repository-1.74-SNAPSHOT) are supplied in the lib directory, and maven will automatically install them into your local Maven cache (usually ~/.m2).

First, Create a Custom Kafka Message to Avro Record Decoder

We hope to eventually create a more out of the box solution, but until we get there you will need to create a custom decoder for handling Kafka messages. You can do this by implementing the abstract class com.linkedin.batch.etl.kafka.coders.KafkaMessageDecoder. Internally, we use a schema registry that enables obtaining an Avro schema using an identifier included in the Kafka byte payload. For more information on other options, you can email [email protected]. Once you have created a decoder, you will need to specify that decoder in the properties as described below. You can also start by taking a look at the existing Decoders to see if they will work for you, or as examples if you need to implement a new one.

Decoding JSON messages

Camus can also process JSON messages. Set "camus.message.decoder.class=com.linkedin.camus.etl.kafka.coders.JsonStringMessageDecoder" in camus.properties. Additionally, there are two more options "camus.message.timestamp.format" (default value: "[dd/MMM/yyyy:HH:mm:ss Z]") and "camus.message.timestamp.field" (default value: "timestamp").

Writing to different formats

By default Camus writes Avro data. But you can also write to different formats by implementing and RecordWriterProvider. For examples see https://github.com/linkedin/camus/blob/master/camus-etl-kafka/src/main/java/com/linkedin/camus/etl/kafka/common/AvroRecordWriterProvider.java and https://github.com/linkedin/camus/blob/master/camus-etl-kafka/src/main/java/com/linkedin/camus/etl/kafka/common/StringRecordWriterProvider.java. You can specify which writer to use with "etl.record.writer.provider.class".

Configuration

Camus can be run from the command line as Java App. You will need to set some properties either by specifying a properties file on the classpath using -p (filename), or an external properties file using -P (filepath to local file system, or to hdfs:), or from the command line itself using -D property=value. If the same property is set using more than one of the previously mentioned methods, the order of precedence is command-line, external file, classpath file.

Here is an abbreviated list of commonly used parameters. An example properties file is also located https://github.com/linkedin/camus/blob/master/camus-example/src/main/resources/camus.properties.

  • Top-level data output directory, sub-directories will be dynamically created for each topic pulled
    • etl.destination.path=
  • HDFS location where you want to keep execution files, i.e. offsets, error logs, and count files
    • etl.execution.base.path=
  • Where completed Camus job output directories are kept, usually a sub-dir in the base.path
    • etl.execution.history.path=
  • Filesystem for the above folders. This can be a hdfs:// or s3:// address.
    • fs.default.name=
  • List of at least one Kafka broker for Camus to pull metadata from
    • kafka.brokers=
  • All files in this dir will be added to the distributed cache and placed on the classpath for Hadoop tasks
    • hdfs.default.classpath.dir=
  • Max hadoop tasks to use, each task can pull multiple topic partitions
    • mapred.map.tasks=30
  • Max historical time that will be pulled from each partition based on event timestamp
    • kafka.max.pull.hrs=1
  • Events with a timestamp older than this will be discarded.
    • kafka.max.historical.days=3
  • Max minutes for each mapper to pull messages
    • kafka.max.pull.minutes.per.task=-1
  • Decoder class for Kafka Messages to Avro Records
    • camus.message.decoder.class=
  • If whitelist has values, only whitelisted topic are pulled. Nothing on the blacklist is pulled
    • kafka.blacklist.topics=
    • kafka.whitelist.topics=
  • Class for writing records to HDFS/S3
    • etl.record.writer.provider.class=
  • Delimiter for writing string records (default is "\n")
    • etl.output.record.delimiter=

Running Camus

Camus can be run from the command line using hadoop jar. Here is the usage:

usage: hadoop jar camus-example-<version>-SNAPSHOT.jar com.linkedin.camus.etl.kafka.CamusJob  <br/>
 -D <property=value>   use value for given property<br/>
 -P <arg>              external properties filename<br/>
 -p <arg>              properties filename from the classpath<br/>

License

FOSSA Status

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].