All Projects → jaegertracing → Spark Dependencies

jaegertracing / Spark Dependencies

Licence: apache-2.0
Spark job for dependency links

Programming Languages

java
68154 projects - #9 most used programming language

Labels

Projects that are alternatives of or similar to Spark Dependencies

Fast Mrmr
An improved implementation of the classical feature selection method: minimum Redundancy and Maximum Relevance (mRMR).
Stars: ✭ 67 (-18.29%)
Mutual labels:  spark
Apache Spark Hands On
Educational notes,Hands on problems w/ solutions for hadoop ecosystem
Stars: ✭ 74 (-9.76%)
Mutual labels:  spark
Docker Spark
🚢 Docker image for Apache Spark
Stars: ✭ 78 (-4.88%)
Mutual labels:  spark
Big Data Engineering Coursera Yandex
Big Data for Data Engineers Coursera Specialization from Yandex
Stars: ✭ 71 (-13.41%)
Mutual labels:  spark
Spark Twitter Stream Example
"Sentiment analysis" on a live Twitter feed with Apache Spark and Apache Bahir
Stars: ✭ 73 (-10.98%)
Mutual labels:  spark
Ds Cheatsheets
List of Data Science Cheatsheets to rule the world
Stars: ✭ 9,452 (+11426.83%)
Mutual labels:  spark
Thingsboard
Open-source IoT Platform - Device management, data collection, processing and visualization.
Stars: ✭ 10,526 (+12736.59%)
Mutual labels:  spark
Lehar
Visualize data using relative ordering
Stars: ✭ 81 (-1.22%)
Mutual labels:  spark
Lpa Detector
Optimize and improve the Label propagation algorithm
Stars: ✭ 75 (-8.54%)
Mutual labels:  spark
Home
ApacheCN 开源组织:公告、介绍、成员、活动、交流方式
Stars: ✭ 1,199 (+1362.2%)
Mutual labels:  spark
Luigi Warehouse
A luigi powered analytics / warehouse stack
Stars: ✭ 72 (-12.2%)
Mutual labels:  spark
Labs
Research on distributed system
Stars: ✭ 73 (-10.98%)
Mutual labels:  spark
Cleanframes
type-class based data cleansing library for Apache Spark SQL
Stars: ✭ 75 (-8.54%)
Mutual labels:  spark
Usersessionbehaviorofflineanalysis
四川大学拓思爱诺用户session行为数据离线分析项目
Stars: ✭ 69 (-15.85%)
Mutual labels:  spark
Setl
A simple Spark-powered ETL framework that just works 🍺
Stars: ✭ 79 (-3.66%)
Mutual labels:  spark
Kontextfrei
Writing application logic for Spark jobs that can be unit-tested without a SparkContext
Stars: ✭ 67 (-18.29%)
Mutual labels:  spark
Dataspherestudio
DataSphereStudio is a one stop data application development& management portal, covering scenarios including data exchange, desensitization/cleansing, analysis/mining, quality measurement, visualization, and task scheduling.
Stars: ✭ 1,195 (+1357.32%)
Mutual labels:  spark
Mleap
MLeap: Deploy ML Pipelines to Production
Stars: ✭ 1,232 (+1402.44%)
Mutual labels:  spark
Spark Gbtlr
Hybrid model of Gradient Boosting Trees and Logistic Regression (GBDT+LR) on Spark
Stars: ✭ 81 (-1.22%)
Mutual labels:  spark
Spark Website
Apache Spark Website
Stars: ✭ 75 (-8.54%)
Mutual labels:  spark

Build Status Docker Build Status Docker Pulls

Jaeger Spark dependencies

This is a Spark job that collects spans from storage, analyze links between services, and stores them for later presentation in the UI. Note that it is needed for the production deployment. all-in-one distribution does not need this job.

This job parses all traces on a given day, based on UTC. By default, it processes the current day, but other days can be explicitly specified.

This repository is based on zipkin-dependencies.

Quick-start

Spark job can be run as docker container and also as java executable:

Docker:

$ docker run --env STORAGE=cassandra --env CASSANDRA_CONTACT_POINTS=host1,host2 jaegertracing/spark-dependencies

Use --env JAVA_OPTS=-Djavax.net.ssl. to set trust store and other Java properties.

As jar file:

STORAGE=cassandra java -jar jaeger-spark-dependencies.jar

Usage

By default, this job parses all traces since midnight UTC. You can parse traces for a different day via an argument in YYYY-mm-dd format, like 2016-07-16 or specify the date via an env property.

# ex to run the job to process yesterday's traces on OS/X
$ STORAGE=cassandra java -jar jaeger-spark-dependencies.jar `date -uv-1d +%F`
# or on Linux
$ STORAGE=cassandra java -jar jaeger-spark-dependencies.jar `date -u -d '1 day ago' +%F`

Configuration

jaeger-spark-dependencies applies configuration parameters through environment variables.

The following variables are common to all storage layers:

* `SPARK_MASTER`: Spark master to submit the job to; Defaults to `local[*]`
* `DATE`: Date in YYYY-mm-dd format. Denotes a day for which dependency links will be created.

Cassandra

Cassandra is used when STORAGE=cassandra.

* `CASSANDRA_KEYSPACE`: The keyspace to use. Defaults to "jaeger_v1_dc1".
* `CASSANDRA_CONTACT_POINTS`: Comma separated list of hosts / ip addresses part of Cassandra cluster. Defaults to localhost
* `CASSANDRA_LOCAL_DC`: The local DC to connect to (other nodes will be ignored)
* `CASSANDRA_USERNAME` and `CASSANDRA_PASSWORD`: Cassandra authentication. Will throw an exception on startup if authentication fails
* `CASSANDRA_USE_SSL`: Requires `javax.net.ssl.trustStore` and `javax.net.ssl.trustStorePassword`, defaults to false.
* `CASSANDRA_CLIENT_AUTH_ENABLED`: If set enables client authentication on SSL connections. Requires `javax.net.ssl.keyStore` and `javax.net.ssl.keyStorePassword`, defaults to false.

Example usage:

$ STORAGE=cassandra CASSANDRA_CONTACT_POINTS=localhost:9042 java -jar jaeger-spark-dependencies.jar

Elasticsearch

Elasticsearch is used when STORAGE=elasticsearch.

* `ES_NODES`: A comma separated list of elasticsearch hosts advertising http. Defaults to
              localhost. Add port section if not listening on port 9200. Only one of these hosts
              needs to be available to fetch the remaining nodes in the cluster. It is
              recommended to set this to all the master nodes of the cluster. Use url format for
              SSL. For example, "https://yourhost:8888"
* `ES_NODES_WAN_ONLY`: Set to true to only use the values set in ES_HOSTS, for example if your
                       elasticsearch cluster is in Docker. If you're using a cloudprovider
                       such as AWS Elasticsearch, set this to true. Defaults to false
* `ES_USERNAME` and `ES_PASSWORD`: Elasticsearch basic authentication. Use when X-Pack security
                                   (formerly Shield) is in place. By default no username or
                                   password is provided to elasticsearch.
* `ES_CLIENT_NODE_ONLY`: Set to true to disable elasticsearch cluster nodes.discovery and enable nodes.client.only.
                         If your elasticsearch cluster's data nodes only listen on loopback ip, set this to true.
                         Defaults to false
* `ES_INDEX_PREFIX`: index prefix of Jaeger indices. By default unset.
* `ES_TIME_RANGE`: How far in the past the job should look to for spans, the maximum and default is `24h`.
                   Any value accepted by [date-math](https://www.elastic.co/guide/en/elasticsearch/reference/current/common-options.html#date-math) can be used here, but the anchor is always `now`.

Example usage:

$ STORAGE=elasticsearch ES_NODES=http://localhost:9200 java -jar jaeger-spark-dependencies.jar

Building locally

To build the job locally and run tests:

./mvnw clean install # if failed add SPARK_LOCAL_IP=127.0.0.1
STORAGE=elasticsearch ES_NODES=http://localhost:9200 java -jar jaeger-spark-dependencies/target/jaeger-spark-dependencies-0.0.1-SNAPSHOT.jar
docker build -t jaegertracing/spark-dependencies:latest .

In tests it's possible to specify version of Jaeger images by env variable JAEGER_VERSION or system property jaeger.version. By default tests are using latest images.

License

Apache 2.0 License.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].