All Projects → GoogleCloudPlatform → redis-dataflow-realtime-analytics

GoogleCloudPlatform / redis-dataflow-realtime-analytics

Licence: Apache-2.0 License
Build a real-time website analytics dashboard on GCP using Dataflow, Cloud Memorystore (Redis) and Spring Boot

Programming Languages

java
68154 projects - #9 most used programming language
javascript
184084 projects - #8 most used programming language
python
139335 projects - #7 most used programming language
shell
77523 projects
HTML
75241 projects

Projects that are alternatives of or similar to redis-dataflow-realtime-analytics

wb-toolbox
Simulink toolbox to rapidly prototype robot controllers
Stars: ✭ 20 (+0%)
Mutual labels:  dataflow
whoshiring
A browser for Hacker News's Ask HN: Who's Hiring, with Matrix Inside(tm)
Stars: ✭ 24 (+20%)
Mutual labels:  dataflow
jylis
A distributed in-memory database for Conflict-free Replicated Data Types (CRDTs). 🌱 ↔️
Stars: ✭ 68 (+240%)
Mutual labels:  in-memory-database
bigflow
A Python framework for data processing on GCP.
Stars: ✭ 96 (+380%)
Mutual labels:  dataflow
redis
Baidu Ksarch Redis - a production solution of redis cluster
Stars: ✭ 89 (+345%)
Mutual labels:  in-memory-database
flowgraph
Flowgraph package for scalable asynchronous system development
Stars: ✭ 51 (+155%)
Mutual labels:  dataflow
jsgraph
Deprecated: Use the @encapsule/arccore package that includes the graph library
Stars: ✭ 42 (+110%)
Mutual labels:  in-memory-database
PothosDemos
Pothos demonstration applications
Stars: ✭ 24 (+20%)
Mutual labels:  dataflow
joern
Open-source code analysis platform for C/C++/Java/Binary/Javascript/Python/Kotlin based on code property graphs
Stars: ✭ 968 (+4740%)
Mutual labels:  dataflow
yarr
Yer another array library
Stars: ✭ 42 (+110%)
Mutual labels:  dataflow
DataflowTemplates
Convenient Dataflow pipelines for transforming data between cloud data sources
Stars: ✭ 22 (+10%)
Mutual labels:  dataflow
gotcha
Go Taint CHeck Analyser
Stars: ✭ 40 (+100%)
Mutual labels:  dataflow
dataflow-contact-center-speech-analysis
Speech Analysis Framework, a collection of components and code from Google Cloud that you can use to transcribe audio files to create analytics.
Stars: ✭ 46 (+130%)
Mutual labels:  dataflow
the-apache-ignite-book
All code samples, scripts and more in-depth examples for The Apache Ignite Book. Include Apache Ignite 2.6 or above
Stars: ✭ 65 (+225%)
Mutual labels:  in-memory-database
dtask
DTask is a scheduler for statically dependent tasks.
Stars: ✭ 17 (-15%)
Mutual labels:  dataflow
hazelcast-kubernetes
Hazelcast clustering for Kubernetes made easy.
Stars: ✭ 50 (+150%)
Mutual labels:  in-memory-database
haro
Haro is a modern immutable DataStore
Stars: ✭ 24 (+20%)
Mutual labels:  in-memory-database
microstream
High-Performance Java-Native-Persistence. Store and load any Java Object Graph or Subgraphs partially, Relieved of Heavy-weight JPA. Microsecond Response Time. Ultra-High Throughput. Minimum of Latencies. Create Ultra-Fast In-Memory Database Applications & Microservices.
Stars: ✭ 283 (+1315%)
Mutual labels:  in-memory-database
systolic-array-dataflow-optimizer
A general framework for optimizing DNN dataflow on systolic array
Stars: ✭ 21 (+5%)
Mutual labels:  dataflow
terraform-splunk-log-export
Deploy Google Cloud log export to Splunk using Terraform
Stars: ✭ 26 (+30%)
Mutual labels:  dataflow

Realtime Analytics using Dataflow and Cloud Memorystore (Redis)

In today’s fast-paced world, there is emphasis on getting instant insights. Typical use-cases involve SaaS operators providing real-time metrics for their KPIs or marketeers' need for quick insights on performance of their offers or experiments on the website.

This solution will demonstrate how to build a real-time website analytics dashboard on GCP. architecture

Components

User events / Message bus provides system decoupling, Pub/Sub is a fully managed message/event bus and provides an easy way to handle the fast click-stream generated by typical websites. The click-stream contains signals which can be processed to derive insights in real time.

Metrics processing pipeline is required to process the click-stream from Pub/Sub into the metrics database. Dataflow will be used, which is a serverless, fully managed processing service supporting real-time streaming jobs.

Metrics Database, needs to be an in-memory database to support real-time use-cases. Some common web analytic metrics are unique visitors, number of active experiments, conversion rate of each experiment, etc. The common theme is to calculate uniques, i.e. Cardinality counting, although from a marketeer's standpoint a good estimation is sufficient, the HyperLogLog algorithm is an efficient solution to the count-unique problem by trading off some accuracy.

Cloud Memorystore (Redis) provides a slew of in-built functions for sets and cardinality measurement, alleviating the need to perform them in code.

The analytics reporting and visualization makes the reports available to the marketeer easily. A Spring dashboard application is used for demo purposes only. The application uses Jedis client to access metrics from Redis using scard and sinterstore commands for identifying user overlap and other cardinality values. It then uses Javascript based web-ui to render graphs using Google Charts library.

Video Tutorial

Part 1 Part 2
Part-1 Part-2

Quick Start

Open in Cloud Shell

Setup Environment

  1. Clone this repository
    git clone https://github.com/GoogleCloudPlatform/redis-dataflow-realtime-analytics.git
    cd redis-dataflow-realtime-analytics
  2. Update and activate all environment variables in set_variables.sh
    source set_variables.sh
  3. Enable required Cloud products
    gcloud services enable \
    compute.googleapis.com \
    pubsub.googleapis.com \
    redis.googleapis.com \
    dataflow.googleapis.com \
    storage-component.googleapis.com

Create Pub/Sub Topic

Pub/Sub is a global message bus enabling easy message consumption in a decoupled fashion. Create a Pub/Sub topic to receive application instrumentation messages

gcloud pubsub topics create $APP_EVENTS_TOPIC --project $PROJECT_ID

Create VPC network

Protecting the Redis instance is important as it does not provide any protections from external entities.

  1. Creating a sepate VPC network with external ingress blocked by a firewall provides basic security for the instance.
    gcloud compute networks create $VPC_NETWORK_NAME \
    --subnet-mode=auto \
    --bgp-routing-mode=regional
  2. Create Firewall rule to enable SSH
    gcloud compute firewall-rules create allow-internal-ssh \
    --network $VPC_NETWORK_NAME \
    --allow tcp:22,icmp

Configure Cloud Memorystore

Cloud Memorystore provides a fully managed Redis database. Redis is a NoSQL In-Memory database, which offers comprehensive in-built functions for SETs operations, including efficient HLL operations for cardinality measurement.

  1. Create Redis instance in Memorystore.
    gcloud redis instances create $REDIS_NAME \
    --size=1 \
    --region=$REGION_ID \
    --zone="$ZONE_ID" \
    --network=$VPC_NETWORK_NAME \
    --tier=standard

    Be patient, this can take some time.

  2. Capture instance's IP to configure the Dataflow and Visualization application
    export REDIS_IP="$(gcloud redis instances describe $REDIS_NAME --region=$REGION_ID \
    | grep host \
    | sed 's/host: //')"

Start Analytics pipeline

The analytic metrics pipeline will read click-stream messages from Pub/Sub and update metrics in the Redis database in real-time. The visualization application can then use the Redis database for the dashboard.

  1. Create Cloud Storage bucket for temporary and staging area for the pipeline
    gsutil mb -l $REGION_ID -p $PROJECT_ID gs://$TEMP_GCS_BUCKET
  2. Launch the pipeline using Maven
    cd processor
    mvn clean compile exec:java \
      -Dexec.mainClass=com.google.cloud.solutions.realtimedash.pipeline.MetricsCalculationPipeline \
      -Dexec.cleanupDaemonThreads=false \
      -Dmaven.test.skip=true \
      -Dexec.args=" \
    --streaming \
    --project=$PROJECT_ID \
    --runner=DataflowRunner \
    --stagingLocation=gs://$TEMP_GCS_BUCKET/stage/ \
    --tempLocation=gs://$TEMP_GCS_BUCKET/temp/ \
    --inputTopic=projects/$PROJECT_ID/topics/$APP_EVENTS_TOPIC \
    --workerMachineType=n1-standard-4 \
    --region=$REGION_ID \
    --subnetwork=regions/$REGION_ID/subnetworks/$VPC_NETWORK_NAME \
    --redisHost=$REDIS_IP \
    --redisPort=6379"

Start the dummy website events generator

The dummy event generator is a Python executable, which needs to keep running, this can be achieved by launching the generator in a separate shell session.

  1. Create and initialize a new python3 virtual environment (you need to have pyhton3-venv package)
    python3 -m venv ~/generator-venv
    source ~/generator-venv/bin/activate  
    pip install -r loggen/requirements.txt   
  2. Run the logs generator
    python loggen/message_generator.py \
    --topic $APP_EVENTS_TOPIC \
    --project-id $PROJECT_ID \
    --enable-log true

Run the Visualization Engine

Use the simple reporting application located in dashboard/ folder, built using SpringBoot and simple HTML+JS based UI.

The application reads the metrics from the Redis database and makes it available to the dashboard UI. The Application server needs to be on the same VPC network as the Redis server, to achieve this for demo purposes, we will use a Proxy VM to tunnel the ports to Cloud Shell VM, as its not on the same network.

  1. Create a VM to act as proxy

    gcloud compute instances create proxy-server \
    --zone $ZONE_ID \
    --image-family debian-10 \
    --image-project debian-cloud \
    --network $VPC_NETWORK_NAME
  2. Start SSH port forwarding

    gcloud compute ssh proxy-server --zone $ZONE_ID -- -N -L 6379:$REDIS_IP:6379 -4 &
  3. Start the Visualization Spring boot application.

    cd dashboard/
    mvn clean compile package spring-boot:run
  4. Click on the web-preview icon to open web preview, to access the application's web-ui in the browser.

    a. Click "Preview on port 8080"
    b. On the dashboard, click "Auto Update" which will keep the dashboard fresh.

Sample Dashbaord dashboard-screenshot

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].