All Projects → aws-samples → aws-batch-runtime-monitoring

aws-samples / aws-batch-runtime-monitoring

Licence: MIT-0 license
Serverless application to monitor an AWS Batch architecture through dashboards.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to aws-batch-runtime-monitoring

awesome-aws-research
A curated list of awesome Amazon Web Services (AWS) libraries, open source repos, guides, blogs, and other resources for Academic Researchers new to AWS
Stars: ✭ 41 (+0%)
Mutual labels:  hpc, aws-batch
launcher-scripts
(DEPRECATED) A set of launcher scripts to be used with OAR and Slurm for running jobs on the UL HPC platform
Stars: ✭ 14 (-65.85%)
Mutual labels:  hpc
Pyslurm
Python Interface to Slurm
Stars: ✭ 222 (+441.46%)
Mutual labels:  hpc
suanPan
🧮 An Open Source, Parallel and Heterogeneous Finite Element Analysis Framework
Stars: ✭ 29 (-29.27%)
Mutual labels:  hpc
Batch Shipyard
Simplify HPC and Batch workloads on Azure
Stars: ✭ 240 (+485.37%)
Mutual labels:  hpc
lustre exporter
Prometheus exporter for use with the Lustre parallel filesystem
Stars: ✭ 25 (-39.02%)
Mutual labels:  hpc
Relion
Image-processing software for cryo-electron microscopy
Stars: ✭ 219 (+434.15%)
Mutual labels:  hpc
arrayfire-java
Java wrapper for ArrayFire
Stars: ✭ 34 (-17.07%)
Mutual labels:  hpc
az-hop
The Azure HPC On-Demand Platform provides an HPC Cluster Ready solution
Stars: ✭ 33 (-19.51%)
Mutual labels:  hpc
hp2p
Heavy Peer To Peer: a MPI based benchmark for network diagnostic
Stars: ✭ 17 (-58.54%)
Mutual labels:  hpc
azurehpc
This repository provides easy automation scripts for building a HPC environment in Azure. It also includes examples to build e2e environment and run some of the key HPC benchmarks and applications.
Stars: ✭ 102 (+148.78%)
Mutual labels:  hpc
tailwind-dashboard-template
Mosaic Lite is a free admin dashboard template built on top of Tailwind CSS and fully coded in React. Made by
Stars: ✭ 1,662 (+3953.66%)
Mutual labels:  dashboards
realcaffe2
The repo is obsolete. Use at your own risk.
Stars: ✭ 12 (-70.73%)
Mutual labels:  hpc
Occa
JIT Compilation for Multiple Architectures: C++, OpenMP, CUDA, HIP, OpenCL, Metal
Stars: ✭ 230 (+460.98%)
Mutual labels:  hpc
t8code
Parallel algorithms and data structures for tree-based AMR with arbitrary element shapes.
Stars: ✭ 37 (-9.76%)
Mutual labels:  hpc
Nek5000
our classic
Stars: ✭ 219 (+434.15%)
Mutual labels:  hpc
tutorials
Tutorials for the usage of the Uni.lu HPC platform
Stars: ✭ 94 (+129.27%)
Mutual labels:  hpc
Foundations of HPC 2021
This repository collects the materials from the course "Foundations of HPC", 2021, at the Data Science and Scientific Computing Department, University of Trieste
Stars: ✭ 22 (-46.34%)
Mutual labels:  hpc
Singularity-tutorial
Singularity 101
Stars: ✭ 31 (-24.39%)
Mutual labels:  hpc
aws-batch-example
Example use of AWS batch
Stars: ✭ 96 (+134.15%)
Mutual labels:  aws-batch

AWS Batch Runtime Monitoring Dashboards Solution

This SAM application deploys a serverless architecture to capture events from Amazon ECS, AWS Batch and Amazon EC2 to visualize the behavior of your workloads running on AWS Batch and provide insights on your jobs and the instances used to run them.

This application is designed to be scalable by collecting data from events and API calls using Amazon EventBrige and does not make API calls to describe your resources. Data collected through events and API are partially aggregated to DynamoDB to recoup information and generate Amazon CloudWatch metrics with the Embedded Metric Format. The application also deploys a several of dashboards displaying the job states, Amazon EC2 instances belonging your Amazon ECS Clusters (AWS Batch Compute Environments), ASGs across Availability Zones. It also collects API calls to the RunTask to visualize job placement across instances.

Dashboard

A series of dashboards are deployed as part of the SAM application. They provide information on your ASGs scaling, the capacity in vCPUs and instances. You will also find dashboards providing statistics on Batch job states transitions, RunTask API calls and jobs placement per Availability Zone, Instance Type, Job Queue and ECS Cluster.

Alt text

Architecture Diagram

Alt text

Overview

This solution captures events from AWS Batch and Amazon ECS using Amazon Event Bridge. Each event triggers an Amazon Lambda function that will add a metrics in Amazon CloudWatch and interact with an Amazon DynamoDB table to tie a job to the instance it runs on.

  • ECS Instance Registration: captures when instance are added or removed from an ECS cluster (Compute Environments in Batch). It helps to link together the EC2 Instance ID and the ECS Container Instance ID.
  • ECS RunTask: is the API called by AWS Batch to run the jobs. Calls can be successful (job placed on the cluster) or not in which case the job is not placed due to a lack of free resources.
  • Batch Job Transitions: capture Batch jobs transitions between states and their placement across Job Queues and Compute Environments.

DynamoDB is used to retain state of the instances joining the ECS Clusters (which sit underneath our Batch Compute Environments) so we can identify the instances joining a cluster and in which availability zone they are created. This also allows us to identify on which instance jobs are placed, succeed or fail.

When RunTask is called or Batch jobs transition between state, we can associate them to the Amazon EC2 instance on which they are running. RunTask API calls and Batch jobs are not associated directly with Amazon EC2, we use the ContainerInstanceID generated when an instance registers with a cluster to identify which Amazon EC2 instance is used to run a job (or place a task in the case of ECS). This architecture does not make any explicit API call such as DescribeJobs or DescribeEC2Instance which makes it it relatively scalable and not subject to potential throttling on these APIs.

CloudWatch Embedded Metrics Format (EMF) is used to collect the metrics displayed on the dashboards. Step Function are used to build the logic around the Lamba functions handing the events and data stored in DynamoDB and pushed to CloudWatch EMF.

How to run the SAM application

To build and deploy your application for the first time, run the following in your shell:

sam build
sam deploy --guided

You will be asked to provide a few parameters.

Parameters

When creating the stack for the first time you will need to provide the following parameters:

  • Stack Name: name of the CloudFormation stack that will be deployed.
  • AWS Region: region in which you will deploy the stack, the default region will be used if nothing is provided.

After the first launch, you can modify the function and deploy a new version with the same parameters through the following commands:

sam build
sam deploy # use sam deploy --no-confirm-changeset to force the deployment without validation

Cleanup

To remove the SAM application, go to the CloudFormation page of the AWS Console, select your stack and click Delete.

Adding Monitoring to Existing Auto-Scaling Groups

The Clusters Usage dashboards needs monitoring activated for each Auto-Scaling groups (ASGs), it is not done by default. The ASGMonitoring Lambda function in the serverless application automatically adds monitoring for new ASGs created by AWS Batch but to add it for existing ones you can run the following command in your terminal (install jq) or AWS CloudShell:

aws autoscaling describe-auto-scaling-groups | \
  jq -c '.AutoScalingGroups[] | select(.MixedInstancesPolicy.LaunchTemplate.LaunchTemplateSpecification.LaunchTemplateName | contains("Batch-lt")) | .AutoScalingGroupName' | \
  xargs -t -I {} aws autoscaling enable-metrics-collection  \
    --metrics GroupInServiceCapacity GroupDesiredCapacity GroupInServiceInstances \
    --granularity "1Minute" \
    --auto-scaling-group-name {}

Requirements

To run the serverless application you need to install the SAM CLI and have Python 3.8 installed on your host. Python 3.7 can be used as well by modifying the template.yaml file and replace the Lambda functions runtime from python3.8 to python3.7. You can quickly do this with the following sed command in the repository directory:

sed -i 's/3\.8/3\.7/g' template.yaml

If you plan to use AWS CloudShell to deploy the SAM template, please modify the Lambda runtime to 3.7 as suggested above (unless 3.8 is available) and make your Python 3 command the default for python: alias python=/usr/bin/python3.7. You can check the Python version available in CloudShell with python3 --version.

References

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].