An aide-mémoire on monitoring using CloudWatch & CloudFormation on AWS

It can be confusing when it comes to setting up the auto scaling rules , alarms and load balancing  health checks so I wanted to take a little time to look at  how to fit the bits  together to get an effective proactive monitoring solution by just using CloudWatch. Sorry this is a longish post but at least it’s in one place 🙂

 AWS does provide a lot of information but there is a lot of it scattered about and wading through it can be time consuming but hopefully this will be a useful introduction .

A few definitions is a good place to start



An alarm is exactly what it says. They are watchers that provide notifications that an AWS  resource has breached one of the thresholds that have been assigned against a specific metric.  (Note you are now able to expose custom metrics as well as CloudWatch metrics and use these for Auto Scaling actions as well).

Health checks:

A health check is a check on the state of an instance which is part of an Auto Scaling group. If an instance is detected as having degraded performance  it is marked as unhealthy

Auto Scaling Policy:

A policy defines what action the AutoScaling group should take in response to an alarm.


A trigger is a combination of an Auto Scaling policy and an Amazon CloudWatch alarm. Alarms are created that monitor specific metrics gathered from EC2 instances. Pairing the alarm with a policy can initiate an Auto Scaling action when the metric breaches a specific threshold.

Launch Configuration:

The definitions (Parameters) needed to instantiate new ec2 instances. These will include values like what AMI to use, the instance size, user data to be passed, EBS volumes to be attached. A Launch configuration is used together with an Auto Scaling group. An Auto Scaling group can only have one Launch Configuration attached to it at any one time but you can replace the Launch configuration.

AutoScaling Group:

An Autoscaling group manages a set of 1 or more  instances. It works in conjunction with a   launch configuration and triggers to enact scaling actions. The Launch configuration tells it what the instances should look like and the triggers tell it how to react to particular situations.


Component breakdown

Alarm Parameters:



Example Value

Alarm name

Name that  typically reflects what the alarm is watching


Alarm Action

An SNS notification or autoscaling policy


Metric Name

The metric being monitored e.g CPU or memory usage



Metric data aggregations collected  over a  specified period of time



Length of time associated with a specific statistic.  periods are expressed in seconds, the minimum granularity for a period is one minute period values are expressed  as multiples of 60


Evaluation Period

The number of periods over which data is compared to the specified threshold



The value that the metric is being evaluated against



The operation to use when comparing the specified Statistic and Threshold. The specified Statistic value is used as the first operand. Valid Values:







Name Value pairs that  provide additional information  to allow you to uniquely identify a metric



Health check Parameters:

The Healthiness of your instance is used by AutoScaling to trigger the termination of an instance



Example Value

Healthy Threshold

Number of consecutive health check successes before declaring an instance healthy


Unhealthy Threshold

Number of consecutive health check failures before declaring an instance unhealthy



The interval in seconds  between successive health checks



Amount of time in seconds during which  no response indicates a failed health check. This value must be less than the interval value



TCP or HTTP check against an instance.  This  is used to determine the health of an instance

For a HTPP check  Any answer other than “200 OK” within the timeout period is considered unhealthy


For a TCP check –– Attempts to open a TCP connection to the instance on the specified port. Failure to connect within the configured timeout is considered unhealthy






Trigger Parameters:



Example Values

Metric name



Name Space

Conceptual containers for metrics . Ensures that metrics in different names spaces are isolated from each other








Metric data aggregations collected  over a  specified period of time






Length of time associated with a specific statistic.  periods are expressed in seconds, the minimum granularity for a period is one minute period values are expressed  as multiples of 60



The statistics unit of measurement

Percent , bytes, seconds etc depends on metric being measured

Upper Breach Scale increment

The incremental amount to scale by when the upper threshold has been breached


Lower Breach Scale increment

The incremental amount to scale by when the upper threshold has been breached


Auto Scaling Group name

Name of the AutoScaling group the trigger is attached to


Breach Duration

Period that defines how long the breach duration can occur for before it triggers an action


Upper Threshold

The upper limit of the metric . The trigger fires if all data points in the last BreachDuration period  exceeds the upper threshold or falls below the lower threshold


Lower Threshold

The lower  limit of the metric . The trigger fires if all data points in the last BreachDuration period  falls below the lower threshold or exceeds the upper threshold



Name Value pairs that  provide additional information  to allow you to uniquely identify a metric






Auto Scaling Group Parameters



Example Values


The availability zones that are available for the group to start an instance in

Eu-west-1a, eu-west-1c


The time in seconds after one scaling action completes before another scaling activity can start



Specifies the  number of instances the auto scaling group will endeavour to maintain



The name of the associated Launch Configuration



Name of Load Balancer Auto Scaling group attached to .



Maximum number of instances that the Auto Scaling Group can have associated with it



Minimum number of instances that the Auto Scaling group  will have associated with it



A policy definition:

Policies are usually paired one for scaling up and one for scaling down.

To create a policy that scales down by 1  from the command line:

# When scaling down, decrease capacity by 1

%as-put-scaling-policy my-group –name “scale-down”

–adjustment -1 –type Absolute


To list policies from the command line to get the ARN :

as-describe-policies autoscaling-group


Putting it all together

So now we know what  the components are and the  associated parameters are  that can be used to be put together an appropriate monitoring solution using CloudWatch .  To illustrate how to start putting things together I’ll use CloudFomation. You can use the Command line tools and the console to do much of what comes next.

Using Alarms:

Metrics can be collated for EC2 instances, ELB’s, EBS volumes ,RDS and  the flexibility to use custom metrics J. Alarms can be set for any one of these metrics. Alarms exist in 3 states OK, ALARM, or INSUFFICIENT_DATA. When a metric breaches a predetermined threshold it is set to the ALARM state. On transition from one state to another an alarm action can be set.  The defined alarm action can be publication to an SNS notification topic or an auto scaling action.  Using CloudFormation snippets to illustrate setting up an alarm that monitors when CPU utilisation breaches a defined  threshold or the metrics disappear with the defined action being publication to an SNS topic that sends an email:

“AlarmTopic” : {

      “Type” : “AWS::SNS::Topic”,

      “Properties” : {

        “Subscription” : [ {

          “Endpoint” : { “Ref” : “OperatorEmail” },

          “Protocol” : “email”

        } ]







“CPUAlarmHigh” : {

      “Type” : “AWS::CloudWatch::Alarm”,

      “Properties” : {

        “AlarmDescription” : “Alarm if CPU too high or metric disappears indicating instance is down”,

        “AlarmActions” : [ { “Ref” : “AlarmTopic” } ],

        “InsufficientDataActions” : [ { “Ref” : “AlarmTopic” } ],

        “MetricName” : “CPUUtilization”,

        “Namespace” : “AWS/EC2”,

        “Statistic” : “Average”,

        “Period” : “60”,

        “EvaluationPeriods” : “1”,

        “Threshold” : “90”,

        “ComparisonOperator” : “GreaterThanThreshold”,

        “Dimensions” : [ {

          “Name” : “AutoScalingGroupName”,

          “Value” : { “Ref” : “AppServerGroup” }

        } ]





Using Auto Scaling Groups and Load Balancers:

This snippet describes an Auto Scaling group that will at any one time manage between 1 or 3 instances while endeavouring to maintain 2 instances.

“AppServerGroup” : {

      “Type” : “AWS::AutoScaling::AutoScalingGroup”,

      “Properties” : {

        “AvailabilityZones” : { “Fn::GetAZs” : “”},

        “LaunchConfigurationName” : { “Ref” : “AppServerLaunchConfig” },

        “MinSize” : “1”,

        “MaxSize” : “3”,

        “DesiredCapcity” :”2”,

        “LoadBalancerNames” : [ { “Ref” : “AppServerLoadBalancer” } ]




In the snippet above the Auto Scaling group has an associated Launch Configuration which is mandatory for an Auto Scaling group. It is also associated with a Load Balancer which we’ll come to in a minute. In the alarm example you may have noted in the Dimensions Parameters   that it refers to the Auto Scaling group above. This configuration has an alarm monitoring the state of the instances that are managed by the Auto Scaling group.

The LoadBalancer associated with the Auto Scaling group described above looks like :

“AppServerLoadBalancer” : {

    “Type” : “AWS::ElasticLoadBalancing::LoadBalancer”,

    “Properties” : {

        “AvailabilityZones” : { “Fn::GetAZs”: { “Ref”: “AWS::Region”} } ,

        “Listeners” : [ {

            “LoadBalancerPort” : “80”,

            “InstancePort” : {“Ref”: “TomcatPort”},

            “Protocol” : “HTTP”

        } ],

       “HealthCheck” : {

          “Target” : { “Fn::Join” : [ “”, [“HTTP:”, { “Ref” : “TomcatPort” }, “/welcome”]]},

          “HealthyThreshold”: “5”,

          “Timeout”: “5”,

          “Interval”: “30”,

          “UnhealthyThreshold”: “2”,

                                  “Target”: {“Fn::Join”: [“”,[ “HTTP:”,{“Ref”: “TomcatPort”},”/welcome”]]}







The Load balancer has been defined with Health checks which in this example does a HTTP check. This check will mark an instance as having had a failed Health check if it does not receive a “200 OK” within 30 seconds . If this happens in consecutive checks the instance is marked as unhealthy. The instance needs to have successfully responded with a  “200 Ok”  5 times in succession to be marked as healthy. The combination of intervals and Thresholds determines how long an instance is technically responding so in theory you could have an unhealthy instance trying to respond for a period of time until it meets the criteria to be marked as unhealthy

You can also associate alarms with the Load Balancer as  in the snippet below  where an alarm  has been defined that notifies you if there are too many unhealthy hosts :

“TooManyUnhealthyHostsAlarm” : {

      “Type” : “AWS::CloudWatch::Alarm”,

      “Properties” : {

        “AlarmDescription” : “Alarm if there are too many unhealthy hosts.”,

        “AlarmActions” : [ { “Ref” : “AlarmTopic” } ],

        “InsufficientDataActions” : [ { “Ref” : “AlarmTopic” } ],

        “MetricName” : “UnHealthyHostCount”,

        “Namespace” : “AWS/ELB”,

        “Statistic” : “Average”,

        “Period” : “60”,

        “EvaluationPeriods” : “1”,

        “Threshold” : “0”,

        “ComparisonOperator” : “GreaterThanThreshold”,

        “Dimensions” : [ {

          “Name” : “LoadBalancerName”,

          “Value” : { “Ref” : “AppServerLoadBalancer” }

        } ]






Triggers and Auto Scaling Policies:

 We’ve looked at defining alarms that on a change of state publish to an SNS topic now as the last part of this post we’ll have a look at how to effect an Auto Scaling action. This can be achieved by using a trigger or by using an AutoScaling policy.

 Triggers when defined are very similar to Alarms but with extra Auto Scaling polices incorporated

In the snippet below a Trigger is defined that monitors the average CPU utilization for the ec2 instances managed by the Auto Scaling group.

“CPUBreachTrigger” : {

      “Type”: “AWS::AutoScaling::Trigger”,

      “Properties”: {

         “AutoScalingGroupName”: { “Ref”: “AppServerGroup” },

         “Dimensions”: [


            “Name”: “AutoScalingGroupName”,

            “Value”: { “Ref”: “AppServerGroup” }


         “MetricName”: “CPUUtilization”,

         “Namespace”: “AWS/EC2”,

         “Period”: “60”,        

         “Statistic”: “Average”,

         “UpperThreshold”: “90”,

         “LowerThreshold”: “20”,

         “BreachDuration”: “120”,

         “UpperBreachScaleIncrement”: “1”,

         “LowerBreachScaleIncrement”: “-1”




In the example snippet If the average CPU utilization breaches the upper or lower threshold the trigger and this breach is sustained for 120 seconds the autoscaling group will scale up or down  by 1 instance accordingly.

Having defined a set of  Auto Sscaling policies via the command line as described earlier in this post the policy can  apparently  be referenced by an alarm using its’ ARN  as it its action on changing state . Although I was unable to figure out how you could do this via CloudFormation as you cannot create an autoscaling  policy that is not attached to an auto scaling group and you cannot create a standalone policy that can be attached later. So as things stand today  to do this via the command line would require creating the Auto Scaling group and then  using a command similar to  the below to  attach the policy:

# When scaling up, increase capacity by 1

C:\> as-put-scaling-policy AppServerGroup  –name “scale-up”   –adjustment  1 –type Absolute


I am hoping the ability to create Auto Scaling policies as part of  a  CloudFormation template will be added as future functionality to the CloudFormation API