Windows in the cloud a 1st class citizen

The perception is that running windows instances in the cloud is often as a second class citizen. This just not true. Both Opsocode and puppets lab have made great strides in  making their configuration management tools ‘windows friendly’  (Disclaimer:  I’ve used Chef with windows no actual experience with puppet). To add to this Amazon Web services introduced Cloudformation friendly windows  base AMI’s . The combination of these AMI’s with the more windows friendly configuration management tools means you can really treat windows as you would Linux instances and use the same tools to manage windows instances as you would Linux.

You can use PowerShell as you would normally so the learning curve isn’t as steep as you’d expect as a windows administrator.Go on give it a go.

If you have an estate that is made up of both windows and Linux  starting from a point where you can use the same tools to mange both environments makes life easy for your Operations/ DevOps or whatever label you place on the team that makes sure you have systems that are up and running each day.

One tool to manage them all

 

I’ve been waiting to give the chef-client MSI a try ever since I noticed it had been released. I wanted to see if it really has made the numerous ( albeit fairly straight forward) steps to get chef-client working on Windows 2008 R2 that much easier.  After all the easier it becomes the more converts there will be as the barriers to adoption are removed.

Running the MSI is simple. It takes care of installing ruby( version 1.9.2p290)  and installing chef-client . Now all you need to do is set up a couple of files to allow your client to authenticate with your Chef server as detailed quite nicely here:

http://wiki.opscode.com/display/chef/Installing+Chef+Client+on+Windows

That’s it you’re good to go.  First impressions a big thumbs up.

I then had a quick look into how things have got better in the windows  recipe development  department. I started by checking out

The opscode supported  windows cookbook

This is looking really promising as the ability to install roles & features, and more importantly install MSI’s can be treated in the same way as you would install services and install packages  on Linux . Meaning you could actually have one person who is capable of  writing  high level recipes for both platforms. You will always need someone who understands the target O/S but this just means you can get admin staff using Chef  ( yeah I know I’m talking about the hosted version)  and it doesn’t really matter which O/S they are more comfortable with . Opscode in my  humble opinion have removed a layer of obstruction to adoption by the work they’ve done here.

Blathering on

Last night ( 8th Sept 2011) I had the privilege of being interviewed together with my ex colleague and good friend James  by Richard & Carl of  . NET Rocks!  to talk about DevOps, how we see it evolving , what we think it means , the effect it is having on the industry just for starters.   The fact this  is on  a show aimed at .NET developers is a statement in itself.

It’s  Due to be  broadcast on the 13th  September 2011.  It will be interesting to see what folks think as it’s not the usual type of subject covered. James & I are both passionate about the subject and  I  must confess I kept forgetting during the interview  that I wasn’t there just to listen to James,  but his eloquence about the subject with or without the DevOps label is always compelling . Please feel free to blame Howard of endjin for  suggesting letting  James & I loose on your ears though Smile

The interview is here:  .NET Rocks! Show 697

Targeting Windows 2008 R2 nodes using chef

Just a quick note.

I’d advise sticking to ruby 1.8.7-p344 on the target node if you are targeting windows 2008 R2.  I recently revisted targeting windows 2008 R2 and found that using  the latest version of ruby  1.9.2-p180 on the windows 2008 r2 target node  and  attempting to run chef-client  after installing the chef gem is a proverbial pain . I’m not sure if Opscode are looking into this but  it’s easy to reproduce the pain :-)

Using AWS for DR when your solution is not in the Cloud

In my previous post  in this series on resilience of non cloudy solutions I discussed how to approach obtaining exactly what was acceptable to the business to achieve an appropriate DR solution . In this post I will look at a fairly high level at  how to exploit AWS to  help  provide a cost effective solution for DR when your solution does not actually use AWS resources and is  probably not designed in a decoupled manner that  would make it easy to deploy to the cloud  .

yes I know know I can’t help it the cloud is here after all Smile

Please note that by necessity I’ve needed to keep to a high level as if I were to attempt to  start exploring the detailed configuration options  I’d still be writing this post by Christmas. Needless to say this post just scratches the surface but hopefully provides some food for thought.

You  will or should have  have local resilience in your solution consisting of multiple application servers and web servers , clustered database servers and load balancers .

The easiest DR solution to implement but the  most costly  is to replicate this albeit with maybe not so many servers and perhaps a single Data base server instance  to  an alternative physical location and putting in place processes to replicate data across to the 2nd location .

This typical configuration will look something like this:

std dc replicated

There are  plenty of variations on this but in the end it  entails physically maintaining a distinct location which replicates the application architecture and associated security controls . Resources need to be in place to support that location;  keep the components  updated regularly and all the usual best practises need to be acted upon to  validate the solution . It’s no point finding out the solution doesn’t’ work when you need it.

 

At this point you should hopefully  be thinking that is a lot of investment for something that will only be rarely used . So here’s where AWS can help keep those costs down.

 

The first model  which I’ve called the ‘halfway house’  may be an option for those who are unable to make use of the full AWS resources available and for whatever reason are unable or unwilling to store their data there . It still requires two maintained DC’s but saves costs by having the application and web servers for resilience being AWS instances. the cool thing here is that those resilient servers/instances are not actually operational unless needed ( you would have prepped AMI’s and hopefully use them in conjunction with a configuration management tool to ensure they are fully up to date when launched) .  You will not have  have the over head associated with watering & feeding them that you would have if you were 100% responsible for the infrastructure. The core  AWS components that make this work are: EC2,VPC and ELB .  If you wanted there is also the potential to use Route 53 to manage the DNS aspects that are needed for routing externally .There are issues with this model though  such as the possibility of a lack of capacity when you need to spin up those instances ( although the use of Multiple AZ and regions should over come that fear), the over head associated with managing 3 sets of resources,latency issues just to name three that come to mind.

The ‘halfway house’   will look something  like this:

 

Part use of AWS

Making use of AWS VPC means that you can create virtual networks built upon the AWS infrastructure which provides you with a great range of networking configurations for example  in the diagram above  I’ve show two  group of instances, one  that is externally accessible and another set that is basically an extension of your private LAN.  there are far too many scenarios possible with just these features of AWS and obviously every application is different ( See why I made sure this post was kept at a high level)

The  nirvana though to really seeing the costs tumbling  is to get rid of DC 2 and use AWS as the Recovery site. as a bonus it can be used for those extra processing needs as well on a demand basis . This not only reduces the support over head, saves cost as you are no longer committed to paying for a second location with all the associated kit necessary to make it a viable alternative site , but  it also provides a wide variety of failover and recovery options that you just won’t get when you have to commit to infrastructure up front ( hopefully that  pre-empts the question about why not a private cloud – you need your own platform).

This model which I’ve called the ‘Big Kahuna’ can look  a little like this :

 

big khauna

With the ‘Big Kahuna’ you should make use of any of the AWS resources available. In the flavour above I’m using S3 to store regular snapshots / transaction logs etc from my primary database. Why not replicate directly? Well s3 is cheap storage and in the scenario I’m illustrating as an example my RTO and RPO values allow some delay between failure and recovery that I can reconstruct the database when needed from the data stored in my s3 bucket . Regular reconstruction exercise should occur though as part of the regular validation of the failover processes. AMI’s and a configuration management solution ( As it’s me it will be chef) are used  to provision up to date application and web servers. Use is made of Route 53 to facilitate DNS management  and Where I need to ensure that traffic is kept internal I’m making use of VPC .

The introduction of RDS for oracle  means it is viable to use AWS as the failover solution for enterprises. There may be concerns over performance but this is a DR situation so if you are not in a position to reengineer for the cloud then when discussing with internal business sponsors discussions about reduced performance should be part of the business impact discussions.

AWS has services  such as dedicated instances which may be the only way your security and networking guys will allow you to exploit AWS resources but you would need to do your sums to see if it makes sense to do so. Personally I’d focus on trying to understand the ‘reasons’ for this . There are a number of valid areas this would be required but I suspect cost   isn’t really going to be any sort of driving force there.

The devil  is in  the detail when designing a failover solution utilising AWS as part of your DR . If you are planning for a new solution make sure you talk to the Software architect about the best practises when designing for the cloud it’s still applicable  for on premise solutions too .

Data is really where all the pain  points are  and will likely dictate the model and  ultimate configuration.

If you are trying to retro fit for an existing solution then the options open to you may not be that many and it’s likely you will have to start off with some form of the ‘halfway house’

Also don’t forget you can  just try stuff out at minimal cost. Wondering if a particular scenario would work just try it out as  you can just delete everything after you’ve tried it.

The cost effectiveness of the solution is directly related to the use you make of AWS resources to effect the solution. I even have a graph to illustrate ( @jamessaull would be proud of me) .

 

awsdr graph

This graph is based on very rough comparative costs from starting off with no AWS resources as in the first situation I started discussing and working my way down through to the ‘Big Kahuna’. You can easily do your own sums .AWS pricing is on their site they even provide you with a calculator  and you know how much it costs for those servers, licences, networking hardware,hardware maintenance costs support etc.

How to deal with the ‘ I want no downtime response.’

With a little time between jobs and with having to dodge the rain  I found myself thinking about some stuff I’ve not really given a lot of deep thought recently so here’s the first of a couple of posts on Resiliency when applied to non cloudy solutions.

I recall numerous occasions where I would design a highly resilient solution that would provide as many 9’s as I’d dare to commit to only for the proposal to be back on my desk some weeks later with the words too expensive where can we save money on it?

The main reason I would come up across this is that I would ask the Business Sponsor two key questions related to Disaster recovery as part of the business Impact analysis:

 What is their RTO?   Recovery Time Objective - The duration of time from point of failure within which a business process must be restored after a disaster or disruption.

What is their RPO?   Recovery Point Objective – Acceptable amount of data loss measured in time

These two questions would inevitably elicit the response no data loss and restore service as quickly as possible, so I would go  off and design a platform to get as close to those unattainable desires.

I soon learnt that a different approach was needed to prevent the paper bounce game.  Eventually coming up with  gold, silver and bronze Resiliency options as part of the platform proposal . We would include what we felt were viable Resiliency options graded according to arbitrary RTO & RPO levels and the associated costs to deliver on those.  

This at least then meant the Business sponsor had somewhere to start from in terms of a monetary value  and  could see that wanting the Gold standard was going to cost them loads  more than if they really thought it through . This got them thinking about what they really wanted in terms of RTO and RPO and we would then discuss what options were open to them.

 For example does the accounting system need to be restored within 2 hrs with less than half an hour’s loss of data? In my experience it’s  only at month end processing  this level of recovery is really needed.  During the majority of the month data is held elsewhere so can be easily recreated, the months accounts file is still open so this isn’t an issue  in terms of processing. Is anyone working on the system at weekends?  This is the sort of thought process that the Business needs to go through when requesting new systems and trying to figure out what they want in terms of resiliency.

 When faced with stark questions about RTO’s and RPO’s the natural response is I want my system to be totally resilient with no downtime they have no idea what this means in terms of resource and thus potential cost so why not save time by giving them some options . You may be lucky and one of the options will be an exact fit but if not the Sponsor has an idea of what it costs to provide their all singing all dancing requirements.  It’s more likely they will say something like “The Silver option may do but we need an enhanced level of support at month end ” or  maybe something like “we need to make sure we have a valid tested backup from the previous night at month end “.

The beauty of this approach  is that straight away you’ve engaged them in conversation and prevented a paper bounce . It’s not a new approach and one service teams are used to when dealing with external suppliers but there is no reason not to use the same methods with internal requests.

CloudFormation deletion policies an important addition

The CloudFormation team made   a forum announcement   on the 31st may  detailing the latest enhancements .  In the list was the feature I’d been waiting on which was the introduction of  resource deletion policies.  Up until the introduction of this feature I had been loath to use CloudFormation to create certain resources  .

Why was I concerned well it boils down to the fact we are subject to human error really. You can just imagine the poor person who makes the decision to remove a stack for valid reasons  such as  they were doing rolling upgrades so have brought up a replacement stack and want to remove the existing stack but have forgotten about the fact that when they deployed their  original stack oh so many months ago this also created their initial database infrastructure ( I’m using RDS  to illustrate the point here but it could have just as easily have been a NOSQL deployment on an ec2 instance) and it would be goodbye all my data.

So how does it work.

The DeletionPolicy is an attribute that you can add to the creation of your resources which basically tells CloudFormation how to handle the deletion of that resource. The default behaviour is to just delete it.

The three states that a DeletionPolicy can have are:

Delete – which is the default behaviour but it may be prudent to add this attribute as part of your self documentation  to all your resources

Retain  – This directs CloudFormation to keep the resource and any associated data/content after stack completion

The above two states are applicable to any resource .

Snapshot –This is only applicable for resources that support snapshots namely EBS volumes and RDS. The actual resource will be deleted but the snapshots will exist after the Stack has been deleted

A quick mention of some of the other new features released that have caught my eye :

Parameter validation pretty self evident why this was must have feature :-)

Wait condition – This provides the ability to pause the stack creation until some predefined action or time out has occurred. This could be used as an example to  fully automate the creation of a master slave set up where the master IP address say is needed to allow the slaves to join the party

Ability to create S3 buckets and S3 hosted websites –   I love the idea of creating your S3 hosted website via a  simple script

An aide-mémoire on monitoring using CloudWatch & CloudFormation on AWS

It can be confusing when it comes to setting up the auto scaling rules , alarms and load balancing  health checks so I wanted to take a little time to look at  how to fit the bits  together to get an effective proactive monitoring solution by just using CloudWatch. Sorry this is a longish post but at least it’s in one place :-)

 AWS does provide a lot of information but there is a lot of it scattered about and wading through it can be time consuming but hopefully this will be a useful introduction .

A few definitions is a good place to start

Definitions

Alarms:

An alarm is exactly what it says. They are watchers that provide notifications that an AWS  resource has breached one of the thresholds that have been assigned against a specific metric.  (Note you are now able to expose custom metrics as well as CloudWatch metrics and use these for Auto Scaling actions as well).

Health checks:

A health check is a check on the state of an instance which is part of an Auto Scaling group. If an instance is detected as having degraded performance  it is marked as unhealthy

Auto Scaling Policy:

A policy defines what action the AutoScaling group should take in response to an alarm.

Triggers:

A trigger is a combination of an Auto Scaling policy and an Amazon CloudWatch alarm. Alarms are created that monitor specific metrics gathered from EC2 instances. Pairing the alarm with a policy can initiate an Auto Scaling action when the metric breaches a specific threshold.

Launch Configuration:

The definitions (Parameters) needed to instantiate new ec2 instances. These will include values like what AMI to use, the instance size, user data to be passed, EBS volumes to be attached. A Launch configuration is used together with an Auto Scaling group. An Auto Scaling group can only have one Launch Configuration attached to it at any one time but you can replace the Launch configuration.

AutoScaling Group:

An Autoscaling group manages a set of 1 or more  instances. It works in conjunction with a   launch configuration and triggers to enact scaling actions. The Launch configuration tells it what the instances should look like and the triggers tell it how to react to particular situations.

 

Component breakdown

Alarm Parameters:

Parameter

Description

Example Value

Alarm name

Name that  typically reflects what the alarm is watching

CPUHighAlarm

Alarm Action

An SNS notification or autoscaling policy

 

Metric Name

The metric being monitored e.g CPU or memory usage

CPUUtilization

Statistic

Metric data aggregations collected  over a  specified period of time

Average

Period

Length of time associated with a specific statistic.  periods are expressed in seconds, the minimum granularity for a period is one minute period values are expressed  as multiples of 60

60

Evaluation Period

The number of periods over which data is compared to the specified threshold

1

Threshold

The value that the metric is being evaluated against

30

ComparisonOperator

The operation to use when comparing the specified Statistic and Threshold. The specified Statistic value is used as the first operand. Valid Values:

GreaterThanorEqualToThreshold

GreaterThanThreshold

LessThanThreshold

LessThanorEqualToThreshold

GreaterThanThreshold

Dimensions

Name Value pairs that  provide additional information  to allow you to uniquely identify a metric

 

 

Health check Parameters:

The Healthiness of your instance is used by AutoScaling to trigger the termination of an instance

Parameter

Description

Example Value

Healthy Threshold

Number of consecutive health check successes before declaring an instance healthy

5

Unhealthy Threshold

Number of consecutive health check failures before declaring an instance unhealthy

2

Interval

The interval in seconds  between successive health checks

120

Timeout

Amount of time in seconds during which  no response indicates a failed health check. This value must be less than the interval value

60

Target

TCP or HTTP check against an instance.  This  is used to determine the health of an instance

For a HTPP check -  Any answer other than “200 OK” within the timeout period is considered unhealthy

 

For a TCP check -- Attempts to open a TCP connection to the instance on the specified port. Failure to connect within the configured timeout is considered unhealthy

 

HTTP:80/home/index.html

 

TCP:8080

 

Trigger Parameters:

Parameter

Description

Example Values

Metric name

 

CPUUtilization

Name Space

Conceptual containers for metrics . Ensures that metrics in different names spaces are isolated from each other

AWS/EC2

AWS/AutoScaling

AWS/EBS

AWS/RDS

AWS/ELB

 

Statistic

Metric data aggregations collected  over a  specified period of time

Average

Minimum

Maximum

sum

Period

Length of time associated with a specific statistic.  periods are expressed in seconds, the minimum granularity for a period is one minute period values are expressed  as multiples of 60

300

Unit

The statistics unit of measurement

Percent , bytes, seconds etc depends on metric being measured

Upper Breach Scale increment

The incremental amount to scale by when the upper threshold has been breached

1

Lower Breach Scale increment

The incremental amount to scale by when the upper threshold has been breached

-1

Auto Scaling Group name

Name of the AutoScaling group the trigger is attached to

WebServerGroup

Breach Duration

Period that defines how long the breach duration can occur for before it triggers an action

500

Upper Threshold

The upper limit of the metric . The trigger fires if all data points in the last BreachDuration period  exceeds the upper threshold or falls below the lower threshold

90

Lower Threshold

The lower  limit of the metric . The trigger fires if all data points in the last BreachDuration period  falls below the lower threshold or exceeds the upper threshold

20

Dimension

Name Value pairs that  provide additional information  to allow you to uniquely identify a metric

Name:AutoScalingGroup

Value:WebServerGroup

Name:Webserver

Value:ProductionServer

 

Auto Scaling Group Parameters

Parameter

Description

Example Values

AvailabilityZones

The availability zones that are available for the group to start an instance in

Eu-west-1a, eu-west-1c

CoolDown

The time in seconds after one scaling action completes before another scaling activity can start

60

DesiredCapacity

Specifies the  number of instances the auto scaling group will endeavour to maintain

2

LaunchConfigurationName

The name of the associated Launch Configuration

LaunchMyInstances

LoadBalancerName

Name of Load Balancer Auto Scaling group attached to .

LoadBalancerforMyInstances

MaxSize

Maximum number of instances that the Auto Scaling Group can have associated with it

3

MinSize

Minimum number of instances that the Auto Scaling group  will have associated with it

1

 

A policy definition:

Policies are usually paired one for scaling up and one for scaling down.

To create a policy that scales down by 1  from the command line:

# When scaling down, decrease capacity by 1

%as-put-scaling-policy my-group –name “scale-down”

–adjustment -1 –type Absolute

 

To list policies from the command line to get the ARN :

as-describe-policies autoscaling-group

 

Putting it all together

So now we know what  the components are and the  associated parameters are  that can be used to be put together an appropriate monitoring solution using CloudWatch .  To illustrate how to start putting things together I’ll use CloudFomation. You can use the Command line tools and the console to do much of what comes next.

Using Alarms:

Metrics can be collated for EC2 instances, ELB’s, EBS volumes ,RDS and  the flexibility to use custom metrics J. Alarms can be set for any one of these metrics. Alarms exist in 3 states OK, ALARM, or INSUFFICIENT_DATA. When a metric breaches a predetermined threshold it is set to the ALARM state. On transition from one state to another an alarm action can be set.  The defined alarm action can be publication to an SNS notification topic or an auto scaling action.  Using CloudFormation snippets to illustrate setting up an alarm that monitors when CPU utilisation breaches a defined  threshold or the metrics disappear with the defined action being publication to an SNS topic that sends an email:

“AlarmTopic” : {

      “Type” : “AWS::SNS::Topic”,

      “Properties” : {

        “Subscription” : [ {

          "Endpoint" : { "Ref" : "OperatorEmail" },

          "Protocol" : "email"

        } ]

      }

    }

 

 

 

 

“CPUAlarmHigh” : {

      “Type” : “AWS::CloudWatch::Alarm”,

      “Properties” : {

        “AlarmDescription” : “Alarm if CPU too high or metric disappears indicating instance is down”,

        “AlarmActions” : [ { "Ref" : "AlarmTopic" } ],

        “InsufficientDataActions” : [ { "Ref" : "AlarmTopic" } ],

        “MetricName” : “CPUUtilization”,

        “Namespace” : “AWS/EC2″,

        “Statistic” : “Average”,

        “Period” : “60″,

        “EvaluationPeriods” : “1″,

        “Threshold” : “90″,

        “ComparisonOperator” : “GreaterThanThreshold”,

        “Dimensions” : [ {

          "Name" : "AutoScalingGroupName",

          "Value" : { "Ref" : "AppServerGroup" }

        } ]

      }

    }

 

 

Using Auto Scaling Groups and Load Balancers:

This snippet describes an Auto Scaling group that will at any one time manage between 1 or 3 instances while endeavouring to maintain 2 instances.

“AppServerGroup” : {

      “Type” : “AWS::AutoScaling::AutoScalingGroup”,

      “Properties” : {

        “AvailabilityZones” : { “Fn::GetAZs” : “”},

        “LaunchConfigurationName” : { “Ref” : “AppServerLaunchConfig” },

        “MinSize” : “1″,

        “MaxSize” : “3″,

        “DesiredCapcity” :”2”,

        “LoadBalancerNames” : [ { "Ref" : "AppServerLoadBalancer" } ]

      }

    },

 

In the snippet above the Auto Scaling group has an associated Launch Configuration which is mandatory for an Auto Scaling group. It is also associated with a Load Balancer which we’ll come to in a minute. In the alarm example you may have noted in the Dimensions Parameters   that it refers to the Auto Scaling group above. This configuration has an alarm monitoring the state of the instances that are managed by the Auto Scaling group.

The LoadBalancer associated with the Auto Scaling group described above looks like :

“AppServerLoadBalancer” : {

    “Type” : “AWS::ElasticLoadBalancing::LoadBalancer”,

    “Properties” : {

        “AvailabilityZones” : { “Fn::GetAZs”: { “Ref”: “AWS::Region”} } ,

        “Listeners” : [ {

            "LoadBalancerPort" : "80",

            "InstancePort" : {"Ref": "TomcatPort"},

            "Protocol" : "HTTP"

        } ],

       “HealthCheck” : {

          “Target” : { “Fn::Join” : [ "", ["HTTP:", { "Ref" : "TomcatPort" }, "/welcome"]]},

          “HealthyThreshold”: “5″,

          “Timeout”: “5″,

          “Interval”: “30″,

          “UnhealthyThreshold”: “2″,

                                  “Target”: {“Fn::Join”: ["",[ "HTTP:",{"Ref": "TomcatPort"},"/welcome"]]}

           }

        }

      

    },

 

 

The Load balancer has been defined with Health checks which in this example does a HTTP check. This check will mark an instance as having had a failed Health check if it does not receive a “200 OK” within 30 seconds . If this happens in consecutive checks the instance is marked as unhealthy. The instance needs to have successfully responded with a  “200 Ok”  5 times in succession to be marked as healthy. The combination of intervals and Thresholds determines how long an instance is technically responding so in theory you could have an unhealthy instance trying to respond for a period of time until it meets the criteria to be marked as unhealthy

You can also associate alarms with the Load Balancer as  in the snippet below  where an alarm  has been defined that notifies you if there are too many unhealthy hosts :

“TooManyUnhealthyHostsAlarm” : {

      “Type” : “AWS::CloudWatch::Alarm”,

      “Properties” : {

        “AlarmDescription” : “Alarm if there are too many unhealthy hosts.”,

        “AlarmActions” : [ { "Ref" : "AlarmTopic" } ],

        “InsufficientDataActions” : [ { "Ref" : "AlarmTopic" } ],

        “MetricName” : “UnHealthyHostCount”,

        “Namespace” : “AWS/ELB”,

        “Statistic” : “Average”,

        “Period” : “60″,

        “EvaluationPeriods” : “1″,

        “Threshold” : “0″,

        “ComparisonOperator” : “GreaterThanThreshold”,

        “Dimensions” : [ {

          "Name" : "LoadBalancerName",

          "Value" : { "Ref" : "AppServerLoadBalancer" }

        } ]

      }

    }

                 

   },          

 

Triggers and Auto Scaling Policies:

 We’ve looked at defining alarms that on a change of state publish to an SNS topic now as the last part of this post we’ll have a look at how to effect an Auto Scaling action. This can be achieved by using a trigger or by using an AutoScaling policy.

 Triggers when defined are very similar to Alarms but with extra Auto Scaling polices incorporated

In the snippet below a Trigger is defined that monitors the average CPU utilization for the ec2 instances managed by the Auto Scaling group.

“CPUBreachTrigger” : {

      “Type”: “AWS::AutoScaling::Trigger”,

      “Properties”: {

         “AutoScalingGroupName”: { “Ref”: “AppServerGroup” },

         “Dimensions”: [

          {

            "Name": "AutoScalingGroupName",

            "Value": { "Ref": "AppServerGroup" }

          }],        

         “MetricName”: “CPUUtilization”,

         “Namespace”: “AWS/EC2″,

         “Period”: “60″,        

         “Statistic”: “Average”,

         “UpperThreshold”: “90″,

         “LowerThreshold”: “20″,

         “BreachDuration”: “120″,

         “UpperBreachScaleIncrement”: “1″,

         “LowerBreachScaleIncrement”: “-1″

      }     

    },

 

In the example snippet If the average CPU utilization breaches the upper or lower threshold the trigger and this breach is sustained for 120 seconds the autoscaling group will scale up or down  by 1 instance accordingly.

Having defined a set of  Auto Sscaling policies via the command line as described earlier in this post the policy can  apparently  be referenced by an alarm using its’ ARN  as it its action on changing state . Although I was unable to figure out how you could do this via CloudFormation as you cannot create an autoscaling  policy that is not attached to an auto scaling group and you cannot create a standalone policy that can be attached later. So as things stand today  to do this via the command line would require creating the Auto Scaling group and then  using a command similar to  the below to  attach the policy:

# When scaling up, increase capacity by 1

C:\> as-put-scaling-policy AppServerGroup  –name “scale-up”   –adjustment  1 –type Absolute

 

I am hoping the ability to create Auto Scaling policies as part of  a  CloudFormation template will be added as future functionality to the CloudFormation API

A few observations on the NIST DRAFT Cloud Computing Synopsis and Recommendations

NIST have produced a   DRAFT Cloud Computing Synopsis and Recommendations. draft

The draft presents an overview of major classes of cloud technology, and provides guidelines and recommendations on how organisations  should consider the relative opportunities and risks of cloud computing.

Firstly by Organisations I’m sure they really mean US Governmental Depts and by association companies providing services to them. I thought it worth a close scrutiny as a result

The document starts off with definitions of types of clouds and  service models where they note that IaaS type services are more portable .  Personally I  think these definitions are a little outdated as he blurring of lines makes them redundant.  With the introduction of ‘Bring your own PaaS / containerised PaaS’ this statement does not stand up to close inspection.

I’ve only got to page 10 of a 84 page document and already I have serious concerns of the applicability of this document but whatever I’ll move on.

Under the Commercial terms of service section this covers Promises (SLAs) , Limitations, End user obligations

SLA’s Nothing new here basically you get SLA’s that promise certain levels of avialbilty, compensation for failure to perform, Data protection and Legal care of subscriber information with anything not just cloud services . Same observation with the limitations section. The recommendations are due diligence recommendations which you’d so anyway regardless of what services you were engaging cloud or not. The only interesting item in this section is this statement:

“Negotiated SLA. If the terms of the default SLA do not address all subscriber needs, the subscriber Should discuss modifications of the SLA with the provider prior to use.”

Making modifications to SLA’s for commodity computing resources which the public cloud provides will I suspect be one conversation that will take too long and probably with no acceptable outcome unless the supplier is willing to ring fence resources and can  really can provide a superior service on request.

I do however love the section on General Cloud environments it cuts right to the chase with it’s opening paragraph:

Many individuals and organizations have made general statements about cloud computing, its advantages, and its weaknesses. It is important to understand, however, that the term “cloud computing” encompasses a variety of systems and technologies as well as service and deployment models, and business models. A number of claims that are sometimes made about cloud computing, e.g., that it “scales”, or that it converts capital expenses to operational expenses, are only true for some kinds of cloud systems.

Pretty much telling you to do your homework and not listen to the Marketing claims.

This section is actually quite good reading and I feel will be seen as controversial for some.

It describes as well as I’ve seen it done the  general cloud environment concepts such a public, private, hybrid .   It should also help alleviate the fear from some quarters of the IT department re their relevance as it points out you will still need IT skills. Yes that is obvious but it needs stating as this message is not coming across with all the marketing and we need to get rid of the FUD.

The statement about the significant to high costs to migrate to the cloud  plus the limitations on scalability using the onsite private model will be an interesting one for ‘Private Cloud’ suppliers to negotiate when faced with this. Interestingly enough an outsourced ‘Private Cloud’ is seen as having moderate to significant costs to migrate to the cloud and more flexibility in terms of scaling.

The community cloud model I’ve struggled with as a concept but I’ve started to appreciate where it can apply such as Colleges and educational institutes and charities but I still would look at the public cloud model first (That’s a discussion for another day though) .

The section on the public Cloud I feel is  out of date and lacking in detail. The generalisations made here are not applicable to certain providers and types of cloud solutions. When they talk about the private cloud it’s different as they are the assets of the organisation so it’s possible within limits to lump them altogether. As an example of my issues with this section is the statement on the risks of multi tenancy and removal of data. I for one have come across inappropriate data removal processes with Traditional Data Centre hosted solutions so this concern isn’t a new one and shouldn’t be presented as such. The security of the public cloud in many cases will far exceed that found on premise and with the maturity of some providers lessons have been learnt and controls are in place to mitigate perceived risks around the use of multi tenancy. The public cloud migration to the cloud gets a low up front cost rating. This makes sense as there is no initial capital investment unlike the ‘private’ model flavours.

The document then goes onto to look at other cloud environments which is where I feel it struggles and it is already out of date.

The SaaS model is obviously well understood after all it has been around a while and the depth of detail and relative lack of concerns voiced reflects this. Note the statement on Data Deletion says Require that cloud providers offer a mechanism for reliably deleting data on a subscriber’s request” where as in the Public cloud described in the General Cloud environment the tone is one that reflects more concern “As an example of this limitation, a subscriber cannot currently verify that data has been completely deleted from a provider’s systems”

Unfortunately the PaaS section is woefully out of date , quite abstract and does not take into account the newer players on the market like CloudFoundary and OpenShift or even at a stretch Elastic Beanstalk from AWS. To make matters worse Microsoft’s Azure doesn’t even fit their definition because of all the work involved with using their VM Role. If you look at a statement in the benefits section you can see what I mean

PaaS providers are able to Manage the lower layers and relieve PaaS subscribers of the responsibility for selecting, installing,maintaining, or operating the platform components. Infrastructure charges are implicitly present in PaaS offerings because PaaS consumes infrastructure resources in some form, but the infrastructure charges are bundled in the rates charged for the PaaS execution environment resources (e.g., CPU, bandwidth,storage).”

This is the case with Microsoft Azure but if you use say  use OpenShift this is not the case as you are responsible for ( As much as I hate using the terms PaaS and IaaS  it fits here) the underlying IaaS that the PaaS container needs to run on if you decide to use a non multi tenanted  flavour. The lack of portability section is also not right as in theory you can now with containerised PaaS move your application from one cloud to another if you are using the right set of building blocks  e.g mongodb , mysql or even take your PaaS container with you. This section is just plain inappropriate I could go on and on here but I still have another 40 pages to go .

The IaaS section has nice description about hypervisors though but the scope of control diagram will only apply to those providers who stick purely to the model described so AWS will not fall into this camp.

What I found odd when reading this though was the relatively appropriate recommendations for considering  these specific models which  didn’t quite sit well with the public cloud recommendations they inevitable go with. That will confuse I reckon.

Many of the issues outlined in the Open Issues section I also found odd as  most are not specific to Cloud computing. Latency is a concern wherever an application is hosted and when designing for the cloud measures to mitigate against latency should be taken into account. One thing you can do is place your clients closer by putting them into the cloud too just a thought plus the fact we are all getting used to the whole latency issue what with  the ever moving march of technology introducing gadgets like smartphones  etc . Network dependencies well I don’t even know where to start with this one every organisation experiences some sort of outage during the course of their business and there is a fundamental reliance on the network cloud or not.

My recommendation would have been look at you internal uptime and reliability, ask yourself how do you scale today when you need extra compute power  and then use these answers  when making decisions.

The problem with generalisations is the very fact they are generalisations.

I do like the concept of having a  standard format for SlA’s which could be use in competitive tenders as I have spent far too much time in the past comparing SLA’s etc. The problem with this though is that no cloud providers are alike in what services they offer.  Whereas if you’re looking for a traditional Data centre hoster it’s easy to compare that isn’t so easy today. Nice idea though.

The portability of workloads is I feel over stated it’s a matter of design if you want absolute portability then don’t use any of the non portable features which will mean you have to expend effort in not doing so and you really do need to think if that makes sense not to take advantage of what are some pretty cool albeit specific to a particular providers service. You could save development time , get access some awesome availability & durability services by deciding not to be ‘totally’ portable but have an exit strategy if you have to.

If you are just using Virtual machines there are already providers who provide the technology to allow you to migrate across clouds so the interopabilty between clouds is another out of date statement.

The other concerns seem to be all to be standard due diligence statements which the established Cloud providers have already addressed in some form or other. I was interested in the support  forensics one ( just because of my interest in It Security).

There are some typical costs included  which I find a strange thing to put in as it really does depend on what you are going to do . There is also an appendix covering roles and responsibilities.

This May have a ‘May 2011’ date on it but it is out of date already and manages to not really understand the market place. It does however have a good description of general cloud models but once it starts talking about SaaS, PaaS and IaaS it is on woollier ground as the market isn’t that clear cut anymore. The disjoin between the General cloud environments and the specific cloud models sections  is confusing. I would  hope that this draft is revisited soon to make it a more relevant guideline else there are going to be some long drawn out difficult conversations if this is where organisations start their journey to the cloud from.

Loosely coupled Chef Cookbooks

I’ve been working on a mongodb installation and configuration cookbook which allows me to install &  if required make custom configurations . It allows me to Install and configure a standalone mongodb installation or a replica set.

Developing this cookbook ( still a work in progress )  has led me to take a loosely coupled approach to its development such that I did not want to force a dependency on any previous recipe.  This has  meant a number of rules would need to be followed to use the cookbook properly rather than imposing any  constraints.

So why did I come to this conclusion that flexibility and thus loose coupling was a requirement for this particular cookbook: 

The use of a replica set and the fact you may want to seed the mongodb set up with data  from a backup did gave me food for thought.  When spiking the various configuration  scenarios I found that if I updated my current master via a data dump where I  had decided to stop the master mongodb instance  while I copied the data into the data folder  I found  the data wasn’t being replicated. This was because  one of the slaves had then taken over as the master and you can’t really force a master( without pain )  in a  replica set  ( maybe 10gen can advise on that one although I guess if I’d made sure the owner of the files was correct before the copy I may not have hit the mongo having a fit stage) so I needed to cater for that one.

Seeding mongo before a replica set is created seems like a nice approach to me anyway.

In a standalone mongodb instance I’m probably not worried about Raid so the recipe to create the Raid device shouldn’t be a constraint and while I’m at it why can’t you set up a replica that just uses instances with local storage?  

 I want to add recipes to create the job to undertake regular backups and maybe one to do a restore. But I may not want to use them.

Suddenly the list of things I want my mongo cookbook to do is growing. So I have done what is required  to deliver the functionality that is required  by  the client  I needed to do the Chef work for and   now I can pimp my cookbooks till I’m happy enough to share it with the community .

(The whole ‘just enough’ ethos is something I mean to talk about here but not now)

I want the recipes to be easy to use and understood by people new to Chef and also to mongodb as I do not believe just because you have a sophisticated tool like Chef that should mean your cookbooks should be overly complicated. Keeping it simple makes maintenance easy and encourages others to expand upon it appropriately if they follow the rules. Mongodb is very easy to get up and running so why use a tool to make it suddenly obtuse?

So the  rules to date :

Obviously you need to have mongo installed as a starting point.  I couldn’t really mandate the use of the installation recipe as it may be an existing set up . ( I have to modify this so it brings down a specified version rather than just the latest version from the 10gen repository) .

Each recipe is to be used to carry out a  single function e.g install mongdb, configure the configuration file , start mongodb etc. The combining of functions is discouraged

Each subsequent recipe can be run independently of the others or be combined as a role this meant making sure I had a recipe to start mongodb so this could be dropped in as say part of a role or workflow .

The use of templates and variables to encourage flexibility .

When I get to a point I feel the cookbook is pimped appropriately I’ll post a  dissection and some guidance .

Follow

Get every new post delivered to your Inbox.