Using AWS for DR when your solution is not in the Cloud

In my previous post  in this series on resilience of non cloudy solutions I discussed how to approach obtaining exactly what was acceptable to the business to achieve an appropriate DR solution . In this post I will look at a fairly high level at  how to exploit AWS to  help  provide a cost effective solution for DR when your solution does not actually use AWS resources and is  probably not designed in a decoupled manner that  would make it easy to deploy to the cloud  .

yes I know know I can’t help it the cloud is here after all Smile

Please note that by necessity I’ve needed to keep to a high level as if I were to attempt to  start exploring the detailed configuration options  I’d still be writing this post by Christmas. Needless to say this post just scratches the surface but hopefully provides some food for thought.

You  will or should have  have local resilience in your solution consisting of multiple application servers and web servers , clustered database servers and load balancers .

The easiest DR solution to implement but the  most costly  is to replicate this albeit with maybe not so many servers and perhaps a single Data base server instance  to  an alternative physical location and putting in place processes to replicate data across to the 2nd location .

This typical configuration will look something like this:

std dc replicated

There are  plenty of variations on this but in the end it  entails physically maintaining a distinct location which replicates the application architecture and associated security controls . Resources need to be in place to support that location;  keep the components  updated regularly and all the usual best practises need to be acted upon to  validate the solution . It’s no point finding out the solution doesn’t’ work when you need it.

 

At this point you should hopefully  be thinking that is a lot of investment for something that will only be rarely used . So here’s where AWS can help keep those costs down.

 

The first model  which I’ve called the ‘halfway house’  may be an option for those who are unable to make use of the full AWS resources available and for whatever reason are unable or unwilling to store their data there . It still requires two maintained DC’s but saves costs by having the application and web servers for resilience being AWS instances. the cool thing here is that those resilient servers/instances are not actually operational unless needed ( you would have prepped AMI’s and hopefully use them in conjunction with a configuration management tool to ensure they are fully up to date when launched) .  You will not have  have the over head associated with watering & feeding them that you would have if you were 100% responsible for the infrastructure. The core  AWS components that make this work are: EC2,VPC and ELB .  If you wanted there is also the potential to use Route 53 to manage the DNS aspects that are needed for routing externally .There are issues with this model though  such as the possibility of a lack of capacity when you need to spin up those instances ( although the use of Multiple AZ and regions should over come that fear), the over head associated with managing 3 sets of resources,latency issues just to name three that come to mind.

The ‘halfway house’   will look something  like this:

 

Part use of AWS

Making use of AWS VPC means that you can create virtual networks built upon the AWS infrastructure which provides you with a great range of networking configurations for example  in the diagram above  I’ve show two  group of instances, one  that is externally accessible and another set that is basically an extension of your private LAN.  there are far too many scenarios possible with just these features of AWS and obviously every application is different ( See why I made sure this post was kept at a high level)

The  nirvana though to really seeing the costs tumbling  is to get rid of DC 2 and use AWS as the Recovery site. as a bonus it can be used for those extra processing needs as well on a demand basis . This not only reduces the support over head, saves cost as you are no longer committed to paying for a second location with all the associated kit necessary to make it a viable alternative site , but  it also provides a wide variety of failover and recovery options that you just won’t get when you have to commit to infrastructure up front ( hopefully that  pre-empts the question about why not a private cloud – you need your own platform).

This model which I’ve called the ‘Big Kahuna’ can look  a little like this :

 

big khauna

With the ‘Big Kahuna’ you should make use of any of the AWS resources available. In the flavour above I’m using S3 to store regular snapshots / transaction logs etc from my primary database. Why not replicate directly? Well s3 is cheap storage and in the scenario I’m illustrating as an example my RTO and RPO values allow some delay between failure and recovery that I can reconstruct the database when needed from the data stored in my s3 bucket . Regular reconstruction exercise should occur though as part of the regular validation of the failover processes. AMI’s and a configuration management solution ( As it’s me it will be chef) are used  to provision up to date application and web servers. Use is made of Route 53 to facilitate DNS management  and Where I need to ensure that traffic is kept internal I’m making use of VPC .

The introduction of RDS for oracle  means it is viable to use AWS as the failover solution for enterprises. There may be concerns over performance but this is a DR situation so if you are not in a position to reengineer for the cloud then when discussing with internal business sponsors discussions about reduced performance should be part of the business impact discussions.

AWS has services  such as dedicated instances which may be the only way your security and networking guys will allow you to exploit AWS resources but you would need to do your sums to see if it makes sense to do so. Personally I’d focus on trying to understand the ‘reasons’ for this . There are a number of valid areas this would be required but I suspect cost   isn’t really going to be any sort of driving force there.

The devil  is in  the detail when designing a failover solution utilising AWS as part of your DR . If you are planning for a new solution make sure you talk to the Software architect about the best practises when designing for the cloud it’s still applicable  for on premise solutions too .

Data is really where all the pain  points are  and will likely dictate the model and  ultimate configuration.

If you are trying to retro fit for an existing solution then the options open to you may not be that many and it’s likely you will have to start off with some form of the ‘halfway house’

Also don’t forget you can  just try stuff out at minimal cost. Wondering if a particular scenario would work just try it out as  you can just delete everything after you’ve tried it.

The cost effectiveness of the solution is directly related to the use you make of AWS resources to effect the solution. I even have a graph to illustrate ( @jamessaull would be proud of me) .

 

awsdr graph

This graph is based on very rough comparative costs from starting off with no AWS resources as in the first situation I started discussing and working my way down through to the ‘Big Kahuna’. You can easily do your own sums .AWS pricing is on their site they even provide you with a calculator  and you know how much it costs for those servers, licences, networking hardware,hardware maintenance costs support etc.

Advertisements

CloudFormation deletion policies an important addition

The CloudFormation team made   a forum announcement   on the 31st may  detailing the latest enhancements .  In the list was the feature I’d been waiting on which was the introduction of  resource deletion policies.  Up until the introduction of this feature I had been loath to use CloudFormation to create certain resources  .

Why was I concerned well it boils down to the fact we are subject to human error really. You can just imagine the poor person who makes the decision to remove a stack for valid reasons  such as  they were doing rolling upgrades so have brought up a replacement stack and want to remove the existing stack but have forgotten about the fact that when they deployed their  original stack oh so many months ago this also created their initial database infrastructure ( I’m using RDS  to illustrate the point here but it could have just as easily have been a NOSQL deployment on an ec2 instance) and it would be goodbye all my data.

So how does it work.

The DeletionPolicy is an attribute that you can add to the creation of your resources which basically tells CloudFormation how to handle the deletion of that resource. The default behaviour is to just delete it.

The three states that a DeletionPolicy can have are:

Delete – which is the default behaviour but it may be prudent to add this attribute as part of your self documentation  to all your resources

Retain  – This directs CloudFormation to keep the resource and any associated data/content after stack completion

The above two states are applicable to any resource .

Snapshot –This is only applicable for resources that support snapshots namely EBS volumes and RDS. The actual resource will be deleted but the snapshots will exist after the Stack has been deleted

A quick mention of some of the other new features released that have caught my eye :

Parameter validation pretty self evident why this was must have feature 🙂

Wait condition – This provides the ability to pause the stack creation until some predefined action or time out has occurred. This could be used as an example to  fully automate the creation of a master slave set up where the master IP address say is needed to allow the slaves to join the party

Ability to create S3 buckets and S3 hosted websites –   I love the idea of creating your S3 hosted website via a  simple script

An aide-mémoire on monitoring using CloudWatch & CloudFormation on AWS

It can be confusing when it comes to setting up the auto scaling rules , alarms and load balancing  health checks so I wanted to take a little time to look at  how to fit the bits  together to get an effective proactive monitoring solution by just using CloudWatch. Sorry this is a longish post but at least it’s in one place 🙂

 AWS does provide a lot of information but there is a lot of it scattered about and wading through it can be time consuming but hopefully this will be a useful introduction .

A few definitions is a good place to start

Definitions

Alarms:

An alarm is exactly what it says. They are watchers that provide notifications that an AWS  resource has breached one of the thresholds that have been assigned against a specific metric.  (Note you are now able to expose custom metrics as well as CloudWatch metrics and use these for Auto Scaling actions as well).

Health checks:

A health check is a check on the state of an instance which is part of an Auto Scaling group. If an instance is detected as having degraded performance  it is marked as unhealthy

Auto Scaling Policy:

A policy defines what action the AutoScaling group should take in response to an alarm.

Triggers:

A trigger is a combination of an Auto Scaling policy and an Amazon CloudWatch alarm. Alarms are created that monitor specific metrics gathered from EC2 instances. Pairing the alarm with a policy can initiate an Auto Scaling action when the metric breaches a specific threshold.

Launch Configuration:

The definitions (Parameters) needed to instantiate new ec2 instances. These will include values like what AMI to use, the instance size, user data to be passed, EBS volumes to be attached. A Launch configuration is used together with an Auto Scaling group. An Auto Scaling group can only have one Launch Configuration attached to it at any one time but you can replace the Launch configuration.

AutoScaling Group:

An Autoscaling group manages a set of 1 or more  instances. It works in conjunction with a   launch configuration and triggers to enact scaling actions. The Launch configuration tells it what the instances should look like and the triggers tell it how to react to particular situations.

 

Component breakdown

Alarm Parameters:

Parameter

Description

Example Value

Alarm name

Name that  typically reflects what the alarm is watching

CPUHighAlarm

Alarm Action

An SNS notification or autoscaling policy

 

Metric Name

The metric being monitored e.g CPU or memory usage

CPUUtilization

Statistic

Metric data aggregations collected  over a  specified period of time

Average

Period

Length of time associated with a specific statistic.  periods are expressed in seconds, the minimum granularity for a period is one minute period values are expressed  as multiples of 60

60

Evaluation Period

The number of periods over which data is compared to the specified threshold

1

Threshold

The value that the metric is being evaluated against

30

ComparisonOperator

The operation to use when comparing the specified Statistic and Threshold. The specified Statistic value is used as the first operand. Valid Values:

GreaterThanorEqualToThreshold

GreaterThanThreshold

LessThanThreshold

LessThanorEqualToThreshold

GreaterThanThreshold

Dimensions

Name Value pairs that  provide additional information  to allow you to uniquely identify a metric

 

 

Health check Parameters:

The Healthiness of your instance is used by AutoScaling to trigger the termination of an instance

Parameter

Description

Example Value

Healthy Threshold

Number of consecutive health check successes before declaring an instance healthy

5

Unhealthy Threshold

Number of consecutive health check failures before declaring an instance unhealthy

2

Interval

The interval in seconds  between successive health checks

120

Timeout

Amount of time in seconds during which  no response indicates a failed health check. This value must be less than the interval value

60

Target

TCP or HTTP check against an instance.  This  is used to determine the health of an instance

For a HTPP check  Any answer other than “200 OK” within the timeout period is considered unhealthy

 

For a TCP check –– Attempts to open a TCP connection to the instance on the specified port. Failure to connect within the configured timeout is considered unhealthy

 

HTTP:80/home/index.html

 

TCP:8080

 

Trigger Parameters:

Parameter

Description

Example Values

Metric name

 

CPUUtilization

Name Space

Conceptual containers for metrics . Ensures that metrics in different names spaces are isolated from each other

AWS/EC2

AWS/AutoScaling

AWS/EBS

AWS/RDS

AWS/ELB

 

Statistic

Metric data aggregations collected  over a  specified period of time

Average

Minimum

Maximum

sum

Period

Length of time associated with a specific statistic.  periods are expressed in seconds, the minimum granularity for a period is one minute period values are expressed  as multiples of 60

300

Unit

The statistics unit of measurement

Percent , bytes, seconds etc depends on metric being measured

Upper Breach Scale increment

The incremental amount to scale by when the upper threshold has been breached

1

Lower Breach Scale increment

The incremental amount to scale by when the upper threshold has been breached

-1

Auto Scaling Group name

Name of the AutoScaling group the trigger is attached to

WebServerGroup

Breach Duration

Period that defines how long the breach duration can occur for before it triggers an action

500

Upper Threshold

The upper limit of the metric . The trigger fires if all data points in the last BreachDuration period  exceeds the upper threshold or falls below the lower threshold

90

Lower Threshold

The lower  limit of the metric . The trigger fires if all data points in the last BreachDuration period  falls below the lower threshold or exceeds the upper threshold

20

Dimension

Name Value pairs that  provide additional information  to allow you to uniquely identify a metric

Name:AutoScalingGroup

Value:WebServerGroup

Name:Webserver

Value:ProductionServer

 

Auto Scaling Group Parameters

Parameter

Description

Example Values

AvailabilityZones

The availability zones that are available for the group to start an instance in

Eu-west-1a, eu-west-1c

CoolDown

The time in seconds after one scaling action completes before another scaling activity can start

60

DesiredCapacity

Specifies the  number of instances the auto scaling group will endeavour to maintain

2

LaunchConfigurationName

The name of the associated Launch Configuration

LaunchMyInstances

LoadBalancerName

Name of Load Balancer Auto Scaling group attached to .

LoadBalancerforMyInstances

MaxSize

Maximum number of instances that the Auto Scaling Group can have associated with it

3

MinSize

Minimum number of instances that the Auto Scaling group  will have associated with it

1

 

A policy definition:

Policies are usually paired one for scaling up and one for scaling down.

To create a policy that scales down by 1  from the command line:

# When scaling down, decrease capacity by 1

%as-put-scaling-policy my-group –name “scale-down”

–adjustment -1 –type Absolute

 

To list policies from the command line to get the ARN :

as-describe-policies autoscaling-group

 

Putting it all together

So now we know what  the components are and the  associated parameters are  that can be used to be put together an appropriate monitoring solution using CloudWatch .  To illustrate how to start putting things together I’ll use CloudFomation. You can use the Command line tools and the console to do much of what comes next.

Using Alarms:

Metrics can be collated for EC2 instances, ELB’s, EBS volumes ,RDS and  the flexibility to use custom metrics J. Alarms can be set for any one of these metrics. Alarms exist in 3 states OK, ALARM, or INSUFFICIENT_DATA. When a metric breaches a predetermined threshold it is set to the ALARM state. On transition from one state to another an alarm action can be set.  The defined alarm action can be publication to an SNS notification topic or an auto scaling action.  Using CloudFormation snippets to illustrate setting up an alarm that monitors when CPU utilisation breaches a defined  threshold or the metrics disappear with the defined action being publication to an SNS topic that sends an email:

“AlarmTopic” : {

      “Type” : “AWS::SNS::Topic”,

      “Properties” : {

        “Subscription” : [ {

          “Endpoint” : { “Ref” : “OperatorEmail” },

          “Protocol” : “email”

        } ]

      }

    }

 

 

 

 

“CPUAlarmHigh” : {

      “Type” : “AWS::CloudWatch::Alarm”,

      “Properties” : {

        “AlarmDescription” : “Alarm if CPU too high or metric disappears indicating instance is down”,

        “AlarmActions” : [ { “Ref” : “AlarmTopic” } ],

        “InsufficientDataActions” : [ { “Ref” : “AlarmTopic” } ],

        “MetricName” : “CPUUtilization”,

        “Namespace” : “AWS/EC2”,

        “Statistic” : “Average”,

        “Period” : “60”,

        “EvaluationPeriods” : “1”,

        “Threshold” : “90”,

        “ComparisonOperator” : “GreaterThanThreshold”,

        “Dimensions” : [ {

          “Name” : “AutoScalingGroupName”,

          “Value” : { “Ref” : “AppServerGroup” }

        } ]

      }

    }

 

 

Using Auto Scaling Groups and Load Balancers:

This snippet describes an Auto Scaling group that will at any one time manage between 1 or 3 instances while endeavouring to maintain 2 instances.

“AppServerGroup” : {

      “Type” : “AWS::AutoScaling::AutoScalingGroup”,

      “Properties” : {

        “AvailabilityZones” : { “Fn::GetAZs” : “”},

        “LaunchConfigurationName” : { “Ref” : “AppServerLaunchConfig” },

        “MinSize” : “1”,

        “MaxSize” : “3”,

        “DesiredCapcity” :”2”,

        “LoadBalancerNames” : [ { “Ref” : “AppServerLoadBalancer” } ]

      }

    },

 

In the snippet above the Auto Scaling group has an associated Launch Configuration which is mandatory for an Auto Scaling group. It is also associated with a Load Balancer which we’ll come to in a minute. In the alarm example you may have noted in the Dimensions Parameters   that it refers to the Auto Scaling group above. This configuration has an alarm monitoring the state of the instances that are managed by the Auto Scaling group.

The LoadBalancer associated with the Auto Scaling group described above looks like :

“AppServerLoadBalancer” : {

    “Type” : “AWS::ElasticLoadBalancing::LoadBalancer”,

    “Properties” : {

        “AvailabilityZones” : { “Fn::GetAZs”: { “Ref”: “AWS::Region”} } ,

        “Listeners” : [ {

            “LoadBalancerPort” : “80”,

            “InstancePort” : {“Ref”: “TomcatPort”},

            “Protocol” : “HTTP”

        } ],

       “HealthCheck” : {

          “Target” : { “Fn::Join” : [ “”, [“HTTP:”, { “Ref” : “TomcatPort” }, “/welcome”]]},

          “HealthyThreshold”: “5”,

          “Timeout”: “5”,

          “Interval”: “30”,

          “UnhealthyThreshold”: “2”,

                                  “Target”: {“Fn::Join”: [“”,[ “HTTP:”,{“Ref”: “TomcatPort”},”/welcome”]]}

           }

        }

      

    },

 

 

The Load balancer has been defined with Health checks which in this example does a HTTP check. This check will mark an instance as having had a failed Health check if it does not receive a “200 OK” within 30 seconds . If this happens in consecutive checks the instance is marked as unhealthy. The instance needs to have successfully responded with a  “200 Ok”  5 times in succession to be marked as healthy. The combination of intervals and Thresholds determines how long an instance is technically responding so in theory you could have an unhealthy instance trying to respond for a period of time until it meets the criteria to be marked as unhealthy

You can also associate alarms with the Load Balancer as  in the snippet below  where an alarm  has been defined that notifies you if there are too many unhealthy hosts :

“TooManyUnhealthyHostsAlarm” : {

      “Type” : “AWS::CloudWatch::Alarm”,

      “Properties” : {

        “AlarmDescription” : “Alarm if there are too many unhealthy hosts.”,

        “AlarmActions” : [ { “Ref” : “AlarmTopic” } ],

        “InsufficientDataActions” : [ { “Ref” : “AlarmTopic” } ],

        “MetricName” : “UnHealthyHostCount”,

        “Namespace” : “AWS/ELB”,

        “Statistic” : “Average”,

        “Period” : “60”,

        “EvaluationPeriods” : “1”,

        “Threshold” : “0”,

        “ComparisonOperator” : “GreaterThanThreshold”,

        “Dimensions” : [ {

          “Name” : “LoadBalancerName”,

          “Value” : { “Ref” : “AppServerLoadBalancer” }

        } ]

      }

    }

                 

   },          

 

Triggers and Auto Scaling Policies:

 We’ve looked at defining alarms that on a change of state publish to an SNS topic now as the last part of this post we’ll have a look at how to effect an Auto Scaling action. This can be achieved by using a trigger or by using an AutoScaling policy.

 Triggers when defined are very similar to Alarms but with extra Auto Scaling polices incorporated

In the snippet below a Trigger is defined that monitors the average CPU utilization for the ec2 instances managed by the Auto Scaling group.

“CPUBreachTrigger” : {

      “Type”: “AWS::AutoScaling::Trigger”,

      “Properties”: {

         “AutoScalingGroupName”: { “Ref”: “AppServerGroup” },

         “Dimensions”: [

          {

            “Name”: “AutoScalingGroupName”,

            “Value”: { “Ref”: “AppServerGroup” }

          }],        

         “MetricName”: “CPUUtilization”,

         “Namespace”: “AWS/EC2”,

         “Period”: “60”,        

         “Statistic”: “Average”,

         “UpperThreshold”: “90”,

         “LowerThreshold”: “20”,

         “BreachDuration”: “120”,

         “UpperBreachScaleIncrement”: “1”,

         “LowerBreachScaleIncrement”: “-1”

      }     

    },

 

In the example snippet If the average CPU utilization breaches the upper or lower threshold the trigger and this breach is sustained for 120 seconds the autoscaling group will scale up or down  by 1 instance accordingly.

Having defined a set of  Auto Sscaling policies via the command line as described earlier in this post the policy can  apparently  be referenced by an alarm using its’ ARN  as it its action on changing state . Although I was unable to figure out how you could do this via CloudFormation as you cannot create an autoscaling  policy that is not attached to an auto scaling group and you cannot create a standalone policy that can be attached later. So as things stand today  to do this via the command line would require creating the Auto Scaling group and then  using a command similar to  the below to  attach the policy:

# When scaling up, increase capacity by 1

C:\> as-put-scaling-policy AppServerGroup  –name “scale-up”   –adjustment  1 –type Absolute

 

I am hoping the ability to create Auto Scaling policies as part of  a  CloudFormation template will be added as future functionality to the CloudFormation API

OpenShift flex a peek behind the scenes

On my first look at  OpenShift I described what the initial getting started experience was like . Now I’d like to go over what it’s like when you deploy a JBoss Application using the deploy your own application route. I had to endure being talked at by my Significant Other about the intricacies of JBoss for this mind and  I now know a little bit more about JBoss than I ever wanted or needed  to know ( I didn’t know much about it before though ). As a result though this post has sneaked a look behind the scenes and focus’s on what is happening at the AWS end as well as some JBoss specifics which I guess most users may not want to do. In this this post we do also  list  some of our gripes and also some questions we have. Anyway enough of that here comes the good stuff  Smile

The first thing you need to get your head around is that the JBoss server is part of your application .

We created a load balanced Cluster which  if you’re interested in what is going on with your AWS estate  consisted  of a  minimum of two  ec2 instances with a  loadbalancer in front of them. The  ec2 instances are launched with a generated ec2  security group which has your cluster name embedded in it.

secGroup

My first question was why are both instances in the same availability zone?

This unfortunately ended up with myself and my significant other both educating each other . Me explaining about AZ’s and him explaining about how JBoss clusters work . We decided to park  that conversation as I think we were both making each others heads hurt at some point (Our works lives do not normally collide).

The reason we started a cluster is because we wanted to show that when you create a clustered JBoss application you do actually get a JBoss cluster and not just two standalone JBoss instances behind a load balancer. I’ll show this later on in this post.

The application we are using is  the one that is used in the getting started with JBoss guide to go with it the JBoss Seam booking application. This application doesn’t actually work when deployed ( I know why as I was told why  but that explanation would distract from this post) but as we were interested in poking around it served its purpose (and I had already downloaded it )

To  deploy an application you also need to deploy the components that go with  it

random number

Basically you need to go through all the tabs but you can say just deploy  the components and deploy the files ( in our case an ear file ) later, OpenShift doesn’t seem to mind that.

Gripe : The annoying yellow circle that appears is a counter telling you how many files has been modified. In the screen shot above you scan see that the ‘20’ matches the Configuration files modified. It rapidly became as annoying as the paper clip in word it needs to go.

We deployed the components first that consisted of  JBoss 6 and MySQL. At the moment there is only a community version of JBoss supported. What I want to know is there going to be a supported version of JBoss so  that we could see a Pay as you go support service like the RedHat instances on ec2?

Anyway after we deployed the components we logged on to the JMX console on one of the nodes  so we could show you that it really is a JBoss cluster and what happens after you deploy the files that make up your application.

The terminology is slightly confusing  as JBoss usually has applications deployed to it and now its a component but that component is actually part of an application. Hopefully by the end of this post you’ll understand it all though.

So pre application file  deployment but post component deploy .When logging onto the JMX console as you can see from the screen shots below that a JBoss Cluster/group exists with both our ec2 instances in it.

I have included  a screen shot of the ec2 instances so you can match up the IP addresses with the walkthrough

jbossclusterview

sameaz

beforedeploy

Those of you familiar with JBoss will see there is no application deployed as yet so we went back to the Flex console and then deployed our application ear file.

The JMX console then looked like this:

postdeploy

and from the JMX console on our second instance looked like this

2ndinstance

To get the password to log onto the JBoss console we actually had to log onto the instance via ssh  to get the randomly generated admin password from the jmx-console-users.properties file.

It would be good if there was a way to change this password from the Flex console . There is the ability change ports via the console. We feel this is a must have enhancement.

admin password

After we’d had  a poke around I then wanted to delete the application . So I stopped the application and then clicked  delete but got this error :

tryingtodeleteapplication

The application ( deployed files and components i.e JBoss is no more) was deleted from both nodes so it did work.

This still leaves the instances and loadbalancer on AWS running though so next it was delete the cluster and another error although it did successfully delete the cluster and the underlying AWS infrastructure

deletinginstance

So there you have it a quick peek behind the deployment of a JBoss application sing OpenShift.

Some other comments we have for RedHat  :

As we can log onto the instances it needs to be made absolutely clear as to what an end user is actually allowed to do in terms of tweaking. If you ‘over tweak’ what support will be provided?

The Getting Started documentation needs to reflect he actuality. For example in the ‘Getting Started with JBoss on OpenShift flex guide’ there is no indication that you cannot have spaces in your application name as shown by the screenshots in the guide . 

You do not find out till you have tried to create a cluster ( a Flex cluster) that there is a  10 characters maximum constraint on the cluster name.  Some on screen help is needed before you hit submit .

The UI still needs a lot of work as the scrolling around is annoying and buttons not being obvious because your browser/laptop resolution isn’t small enough although I have been told even on a dev size screen it’s still rubbish ( I know that they are working on this but worth listing)

 

A quick introductory tour of OpenShift

I’d had  a play with the pre-beta of what is now know as OpenShift  when it was called Makara cloud but it was very much a  pre-beta .  With my Significant other needing to look at  OpenShift & the need to use ec2 for the flavour of OpenShift  he wanted to look at  another visit was in order. So the first of my visits was just to see how the getting started experience has evolved from those early pre-beta days

OpenShift comes in a number of flavours Express,Flex and power. In this post we are  looking  at Flex .

Express  The free shared model  for PHP, Ruby and python apps

Flex For Java EE and PHP apps that make use of middle ware components like Jboss & tomcat

Power – Basically you write  custom apps for the underlying instance in say C

The faq provides info on the differences . The only one you need an AWS account for is flex.

After signing up you get a confirmation email and off you go. Using Flex requires that you provide the portal with your cloud details

cloudsignup

This is tied into using  Amazon ec2 instances. I’m assuming that more cloud providers will be available at some point. ec2 suits me fine though so onwards with the tour.

After entering your credentials  you get informative messages telling you what’s happening  as it connects your cloud account with flex.

linkingtocloud

.

You then need to define your server cluster

createcluster

But at last a cloud solution that on release is available  in the eu-west region now . Nice not to be a second class citizen for a change just because I’m based in Europe

While it creates the cluster RedHat have taken he opportunity to sneak in a survey

stopmegetting bored

Well why not I guess stops me getting bored anyway.

For this post I was trying to get handle on the initial sign up experience my Significant other will be using JBoss on it so hopefully I can entice some feedback about what it’s like to use it  in anger in another post.

Anyway the next step is to either deploy your own application or try one of the  demo applications. I opted to try the demo  spring & hibernate autoInsurance application autotrack. I clicked the submit button and off it went.

Behind the scenes  it created a large  EBS backed  instance and loadbalancer against my   our AWS account. It wasn’t obvious where I could change the instance size .

Once the app is deployed you get access to the application dashboard . Now this is neat and I cannot do it justice in this brief intro but her are a few screenshots so you can see the sort of info you can get from the dashboard

appconsole

I’m very interested in Deployment processes so the Deploy changes tab is pretty cool to me anyway.

deploychanges

The performance tab provides transactional as well as instance metrics in an easily personalised dashboard.

performance

The logs tab provides a view of the application server logs.

The initial verdict:

If you use OpenShift flex you do  still need to understand the underlying IaaS platform it will be deployed to. This isn’t an Azure type PaaS  mode but a bring your own PaaS container and you need somewhere to deploy the PaaS container to . I would like to see more flexibility around the cluster( underlying IaaS ) set up . How easy is it to move from IaaS to IaaS ? I don’t know difficult to assess with only AWS ec2  being an option. if it isn’t that easy then why not make use of some of the services that made you select a particular cloud vendor in the first place? RDS as an option would be nice.

The console is nice and friendly though and it was a painless experience to sign up and get my first OpenShift demo application up and running.  Express is the free shared model where RedHat have abstracted the IaaS but its target market Ruby, python, php   have a lot of choice already what with Heroku, CloudFoundry and numerous other players who offer to run apps like Drupal . I think that if you are developing Java EE applications and really do not want to worry about the underlying IaaS then this and CloudFoundry may well be your first port of call.

I’ll revisit for a deeper dive soon though as I have number of unanswered questions I need to explore and My Significant Other has yet to start exploring this which should provide a good test .

A pictorial representation of AWS EBS Architecture

I’m not known for creating pretty pictures & this is definitely not a pretty one but hopefully it will help visualise how AWS EBS fits together. I’m hoping someone will feel so appalled at my terrible diagram they’ll feel obliged to come up with a pretty one .

I drafted this after reading the incredibly detailed post mortem on the EBS problems  AWS experienced in the US-east region on the 21st April 2011 where they explained the EBS architecture.

I have pulled out the following points  from the Post mortem message to help understand how a normally functioning EBS cluster works:

An EBS Cluster exists within an Availaibility Zone

An EBS Cluster manages a set of EBS Nodes

The EBS Nodes store replicas of EBS volume data and serve read & write requests to EC2

EBS Nodes Communicate with other  EBS nodes, with EC2 instances, and with the EBS control plane services is via a high bandwidth network

A secondary lower capacity network is also in use that  is used as a back-up network to allow EBS nodes to reliably communicate with other nodes in the EBS cluster and to  provide overflow capacity for data replication

If an EBS  node loses connectivity to a node to which it is replicating data to, it assumes the other node failed. To preserve durability, it must find a new node to which it can replicate its data (this is called re-mirroring). As part of the re-mirroring process, the EBS node searches its EBS cluster for another node with enough available server space, establishes connectivity with the server, and propagates the volume data

The control plane services  accepts user requests and propagates them to the appropriate EBS cluster. There is one set of EBS control plane services per EC2 Region, but the control plane itself is highly distributed across the Availability Zones to provide availability and fault tolerance. These control plane services also act as the authority to the EBS clusters when they elect primary replicas for each volume in the cluster (for consistency, there must only be a single primary replica for each volume at any time)

Powershell & user data to start a chef run

This script can be used to configure  an instance on AWS at startup to collect user data which would be the run list in json . This assumes that ruby and the pre-requisite gems have already been installed

#chef-clientrun.ps1

# install chef gem – This ensures only the latest stable version is installed

$installchef= “gem install chef –no-rdoc –no-ri””

# Download userdata

$webclient = new-object system.net.webclient

$awsurl =”http://169.254.169.254/latest/user-data

$targetfile =”c:\chef\etc\runlist.json”

$webClient.DownloadFile(“$awsurl”,”$targetfile”)

# Run chef-client passing json file which contains runlist

$runchef = “C:\Ruby192\bin\chef-client -j”+  $targetfile

invoke-expression $installchef

invoke-expression $runchef

Using CloudFormation to kick off a chef run

Once you decide to use CloudFormation to create your AWS resources you are now unable to use the knife command to kick of an ec2 server creation so you will have to get the client to start the chef run by doing a chef-client run .

The solution described in this post  is simple to implement .It requires  doing a  little scripting at the Instance end by baking that into a base AMI and the use of userdata.

I will use a Linux AWS AMI as my starting point.

The first thing to do is set up your target AMI to be able to  use userdata.

The script below shows the salient parts of an rc.local I have used to facilitate a chef run when an instance is created from the AMI:

gem install chef –no-rdoc –no-ri
# grab userdata then use to construct name of json file
# json file contains run list and is passed to chef-client run
export USERDATA=`/usr/local/bin/aws-get-ec2-userdata`
echo userdata = $USERDATA
export ROLE=$(echo $USERDATA | cut -f 1 -d “:”)
chef-client -j /etc/chef/$ROLE.json

The file /usr/local/bin/aws-get-ec2-userdata  file uses curl ( just like the sample templates from AWS) to return the userdata which is then  stored in the environment variable USERDATA. The first value  which represents the role we want to apply to the node is extracted and saved as the environment variable ROLE which is then used to pass the appropriate json  file which contains that role in the  runlist.

The corresponding  part of a Cloudformation script that creates the EC2 instance resource and passes the userdata looks like this:

“Ec2Instance” : {
“Type” : “AWS::EC2::Instance”,
“Properties” : {
“KeyName” : { “Ref” : “KeyName” },
“AvailabilityZone” : { “Fn::FindInMap” : [ “RegionMap”, { “Ref” : “AWS::Region” }, “AvailabilityZone” ]},
“ImageId” : { “Fn::FindInMap” : [ “RegionMap”, { “Ref” : “AWS::Region” }, “AMI” ]},
“InstanceType” : { “Ref” : “InstanceType”},
“UserData”: {
“Fn::Base64”: {
“Fn::Join”: [
“:”,
[
{
“Ref”: “ChefRole”
},
{
“Ref”: “EBsVolsreqd”
}

]
]
}
}
}
}

The userdata needs to be base 64 encoded  hence the  “Fn::Base64”:  encapsulating this property. The “Fn::Join”: [  “:”, appends the values passed as as single value with  each value separated by  “:”

The line export ROLE=$(echo $USERDATA | cut -f 1 -d “:”)”
in the rc.local uses the delimiter to identify each value and as the ChefRole is the first parameter started it uses this to set the variable ROLE.

When the stack is started you can accept the default role or change it to an appropriate value .

stack

After the stack is complete you can then check to see if it has created the node by looking at the system log :

syslog

and /or using the chef  console  ( I use the opscode hosted platform):

opscode

I think this is a nice  straight forward  way  to achieve a  fully automated end to end deployment  using  AWS ec2 CloudFormation and Chef from the base O/S through to the applications that need to be deployed

An intro to writing AWS CloudFormation Templates

AWS  introduced recently the ability to create & document  your AWS environment via CloudFormation. You need to create a JSON template that describes your environment . This template  is then used to create your AWS Stack.  The AWS documentation is pretty good but the getting started guide starts you off by using an example template .  I think a walk through showing you what the component parts that make up a  template are and how to put them  together is a much better starting point if you want to create a stack for your own AWS environment so here you are.

There are 6 basic components that can be used in the JSON Cloudformation templates that you need to become familiar with.

Format version (optional): The AWS CloudFormation template version against which the template was written

Description (optional):  JSON   text string description of template or part of template

Parameters (optional):  String or comma separated list. The values can be overridden at run time. Parameters are dereferenced in resources and outputs section of  the template e.g  you can declare a Parameter  called InstanceType with a default of t1.micro that can be overridden at Stack instantiation to be an alternative InstanceType.

*******************************************

“InstanceType”   : {

“Type”   : “String”,

“Default”   : “t1.micro”,

“Description”  : ” ‘t1.micro’ |  ‘m1.large’ |  ‘c1.xlarge’ ”

},

*******************************************

Mappings (optional):  Allow the passing of conditional parameter values used with the function Fn::FindInMap. It is similar to a case statement in usage. Used in conjunction with Parameters. Each mapping has a unique name in a template and consists of a number of key-attribute pairs. Each Attribute is a literal string

************************************

“Mappings” : {

“RegionMap” : {

“us-east-1” : {

“AMI” : “ami-8e1fece7”,

“AvailabilityZone” : “us-east-1b”

},

 

“eu-west-1” : {

“AMI” : “ami-45cefa31”,

“AvailabilityZone” : “eu-west-1a”

}

}

},

*********************************

When using the function Fn:FindInMap it needs to be passed the Map name, the  Reference key and the return value

Resources: AWS resources such as instances, RDS etc which are declared as members of the AWS stack . Resources declared in the Resources section contain a Properties section, in which you declare both required and optional properties for the resource.

**********************************

“Resources” : {

“Ec2Instance” : {

“Type” : “AWS::EC2::Instance”,

“Properties” : {

“KeyName” : { “Ref” : “KeyName” },

“AvailabilityZone” : { “Fn::FindInMap” : [ “RegionMap”, { “Ref” : “AWS::Region” }, “AvailabilityZone” ]},

“ImageId” : { “Fn::FindInMap” : [ “RegionMap”, { “Ref” : “AWS::Region” }, “AMI” ]},

“InstanceType” : { “Ref” : “InstanceType”}

}

}

},

********************************

Outputs (optional):  Messages that can be returned as  part of the cfn-describe-stacks command

**********************

“PublicIP” : {

“Description” : “Public IP address of the newly created EC2 instance”,

“Value” : { “Fn::GetAtt” : [ “Ec2Instance”, “PublicIp” ] }

}

****************************************

So putting all the above together to create a template that starts an ec2 instance you get ( I’ve added some bits to make it a workable solution):

******************************

{

“AWSTemplateFormatVersion” : “2010-09-09”,

 

“Description” : “Creates an EC2 instance running the AWS Linux 64 bit AMI from the EU region. “,

 

“Parameters” : {

“KeyName” : {

“Description” : “Name of  EC2 KeyPair to enable SSH access to the instance”,

“Default” : “AWSNet-EU”,

“Type” : “String”

},

“InstanceType”   : {

“Type”   : “String”,

“Default”   : “t1.micro”,

“Description”  : ” ‘t1.micro’ |  ‘m1.large’ |  ‘c1.xlarge’ ”

}

},

 

“Mappings” : {

“RegionMap” : {

“us-east-1” : {

“AMI” : “ami-8e1fece7”,

“AvailabilityZone” : “us-east-1b”

},

 

“eu-west-1” : {

“AMI” : “ami-45cefa31”,

“AvailabilityZone” : “eu-west-1a”

}

}

},

 

“Resources” : {

“Ec2Instance” : {

“Type” : “AWS::EC2::Instance”,

“Properties” : {

“KeyName” : { “Ref” : “KeyName” },

“AvailabilityZone” : { “Fn::FindInMap” : [ “RegionMap”, { “Ref” : “AWS::Region” }, “AvailabilityZone” ]},

“ImageId” : { “Fn::FindInMap” : [ “RegionMap”, { “Ref” : “AWS::Region” }, “AMI” ]},

“InstanceType” : { “Ref” : “InstanceType”}

}

}

},

 

“Outputs” : {

“InstanceId” : {

“Description” : “InstanceId of the newly created EC2 instance”,

“Value” : { “Ref” : “Ec2Instance” }

},

“AZ” : {

“Description” : “Availability Zone of the newly created EC2 instance”,

“Value” : { “Fn::GetAtt” : [ “Ec2Instance”, “AvailabilityZone” ] }

},

“PublicIP” : {

“Description” : “Public IP address of the newly created EC2 instance”,

“Value” : { “Fn::GetAtt” : [ “Ec2Instance”, “PublicIp” ] }

}

}

}

*******************************************************************

 

 


Setting up a windows AMI for use with Elastic Bamboo

This is a revision of a post I originally posted at the http://consultingblogs.emc.com/gracemollison/ but it’s quite useful until Atlassian get round to supporting windows as part of Jira Studio.

Jira Studio uses elastic bamboo as its build controller which in turn uses elastic agents (AWS ec2 instances that are spun up by the build controller to run the build) rather than the standard remote agents.

The Diagram from Atlassian  below illustrates the Elastic Bamboo configuration:


You will need to have an Amazon AWS account and will have to create a custom windows AMI to be used to create your elastic-agents.. This post outlines what I did to get this working for a recent project. Please note that this is unsupported by Atlassian but hopefully they’ll be supporting windows soon.

Please note that this post assumes you have some basic knowledge of using AWS.

Start a Windows 2008  instance from an  Amazon base windows 2008 AMI in the us-east-1  region. It needs to be in the US-East region as this is where Jira Studio expects to find the build agent.

All actions below are carried out from the instance you have started.

Turn off windows firewall

Install the latest JDK. (You will need tools.jar in your JAVA_HOME path hence why the JDK is required)

Install whatever components you need to be able to undertake a build e.g visual studio, msbuild, SDK’s  etc and other bits

Use the link here as a  guide : http://confluence.atlassian.com/display/BAMBOO/Creating+a+Custom+Elastic+Image

Set up the Amazon ec2 API tools as outlined in the Atlassian guide ( section 5.4). Specific guidance for windows can be obtained from the AWS documentation.

Check the version of Bamboo by  clicking on administration then expand under system \ system information  :


Scroll down to see the Bamboo version

Download the bamboo-elastic-agent that matches the version of Bamboo that is being used in Jira Studio: http://www.atlassian.com/software/bamboo/BambooDownloadCenter.jspa Make sure you click show all.

Create the   folder c:\bamboo-elastic-agent . Unzip the bamboo-elastic-agent-2.6.2.zip to this folder.

Download the latest zip of ant  from http://www.apache.org/dist/ant/binaries/ and unzip into c:\ant\

Download the latest zip of maven and unzip e.g  into c:\ apache-maven-2.0.11

Set up environment variables and paths

Example Summary of relevant Environment variables ( depends on versions you have downloaded):

ANT_HOME  C:/ant

EC2_CERT  c:\ec2-api-tools\YOUR-cert.pem

EC2_HOME  c:\ec2-api-tools

EC2_PRIVATE_KEY c:\ec2-api-tools\YOUR-pk.pem

JAVA_HOME  C:\Program Files\Java\jre6

MAVEN_HOME c:\apache-maven-2.0.11

Example path variables:

…….  c:/ant\bin;C:\Program Files\Java\jdk1.6.0_23;c:\apache-maven-2.0.11\bin;C:\ec2-api-toolsbin;c:\bamboo-elastic-agent\bin

Create a  batch file that  consists of two lines:

Line 1: the Java classpath (This was obtained by using a simple Powershell script to scrape the lib folder under bamboo-elastic-agent  and setting the result as  a CLASSPATH  variable)

Line2: This will run the actual elastic-agent

Note there are over a  hundred jar files but to date Atlassian have been unable to let me know which ones are actually needed hence the snippet of the batch file I used  rather than the full list.

SET CLASSPATH=acegi-security-1.0.4.jar;activation-1.1.1.jar;activeio-core-3.1.0.jar;activemq-core-5.2.0.jar;activemq-ra-5.2.0………

java -server -Xms32m -Xmx512m -XX:MaxPermSize=256m  -classpath %CLASSPATH% com.atlassian.bamboo.agent.elastic.client.ElasticAgentBootstrap 2>&1 > c:\bamboo-elastic-agent\bamboo-elastic-agent.log

Test that everything is set up okay by running the batch file interactively. You should see output similar to:

………  Java.io.FileNotFoundException: http://169.254.169.254/2008-02-01/user-data
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown So
urce)
at java.net.URL.openStream(Unknown Source)

……….

oo.agent.elastic.client.ElasticAgentBootstrap

0 [main] INFO com.atlassian.bamboo.agent.elastic.client.ElasticAgentBootstrap  –

Using tunnnelling. Registering ‘httpt’ and ‘httpst’ protocols.

577 [com.sun.sungrid.service.tunnel.server.TunnelServer] INFO com.sun.sungrid.service.tunnel.server.TunnelServer  – Waiting for tunnel connection.

The key points are it trying to get userdata and attempting to create the tunnel. The agent needs to be started by the bamboo controller hence the errors

The agent needs to start automatically on starting the instance so the batch file  needs to be wrapped as a service . The nssm works well https://iain.cx/src/nssm/

Set the service to start automatically:

Create an AMI based on this instance

Log onto Jira Studio and  register the AMI, Set up the  capabilities , test that it will spin up the instance  and  test a basic build .

Jira Studio recognises both Visual Studio and msbuild so you just need to add the paths in when setting up the capabilities: