Using AWS for DR when your solution is not in the Cloud

In my previous post  in this series on resilience of non cloudy solutions I discussed how to approach obtaining exactly what was acceptable to the business to achieve an appropriate DR solution . In this post I will look at a fairly high level at  how to exploit AWS to  help  provide a cost effective solution for DR when your solution does not actually use AWS resources and is  probably not designed in a decoupled manner that  would make it easy to deploy to the cloud  .

yes I know know I can’t help it the cloud is here after all Smile

Please note that by necessity I’ve needed to keep to a high level as if I were to attempt to  start exploring the detailed configuration options  I’d still be writing this post by Christmas. Needless to say this post just scratches the surface but hopefully provides some food for thought.

You  will or should have  have local resilience in your solution consisting of multiple application servers and web servers , clustered database servers and load balancers .

The easiest DR solution to implement but the  most costly  is to replicate this albeit with maybe not so many servers and perhaps a single Data base server instance  to  an alternative physical location and putting in place processes to replicate data across to the 2nd location .

This typical configuration will look something like this:

std dc replicated

There are  plenty of variations on this but in the end it  entails physically maintaining a distinct location which replicates the application architecture and associated security controls . Resources need to be in place to support that location;  keep the components  updated regularly and all the usual best practises need to be acted upon to  validate the solution . It’s no point finding out the solution doesn’t’ work when you need it.

 

At this point you should hopefully  be thinking that is a lot of investment for something that will only be rarely used . So here’s where AWS can help keep those costs down.

 

The first model  which I’ve called the ‘halfway house’  may be an option for those who are unable to make use of the full AWS resources available and for whatever reason are unable or unwilling to store their data there . It still requires two maintained DC’s but saves costs by having the application and web servers for resilience being AWS instances. the cool thing here is that those resilient servers/instances are not actually operational unless needed ( you would have prepped AMI’s and hopefully use them in conjunction with a configuration management tool to ensure they are fully up to date when launched) .  You will not have  have the over head associated with watering & feeding them that you would have if you were 100% responsible for the infrastructure. The core  AWS components that make this work are: EC2,VPC and ELB .  If you wanted there is also the potential to use Route 53 to manage the DNS aspects that are needed for routing externally .There are issues with this model though  such as the possibility of a lack of capacity when you need to spin up those instances ( although the use of Multiple AZ and regions should over come that fear), the over head associated with managing 3 sets of resources,latency issues just to name three that come to mind.

The ‘halfway house’   will look something  like this:

 

Part use of AWS

Making use of AWS VPC means that you can create virtual networks built upon the AWS infrastructure which provides you with a great range of networking configurations for example  in the diagram above  I’ve show two  group of instances, one  that is externally accessible and another set that is basically an extension of your private LAN.  there are far too many scenarios possible with just these features of AWS and obviously every application is different ( See why I made sure this post was kept at a high level)

The  nirvana though to really seeing the costs tumbling  is to get rid of DC 2 and use AWS as the Recovery site. as a bonus it can be used for those extra processing needs as well on a demand basis . This not only reduces the support over head, saves cost as you are no longer committed to paying for a second location with all the associated kit necessary to make it a viable alternative site , but  it also provides a wide variety of failover and recovery options that you just won’t get when you have to commit to infrastructure up front ( hopefully that  pre-empts the question about why not a private cloud – you need your own platform).

This model which I’ve called the ‘Big Kahuna’ can look  a little like this :

 

big khauna

With the ‘Big Kahuna’ you should make use of any of the AWS resources available. In the flavour above I’m using S3 to store regular snapshots / transaction logs etc from my primary database. Why not replicate directly? Well s3 is cheap storage and in the scenario I’m illustrating as an example my RTO and RPO values allow some delay between failure and recovery that I can reconstruct the database when needed from the data stored in my s3 bucket . Regular reconstruction exercise should occur though as part of the regular validation of the failover processes. AMI’s and a configuration management solution ( As it’s me it will be chef) are used  to provision up to date application and web servers. Use is made of Route 53 to facilitate DNS management  and Where I need to ensure that traffic is kept internal I’m making use of VPC .

The introduction of RDS for oracle  means it is viable to use AWS as the failover solution for enterprises. There may be concerns over performance but this is a DR situation so if you are not in a position to reengineer for the cloud then when discussing with internal business sponsors discussions about reduced performance should be part of the business impact discussions.

AWS has services  such as dedicated instances which may be the only way your security and networking guys will allow you to exploit AWS resources but you would need to do your sums to see if it makes sense to do so. Personally I’d focus on trying to understand the ‘reasons’ for this . There are a number of valid areas this would be required but I suspect cost   isn’t really going to be any sort of driving force there.

The devil  is in  the detail when designing a failover solution utilising AWS as part of your DR . If you are planning for a new solution make sure you talk to the Software architect about the best practises when designing for the cloud it’s still applicable  for on premise solutions too .

Data is really where all the pain  points are  and will likely dictate the model and  ultimate configuration.

If you are trying to retro fit for an existing solution then the options open to you may not be that many and it’s likely you will have to start off with some form of the ‘halfway house’

Also don’t forget you can  just try stuff out at minimal cost. Wondering if a particular scenario would work just try it out as  you can just delete everything after you’ve tried it.

The cost effectiveness of the solution is directly related to the use you make of AWS resources to effect the solution. I even have a graph to illustrate ( @jamessaull would be proud of me) .

 

awsdr graph

This graph is based on very rough comparative costs from starting off with no AWS resources as in the first situation I started discussing and working my way down through to the ‘Big Kahuna’. You can easily do your own sums .AWS pricing is on their site they even provide you with a calculator  and you know how much it costs for those servers, licences, networking hardware,hardware maintenance costs support etc.

How to deal with the ‘ I want no downtime response.’

With a little time between jobs and with having to dodge the rain  I found myself thinking about some stuff I’ve not really given a lot of deep thought recently so here’s the first of a couple of posts on Resiliency when applied to non cloudy solutions.

I recall numerous occasions where I would design a highly resilient solution that would provide as many 9’s as I’d dare to commit to only for the proposal to be back on my desk some weeks later with the words too expensive where can we save money on it?

The main reason I would come up across this is that I would ask the Business Sponsor two key questions related to Disaster recovery as part of the business Impact analysis:

 What is their RTO?   Recovery Time Objective – The duration of time from point of failure within which a business process must be restored after a disaster or disruption.

What is their RPO?   Recovery Point Objective – Acceptable amount of data loss measured in time

These two questions would inevitably elicit the response no data loss and restore service as quickly as possible, so I would go  off and design a platform to get as close to those unattainable desires.

I soon learnt that a different approach was needed to prevent the paper bounce game.  Eventually coming up with  gold, silver and bronze Resiliency options as part of the platform proposal . We would include what we felt were viable Resiliency options graded according to arbitrary RTO & RPO levels and the associated costs to deliver on those.  

This at least then meant the Business sponsor had somewhere to start from in terms of a monetary value  and  could see that wanting the Gold standard was going to cost them loads  more than if they really thought it through . This got them thinking about what they really wanted in terms of RTO and RPO and we would then discuss what options were open to them.

 For example does the accounting system need to be restored within 2 hrs with less than half an hour’s loss of data? In my experience it’s  only at month end processing  this level of recovery is really needed.  During the majority of the month data is held elsewhere so can be easily recreated, the months accounts file is still open so this isn’t an issue  in terms of processing. Is anyone working on the system at weekends?  This is the sort of thought process that the Business needs to go through when requesting new systems and trying to figure out what they want in terms of resiliency.

 When faced with stark questions about RTO’s and RPO’s the natural response is I want my system to be totally resilient with no downtime they have no idea what this means in terms of resource and thus potential cost so why not save time by giving them some options . You may be lucky and one of the options will be an exact fit but if not the Sponsor has an idea of what it costs to provide their all singing all dancing requirements.  It’s more likely they will say something like “The Silver option may do but we need an enhanced level of support at month end ” or  maybe something like “we need to make sure we have a valid tested backup from the previous night at month end “.

The beauty of this approach  is that straight away you’ve engaged them in conversation and prevented a paper bounce . It’s not a new approach and one service teams are used to when dealing with external suppliers but there is no reason not to use the same methods with internal requests.