For AWS administrators who need to build a multi-regional strategy to maintain the resilience of their cloud deployments in the event of an outage, various approaches come with trade-offs between cost, complexity, and efficiency.
A multi-regional infrastructure setup is a complex undertaking. Organizations need to plan for a variety of factors, including additional services, maintenance, and staff needed to manage complexity. Moving data between regions can increase costs and complexity.
“The two primary benefits of a multi-region strategy are end-user proximity performance and resiliency,” said Tim Banks, principal cloud economist at The Duckbill Group, an AWS cost management consultancy.
The importance of staying active-active
AWS provides a range of backup and recovery options for users. At a minimum, administrators should replicate data across multiple regions, with an appropriate backup plan. This protects operations if an entire AWS Region goes down.
Active-active architecture, which is growing in popularity, supports high availability of data and business processes. It’s also the most expensive, said Mike Nolan, principal architect at IT consultant SPR.
In an active-active approach, systems are always activated in several regions. If AWS loses a region, the configuration running in one or more other regions takes over. This arrangement means that recovery point objectives (RPOs) and recovery time objectives (RTOs), two critical benchmarks of any disaster recovery (DR) plan, are never at risk.
Eventually, the region comes back online and the active-active configuration should be synchronized. Data from the region that experienced the problem needs to catch up. To understand the catch-up path, you will need to do some planning and consider the particular software and systems used in the deployment.
Active-passive strategies are less resilient, less complex; they are also less expensive. Common active-passive strategies include backup and restore, standby, and hot standby. Backup and restore, in which the backup deployment does not exist until it is needed, is the least expensive. Nightlight and hot standby are more complex active-passive variants. They maintain a bootstrapped or running application infrastructure and data up-to-date in preparation for launching an active deployment.
An active-passive strategy typically involves data replication only through backups. AWS administrators can configure infrastructure as code and DevOps pipelines to quickly deploy infrastructure and restored applications when needed. In the event of a disaster, these automation strategies bring the core network and server platforms online. Data is retrieved and application and configuration information in pipelines allows applications to be operational. Even with automation, recovery takes longer in an active-passive strategy than in an active-active approach.
Consider a range of thresholds as part of your overall resilience strategy. While active-active DR is the pinnacle of best RPOs and RTOs, it’s not always the right choice.
“Any size organization will struggle to justify a single active-active approach. This adds complexity and increases cost in non-critical systems,” Nolan said.
Organizations can create system tiers based on combinations of RTOs and RPOs for different aspects of their overall AWS usage. The core network and security are at level 0. They are critical to all aspects of getting systems working again. From there, the lower the tier number, the more critical the RTO and RPO. Tier 1 can include any customer-facing, revenue-generating system the business needs to operate, for example. Use tiers to quantify the value of various disaster recovery strategies for workloads.
Keys to Application Resilience
Much is spent on protecting applications from downtime resulting from a regional AWS outage.
Start with the entry point of an application. “The gateway to your application is DNS,” Banks said. To keep traffic flowing to available services, DNS must always point to reachable targets.
Next, configure health checks and monitoring for downtime, errors, or other service degradations. Automate responses to these outages and performance issues.
In the event of a regional outage, you won’t be the only AWS customer claiming capacity elsewhere. Make sure you’ve pre-provisioned enough resources to handle failover traffic in other regions, Banks said.
Preparing for application-level failure is vital, Nolan said. Ensure users aren’t left hanging if any aspect of a system is faulty or unavailable. Consider chaos engineering, a game of unpredictable and intentional breaking of an application environment, as a practice to ensure your approach to resiliency meets service level agreements.
Approaches to data replication
Administrators also need a plan to ensure data consistency, reliability, and integrity. Active-active strategies across regions require a synchronous replication approach. In contrast, basic active-passive DR policies support asynchronous data replication approaches, when RTOs and RPOs have lower thresholds.
Consider the appropriate use of database technology and whether your availability requirements justify the cost and complexity of choosing an active-active strategy for your architecture, Nolan said.
Conventional relational database management often requires an enterprise-level license to enable the multi-replication model required for multi-regional active-active applications. Licensing for a relational database management system, or RDBMS, can be a significant cost factor in your overall approach to protecting AWS workloads.
NoSQL databases are better suited for these multi-regional scenarios without high licensing costs. They are often integrated as offerings from AWS, such as Amazon DynamoDB. However, NoSQL’s transaction model is different from RDBMS offerings. NoSQL databases have what is sometimes called eventual consistency, while RDBMSs are considered always consistent.
“This can have big implications for how your data is presented to users and your ability to meet user needs,” Nolan said. It also influences your approach to application architecture for AWS workloads.
Optimize network infrastructure
Assess the networking and infrastructure elements needed to provide disaster recovery and resiliency in a multi-region strategy. An infrastructure-as-code (IaC) approach can automate some aspects of environment setup and enforce best practices. And yet, organizations typically underestimate their IaC approach to DR, Nolan said.
Consider how often the layers of your infrastructure architecture change relative to the systems running in the regions. Avoid bundling static aspects of infrastructure with frequently changing parts. VPC subnets don’t tend to change often. This may have implications for network addressing. In contrast, security groups change many times, especially in evolving systems. Do not bind these configurations to unique sets of scripts. Make sure the automation doesn’t impede your ability to change the deployment if needed.
At the same time, be careful not to hard code things that change from region to region. Even the most resilient infrastructure will struggle, Banks said, if the database client in your code has a hard-coded regional endpoint.
Also take inventory of your network infrastructure, said Gavin McMurdo, chief technical adviser at IStreamPlanet, a video streaming service. Any AWS offering comes with different burst or sustained network throughput limits. It can be tempting to ignore these limits, which usually don’t matter for something like a storage or database service. Boundaries could be a big issue in a DR scenario, McMurdo noted, when things suddenly shift to another region.
Examine how the dedicated network fiber ports are connected. McMurdo deemed it essential to work with AWS technical advisors to ensure that the fiber terminates in different devices, to avoid a single point of failure. Sometimes fiber ports are moved around by AWS personnel as they consolidate and troubleshoot outages. A direct conversation about network design can reduce the risk of AWS staff introducing a single point of failure into the process of resolving another problem, he said.
Specific AWS Services
Nolan recommended shortlisting several AWS services to implement in an active-active architecture, broken down here by usage:
- Amazon CloudFront for Regional or Global Content Delivery Networks
- EC2 Auto Scaling for traditional IaaS systems and application scenarios
- AWS Lambda for PaaS Application Scenarios
- S3, RDS, DynamoDB, and Amazon DocumentDB for data storage and access solutions, which have snapshot capabilities
- Elastic Block Store and Elastic File System snapshots for attached disks and shared file systems
Know the costs
An active-active multi-region architecture on AWS will cost significantly more than a single active region, Duckbill’s Banks said. Besides the operating cost of additional compute and storage resources, data transfer is not free.
Consider using Capacity Reservations or Reserved Instances. With Reserved Instances, an organization commits financially: it is paid upfront, partially upfront, or monthly. Capacity reservations are essentially an attempt to call dibs on existing capacity in an Availability Zone, Banks said, and they don’t require a fixed financial commitment.
Additionally, to design, monitor, and maintain a complex infrastructure, an organization must invest staff time. These expenses do not appear on the AWS bill, but they do exist.
A good practice in a multi-region setup is to minimize the amount of data that has to go back and forth between regions.