The Importance of Automated Replication and High-Availability

The Importance of Automated Replication and High-Availability

All cloud providers offer continuous network and data replication, but it’s your responsibility to manage the replication and availability across regions.

 

This article was originally posted by Paul Sinaï on LinkedIn.

 

We live in a world of unpredictable disasters, forcing IT organizations to plan for application resiliency and continuous service uptime. Short periods of service downtime can result in substantial financial consequences for near-real-time services like Netflix, Uber, and Lyft, which must be up and running 24/7. But all organizations still need to ensure business continuity even during regular maintenance and upgrade downtimes.

 

Today’s cloud infrastructure providers offer tools to address disaster recovery or continuity – at a cost. The cost includes setting up and maintaining a reliable service replication system with several separate physical sites. It also includes implementing trustworthy failover processes, which become costly over time. Mirroring services across several regions also leads to higher processing costs. Maintaining consistency across data centers creates additional bandwidth expenses, while the configuration of private networks across different regions might become a technical and process challenge even for a not-so-complex network.

 

From the inception of our product, we designed our Cloud Management Platform with a high availability and no-single-point-of-failure architecture. It is based on load-balancers, continuous health evaluation, and dynamic DNS configurations. It works as one logical global service with continuous up-time in case of disaster.

 

ForePaaS Platform has three layers: the first one is a Control Plane to orchestrate all ForePaaS clusters on any cloud (public or private). The second is the cluster itself where the processing, storage, and hosting are done. The third layer is a subset of a cluster dedicated to a customer.

 

It's a schema with the Cloud management platform with three servers in each of the 3 regions. They are all connected to clusters of servers on different cloud providers and private cloud. In each cluster, there is different tenants represented by a house with a key inside

The Cloud Management Platform is a key component to automated service replication. It offloads the heavy cluster management workload like deployments, upgrades, and copy. If the entire control plane is unavailable, our users can still use their applications, data pipelines, ML models, and databases. For obvious reasons however, new deployments and environment updates are suspended during the disruption.

 

This year, one of our cloud provider’s data centers encountered a failure when one of their data centers caught on fire. The fire quickly spread to other data centers. To operate securely and avoid any human casualty, the firemen turned off the power supply to the whole data center, shutting down the services of the entire region. This is obviously not the first time this kind of disturbance has happened. We all remember the AWS S3 disruption in 2017, which happened although for a very different reason.

 

This year’s cloud provider disruption could have been catastrophic for ForePaaS. But our customers’ end-users didn’t notice any interruption. All end-user sessions were automatically rerouted to other data centers resulting in no service drop whatsoever.

 

Our Cloud Management Platform automatically replicates end-user sessions, including log-in, deployments, upgrades, and close to ten other services, not only in different zones but also regions. Most importantly, it acts automatically as a circuit breaker in case of failure.

 

When a user logs in at the nearest data center, ForePaaS immediately replicates the session across other data centers – in real-time. In the event of an interruption, the platform automatically notices that the data center is not responding, and no requests are sent to this data center until it comes back in a healthy state. The process is seamless to the end-users – They don’t see that their session was shifted from one data center to the other. In normal operation mode, users have a lower latency because they connect to the nearest region from their position.

 

When we designed the control plane in 2015, it was challenging to build this kind of active-active replication across three different regions on two continents, especially for a start-up with limited resources. Cross-region VPC peering, or spanning didn’t exist on the major cloud providers. One cloud provider retained our attention, OVHcloud. They had a network feature called vRack, which required few configurations to work on different regions without VPN or expensive dedicated connections. The best part: the cross-region private traffic was unlimited! So, we decided to launch our Cloud Management Platform on OVHcloud while our workloads were and still are on several other clouds, including AWS, Azure, and GCP.

 

During the 2021 incident, all the servers located in one region became unavailable, and all the existing traffic targeted toward the impacted data center was rerouted to the two other ones. We didn’t suffer a data loss because we don’t rely on one region to replicate the data and services but three. Of course, this comes with the cost of tripling the public cloud invoice.

 

This architecture is not the only one possible, and some middle ground like Active-Passive deployment can perfectly do the job for other use cases. Companies should evaluate their replication service decision based on the cost of a failure – because one will happen at some point in time.

 

Try the ForePaaS Platform for free:

 

Try It For Free