The Importance of Automated Replication and High-Availability

The Importance of Automated Replication and High-Availability

All cloud providers offer continuous network and data replication, but it’s your responsibility to manage the replication and availability across regions. What are continuous availability solutions. The importance of high availability and disaster recovery.

 

How important are automated replication and high-availability, and what to do about it. Let’s explore.

 

The need for replication and high-availability

 

We live in a world of unpredictable disasters, forcing IT organizations to plan for application resiliency and continuous service uptime. Short periods of service downtime can result in substantial financial consequences for near-real-time services like Netflix, Uber, and Lyft, which must be up and running 24/7. But all organizations still need to ensure business continuity, replication and high-availability, even during regular maintenance and upgrade downtimes.

 

Today’s cloud infrastructure providers offer tools to address disaster recovery or continuity – at a cost. The cost includes setting up and maintaining a reliable service replication system with several separate physical sites. It also includes implementing trustworthy failover processes, which become costly over time. Mirroring services across several regions also leads to higher processing costs. Maintaining consistency across data centers creates additional bandwidth expenses, while the configuration of private networks across different regions might become a technical and process challenge even for a not-so-complex network.

 

Our Cloud Management Platform for replication and high-availability

 

From the inception of our product, we designed our Cloud Management Platform with replication and high-availability in mind, and with no-single-point-of-failure architecture. It is based on load-balancers, continuous health evaluation, and dynamic DNS configurations. It works as one logical global service with continuous up-time in case of disaster.

 

The ForePaaS Platform has three layers: the first one is a Control Plane to orchestrate all ForePaaS clusters on any cloud (public or private). The second is the cluster itself where the processing, storage, and hosting are done. The third layer is a subset of a cluster dedicated to a customer.

 

The ForePaaS Platform has three layers: the first one is a Control Plane to orchestrate all ForePaaS clusters on any cloud (public or private). The second is the cluster itself where the processing, storage, and hosting are done. The third layer is a subset of a cluster dedicated to a customer.

 

The Importance of Automated Replication and High-Availability

 

 

The Cloud Management Platform is a key component to automated service replication. It offloads the heavy cluster management workload like deployments, upgrades, and copy. If the entire control plane is unavailable, our users can still use their applications, data pipelines, ML models, and databases. For obvious reasons however, new deployments and environment updates are suspended during the disruption.

 

A replication and high-availability example

 

This year, one of our cloud provider’s data centers encountered a failure when one of their data centers caught on fire. The fire quickly spread to other data centers. To operate securely and avoid any human casualty, the firemen turned off the power supply to the whole data center, shutting down the services of the entire region. This is obviously not the first time this kind of lack of replication and high-availability and business disturbance has happened. We all remember the AWS S3 disruption in 2017, which happened although for a very different reason.

 

This year’s cloud provider disruption could have been catastrophic for ForePaaS. But our customers’ end-users didn’t notice any interruption. All end-user sessions were automatically rerouted to other data centers resulting in no service drop whatsoever.

 

Our Cloud Management Platform automatically replicates end-user sessions, including log-in, deployments, upgrades, and close to ten other services, not only in different zones but also regions. Most importantly, it acts automatically as a circuit breaker in case of failure.

 

When a user logs in at the nearest data center, the ForePaaS Platform replication and high-availability features immediately replicate the session across other data centers – in real-time. In the event of an interruption, the platform automatically notices that the data center is not responding, and no requests are sent to this data center until it comes back in a healthy state. The process is seamless to the end-users – They don’t see that their session was shifted from one data center to the other. In normal operation mode, users have a lower latency because they connect to the nearest region from their position.

 

When we designed the control plane in 2015, it was challenging to build this kind of active-active replication across three different regions on two continents, especially for a start-up with limited resources. Cross-region VPC peering, or spanning didn’t exist on the major cloud providers. One cloud provider retained our attention, OVHcloud. They had a network feature called vRack, which required few configurations to work on different regions without VPN or expensive dedicated connections. The best part: the cross-region private traffic was unlimited! So, we decided to launch our Cloud Management Platform on OVHcloud while our workloads were and still are on several other clouds, including AWS, Azure, and GCP.

 

During the 2021 incident, all the servers located in one region became unavailable, and all the existing traffic targeted toward the impacted data center were rerouted to the two other ones thanks to our replication and high-availability services. We didn’t suffer a data loss because we don’t rely on one region to replicate the data and services but three. Of course, this comes with the cost of tripling the public cloud invoice.

 

Beyond high-availability and replication

 

This architecture is not the only one possible, and some middle ground like Active-Passive deployment can perfectly do the job for other use cases. Companies should evaluate their replication service decision based on the cost of a failure – because one will happen at some point in time.

 

For more articles on cloud infrastructure, data, analytics, machine learning, and data science, follow me on Towards Data Science.

 

Get started with ForePaaS for FREE!

LET’S GO

End-to-end Machine Learning and Analytics Platform

Discover how to make your journey towards successful ML/ Analytics – Painless