Data Resilience: Atlas / MongoDB Atlas Architecture for Resilience
Catastrophic technical failures such as hardware malfunctions, regional outages, and infrastructure disasters can take down entire systems, making it a critical threat to data.
Embedding best practices directly into your architecture is one of the most effective ways to prevent technical failures from impacting your business. Best practices will look different depending on your business needs. Should you keep all of your data in a single region to reduce cost and simplify your deployment? Or do you need to distribute data across multiple regions and cloud providers to create redundancy layers that keep your application running even when components fail? Whichever scenario, MongoDB's architecture has you covered. In this video, we'll look at how MongoDB Atlas is designed for data resilience with scalability, high availability, security, and compliance at its core.
We'll examine the different configuration options available from the default setup to advanced multi cloud deployments so you can choose the right architecture for you. Depending on the resilience requirements of your application, Atlas architecture offers several default features, but also provides flexible options to meet your needs. By default, every MongoDB Atlas cluster maintains read and write availability when one node fails, and read availability when two out of three nodes fail or during availability zone outages.
This means that your application maintains high availability by continuing to serve read requests to your users, even when parts of your infrastructure are unavailable. This happens automatically when you create a new cluster. Atlas creates at least three data nodes per replica set and distributes them across separate availability zones within a single provider's region. In this example, these three nodes are distributed across three different zones within one AWS region. Since each region contains multiple availability zones, essentially independent data centers with their own power and networking, no single facility failure can take down your database. So what happens if a primary node goes down?
Automated failover is MongoDB's safety mechanism that kicks in when this happens. When the primary node fails, MongoDB automatically detects the issue and promotes one of the secondary nodes to become the new primary.
This behavior can be fine tuned through client configuration options to meet your requirements.
As long as you've configured the driver properly, your application stays available with minimal interruption.
For more information on how to do this, check out MongoDB's documentation on building a resilient application. So back to our example. If the primary node in zone one fails, one of the secondaries in either zone two or zone three will become the new primary.
Resilience against a single node failure is available by default. But what if your application demands even more? What if you need to withstand complete regional failures? That's where multi region and multi provider clusters come in, allowing you to design clusters that match your needs. With multi region clusters, your data nodes are distributed across multiple geographic regions within the same cloud provider.
If an entire region goes down due to a major power outage or natural disaster, your cluster automatically fails over to nodes in another region. The most common multi region cluster topologies include at least three regions, like our example here, so that your cluster can automatically elect a new primary capable of serving reads and writes even if one region fails. Multi cloud clusters take resilience a step further by distributing your nodes across different cloud providers, AWS, Google Cloud, and Azure. This protects you from provider level outages, which, while rare, do happen.
You can deploy providers in the same geographic area to maintain low latency access and meet regulatory requirements, like in this example, where data is distributed across all three providers in the same region for high availability.
Or you can spread them globally for maximum protection. This is your best in class availability option, which protects against everything from facility failures to entire cloud provider outages. Keep in mind that greater resilience means greater cost, so you'll need to balance the level of protection your application needs against your budget. To recap, multi region clusters protect against regional outages and are ideal when you need resilience within a single cloud provider.
Multi cloud clusters protect against provider level outages, offering an additional layer of protection for mission critical applications.
You can also combine these strategies to create multi cloud, multi region clusters for the highest level of availability and protection against infrastructure failure. While multi region architecture protects against infrastructure failure, we also need to protect against performance degradation caused by resource heavy tasks.
This is where workload isolation comes in.
Workload isolation is the practice of separating different types of database operations into dedicated resources. This is a powerful Atlas feature that can enhance performance, resilience, and compliance. There are a number of ways to isolate a workload. For example, you could implement one of MongoDB's multi tenancy options. In more advanced replica set configurations, you could use replica set tags to isolate certain types of workloads to specific members of a replica set, or zone charting to distribute data across specific shards based on predefined zone rules.
Another approach is to route specific workloads or applications, such as analytics queries, operational reads, or backup operations, to designated nodes, specific cloud providers, or even entire regions, rather than having everything compete for resources on your primary infrastructure. Take this example, where a customer has been running their production application on a multi region AWS cluster with a secondary read only analytics node in Google Cloud.
This node (labeled "analytics") processes data using analytics software running on Google Cloud.
This isolation prevents resource contention, so heavy analytical queries won't slow down your production application traffic.
It also reduces the risk of cascading failures.
For example, if a problematic analytical query causes issues on an analytics node or in a specific region, your primary workload will remain unaffected. As you can see, MongoDB Atlas architecture offers different options for achieving data resilience and high availability.
From the default three node configuration to multi region deployments to multi cloud clusters, Atlas gives you the tools to build a system that matches your needs.
To deploy multi region or multi cloud clusters MongoDB Atlas, you need to use a dedicated cluster tier or an M10 or higher.
The specific tier you choose depends on your performance and storage requirements, but all dedicated tiers offer you full flexibility for multi region and multi cloud configurations. You can configure your cluster architecture when you first create it or modify it later as your resilience requirements evolve.
For more information on configuring a cluster, check out MongoDB's Atlas documentation.
Now that we know how to protect against failures and workload issues, how do we know our cluster is resilient?
Fortunately, we don't need to wait until something goes wrong since Atlas allows us to test our cluster resilience.
Atlas' test resilience feature lets you simulate failures with the click of a button. This means you can continuously test your mission-critical applications throughout the year just like your regular CI/CD (continuous integration/continuous deployment) process.
When a real incident happens, you'll know exactly how your system responds. Nice work. Let's recap what we covered in this video.
First, we explored Atlas cluster architectures for high availability, including default cluster configuration, multi region deployments, and multi cloud configurations. We discussed how automated failover maintains uptime by seamlessly switching to healthy nodes when failures occur.
You learned that workload isolation can prevent resource contention and ensure consistent performance across different application needs.
We also covered Atlas' test resilience feature for testing your cluster's resilience so you can proactively identify potential issues before they impact production. Keep in mind, this is an overview of essential features, not an exhaustive review. For complete information on all Atlas capabilities and configuration options, refer to the official documentation, including the Atlas Architecture Center.
