Data Resilience: Self-Managed / Data Resilience Scenario
Let's bring everything we've learned about data resilience together and build a complete resilience strategy for a real world scenario. While some of the steps will be abbreviated for the demonstration, this should give you a good picture of how to implement a data resilience strategy.
In this video, we'll design and implement a resilient MongoDB deployment for Leafy Bank, a fintech company specializing in online payment processing and investment services.
Leafy Bank's primary mission is to safeguard customer data while maintaining uninterrupted service.
As a financial company, they operate in a highly regulated environment and must implement data resiliency measures to comply with industry standards and ensure customer trust.
Understanding Leafy Bank's requirements is the first step to creating a strong data resilience strategy. So let's examine what they need.
Leafy Bank has two critical requirements that will drive our architectural decisions.
First, they need the ability to remain fully operational in the event of a full cloud provider region outage. This ensures continuous service for customers even during major infrastructure failures.
Second, all personally identifiable information must be encrypted so that only authorized users can see the data. This helps Leafy Bank comply with financial regulations, protect customer privacy, and maintain trust while mitigating the risk of data breaches. Next, we know that Leafy Bank can tolerate losing no more than one minute of data and must restore operations within fifteen minutes to minimize business disruption.
This translates to a recovery point objective, or RPO, of one minute, and recovery time objective, or RTO, of fifteen minutes.
Finally, backups must be retained for a total of seven years. Warm data must be stored for two years, while cold data must be stored for an additional five years. Access to this data must be restricted to privileged users, meaning no one can edit, modify, or delete backup copies without the proper permissions. These backups must be retained in multiple locations, including a long term archival located off-site for regulatory compliance and disaster recovery.
Let's start with the first requirement, remaining fully operational in the event of a full cloud provider region outage. To remain operational during regional outages, we'll implement a multi region data distribution replica set spanning multiple cloud provider regions.
For Leafy Bank, we'll deploy two members in US East, two members in US West, and one member in EU Central.
This distribution ensures we can survive a complete regional failure. Of course, multi region resilience comes with trade offs in complexity and latency that need to be carefully considered. For this demonstration, we've already provisioned the necessary servers across three regions. For information on how to do this, refer to the MongoDB Ops Manager documentation for detailed guidance on server provisioning and the MongoDB agent installation process.
With our servers ready, let's create the cluster in Ops Manager.
On our cluster creation page, let's save our cluster and shard prefix. We'll use leafy bank for the cluster name and leafy shard for the prefix. We've set the version of MongoDB to 8.2.2-ent (indicates its the enterprise version) and have supplied the location for our data directories and log files on the nodes. To add our nodes to the shard replica set, we just need to expand the shard settings and select the nodes from the drop down.
Since this is a sharded cluster, we'll also need to select the node for the mongos instance.
For more information on sharded clusters and the roles each node plays, check out our skill on sharding.
With the configuration set, we can click the create cluster button.
Great. The sharded cluster is provisioned, but we have a warning that monitoring is not enabled. To fix this, we need to go to the servers view in ops manager and enable monitoring at the server level.
While we're doing this, we can also enable backups on the servers as well for later use. That takes care of the first requirement. Now let's look at the second, encrypting all personally identifiable information or PII. We'll focus on configuring TLS to keep the PII data encrypted in transit. For TLS encryption, we'll need certificates for each server, which we've already provisioned for this demonstration.
Check out our networking security self managed skill for more information about how to set up TLS encryption for your self managed deployment. With the certificates on the nodes, we can go back to Ops Manager and click on security.
On the security page, we'll click here to edit the TLS settings for the cluster.
Now we can put the file path for our certificate under the TLS certificate authority file path and TLS cluster certificate authority file path.
After we click save, Ops Manager will configure the TLS settings across the cluster, and we will see TLS enabled on our cluster page. This is great, but when we're handling sensitive information like PII, we need an extra layer of protection.
To do this, we can use client side field level encryption, or CSFLE, with a remote key management system, or KMS. With CSFLE, sensitive fields will be encrypted before they even leave Leafy Bank's application.
This means your PII stays protected everywhere while traveling over the network, in database memory, in storage and backups, and even in system logs. So while CSFLE adds complexity through key management requirements, it will deliver end to end encryption, protecting sensitive data. To implement CSFLE, Leafy Bank can start by identifying which fields need protection, such as social security numbers or account information, then set up encryption keys using a key management service. MongoDB provides client libraries for popular programming languages that make it easy to mark these sensitive fields for automatic encryption. Nice work.
Let's move on to the third requirement, achieving an RPO of one minute and an RTO of fifteen minutes or less.
We can achieve this by using Ops Manager to configure a continuous backup schedule for our cluster. We'll start by navigating to the continuous backup section in Ops Manager and clicking on start next to our Levy Bank deployment. We click start again, and Ops Manager will begin taking snapshots automatically.
For our financial data, we'll customize the snapshot schedule to meet regulatory requirements.
To do this, we'll edit snapshot schedule.
We'll configure retention for our hourly snapshots to five days, daily snapshots to fifteen days, weekly for eight weeks, and monthly snapshots to seven years to meet financial record keeping regulations. Perfect. Let's move on to our fourth requirement.
We've already satisfied the first portion by setting our monthly snapshots to be retained for seven years. Ops Manager allows us to back up the snapshots to a file system, block store, and an AWS S3 bucket.
Now all we need to do is facilitate the off-site archive storage and restrict access to privileged users only. For the off-site storage, we can download snapshots from Ops Manager and move them to our preferred archival location.
When we're ready to do that, we can do so by navigating to continuous backup page in Ops Manager and selecting the view all snapshots from the drop down next to the backup we want to download.
On the snapshots page, we can click on restore or download for the snapshot, and then on the sidebar, we click on the download button.
We then choose whether the restore link can be used once or infinitely and how long it is good for.
Once we hit the finalize request button, we can download our backup and move it to our archival location. To avoid unnecessarily large databases, it's important to distinguish between moderately accessed or warm data and historical or cold data that is rarely if ever accessed. For Levy Bank, warm data such as transactions from the past six to twelve months can be stored using several methods.
One approach is to use secondary storage, like cost effective HDDs. Cold data, including historical transactions older than twelve months, can be stored in external storage solutions such as AWS S3. To tackle the last piece of the requirement, we need to make sure that users can only interact with our cluster according to the principle of least privilege. So let's implement role based access controls.
If we go to the MongoDB users section on the security page, we can add additional users by clicking here.
That will open a dialog to allow us to configure the users we need. We want to create three specific users for our application.
First, we need an admin user ("adminUser") with root privileges to be able to handle any situation. This user will be the only user that can modify backups to satisfy the last portion of our backup requirements. Next, we need an application user ("paymentApp") for Leafy Bank's payment processing system.
This user gets read and write access to the transactions database, but nothing else.
Finally, we need a reporting user ("analyticsUser") for Leafy Bank's analytics team. This user gets read only access to prevent accidental data modifications during analysis. When we click on the add new user button, we get a dialogue box like we see here.
We'll need to add the user's name and the database that we are setting the permissions on.
Then we choose the roles from the dropdown and provide the password for the user's authentication.
In this example, we are creating the application user ("paymentApp") and granting them read, write permissions on the transactions database.
To save this user, we just click the add user button.
We can repeat this process for the other users as well. When finished, we can see that all three users are now listed on the MongoDB user security page with the correct permission in place. Great job so far. We've done a lot of work to configure MongoDB to meet all of Leafy Bank's requirements.
With that completed, let's set up our monitoring alerts. Although not explicitly stated in the list of business requirements, this is a best practice and allows you to address issues early and avoid having to restore data in the first place.
From our cluster page, we can click on metrics. And then from the cluster metrics page, we can click on the shard to get to metrics page for the individual nodes in the shard. From here, we can click on the add alert button to take us to the project alerts page. This page shows us all of the existing alerts and lets us add additional alerts.
For our financial services application, we'll configure several critical alerts including replication lag, disk space percentage used, and we'll want to keep an eye on our connection count. To learn more about these metrics, check out our documentation and monitoring skill. Excellent work. You successfully designed and implemented a complete data resilience strategy for Leafy Bank.
Let's review what we've accomplished. We deployed a five member geo distributed replica set across US East, US West, and EU Central, ensuring operations continue during regional failures.
We enabled TLS encryption and transit to protect all PII. We configured automated backups with point in time recovery and customized retention policies to meet financial regulatory requirements.
We implemented proactive monitoring with alerts for replication lag, disk space, and connection counts.
We configured role based access controls with separate users for administration, application access, and analytics.
This comprehensive approach ensures LeafyBank can maintain customer trust, meet regulatory requirements, and provide uninterrupted service even when facing regional outages or security threats. You now have the practical experience to design and implement similar resilience strategies for your own deployments.
