Data Resilience: Self-Managed / Managing Failover and Backups

8:31
Even the most carefully designed systems can experience disruptions, whether it's from hardware failures, human error, ransomware attacks, or natural disasters. So you have to be prepared to restore your data when disaster strikes. In this video, we'll learn how MongoDB Ops Manager helps you manage failovers in a self managed deployment. We'll also briefly discuss monitoring and alerting. Next, we'll focus on how to create automated backups, as well as how to implement workload isolation for self managed MongoDB deployments. Let's dive in. Ops Manager enables you to deploy, monitor, backup, and scale MongoDB on your own infrastructure. When preparing a disaster recovery strategy, businesses typically start with two critical metrics, your recovery point objective, or RPO, and your recovery time objective, or RTO. Your RPO answers the question, how much data can we afford to lose? It defines the maximum acceptable amount of data loss measured in time, whether that's seconds, minutes, hours, or even days. Your RTO answers, how quickly do we need to be back up and running? It defines the maximum acceptable downtime before your business is significantly impacted. By clearly defining your RPO and RTO for each workload, you can select the right combination of MongoDB features to meet your needs. Before a failover happens, Ops Manager can help us monitor our replica set health, as well as configuring alerts to notify us of potential issues before they trigger failovers. We should also periodically validate our failover mechanisms and testing environments to verify that elections complete successfully and applications handle primary changes gracefully. Effective monitoring prevents small issues from becoming major outages. Ops Manager provides centralized observability for self managed deployments, tracking critical metrics that indicate system health. To get an overview of the general health of your cluster or observe a failover, navigate to the Ops Manager dashboard and select the list view for our deployment. The list view displays the current status of each node in a replica set. Healthy nodes show a green indicator, while nodes experiencing issues appear yellow or red. Beyond basic health status, Ops Manager's metrics tab provides detailed performance data across a variety of areas. Check out our skills on monitoring performance tools for a detailed look at important metrics to monitor and best practices for setting up alerts. Even with a robust monitoring and alerting strategy, we can't prevent every problem. The next critical layer of protection is a solid backup strategy. Backups defend against catastrophic events, like regional outages or accidental data deletion. Ops Manager's automated backup system captures point in time snapshots of our data, enabling recovery to specific moments. To begin the configuration, navigate to continuous backup in the sidebar under the database heading. From here, we'll locate the process we wanna back up from the list and click start in the status column. A pane for the backups will pop up for us to review the storage engine. Once we click start, the automated backups are now active. Ops Manager automatically creates snapshots on a schedule. By default, base snapshots are taken every six hours and retained for two days. Daily snapshots are kept for seven days, weekly snapshots for four weeks, and monthly snapshots for thirteen months. Adjusting your snapshot frequency allows you to tighten your RPO. For example, a six hour interval means a maximum potential data loss of six hours of writes. Administrators can adjust both frequency and retention periods through the edit snapshot schedule option, balancing recovery flexibility with storage costs. With these automated snapshots in place, the actual recovery process is straightforward. When we need to restore data, we just navigate to the continuous backup page for our project. From the overview tab, we can select our deployment, and then either click on the options button to choose view all snapshots, or we can click on the restore or download button that appears when we hover over the status column. This opens the snapshots window that displays available restore points, including stored snapshots and point in time options. Ops Manager offers multiple restore types. We can restore from an existing snapshot, a specific point in time, date and time, or an op log time stamp. For point in time restores, Ops Manager creates a custom snapshot that includes all operations up to your selected time. There are a few other important considerations before using Ops Manager to restore a cluster. When Ops Manager restores a cluster, it removes all existing data from the target host and replaces it with our backup data from our snapshot. For sharded clusters, we must restore all shards. Ops Manager will not restore a single shard only. During an automated restore, select choose cluster to restore to, and specify the target project and cluster. Ops Manager displays the required storage space and replaces all existing data on the target cluster while preserving backup data and snapshots. Once everything looks okay, we just click the restore button and then confirm our changes. Once that's done, a modal will appear letting us know our restore is in process. For large databases, restores may take several hours. Ops Manager allows you to monitor the progress of the restore via the modal and enables you to have a predictable timeline to communicate to stakeholders. Once complete, we can verify our data integrity by running sample queries against the restore deployment. Beyond disaster recovery, another common operational challenge is managing different types of workloads on your MongoDB deployment. While backups protect against data loss, workload isolation protects your production performance from resource intensive operations. Workload isolation separates resource intensive, non critical operations such as analytics queries from your production traffic. MongoDB's read preference settings let us direct queries to specific replica set members. By configuring our analytics queries to read from secondary nodes, we can preserve primary node resources for production writes. We can set this in our application code when we initiate the client as seen here. This directs all read operations through this client to secondary nodes. Production write operations still go to the primary, but your analytics workload no longer competes for primary node resources. For more advanced isolation, we can dedicate specific secondary nodes to analytics workloads. In Ops Manager, we can navigate to our deployment and click modify to adjust the member priorities. We'll set our analytics node priority to zero and configure the member type to be hidden. This configuration ensures your resource intensive analytics workloads remain completely isolated from your primary database operations. Once we click save, this node will exclusively serve analytics queries. Other ways of isolating a workload include implementing one of MongoDB's multitenancy options. In more advanced replica set configurations, you could also use replica set tags to isolate certain types of workloads to specific members of a replica set, or zone sharding to distribute data across specific shards based on predefined zone rules. Fantastic. In this video, you learned how MongoDB's Ops Manager helps you manage failovers and backups. We created automated backups and outlined their restoration process for our cluster. We also implemented workload isolation by leveraging read preference settings and priority adjustments to dedicate resources for analytics. We also explored how replica sets enable automatic failover and used Ops Manager to perform proactive monitoring of key metrics, like replication lag and disk space.