Cluster Reliability / Atlas Considerations

7:38
Up to this point, we've discussed some issues you may encounter while managing your cluster. Issues with replication and sharding, and restoring from backups. We've mostly been looking at these issues from the perspective of self managed clusters. So let's take a few minutes to talk about some features of MongoDB Atlas which can help you overcome these issues or avoid them before they occur. In this video, we'll be discussing Atlas' automated monitoring features, auto scaling, and how to back up and restore with Atlas. As we've discussed previously in this skill, one of the best ways to identify issues and prevent them from recurring is to keep an eye on key metrics. It's important to establish a baseline for what is typical for your instance, so you can tell when something has changed. It's also important to have alerts set up, so you know when metrics exceed certain thresholds or important events occur. Fortunately, MongoDB Atlas can help you with all of this. Let's take a look at the metrics first. To view metrics for your cluster, you can click on view monitoring next to the cluster name on the projects page. If you're already looking at your cluster, you can click on cluster metrics in the left frame. Once you're on the cluster metrics page, you'll see this layout with a chart. You can use the menu above the chart to change the granularity of the sample data, the date and time charted, what nodes you're charting, and what the chart displays. On the cluster metrics page, you can track many metrics which are helpful in assessing the health and performance of your cluster. To learn more, review the metrics mentioned in the previous videos, our skill on monitoring tools, and our documentation. Atlas also provides alerts, which can be customized and configured to let you know when certain events occurred or thresholds are crossed. Any triggered alerts will be sent out via email, SMS, or whatever other method you may have configured. And when you look on your organization page, you'll see active alerts listed. You can view them by clicking here. On the project alerts page, we can see an alert and acknowledge it if we so choose. Okay. We've looked at our metrics and alerts. Now let's discuss auto scaling. Throughout this skill, we've discussed issues and situations like replication lag or unbalanced workloads in charted clusters, where scaling your cluster is a good strategy for mitigating negative impact or to remediate your cluster. With auto scaling, Atlas can automatically take action to scale your cluster up or down based on sustained CPU, memory, or IOPS usage. Remember, auto scaling is designed to scale the cluster based on sustained resource usage, not to respond instantly to large, sudden spikes. Please make sure you leave overhead to handle fluctuations in traffic. If auto scaling is not enabled on your cluster and you'd like to enable it, you can edit the configuration of your cluster. Once on the configuration page, you'll find auto scaling options under the cluster tier menu. From here, you can enable auto scaling, toggle whether or not the cluster can be auto scaled down, and set your minimum and maximum cluster tiers. You can also enable storage scaling if it's not already enabled. Note that both auto scaling and auto storage scaling are enabled by default unless explicitly disabled when you created the cluster. Scaling resources automatically can significantly reduce the performance impact of issues discussed in this skill. However, what happens if the problem affects the entire system? What if the entire cluster is compromised? For example, corrupted data, accidental deletions, and ransomware are all disasters which can be recovered from by restoring from backup data. Let's take a closer look at how to backup and restore with MongoDB Atlas. First, let's click on backup on the left frame. This takes us to the backup page where we can see all the databases being backed up and view their most recent snapshot, when the next snapshot is scheduled to be taken, and the oldest snapshot for each instance. If we click on our cluster, we can see data for all the snapshots for this cluster, including when they were created, the cloud provider and region, retention period, and frequency of the backup. By clicking on the chevron next to a snapshot, we can get more details like the cluster type, whether the config server is embedded or dedicated, and the total size of the data that would be restored if we restored from the snapshot. We can also see the version and the encryption key. Beyond your scheduled backup policy, we can create an on demand snapshot by clicking take snapshot now. This brings up a modal window which will allow you to set the retention period for the shot and write a description. Click take snapshot and a message will tell you that an email will be sent when the snapshot is in progress. Here we can see that the snapshot is currently being taken. Remember that with storage auto scaling enabled, MongoDB Atlas will automatically scale your cluster to ensure that you have enough disk base available to keep your backups functioning and your cluster healthy. Once the snapshot is complete, we can recreate the scenario we discussed at the beginning of this course. Let's accidentally drop the entire customer's collection. We can confirm that the data is gone by performing a find one or show collections. Oh, no. Our collection is gone. Fortunately, we've been backing up our data and just happened to have a recent snapshot to restore from. Note that if you're making use of continuous cloud backup, you can also conduct a point in time restore, which allows you to select a date and time and will then restore from the most recent snapshot prior to that point, then replay the backed up op log operations and recreate the changes made since the snapshot to bring your data back to that specific point in time. In our example, we have an extremely recent snapshot, and because it's a test environment, no data has changed since it was taken, so we'll conduct a standard restore. Click on the three dot menu and then restore, and a modal window will appear. Here, you can specify the target project and cluster you wish to restore to. You might address performance issues by restoring to a different cluster with more compute resources. In this case, we'll restore to the same location. But in a production environment, you may wish to restore to a different cluster to reduce downtime or for other reasons. Type I agree in the warning field if you agree. This is here to advise you that all existing data in the cluster you're restoring to will be deleted as we're restoring the entire cluster. Once the restore has successfully completed, our cluster will be back online and in exactly the same state it was when it was backed up. We can confirm that the collection and the data therein have been restored by once again using the findOne method. As you can see, Atlas makes identifying issues simple with monitoring and alerts. It's also much easier to mitigate and remediate issues with tools like auto scaling, auto storage scaling, and the easy to use backup and restore functions. Having said that, it's of course important to know how to manually check your cluster's performance and to understand what your baseline looks like. If you'd like to know more, please see our other badges, online documentation, and other MongoDB University content. See you next time.