Cluster Reliability / Restoring Data from Backups
Throughout this skill, we've discussed various issues you may encounter and the ways that you can mitigate their impact and remediate your cluster to resolve them. But what do you do if something occurs that you cannot fix?
Imagine this. You're trying to remove some documents which are no longer needed and you begin to write out a delete many query. Your terminal auto completes the parentheses. You move your hand to hit the backspace key but suddenly someone bumps into you moving your arm forward and your finger hits the enter key.
You've just asked MongoDB to delete every document in the collection. You move quickly trying to see if you can stop replication, but it's a low latency environment and a low traffic period and nothing else is queued in the Oplog.
The operation replicates swiftly to your secondary nodes and your replica set does its job, faithfully reproducing the right operation on each node. There's nothing to roll back because this was a logical delete operation which has been replicated successfully.
That's why backup and recovery are so important.
Sometimes, our database may suffer from unexpected events like attacks, bugs, hardware or network issues, and we need to be ready to restore our data quickly.
This scenario might sound a little far fetched, but this kind of thing does happen. In this video, we'll discuss backup strategies, best practices for configuring backups, how to troubleshoot backups, and how to restore. Let's get started.
When planning your backup strategy, you have three options, continuous backup, file system snapshots, or using command line tools like Mongo dump and Mongo restore. Let's discuss the pros and cons of each so you can select the right strategy for your use case.
For most production environments, continuous backup on ops manager or cloud manager or continuous cloud backup on MongoDB Atlas will be the best choice. This approach allows for a point in time recovery where you can select a snapshot and then recreate recorded write operations in the Oplog to bring your cluster back to a specific point in time. Importantly, this method is supported by MongoDB and continuous backup is easy to schedule and monitor with Ops Manager, Cloud Manager, or Atlas.
While this option is strongly recommended, you should be sure to leave enough storage overhead for data migration and groom jobs. We'll discuss those in more detail a little later. Another option is to backup via file system snapshots.
This approach can be fast and is typically reliant on functionality that's native to the operating system you're using.
On the other hand, file system snapshots require to use d b dot f sync lock to lock the system and thus can require significant schedule maintenance. And they offer no point in time recovery options. Perhaps most importantly, file system snapshots are not suited to backing up sharded clusters.
We could use Mongo dump and Mongo restore. These command line tools are simple to use, allow you to restore to different hardware, and they're available immediately when you install MongoDB. However, this method isn't ideal for replica sets that are under load. It doesn't offer point in time recovery and it can be slow and resource intensive, especially for large data sets. This method is really best suited to development and testing environments. Since it's ideal for applications with strong point in time recovery needs, we'll show you how to perform a continuous backup. For additional information on all backup strategies, check out our online documentation and MongoDB University.
Now let's look at the details of managing backups and restoring our data. First, let's talk about backup size. It's best to keep your replica set size to below two terabytes if possible. Again, this is a best practice. It's not a practical limitation. But remember, the larger the replica set or shard, the longer it will take to perform a full backup and to restore. If your replica set is larger than two terabytes, you should shard the database and ensure that each shard is less than two terabytes in size.
Where very rapid restores are required, it can be helpful to consider even smaller replica sets or shards.
Restore time, specifically, is something to keep in mind. Restoring from a backup is something you hope you never have to do. But when you have to do it, you certainly want it to complete as quickly as possible.
You'll also want to confirm that your network bandwidth and write speed on your backup storage and your replica set are sufficient to allow for fast backup and fast restore operations. You should ensure that there is enough free space available on your block store to allow groom jobs to run.
Groom jobs remove unused blocks and block stores to reclaim storage space. The groom job must first copy all used blocks to a new target block store and update references to the blocks before dropping the original source database.
This means that in order for groom jobs to be able to run, you'll need to have enough space available on your backup storage location to accommodate two copies of the used blocks, which can be significant. It's a best practice to leave fifteen to twenty five percent of your disk space available at all times.
Finally, be sure to identify potential backup issues by monitoring and confirming your backup activity. Confirm your backups are successful using monitoring tools and logs and perform test restores to make sure that you can restore from your backups. By default, you'll be alerted if a backup has failed. Whether you're using Ops Manager, Cloud Manager, or Atlas, be sure to review your backups.
While some alerts are transitory, others may indicate a persistent problem. You can always review our documentation or reach out to our support team for help.
As I mentioned, you should examine the log files to gain insight into your backup activity.
In addition to MongoDB and MongoDB logs, MongoDB also keeps logs for the backup agent. See this slide and the code summary for the default locations on Linux and Windows.
You can look in the logs around the time of the failed backup to attempt to gain some additional context into what might be causing the issue. If you've identified that there's an issue with your backup, you can take the following steps to resolve it. First, make sure that the MongoDB agent is running on the deployment being backed up. You can confirm this by running p s aux pipe grep m m s in a Linux environment. If the agent is not running, you may start it via sudo system control start mongo d b hyphen m m s hyphen automation hyphen agent dot service. Contact MongoDB support and include the diagnostic archives and the backup daemon logs. Check our documentation for the default log locations for your operating system.
So you've confirmed your backup is functioning, and now you need to restore data. How can you do this with minimal impact to your application's performance?
Restoring an entire cluster is time consuming. Sometimes, you just want to recover a particular database or collection. While it's not an automated granular restore process, MongoDB does allow you to recover data from a queryable snapshot and manually write it onto a database. As the name implies, queryable snapshots may be queried to compare data in the snapshots against your current production data. While snapshots are a read only, you can use commands like mongo dump and mongo restore to copy the collection or database and write it to your cluster.
See our online documentation for more details on how to do this.
If you need to restore a replica set or an entire sharded cluster, you can navigate to the deployment and then click on continuous backup, which will take you to the overview tab.
Select your deployment, then click the three dot menu for the snapshot you wish to restore from. Next, click restore. You can choose a snapshot, a point in time, or an Oplog time stamp.
Select the cluster to which you wish to restore. Note that if you're conducting a test restore or you wish to restore with little to no downtime, you can restore to another cluster rather than your active cluster.
Confirm that you agree and finally, click restore.
Great job. We've successfully initiated a restore. Let's quickly recap what we've learned here today. Follow the best practices we discussed in this lesson to ensure that your backup functions as intended.
Should a restore be deemed necessary, consider what data needs to be restored and then select a restore strategy that best aligns with that goal. When contacting support, gather the project diagnostic archives, backup daemon logs, and agent verbose logs.
Great work. We've almost reached the end of this skill. Our final video will be concerned with information specific to users of MongoDB Atlas. See you there.
