Cluster Reliability / Troubleshooting Replication Issues

6:39
We've seen how conducting an initial sync can help to recover from replication problems. An initial sync is necessary when certain conditions are met, such as when replication lag exceeds the Oplog window. But what if an issue occurs during an initial sync? What if your cluster is regularly experiencing failovers where a node repeatedly goes offline? In this video, we'll explore these scenarios in detail so you can identify, mitigate, and recover from initial sync and failover issues. Let's say you're a database administrator tasked with ensuring your company's MongoDB replica set functions properly. High availability is crucial to your objectives and one of your nodes is experiencing serious replication lag. You managed to resolve the issue causing the lag, but by that point, the node in question had exceeded the Oplog window and will now require an initial sync. You start the initial sync and estimate it should take several hours based on the size of the data being transferred and your average data transfer rate between the source node and its sync target. The time period you estimated elapses and your initial sync still hasn't finished. How do you know if the sync has stalled? And more importantly, what do you do? First, you'll need to be able to identify whether the sync is still making progress. You can use rs. Status to check the state of each replica set member. While the initial sync is running, it will return a state of startup two. This will change to recovering while it performs some checks to confirm everything is ready, and then secondary once it's rejoined the replica set as a secondary node. During an initial sync, running rs. Status on the syncing node will provide initial sync status information. Here, we'll look at the information in initial sync attempts. Duration millis specifies the duration of the initial sync attempt. Here, we can see this attempt lasted just under a minute. Status indicates the status of the attempt. Here we see there was an error fetching the op log. Sync source indicates which of the other nodes in the replica set is being used as the source for this sync attempt. And operations retried lets you know the total number of all operation retry attempts. Here we see one hundred and twenty retry attempts. Given the short duration of this sync attempt, that indicates a significant problem. So we have a problem. Let's discuss potential causes of slowed or stopped initial syncs. First, network latency between the syncing and source nodes will slow the initial sync. And if the connection is unstable enough, it could result in the sync failing. Hardware bottlenecks could also be responsible. Insufficient memory or CPU resources on a secondary node could lead to excessive page faults. Then, look at the sync sources workload. A heavy load, perhaps from read preferences directing queries to secondaries, or many operations syncing from the primary, can slow down the initial sync. Finally, if the MongoDB process on the syncing node is stopped or restarted during the initial sync, it will cause the sync to fail. If this occurs on the source node during the clone or op log fetch stage of the sync, the sync can also stall or fail. If you suspect that your initial sync is stalled completely and not just taking a while to complete, search your logs for error messages like error fetching Oplog during initial sync, Oplog start missing, or not found in Oplog. Of course, once you've identified an issue with the initial sync, you'll want to take steps to mitigate the impact. Here are two things you can do. To speed up the initial sync, you can seed the initial sync using a snapshot of the data from the source node. This will accelerate data transfer and reduce the amount of data traveling over the network. And to ensure the secondary has enough time to sync with the source, you can increase the size of your op log to ensure any logged write operations are preserved during the initial sync. Now that you've mitigated the issue, you'll want to take steps to ensure it doesn't happen again. Let's remediate our cluster. If you suspect hardware bottlenecks are the cause, take advantage of the downtime to scale the node's resources. Consider scaling horizontally by sharding your cluster to reduce the data size of initial syncs in the feature. You can always stop and retry the initial sync, but this is really only advisable if you can confirm it's stalled. Check the logs to be sure it's stalled and no longer progressing. Finally, if you still need assistance, reach out to our support team. Now that you know how to handle issues with an initial sync, let's talk about frequent or recurring failovers. With three members in your replica set, you should be able to tolerate a failover and still have a functioning secondary and primary. Occasional failovers are not cause for concern, but if they occur regularly, frequent failovers indicate a larger issue which should be addressed. To identify recurring failovers, regularly monitor the health of your cluster via rs. Status. Use monitoring and metrics from Ops Manager, Cloud Manager, or MongoDB Atlas to get insights into your cluster's health. Here, we'll create an alert to let us know whenever a failover occurs and an election is triggered. We'll select replica set as our category type and replica set elected new primary as our trigger. If I start to receive these alerts on a regular basis, I'll know there's an underlying issue that needs to be addressed. To remediate your cluster and resolve recurring failovers, you'll want to investigate your network configuration and latency between each node. Take the time to address any hardware issues that might be causing failovers and consider scaling to accommodate for increased workload. If the bottleneck is occurring on a single node, scaling that node vertically might be the solution. If the workload is just too much for your cluster, horizontally scaling by sharding your cluster will help distribute the workload more evenly. Finally, once you've resolved the underlying issue or issues, be sure to continue to monitor your cluster's health and performance regularly via the tools and metrics we've discussed earlier. Remember, it's important to know your baseline so you can easily notice when your cluster is not performing as expected. Let's quickly recap what we've learned. Use rs. Status and metrics to identify both initial sync issues and recurring failovers. Additionally, search your logs for messages containing Oplog or initial sync to identify potential initial sync issues. Mitigate downtime from initial sync issues by seeding the sync, by restoring from a recent snapshot. Finally, to remediate your cluster, consider scaling it to accommodate increased workload and addressing network issues. Outstanding. Now you know how to handle issues with initial syncs and frequent failovers.