Cluster Reliability / Troubleshooting Replication Issues
Now that we know how to identify replication lag and its causes, let's learn how to overcome it to ensure optimal performance.
We'll begin by exploring ways you can reduce replication lag depending on the cause of the lag, whether it's a network or a hardware bottleneck.
We'll also show you what to do if replication lag causes a node to exceed the op log window. Let's start with replication lag caused by network latency.
Whether you're hosting your replica set on your own hardware or using a cloud provider, one way to reduce latency is to host nodes in close geographical proximity. Be sure to consider your company's compliance requirements before doing so.
You could also employ dedicated network connections. This will improve throughput and make it easier to locate where a problem is occurring since there will be fewer hops between nodes. Keep in mind that troubleshooting your network could be fairly complex and it is your responsibility if you're hosting on your own hardware.
If you're making use of MongoDB Atlas, network connectivity will automatically be managed for you.
Okay. We've discussed some actions you can take to reduce the likelihood of network bottlenecks.
Now let's discuss how to remediate replication lag caused by a hardware bottleneck, which could manifest in slow storage or tax compute resources.
If replication lag is caused by a hardware bottleneck, vertically scaling your replica set member nodes will help. Depending on the bottleneck, you can add more CPU, memory, and disk resources.
Or if scaling vertically is not an option, scale horizontally by sharding your clusters.
You can learn how to do so in our sharding skill and documentation.
Replication lag can cause one or more secondary nodes to fall so far behind the primary that they exceed the Oplog window. This means they can no longer catch up with the primary and you'll need to perform an initial sync on the node or nodes in question.
To perform an initial sync, stop the MongoDB process on the replica set member or members that require a new initial sync by running db. ShutdownServer. At this point, optionally, you can back up the entire db path directory and its sub directories, but you should at least back up the diagnostic. Data directory for use in troubleshooting.
If you contact MongoDB support, this will come in handy and could save a lot of time. Next, delete the data and subdirectories from the member's db path directory. At this point, you can save time by restoring the data from a snapshot of one of your nodes.
Depending on the size and complexity of your dataset, this could reduce the sync time from days down to minutes. The snapshot must be recent enough to fall within the Oplog window or the sync will fail.
You can't use a mongodump backup for this. It must be a snapshot backup or a directory copy. Finally, restart the MongoDB process via the MongoDB command, and the MongoDB process will automatically perform an initial sync upon starting up with an empty DB path directory.
If you opted to restore a snapshot within the Oplog window timeframe, normal replication will resume from the last entry in the Oplog. We'll discuss potential issues with initial syncs and how to check on their progress later in this skill. But for now, please keep in mind that initial syncs can take a long time to complete, depending on the network latency and the size of the data set, as well as the write activity on the primary.
The larger the data set, the longer it will take. In addition to the size of the data set, the number of collections and number of indexes in each collection can also impact the speed of an initial sync, as additional compute resources and disk space must be used for more complex datasets. Finally, if you want to avoid an initial sync for a node that's falling behind, you could increase the op log size by using the REPL set resize op log command.
When running this admin command, specify the size of the Oplog in megabytes and format it as a double precision floating point.
You'll need to set your Oplog to a value of over nine ninety megabytes. But seeing as we're expanding our Oplog, this won't be an issue for our example.
Finally, the command must be run for each node. Make sure that you resize the Oplog on your secondary nodes first and then on your primary. Please note that while increasing the op log size could buy us some time and reduce the chances of exceeding the op log window, it will not resolve the underlying issue causing the replication lag. As well, resizing the op log should be done with care. Bear in mind the impact to your disk space and note that while running the resize operation, there will be a lock on the op log.
Finally, if you need assistance with any of these steps or find that further action is required, our support team is always happy to assist you. Should you decide to contact support, please provide the following: the name and configuration information of the affected cluster, like the size of the cluster, MongoDB version, whether it's a replica set or sharded cluster, and any other information which might be relevant. The date and time of the incident as precisely as you can pinpoint it. A copy of the MongoDB.
Log file from each of the affected nodes and an archived copy of the diagnostic. Data subdirectory of the DB path for every node in the affected cluster or the primary of every shard for a shardry cluster. Okay, let's recap what we've learned. We've discussed actions to mitigate replication lag by addressing network and hardware bottlenecks.
We also discussed how to conduct a new initial sync when a secondary exceeds the op log window. Remember, to remediate these issues, there are many options at your disposal. You can host nodes in close proximity and establish dedicated network connections. You can also consider scaling vertically or horizontally to reduce the workload on any given node. If the Oplog window is exceeded, you may perform an initial sync to bring a stale replica set member back up to active status.
In cases where the replication lag cause is temporary, you can increase the size of the Oplog window to allow more room for your replica to catch up. Great work. Now you can not only identify replication lag and its potential causes, but you can take steps to remediate your cluster and ensure your database performs as expected. See you in the next video.
