Cluster Reliability / Troubleshooting Replication Issues

9:20
In our troubleshooting methodology video, we discussed the following scenario. During a major sale that generated more traffic than anticipated, you received complaints from customers who are seeing significant delay between placing orders and receiving order confirmations. This is just one example of how replication lag can impact user experience. In this lesson, we'll discuss how to identify replication lag and the various issues that can cause it. Let's start with a brief review of how replication works in MongoDB. MongoDB replica sets are designed to ensure high availability and data redundancy. A replica set consists of multiple nodes, each with specific roles. One primary node will receive and execute all write operations such as inserts, updates, and deletions. In most cases, your primary node will also receive read operations. The primary node records write operations in a collection called the op log. This data is then sent to the secondary nodes. Secondary nodes apply the right operations listed in the op log logically to their own copy of the database, ensuring the data remains consistent across the replica set. MongoDB replica set members send heartbeats to each other at regular two second intervals to ensure they're able to communicate. If a particular node is unreachable via this heartbeat signal after a set period of time, it's marked as inaccessible by the other nodes. By default, this period of time is set to ten seconds. When this happens to a primary node, an election is called to select the next primary. Replication lag occurs when one or more of the secondary nodes falls behind the primary in replicating the right operations stored in the op log. A little replication lag isn't necessarily an issue, but if it accumulates, it can impact your application's response times or even cause timeouts. It can also potentially cause stale data reads depending on your read and write concerns. Additionally, if replication lag accumulates on a secondary node, it could be in danger of exceeding the Oplog window. In which case, you'll need to perform an initial sync. The Oplog window represents the amount of time that the Oplog can retain changes before overwriting the oldest entries, tracking the time span between the newest and oldest entries in the Oplog. It should be noted that this time window is an estimate based on the current size of the Oplog, the maximum Oplog size setting, and the average workload. If a secondary falls too far behind and its last replicated operation is older than the oldest entry in the Oplog, it must undergo an initial sync, which can be time consuming and can impact performance. By default, the Oplog size is limited to the smaller of five percent of disk space or fifty gigabytes. MongoDB automatically truncates the oldest entries to maintain this limit. Administrators can increase the Oplog size using the REPL set resize Oplog command. Increasing the Oplog size can help you avoid the need to perform initial syncs on secondaries that are experiencing replication lag. But if the lag is significant enough to require this, there's likely an underlying issue that still needs to be addressed. As the Oplog is a capped collection, increasing the Oplog size will immediately allocate the designated space on your database storage. So be aware of your storage needs. Additionally, you can set a minimum retention period using storage dot Oplog min retention hours in the MongoDB configuration file. This ensures entries are only truncated if they exceed the set size and are older than the specified amount of hours, which will help prevent costly initial syncs by temporarily allowing the Oplog to exceed its maximum size. Please be aware of the additional space that may be required when enabling the minimum Oplog retention period. Now we understand what replication lag is and how it can impact your user experience, including potential data loss and time intensive initial syncs. We've also discussed preventing initial syncs by changing the Oplog size and setting minimum retention hours. With that in mind, what are some potential causes of replication lag? Here are a few examples. Network connectivity between your nodes may be intermittent or impeded causing failed heartbeats, creating a bottleneck for incoming Oplog write operations, or frequent failovers and elections, which could lead to further replication lag. Additionally, heavy workloads can also cause replication lag, especially if the hardware hosting your nodes is taxed by the workload. Being familiar with potential causes of replication lag will help you know where to look should you experience them. This is also another example of how regularly monitoring your cluster and having a good idea of your baseline metrics when your cluster is functioning as expected can help you notice when something has changed. Now that you understand what replication lag is, its potential consequences, and some common causes, how do you confirm it's actually happening and find out exactly what is causing it? Let's open ops manager and then click on our replica set. Then we'll click on the metrics tab. We see the metrics for our primary node and both of our secondary nodes. We can look at replication lag, replication Oplog window, and other relevant metrics to get a good picture of how our cluster is performing. Here, we can see my test cluster is experiencing little to no replication lag, and my Oplog window is over five days. This is ideal. In addition to the replication lag and Oplog window metrics we just looked at, you'll want to pay special attention to the Oplog gigabytes per hour, network, and replication headroom metrics. It's also a good idea to track your disk IOPS, disk latency, and disk queue depth as higher than baseline values for these metrics could indicate hardware bottlenecks. Now let's look at some useful methods available to you in MongoDB Shell to help you identify replication lag. Let's run r s dot print secondary replication info to get a report of the replica stat status from the perspective of the secondary nodes. We see that our server named m one is only two seconds behind the primary. However, our server named m two is two hours and ten seconds behind the primary. This may indicate a problem and it's worth investigating. Let's dive deeper into the data. Let's look at the overall health of the replica set with r s dot status. R s dot status shows the results for the entire replica set, but we'll look at just one node here. First, we can see the ID, name, and the health of the node. A health value of one indicates the node is healthy and responding to heartbeats from other members of the replica set. Next, we can see that the state is listed as two, indicating this is a secondary node. This is echoed in the state string field. Not far below is information about the timestamp of the latest operation applied to the member, as well as when the last heartbeat was received. By comparing the timestamp of the latest operation and the last heartbeat received on the primary to the same information on the secondary, we can see that m one, our first secondary, has no recorded latency for its ping response. Interestingly, m two, our lagging secondary node, also shows no latency for its ping response. And its last heartbeat received was more or less in time with m one. However, the timestamp of the latest written operation is over two hours behind m one. This indicates that the culprit here may be a hardware bottleneck rather than network connectivity issues. You can find additional information on r s dot print secondary replication info, r s dot status, and other command line tools via our online documentation. Additionally, if you need any further clarification, you can look at your log files. Within the context of identifying the cause of replication lag, we'll likely want to focus on the mongod dot log file on the secondary node, which is experiencing the lag, and possibly the mongod dot log file on the primary node. What you'll want to search for will vary depending on your situation. But given the common causes of replication lag we've mentioned in this video, here are some helpful words and phrases to search for in the logs. If you suspect the replication lag is due to performance bottlenecks, you could try searching for slow query or page fault. For network issues, you could search for connection refused or time out. Other helpful search terms might include server heartbeat failed, can't see majority, or error running Oplog fetcher. We've covered a lot of ground. Let's quickly recap what we've learned. Use tools like r s dot status, r s dot print secondary replication info, and ops manager or cloud manager to obtain useful information for assessing how your replica set is performing. Establish a baseline for your cluster's metrics so you can identify replication lag by comparing past performance to present performance. Search log files for relevant terms to positively identify an issue that may be contributing to replication lag. Great work. Now you should be able to quickly identify replication lag when it occurs and pinpoint what's causing it or contributing to it. Next, we'll look at how we can mitigate the effects of replication lag and remediate your cluster so it doesn't continue to be an issue.