Cluster Reliability / Troubleshooting Replication Issues

Code Recap: Mitigating and Recovering from Replication Lag Issues

Performing an initial sync on a stale node

Below is the process used to perform an initial sync on a node which has exceeded the oplog window, or otherwise needs to be synchronized to the other nodes.

db.shutdownServer() method

Use the db.shutdownServer() method to stop the mongod process on the node in question:

db.shutdownServer()

Back up diagnostic data

Back up the diagnostic.data subdirectory of your dbPath directory (or, optionally, you can backup the entire dbPath directory):

/var/lib/mongo/diagnostic.data

Delete data in dbPath and all subdirectories

Delete the entire contents of the dbPath directory and all subdirectories:

rm -r /var/lib/mongo

Restart the mongod process

To begin the initial sync, restart the mongod process:

mongod

Resize the oplog with replSetResizeOplog

replSetResizeOplog Admin Command

Use the replSetResizeOplog admin command to resize the oplog or its minimum retention period dynamically without restarting the mongod process.

db.adminCommand(
   {
     replSetResizeOplog: <int>,
     size: <double>,
     minRetentionHours: <double>
   }
 )

Information to provide when contacting Support

When contacting Support, please provide:

  • The name and configuration (replica set, sharded cluster, etc.) of the affected cluster
  • The date and time of the incident
  • Mongod.log files from affected nodes
  • The contents of the diagnostic.data subdirectory of the dbPath