Cluster Reliability / Troubleshooting Replication Issues

Code Recap: Troubleshooting Initial Sync and Failover Issues

Checking replica set health and initial sync status with rs.status()

rs.status() Method

Use the rs.status() method to return the replica set status from the point of view of the member where the method is run. This includes the status of the node

  • STARTUP2 indicates that the node is undergoing the initial sync
  • RECOVERING indicates that the node has finished the sync and is running checks prior to re-joining the replica set
  • SECONDARY indicates that the node has successfully re-joined the replica set as a secondary node
rs.status()

Example Output:

{
      _id: 2,
      name: 'atlas-zr1do5-shard-00-02.xwgj1.mongodb.net:27017',
      health: 1,
      state: 1,
      stateStr: ‘SECONDARY’,
      uptime: 333139,
      optime: { ts: Timestamp({ t: 1747042492, i: 1 }), t: Long('16') },
      optimeDate: ISODate('2025-05-12T09:34:52.000Z'),
      optimeWritten: { ts: Timestamp({ t: 1747042492, i: 1 }), t: Long('16') },
      optimeWrittenDate: ISODate('2025-05-12T09:34:52.000Z'),
      lastAppliedWallTime: ISODate('2025-05-12T09:34:52.385Z'),
      lastDurableWallTime: ISODate('2025-05-12T09:34:52.385Z'),
      lastWrittenWallTime: ISODate('2025-05-12T09:34:52.385Z'),
      syncSourceHost: '',
      syncSourceId: -1,
      infoMessage: '',
      electionTime: Timestamp({ t: 1746709383, i: 1 }),
      electionDate: ISODate('2025-05-08T13:03:03.000Z'),
      configVersion: 1,
      configTerm: 16,
      self: true,
      lastHeartbeatMessage: ''
    }

When rs.status() is run while a node is undergoing an initial sync, it will include initialSyncStatus information.

Example Output:

rs1 [direct: secondary] test> rs.status()
…
"initialSyncAttempts" : [
   {
      "durationMillis" : 59539,
      "status" : "InvalidOptions: error fetching oplog during initial sync :: caused by :: Error while getting the next batch in the oplog fetcher :: caused by :: readConcern afterClusterTime value must not be greater than the current clusterTime. Requested clusterTime: { ts: Timestamp(0, 1) }; current clusterTime: { ts: Timestamp(0, 0) }",
      "syncSource" : "m1.example.net:27017",
      "rollBackId" : 1,
      "operationsRetried" : 120,
      "totalTimeUnreachableMillis" : 52601
   }
]