Jan
23
2019
--

MongoDB Replica Set Scenarios and Internals – Part II (Elections)

mongodb node election to primary

In this blog post, we will walk through the internals of the election process in MongoDB®, following on from a previous post on the internals of the replica set. You can read Part 1 here.

For this post, I am refer to the same configurations we discussed before.

Elections: As the term suggests, in MongoDB there is a freedom to “vote”: individual nodes of the cluster can vote and select their primary member for that replica set cluster.

Why Elections? MongoDB maintains high availability through this process.

When do elections take place?

  1. When the node does not found a primary node within the election timeout limit. By default this value is 10s, and from MongoDB version 3.2 this can be changed according to your needs.  The parameter to set this value is
    settings.electionTimeoutMillis

      and can be seen in the logs as:

settings: { chainingAllowed: true, heartbeatIntervalMillis: 2000, heartbeatTimeoutSecs: 10, electionTimeoutMillis: 10000, catchUpTimeoutMillis: 60000, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, wtimeout: 0 }, replicaSetId: ObjectId('5ba8ed10d4fddccfedeb7492') } }

From the mongo shell, the value for the

electionTimeoutMillis

  can be found in replica set configuration as:

rplint:SECONDARY> rs.conf()
{
	"_id" : "rplint",
	"version" : 3,
	"protocolVersion" : NumberLong(1),
	"members" : [
		{
			"_id" : 0,
			"host" : "m103:25001",
			"arbiterOnly" : false,
			"buildIndexes" : true,
			"hidden" : false,
			"priority" : 1,
			"tags" : {
			},
			"slaveDelay" : NumberLong(0),
			"votes" : 1
		},
		{
			"_id" : 1,
			"host" : "192.168.103.100:25002",
			"arbiterOnly" : false,
			"buildIndexes" : true,
			"hidden" : false,
			"priority" : 1,
			"tags" : {
			},
			"slaveDelay" : NumberLong(0),
			"votes" : 1
		},
		{
			"_id" : 2,
			"host" : "192.168.103.100:25003",
			"arbiterOnly" : false,
			"buildIndexes" : true,
			"hidden" : false,
			"priority" : 1,
			"tags" : {
			},
			"slaveDelay" : NumberLong(0),
			"votes" : 1
		}
	],
	"settings" : {
		"chainingAllowed" : true,
		"heartbeatIntervalMillis" : 2000,
		"heartbeatTimeoutSecs" : 10,
		"electionTimeoutMillis" : 10000,
		"catchUpTimeoutMillis" : 60000,
		"getLastErrorModes" : {
		},
		"getLastErrorDefaults" : {
			"w" : 1,
			"wtimeout" : 0
		},
		"replicaSetId" : ObjectId("5c20ff87272eff3a5e28573f")
	}
}

More precisely the value for

electionTimeoutMillis

  can be found at:

rplint:SECONDARY> rs.conf().settings.electionTimeoutMillis
10000

2.  If the priority of the existing primary node is being taken over by another node. For example, during planned maintenance using replica set configuration settings. The priority of the member node can be changed as explained here

The priority of all three members can be seen from the replica set configuration like this:

rplint:SECONDARY> rs.conf().members[0].priority
1
rplint:SECONDARY>
rplint:SECONDARY>
rplint:SECONDARY> rs.conf().members[2].priority
1
rplint:SECONDARY> rs.conf().members[1].priority
1

How do elections work in a MongoDB replica set cluster?

Before real elections, the node runs a dry election. Dry election? Yes, the node first runs dry elections, and if the node wins a dry election, then an actual election begins. Here’s how:

  1. Candidate node asks every node if another node would vote for it through
    replSetRequestVotes

     , without increasing the term itself.

  2. Primary node steps down if it finds a candidate node term higher than itself. Otherwise the dry election fails, and the replica set continues to run as is did before.
  3. If the dry election succeeds, then an actual election begins.
  4. For the real election, the node increments its term and then votes for itself.
  5. VoterRequester sends
    replSetRequestVotes

     command through ScatterGatherRunner and then each node responds back with their vote.

  6. The candidate that receives votes from the most nodes wins the election.
  7. Once the candidate wins, it transits to primary node. Through heartbeats it sends a notification to all other nodes.
  8. Then the candidate node checks if it needs to catch up from the former primary node.
  9. The node that receives the 
    replSetRequestVotes

     command checks its own term and then votes, but only after ReplicationCoordinator receives confirmation from TopologyCoordinator

  10. The TopologyCoordinator grants the vote after following considerations:
    1. Config version must be matched,
    2. Replica set name must be matched
    3. An arbiter voter must not see any healthy primary of greater or equal priority.

An example

A primary (port:25002) Transition to secondary after receiving the

rs.stepDown()

  command.

2019-01-03T03:05:29.972+0000 I COMMAND  [conn124] Attempting to step down in response to replSetStepDown command
2019-01-03T03:05:29.976+0000 I REPL     [conn124] transition to SECONDARY
driver: { name: "NetworkInterfaceASIO-Replication", version: "3.4.15" }, os: { type: "Linux", name: "Ubuntu", architecture: "x86_64", version: "14.04" } }
2019-01-03T03:05:40.874+0000 I REPL     [ReplicationExecutor] Member m103:25001 is now in state PRIMARY
2019-01-03T03:05:41.459+0000 I REPL     [rsBackgroundSync] sync source candidate: m103:25001
2019-01-03T03:05:41.459+0000 I ASIO     [NetworkInterfaceASIO-RS-0] Connecting to m103:25001
2019-01-03T03:05:41.460+0000 I ASIO     [NetworkInterfaceASIO-RS-0] Successfully connected to m103:25001, took 1ms (1 connections now open to m103:25001)
2019-01-03T03:05:41.461+0000 I ASIO     [NetworkInterfaceASIO-RS-0] Connecting to m103:25001
2019-01-03T03:05:41.462+0000 I ASIO     [NetworkInterfaceASIO-RS-0] Successfully connected to m103:25001, took 1ms (2 connections now open to m103:25001)

Dry election at candidate node (port:25001) and success: no primary found.

2019-01-03T03:05:31.498+0000 I REPL     [rsBackgroundSync] could not find member to sync from
2019-01-03T03:05:36.493+0000 I REPL     [SyncSourceFeedback] SyncSourceFeedback error sending update to 192.168.103.100:25002: InvalidSyncSource: Sync source was cleared. Was 192.168.103.100:25002
2019-01-03T03:05:39.390+0000 I REPL     [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms
2019-01-03T03:05:39.390+0000 I REPL     [ReplicationExecutor] conducting a dry run election to see if we could be elected. current term: 35
2019-01-03T03:05:39.391+0000 I REPL     [ReplicationExecutor] VoteRequester(term 35 dry run) received a yes vote from 192.168.103.100:25002; response message: { term: 35, voteGranted: true, reason: "", ok: 1.0 }

Dry election succeeds and increments term by 1 (here the term was 35 and is incremented to 36). It transitions to primary and enters catchup mode.

2019-01-03T03:05:39.391+0000 I REPL [ReplicationExecutor] dry election run succeeded, running for election in term 36
2019-01-03T03:05:39.394+0000 I REPL [ReplicationExecutor] VoteRequester(term 36) received a yes vote from 192.168.103.100:25003; response message: { term: 36, voteGranted: true, reason: "", ok: 1.0 }
2019-01-03T03:05:39.395+0000 I REPL [ReplicationExecutor] election succeeded, assuming primary role in term 36
2019-01-03T03:05:39.395+0000 I REPL [ReplicationExecutor] transition to PRIMARY
2019-01-03T03:05:39.395+0000 I REPL [ReplicationExecutor] Entering primary catch-up mode.

Other nodes also receive information about the new primary.

2019-01-03T03:05:31.498+0000 I REPL [rsBackgroundSync] could not find member to sync from
2019-01-03T03:05:36.493+0000 I REPL [SyncSourceFeedback] SyncSourceFeedback error sending update to 192.168.103.100:25002: InvalidSyncSource: Sync source was cleared. Was 192.168.103.100:25002
2019-01-03T03:05:41.499+0000 I REPL [ReplicationExecutor] Member m103:25001 is now in state PRIMARY

This is how MongoDB is able to maintain high availability by electing primary node from the replica set clusters in the case of existing primary node failures.


Photo by Daria Shevtsova from Pexels

May
17
2017
--

CrowdStrike, the firm investigating Russian hacks, raised $100M, now valued around $1B

 The business of hacking has dealt a huge blow to our democracy, not to mention a plethora of organizations and individuals, and our collective sense of sanity. One silver lining, however, has been that it has led to the emergence of a number of security startups that are building and deploying a range of tools to try to track and stop the nefarious activity. One of the larger of these… Read More

May
25
2016
--

MongoDB 3.2: elections just got better!

MongoDB-replication-4

Introduction

In this blog, we’ll review MongoDB 3.2 elections and how they work, as well as what is really new and different in the election protocol.

MongoDB 3.2 revamped its election protocol for increased stability! Exciting times, with smarter and faster elections are here! With this latest release, you will find that replication (and the election protocol) have been improved. Some of the changes include:

  • The addition of
    electionTimeoutMS
  • WriteConcern

      now implies “j:true”

    • Old j:true meant just the primary node
    • New j:true means all involved nodes must ACK the journal
    • j:true means your journal MS will be thirded, and synchronization occurs every 10ms (MMAP) or 33ms (WiredTiger) by default
  • Optime in rs.status now an Object, not a Timestamp

You’ll need to enable the Election Protocol when upgrading MongoDB from an earlier version, while new replSets get it enabled by default.

Election Protocol: what is an election?

Mongo uses a consensus protocol. This means that all nodes must agree who is the most current when handing:

  • Hardware failure
  • Network split
  • Time shifts

New updates allow for faster elections using an (term) electionId to prevent timeout between separate voting rounds. This guarantees there aren’t double (and conflicting) votes while also reducing the time to wait to know a vote completed.

How does it do it?

Elections now have “term” or “vote” identifiers (ID). Terms are used to separate voting rounds. Every vote attempt increments the ID. The ID incrementation prevents a node from double voting in the same term, and makes it easier for nodes to know if a re-vote is needed where before it could be up to 5 minutes!

The protocol timeouts have some new features and behaviors:

  • Now configurable
  • Randomness added to each node
  • Less chance all node timeout at the same time

Normal election process

Below I’m going to walk you through a typical replica set operation. The configuration looks like the following:

MongoDB 3.2 elections

In this topology:

  • There are three members
  • All of them are heartbeating to each other
  • There is no arbiter, so you get full high availability (HA)

The following diagram provides a more detailed picture of the interactions:

MongoDB 3.2 elections

Notice how replication pulls from the primary to each secondary from the primary – the secondary does all the work. A heartbeat is still shared by all the nodes.

Now let’s see what happens when our primary crashes. It just did!

MongoDB 3.2 elections

Nodes will still try to heartbeat to it until two have failed in a short period.

MongoDB 3.2 elections

After the failure, things happen quickly.

  1. Secondaries give up on heartbeats
  2. They then vote with each other on who is newest in oplog
  3. If they have > 50% of total voting population they select a new winner

A new Primary is selected, and the heartbeat system is cleaned up.

MongoDB 3.2 elections

Replication now gets restarted. If the fatal node comes back online, it’s treated as a secondary once it “catches up” via the oplog.

Stepdown Election Process

The stepdown election process is the same as above, with the following caveats:

  • It’s MUCH faster, as the existing primary “starts” an election
  • There is less chance of the old primary not having data replicated
  • It kills writes while doing election
  • The election process doesn’t wait for heartbeat timeouts

Generally speaking, you should always try to use the stepdown election process. Timeouts are for crashes and failures, not general use.

 

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com