<div dir="ltr"><font face="verdana, sans-serif">Hi<br><br>I was working on the new feature to make the backend node failover quorum aware and on the half way through the implementation I also added the majority consensus feature for the same.</font><div><font face="verdana, sans-serif"><br>So please find the first version of the patch for review that makes the backend node failover consider the watchdog cluster quorum status and seek the majority consensus before performing failover.<br></font></div><div><font face="verdana, sans-serif"><br></font></div><div><font face="verdana, sans-serif"><b><font size="4">Changes in the Failover mechanism with watchdog.</font></b><br>For this new feature I have modified the Pgpool-II&#39;s existing failover mechanism with watchdog.<br>Previously as you know when the Pgpool-II require to perform a node operation (failover, failback, promote-node) with the watchdog. The watchdog used to propagated the failover request to all the Pgpool-II nodes in the watchdog cluster and as soon as the request was received by the node, it used to initiate the local failover and that failover was synchronised on all nodes using the distributed locks.<br><br><b>Now Only the Master node performs the failover.</b><br></font><div><font face="verdana, sans-serif">The attached patch changes the mechanism of synchronised failover, and now only the Pgpool-II of master watchdog node performs the failover, and all other standby nodes sync the backend statuses after the master Pgpool-II is finished with the failover. <br><br><b>Overview of new failover mechanism.</b><br>-- If the failover request is received to the standby watchdog node(from local Pgpool-II), That request is forwarded to the master watchdog and the Pgpool-II main process is returned with the FAILOVER_RES_WILL_BE_DONE return code. And upon receiving the FAILOVER_RES_WILL_BE_DONE from the watchdog for the failover request the requesting Pgpool-II moves forward without doing anything further for the particular failover command.<br><br>-- Now when the failover request from standby node is received by the master watchdog, after performing the validation, applying the consensus rules the failover request is triggered on the local Pgpool-II .<br><br>-- When the failover request is received to the master watchdog node from the local Pgpool-II (On the IPC channel) the watchdog process inform the Pgpool-II requesting process to proceed with failover (provided all failover rules are satisfied).<br><br>-- After the failover is finished on the master Pgpool-II, the failover function calls the <i>wd_failover_end</i>() which sends the backend sync required message to all standby watchdogs.<br><br>-- Upon receiving the sync required message from master watchdog node all Pgpool-II sync the new statuses of each backend node from the master watchdog.<br><br><b>No More Failover locks</b><br>Since with this new failover mechanism we do not require any synchronisation and guards against the execution of failover_commands by multiple Pgpool-II nodes, So the patch removes all the distributed locks from failover function, This makes the failover simpler and faster.<br><br><b>New kind of Failover operation NODE_QUARANTINE_REQUEST</b><br>The patch adds the new kind of backend node operation NODE_QUARANTINE which is effectively same as the NODE_DOWN, but with node_quarantine the failover_command is not triggered.<br>The NODE_DOWN_REQUEST is automatically converted to the NODE_QUARANTINE_REQUEST when the failover is requested on the backend node but watchdog cluster does not holds the quorum.</font></div><div><font face="verdana, sans-serif">This means in the absence of quorum the failed backend nodes are quarantined and when the quorum becomes available again the Pgpool-II performs the failback operation on all quarantine nodes.</font></div><div><font face="verdana, sans-serif">And again when the failback is performed on the quarantine backend node the failover function does not trigger the failback_command.<br><br><font size="4"><b>Controlling the Failover behaviour.</b><br></font>The patch adds three new configuration parameters to configure the failover behaviour from user side.<br><br><b>failover_when_quorum_exists</b><br>When enabled the failover command will only be executed when the watchdog cluster holds the quorum. And when the quorum is absent and failover_when_quorum_exists is enabled the failed backend nodes will get quarantine until the quorum becomes available again.</font></div><div><font face="verdana, sans-serif">disabling it will enable the old behaviour of failover commands.<br><br><b>failover_require_consensus<br></b>This new configuration parameter can be used to make sure we get the majority vote before performing the failover on the node. When <i>failover_require_consensus</i> is enabled then the failover is only performed after receiving the failover request from the majority or Pgpool-II nodes.</font></div><div><font face="verdana, sans-serif">For example in three nodes cluster the failover will not be performed until at least two nodes ask for performing the failover on the particular backend node.<br><br>It is also worthwhile to mention here that <i>failover_require_consensus</i> only works when failover_when_quorum_exists is enables.<br><br><br><b>enable_multiple_failover_requests_from_node</b><br>This parameter works in connection with <i>failover_require_consensus</i> config. When enabled a single Pgpool-II node can vote for failover multiple times.</font></div><div><font face="verdana, sans-serif">For example in the three nodes cluster if one Pgpool-II node sends the failover request of particular node twice that would be counted as two votes in favour of failover and the failover will be performed even if we do not get a vote from other two nodes.<br><br></font></div><div><font face="verdana, sans-serif">And when <i>enable_multiple_failover_requests_from_node</i> is disabled, Only the first vote from each Pgpool-II will be accepted and all other subsequent votes will be marked duplicate and rejected.</font></div><div><font face="verdana, sans-serif">So in that case we will require a majority votes from distinct nodes to execute the failover.<br>Again this <i>enable_multiple_failover_requests_from_node</i> only becomes effective when both <i>failover_when_quorum_exists</i> and <i>failover_require_consensus</i> are enabled.<br><br><br><font size="4"><b>Controlling the failover: The Coding perspective.</b></font><br>Although the failover functions are made quorum and consensus aware but there is still a way to bypass the quorum conditions, and requirement of consensus.<br><br></font></div><div><font face="verdana, sans-serif">For this the patch uses the existing request_details flags in POOL_REQUEST_NODE to control the behaviour of failover.<br><br></font></div><div><font face="verdana, sans-serif">Here are the newly added flags values.<br><br><b>REQ_DETAIL_WATCHDOG</b>:<br>Setting this flag while issuing the failover command will not send the failover request to the watchdog. But this flag may not be useful in any other place than where it is already used.</font></div><div><font face="verdana, sans-serif">Mostly this flag can be used to avoid the failover command from going to watchdog that is already originated from watchdog. Otherwise we can end up in infinite loop.<br><br><b>REQ_DETAIL_CONFIRMED</b>:<br>Setting this flag will bypass the <i>failover_require_consensus</i> configuration and immediately perform the failover if quorum is present. This flag can be used to issue the failover request originated from PCP command.<br><br><b>REQ_DETAIL_UPDATE</b>:<br>This flag is used for the command where we are failing back the quarantine nodes. Setting this flag will not trigger the failback_command.<br><br><b>Some conditional flags used:</b></font></div><div><font face="verdana, sans-serif">I was not sure about the configuration of each type of failover operation. As we have three main failover operations NODE_UP_REQUEST, NODE_DOWN_REQUEST, and PROMOTE_NODE_REQUEST</font></div><div><font face="verdana, sans-serif">So I was thinking do we need to give the configuration option to the users, if they want to enable/disable quorum checking and consensus for individual failover operation type.</font></div><div><font face="verdana, sans-serif">For example: is it a practical configuration where a user would want to ensure quorum while preforming NODE_DOWN operation while does not want it for NODE_UP.</font></div><div><font face="verdana, sans-serif">So in this patch I use three compile time defines to enable disable the individual failover operation, while we can decide on the best solution.<br></font>


<p class="gmail-p1"><span class="gmail-s1">NODE_UP_REQUIRE_CONSENSUS: defining it will enable quorum checking feature for NODE_UP_REQUESTs </span></p>

<p class="gmail-p1"><span class="gmail-s1">NODE_DOWN_REQUIRE_CONSENSUS: </span>defining it will enable quorum checking feature for NODE_DOWN_REQUESTs</p>

<p class="gmail-p1"><span class="gmail-s1">NODE_PROMOTE_REQUIRE_CONSENSUS: </span>defining it will enable quorum checking feature for PROMOTE_NODE_REQUESTs</p></div><div><font face="verdana, sans-serif"><br></font></div><div><font face="verdana, sans-serif" size="4"><b>Some Point for Discussion:</b></font></div><div><font face="verdana, sans-serif"><br></font></div><font style="font-family:verdana,sans-serif"><b>Do we really need to check ReqInfo-&gt;switching flag before enqueuing failover request.</b><br></font><span style="font-family:verdana,sans-serif">While working on the patch I was wondering why do we disallow enqueuing the failover command when the failover is already in progress? For example in <i>pcp_process_command</i>() function if we see the <i>Req_info-&gt;switching</i> flag set we bailout with the error instead of enqueuing the command. Is is really necessary?</span><br style="font-family:verdana,sans-serif"><div><br></div><div><font face="verdana, sans-serif"><b>Do we need more granule control over each failover operation:</b></font></div><div><font face="verdana, sans-serif">As described in section &quot;Some conditional flags used&quot; I want the opinion on do we need configuration parameters in pgpool.conf to enable disable quorum and consensus checking on individual failover types.</font></div><div><font face="verdana, sans-serif"><br></font></div><div><font face="verdana, sans-serif"><b>Which failover should be mark as Confirmed:</b></font></div><div><span style="font-family:verdana,sans-serif">As defined in the above section of REQ_DETAIL_CONFIRMED, We can mark the failover request to not need consensus, currently the requests from the PCP commands are fired with this flag. But I was </span><font face="verdana, sans-serif">wondering there may be more places where we many need to use the flag.</font></div><div><font face="verdana, sans-serif">For example I currently use the same confirmed flag when failover is triggered because of <i>replication_stop_on_mismatch</i>.</font></div><div><span style="font-family:verdana,sans-serif"><br></span></div><div><span style="font-family:verdana,sans-serif">I think we should think this flag for each place of failover, </span><span style="font-family:verdana,sans-serif">like when the failover is triggered</span></div><div><span style="font-family:verdana,sans-serif">because of health_check failure.</span></div><div><span style="font-family:verdana,sans-serif">because of replication mismatch</span></div><div><span style="font-family:verdana,sans-serif">because of backend_error </span></div><div><font face="verdana, sans-serif">e.t.c</font></div><div><font face="verdana, sans-serif"><br></font></div><div><font face="verdana, sans-serif"><b>Node Quarantine behaviour.</b></font></div><div><font face="verdana, sans-serif">What do you think about the node quarantine used by this patch. Can you think of some problem which can be caused by this?</font></div><div><font face="verdana, sans-serif"><br></font></div><div><font face="verdana, sans-serif"><b>What should be the default values for each newly added config parameters.</b></font></div><div><br></div><div><font face="verdana, sans-serif"><br></font></div><div><font face="verdana, sans-serif"><br></font></div><div><font face="verdana, sans-serif"><b>TODOs</b><br></font><br></div><div>-- Updating the documentation is still todo. Will do that once every aspect of the feature will be finalised.</div><div>-- Some code warnings and cleanups are still not done.</div><div>-- I am still little short on testing</div><div>-- Regression test cases for the feature</div><div><br></div><div><br></div><div>Thoughts and suggestions are most welcome.</div><div><br></div><div>Thanks</div><div>Best regards</div><div>Muhammad Usama</div><div><br></div><div><br></div></div></div>