<div dir="ltr">Hi Ishii-San,<div><br></div><div><div class="gmail_extra"><div class="gmail_quote">On Tue, Nov 28, 2017 at 5:55 AM, Tatsuo Ishii <span dir="ltr">&lt;<a href="mailto:ishii@sraoss.co.jp" target="_blank">ishii@sraoss.co.jp</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi Usama,<br>

<br>

While writing a presentation material of Pgpool-II 3.7, I am not sure<br>

I understand the behavior the quorum consusens behavior.<br>

<br>

&gt; *enable_multiple_failover_requ<wbr>ests_from_node*<br>

&gt; This parameter works in connection with *failover_require_consensus*<br>

<span>&gt; config. When enabled a single Pgpool-II node can vote for failover multiple<br>

&gt; times.<br>

<br>

</span>In what situation a Pgpool-II node could send multiple failover<br>

requests? My guess is in the following scenario:<br>

<br>

1) Pgpool-II watchdog standby health check process detects the failure<br>

   of backend A and send a faiover request to the master Pgpool-II.<br>

<br>

2) Since the vote does not satisfy the quorum consensus, failver is<br>

   not occurred. Just backend_info-&gt;quarantine is set and<br>

   backend_info-&gt;backend_status is set to CON_DOWN.<br>

<br>

3) Pgpool-II watchdog standby health check process detects the failure<br>

   of backend A again, then sent a failover request to the master<br>

   Pgpool-II again. If enable_multiple_failover_reque<wbr>sts_from_node is<br>

   set, failover will happen.<br>

<br>

But after thinking more, I realized that in step 3, since<br>

backend_status is already set to CON_DOWN, health check will not be<br>

performed against backend A. So the watchdog standby will not send<br>

multiple vote.</blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

Apparently I am missing something here.<br>

<br>

Can you please tell what is the scenario in that a watchdog sends<br>

multiple votes for failover?<br>

<span><br></span></blockquote><div><br></div>Basically when <font face="monospace, monospace">allow_multiple_failover_requests_from_node</font> is set then watchdog</div><div class="gmail_quote">does not performs the quarantine operation and node status is not changed to DOWN.</div><div class="gmail_quote">So it is possible for the node to send multiple votes for node failover.</div><div class="gmail_quote">Also even when the <font face="monospace, monospace">allow_multiple_failover_requests_from_node</font> is not set,</div><div class="gmail_quote">Pgpool-II does not quarantines the node straightaway after first failover request while watchdog</div><div class="gmail_quote">is waiting for consensus. What happens is, when the watchdog receives the failover requests</div><div class="gmail_quote">and that request requires a consensus, it returns <font face="monospace, monospace">FAILOVER_RES_CONSENSUS_MAY_FAIL</font>,</div><div class="gmail_quote">and when the main pgpool process receives this return code for failover request from watchdog,</div><div class="gmail_quote">it just ignores this request without changing the backend node status to down and relies on watchdog</div><div class="gmail_quote">to handle that failover request, meanwhile pgpool continues with its normal duties,<div><br></div><div>Now when the same pgpool sends the failover request for the same backend node second time around,</div><div>Then the behaviour depends upon the setting of allow_multiple_failover_requests_from_node configuration.<br><br></div><div>1- When <font face="monospace, monospace">allow_multiple_failover_requests_from_node</font> = off<br>    Then watchdog returns  <font face="monospace, monospace">FAILOVER_RES_CONSENSUS_MAY_FAIL</font>, and Pgpool main process quarantines</div><div>     the backend node and set its status to DOWN when it receives this code from watchdog.<br><br>1- When <font face="monospace, monospace">allow_multiple_failover_requests_from_node</font> = on<br>    Then watchdog returns <font face="monospace, monospace">FAILOVER_RES_BUILDING_CONSENSUS</font><span class="sewtxj062qinb58"></span><span class="sewtxj062qinb58"></span>, and Pgpool main process does not</div><div>    quarantines the backend node and its status remains unchanged and effectively health check</div><div>    keeps executing on that backend node.<br><br></div><div> Thanks</div><div>Best Regards</div><div>Muhammad Usama</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span>

Best regards,<br>

--<br>

Tatsuo Ishii<br>

SRA OSS, Inc. Japan<br>

English: <a href="http://www.sraoss.co.jp/index_en.php" rel="noreferrer" target="_blank">http://www.sraoss.co.jp/index_<wbr>en.php</a><br>

Japanese:<a href="http://www.sraoss.co.jp" rel="noreferrer" target="_blank">http://www.sraoss.co.<wbr>jp</a><br>

<br>

</span>From: Muhammad Usama &lt;<a href="mailto:m.usama@gmail.com" target="_blank">m.usama@gmail.com</a>&gt;<br>

Subject: New Feature with patch: Quorum and Consensus for backend failover<br>

Date: Tue, 22 Aug 2017 00:18:27 +0500<br>

Message-ID: &lt;CAEJvTzUbz-d8dfsJdLt=XNYWdOMx<wbr>Kf06sp+p=uAbxyjvG=<a href="mailto:vS3A@mail.gmail.com" target="_blank">vS3A@mail.<wbr>gmail.com</a>&gt;<br>

<span><br>

&gt; Hi<br>

&gt;<br>

&gt; I was working on the new feature to make the backend node failover quorum<br>

&gt; aware and on the half way through the implementation I also added the<br>

&gt; majority consensus feature for the same.<br>

&gt;<br>

&gt; So please find the first version of the patch for review that makes the<br>

&gt; backend node failover consider the watchdog cluster quorum status and seek<br>

&gt; the majority consensus before performing failover.<br>

&gt;<br>

</span>&gt; *Changes in the Failover mechanism with watchdog.*<br>

<span>&gt; For this new feature I have modified the Pgpool-II&#39;s existing failover<br>

&gt; mechanism with watchdog.<br>

&gt; Previously as you know when the Pgpool-II require to perform a node<br>

&gt; operation (failover, failback, promote-node) with the watchdog. The<br>

&gt; watchdog used to propagated the failover request to all the Pgpool-II nodes<br>

&gt; in the watchdog cluster and as soon as the request was received by the<br>

&gt; node, it used to initiate the local failover and that failover was<br>

&gt; synchronised on all nodes using the distributed locks.<br>

&gt;<br>

</span>&gt; *Now Only the Master node performs the failover.*<br>

<span>&gt; The attached patch changes the mechanism of synchronised failover, and now<br>

&gt; only the Pgpool-II of master watchdog node performs the failover, and all<br>

&gt; other standby nodes sync the backend statuses after the master Pgpool-II is<br>

&gt; finished with the failover.<br>

&gt;<br>

</span><span>&gt; *Overview of new failover mechanism.*<br>

</span><span>&gt; -- If the failover request is received to the standby watchdog node(from<br>

&gt; local Pgpool-II), That request is forwarded to the master watchdog and the<br>

&gt; Pgpool-II main process is returned with the FAILOVER_RES_WILL_BE_DONE<br>

&gt; return code. And upon receiving the FAILOVER_RES_WILL_BE_DONE from the<br>

&gt; watchdog for the failover request the requesting Pgpool-II moves forward<br>

&gt; without doing anything further for the particular failover command.<br>

&gt;<br>

&gt; -- Now when the failover request from standby node is received by the<br>

&gt; master watchdog, after performing the validation, applying the consensus<br>

&gt; rules the failover request is triggered on the local Pgpool-II .<br>

&gt;<br>

&gt; -- When the failover request is received to the master watchdog node from<br>

&gt; the local Pgpool-II (On the IPC channel) the watchdog process inform the<br>

&gt; Pgpool-II requesting process to proceed with failover (provided all<br>

&gt; failover rules are satisfied).<br>

&gt;<br>

&gt; -- After the failover is finished on the master Pgpool-II, the failover<br>

</span>&gt; function calls the *wd_failover_end*() which sends the backend sync<br>

<span>&gt; required message to all standby watchdogs.<br>

&gt;<br>

&gt; -- Upon receiving the sync required message from master watchdog node all<br>

&gt; Pgpool-II sync the new statuses of each backend node from the master<br>

&gt; watchdog.<br>

&gt;<br>

</span>&gt; *No More Failover locks*<br>

<span>&gt; Since with this new failover mechanism we do not require any<br>

&gt; synchronisation and guards against the execution of failover_commands by<br>

&gt; multiple Pgpool-II nodes, So the patch removes all the distributed locks<br>

&gt; from failover function, This makes the failover simpler and faster.<br>

&gt;<br>

</span><span>&gt; *New kind of Failover operation NODE_QUARANTINE_REQUEST*<br>

</span><span>&gt; The patch adds the new kind of backend node operation NODE_QUARANTINE which<br>

&gt; is effectively same as the NODE_DOWN, but with node_quarantine the<br>

&gt; failover_command is not triggered.<br>

&gt; The NODE_DOWN_REQUEST is automatically converted to the<br>

&gt; NODE_QUARANTINE_REQUEST when the failover is requested on the backend node<br>

&gt; but watchdog cluster does not holds the quorum.<br>

&gt; This means in the absence of quorum the failed backend nodes are<br>

&gt; quarantined and when the quorum becomes available again the Pgpool-II<br>

&gt; performs the failback operation on all quarantine nodes.<br>

&gt; And again when the failback is performed on the quarantine backend node the<br>

&gt; failover function does not trigger the failback_command.<br>

&gt;<br>

</span>&gt; *Controlling the Failover behaviour.*<br>

<span>&gt; The patch adds three new configuration parameters to configure the failover<br>

&gt; behaviour from user side.<br>

&gt;<br>

</span>&gt; *failover_when_quorum_exists*<br>

<span>&gt; When enabled the failover command will only be executed when the watchdog<br>

&gt; cluster holds the quorum. And when the quorum is absent and<br>

&gt; failover_when_quorum_exists is enabled the failed backend nodes will get<br>

&gt; quarantine until the quorum becomes available again.<br>

&gt; disabling it will enable the old behaviour of failover commands.<br>

&gt;<br>

&gt;<br>

</span>&gt; *failover_require_consensus*Th<wbr>is new configuration parameter can be used to<br>

<span>&gt; make sure we get the majority vote before performing the failover on the<br>

</span>&gt; node. When *failover_require_consensus* is enabled then the failover is<br>

<span>&gt; only performed after receiving the failover request from the majority or<br>

&gt; Pgpool-II nodes.<br>

&gt; For example in three nodes cluster the failover will not be performed until<br>

&gt; at least two nodes ask for performing the failover on the particular<br>

&gt; backend node.<br>

&gt;<br>

</span>&gt; It is also worthwhile to mention here that *failover_require_consensus*<br>

<span>&gt; only works when failover_when_quorum_exists is enables.<br>

&gt;<br>

&gt;<br>

</span>&gt; *enable_multiple_failover_requ<wbr>ests_from_node*<br>

&gt; This parameter works in connection with *failover_require_consensus*<br>

<span>&gt; config. When enabled a single Pgpool-II node can vote for failover multiple<br>

&gt; times.<br>

&gt; For example in the three nodes cluster if one Pgpool-II node sends the<br>

&gt; failover request of particular node twice that would be counted as two<br>

&gt; votes in favour of failover and the failover will be performed even if we<br>

&gt; do not get a vote from other two nodes.<br>

&gt;<br>

</span>&gt; And when *enable_multiple_failover_requ<wbr>ests_from_node* is disabled, Only<br>

<span>&gt; the first vote from each Pgpool-II will be accepted and all other<br>

&gt; subsequent votes will be marked duplicate and rejected.<br>

&gt; So in that case we will require a majority votes from distinct nodes to<br>

&gt; execute the failover.<br>

</span>&gt; Again this *enable_multiple_failover_requ<wbr>ests_from_node* only becomes<br>

<span>&gt; effective when both *failover_when_quorum_exists* and<br>

&gt; *failover_require_consensus* are enabled.<br>

&gt;<br>

&gt;<br>

&gt; *Controlling the failover: The Coding perspective.*<br>

</span><span>&gt; Although the failover functions are made quorum and consensus aware but<br>

&gt; there is still a way to bypass the quorum conditions, and requirement of<br>

&gt; consensus.<br>

&gt;<br>

&gt; For this the patch uses the existing request_details flags in<br>

&gt; POOL_REQUEST_NODE to control the behaviour of failover.<br>

&gt;<br>

&gt; Here are the newly added flags values.<br>

&gt;<br>

</span>&gt; *REQ_DETAIL_WATCHDOG*:<br>

<span>&gt; Setting this flag while issuing the failover command will not send the<br>

&gt; failover request to the watchdog. But this flag may not be useful in any<br>

&gt; other place than where it is already used.<br>

&gt; Mostly this flag can be used to avoid the failover command from going to<br>

&gt; watchdog that is already originated from watchdog. Otherwise we can end up<br>

&gt; in infinite loop.<br>

&gt;<br>

</span>&gt; *REQ_DETAIL_CONFIRMED*:<br>

&gt; Setting this flag will bypass the *failover_require_consensus*<br>

<span>&gt; configuration and immediately perform the failover if quorum is present.<br>

&gt; This flag can be used to issue the failover request originated from PCP<br>

&gt; command.<br>

&gt;<br>

</span>&gt; *REQ_DETAIL_UPDATE*:<br>

<span>&gt; This flag is used for the command where we are failing back the quarantine<br>

&gt; nodes. Setting this flag will not trigger the failback_command.<br>

&gt;<br>

</span><span>&gt; *Some conditional flags used:*<br>

</span><span>&gt; I was not sure about the configuration of each type of failover operation.<br>

&gt; As we have three main failover operations NODE_UP_REQUEST,<br>

&gt; NODE_DOWN_REQUEST, and PROMOTE_NODE_REQUEST<br>

&gt; So I was thinking do we need to give the configuration option to the users,<br>

&gt; if they want to enable/disable quorum checking and consensus for individual<br>

&gt; failover operation type.<br>

&gt; For example: is it a practical configuration where a user would want to<br>

&gt; ensure quorum while preforming NODE_DOWN operation while does not want it<br>

&gt; for NODE_UP.<br>

&gt; So in this patch I use three compile time defines to enable disable the<br>

&gt; individual failover operation, while we can decide on the best solution.<br>

&gt;<br>

&gt; NODE_UP_REQUIRE_CONSENSUS: defining it will enable quorum checking feature<br>

&gt; for NODE_UP_REQUESTs<br>

&gt;<br>

&gt; NODE_DOWN_REQUIRE_CONSENSUS: defining it will enable quorum checking<br>

&gt; feature for NODE_DOWN_REQUESTs<br>

&gt;<br>

&gt; NODE_PROMOTE_REQUIRE_CONSENSUS<wbr>: defining it will enable quorum checking<br>

&gt; feature for PROMOTE_NODE_REQUESTs<br>

&gt;<br>

</span><span>&gt; *Some Point for Discussion:*<br>

&gt;<br>

</span>&gt; *Do we really need to check ReqInfo-&gt;switching flag before enqueuing<br>

&gt; failover request.*<br>

<span>&gt; While working on the patch I was wondering why do we disallow enqueuing the<br>

&gt; failover command when the failover is already in progress? For example in<br>

</span>&gt; *pcp_process_command*() function if we see the *Req_info-&gt;switching* flag<br>

<span>&gt; set we bailout with the error instead of enqueuing the command. Is is<br>

&gt; really necessary?<br>

&gt;<br>

</span>&gt; *Do we need more granule control over each failover operation:*<br>

<span>&gt; As described in section &quot;Some conditional flags used&quot; I want the opinion on<br>

&gt; do we need configuration parameters in pgpool.conf to enable disable quorum<br>

&gt; and consensus checking on individual failover types.<br>

&gt;<br>

</span>&gt; *Which failover should be mark as Confirmed:*<br>

<span>&gt; As defined in the above section of REQ_DETAIL_CONFIRMED, We can mark the<br>

&gt; failover request to not need consensus, currently the requests from the PCP<br>

&gt; commands are fired with this flag. But I was wondering there may be more<br>

&gt; places where we many need to use the flag.<br>

&gt; For example I currently use the same confirmed flag when failover is<br>

</span>&gt; triggered because of *replication_stop_on_mismatch*<wbr>.<br>

<span>&gt;<br>

&gt; I think we should think this flag for each place of failover, like when the<br>

&gt; failover is triggered<br>

&gt; because of health_check failure.<br>

&gt; because of replication mismatch<br>

&gt; because of backend_error<br>

&gt; e.t.c<br>

&gt;<br>

</span>&gt; *Node Quarantine behaviour.*<br>

<span>&gt; What do you think about the node quarantine used by this patch. Can you<br>

&gt; think of some problem which can be caused by this?<br>

&gt;<br>

</span>&gt; *What should be the default values for each newly added config parameters.*<br>

&gt;<br>

&gt;<br>

&gt;<br>

&gt; *TODOs*<br>

<div class="gmail-m_6349959011663528566HOEnZb"><div class="gmail-m_6349959011663528566h5">&gt;<br>

&gt; -- Updating the documentation is still todo. Will do that once every aspect<br>

&gt; of the feature will be finalised.<br>

&gt; -- Some code warnings and cleanups are still not done.<br>

&gt; -- I am still little short on testing<br>

&gt; -- Regression test cases for the feature<br>

&gt;<br>

&gt;<br>

&gt; Thoughts and suggestions are most welcome.<br>

&gt;<br>

&gt; Thanks<br>

&gt; Best regards<br>

&gt; Muhammad Usama<br>

</div></div></blockquote></div><br></div></div></div>