<div dir="ltr">Hi Ishii-San<div><br></div><div><div class="gmail_extra"><div class="gmail_quote">On Tue, Nov 28, 2017 at 10:06 PM, Tatsuo Ishii <span dir="ltr"><<a href="mailto:ishii@sraoss.co.jp" target="_blank">ishii@sraoss.co.jp</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Usama,<br>
<div><div class="gmail-h5"><br>
> Hi Ishii-San,<br>
><br>
> On Tue, Nov 28, 2017 at 5:55 AM, Tatsuo Ishii <<a href="mailto:ishii@sraoss.co.jp">ishii@sraoss.co.jp</a>> wrote:<br>
><br>
>> Hi Usama,<br>
>><br>
>> While writing a presentation material of Pgpool-II 3.7, I am not sure<br>
>> I understand the behavior the quorum consusens behavior.<br>
>><br>
>> > *enable_multiple_failover_<wbr>requests_from_node*<br>
>> > This parameter works in connection with *failover_require_consensus*<br>
>> > config. When enabled a single Pgpool-II node can vote for failover<br>
>> multiple<br>
>> > times.<br>
>><br>
>> In what situation a Pgpool-II node could send multiple failover<br>
>> requests? My guess is in the following scenario:<br>
>><br>
>> 1) Pgpool-II watchdog standby health check process detects the failure<br>
>> of backend A and send a faiover request to the master Pgpool-II.<br>
>><br>
>> 2) Since the vote does not satisfy the quorum consensus, failver is<br>
>> not occurred. Just backend_info->quarantine is set and<br>
>> backend_info->backend_status is set to CON_DOWN.<br>
>><br>
>> 3) Pgpool-II watchdog standby health check process detects the failure<br>
>> of backend A again, then sent a failover request to the master<br>
>> Pgpool-II again. If enable_multiple_failover_<wbr>requests_from_node is<br>
>> set, failover will happen.<br>
>><br>
>> But after thinking more, I realized that in step 3, since<br>
>> backend_status is already set to CON_DOWN, health check will not be<br>
>> performed against backend A. So the watchdog standby will not send<br>
>> multiple vote.<br>
><br>
><br>
>> Apparently I am missing something here.<br>
>><br>
>> Can you please tell what is the scenario in that a watchdog sends<br>
>> multiple votes for failover?<br>
>><br>
>><br>
> Basically when allow_multiple_failover_<wbr>requests_from_node is set then<br>
> watchdog<br>
> does not performs the quarantine operation and node status is not changed<br>
> to DOWN.<br>
> So it is possible for the node to send multiple votes for node failover.<br>
> Also even when the allow_multiple_failover_<wbr>requests_from_node is not set,<br>
> Pgpool-II does not quarantines the node straightaway after first failover<br>
> request while watchdog<br>
> is waiting for consensus. What happens is, when the watchdog receives the<br>
> failover requests<br>
> and that request requires a consensus, it returns<br>
> FAILOVER_RES_CONSENSUS_MAY_<wbr>FAIL,<br>
> and when the main pgpool process receives this return code for failover<br>
> request from watchdog,<br>
> it just ignores this request without changing the backend node status to<br>
> down and relies on watchdog<br>
> to handle that failover request, meanwhile pgpool continues with its normal<br>
> duties,<br>
><br>
> Now when the same pgpool sends the failover request for the same backend<br>
> node second time around,<br>
> Then the behaviour depends upon the setting of<br>
> allow_multiple_failover_<wbr>requests_from_node configuration.<br>
><br>
> 1- When allow_multiple_failover_<wbr>requests_from_node = off<br>
> Then watchdog returns FAILOVER_RES_CONSENSUS_MAY_<wbr>FAIL, and Pgpool main<br>
> process quarantines<br>
> the backend node and set its status to DOWN when it receives this code<br>
> from watchdog.<br>
><br>
> 1- When allow_multiple_failover_<wbr>requests_from_node = on<br>
> Then watchdog returns FAILOVER_RES_BUILDING_<wbr>CONSENSUS, and Pgpool main<br>
> process does not<br>
> quarantines the backend node and its status remains unchanged and<br>
> effectively health check<br>
> keeps executing on that backend node.<br>
<br>
</div></div>So when allow_multiple_failover_<wbr>requests_from_node = on, Pgpool-II<br>
never sets the backend node status to DOWN?<br></blockquote><div><br></div><div>Basically the backend status is set to down either when failover or quarantine operation is performed on the node</div><div>so with <font face="monospace, monospace">allow_multiple_failover_<wbr>requests_from_node</font> = TRUE, pgpool main process does not perform the quarantine operation</div><div>while the watchdog is building the consensus for failover, But as soon as consensus is built the failover is executed and</div><div>node status is set to down. So effectively for the time between first failover request till the consensus is built and actual</div><div>failover is performed, the node status remain UP and it can send multiple failover requests for the node.</div><div><br></div><div><br></div><div>Please see the below flow diagrams of the scenarios with <font face="monospace, monospace">allow_multiple_failover_<wbr>requests_from_node</font> ON and OFF cases</div><div>for further clarification.</div><div><br></div><div><b>SCENARIO1</b>: when <span style="font-family:monospace,monospace">allow_multiple_failover_</span><wbr style="font-family:monospace,monospace"><span style="font-family:monospace,monospace">requests_from_node = TRUE</span></div><div><span style="font-family:monospace,monospace"><img src="cid:ii_jal5vem90_160083b5a0a42712" width="493" height="562"><br><br></span></div><div><div><b>SCENARIO2</b>: when <span style="font-family:monospace,monospace">allow_multiple_failover_</span><wbr style="font-family:monospace,monospace"><span style="font-family:monospace,monospace">requests_from_node = FALSE</span></div></div><div><span style="font-family:monospace,monospace"><img src="cid:ii_jal5w1re1_160083bceea9beb8" width="493" height="562"><br><br></span></div><div><br></div><div>Note that the return code by watchdog in both scenarios on second failover request makes the difference.</div><div><br></div><div>Please let me know if you want further information/clarifications.</div><div><br></div><div>Thanks</div><div>Best regards</div><div>Muhammad Usama</div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
But the manual says:<br>
"For example, in a three node watchdog cluster, if one Pgpool-II node<br>
sends two failover requests for a particular backend node failover,<br>
Both requests will be counted as a separate vote in the favor of the<br>
failover and Pgpool-II will execute the failover, even if it does not<br>
get the vote from any other Pgpool-II node."<br>
<br>
I am confused.<br>
<div class="gmail-HOEnZb"><div class="gmail-h5"><br>
Best regards,<br>
--<br>
Tatsuo Ishii<br>
SRA OSS, Inc. Japan<br>
English: <a href="http://www.sraoss.co.jp/index_en.php" rel="noreferrer" target="_blank">http://www.sraoss.co.jp/index_<wbr>en.php</a><br>
Japanese:<a href="http://www.sraoss.co.jp" rel="noreferrer" target="_blank">http://www.sraoss.co.<wbr>jp</a><br>
<br>
> Thanks<br>
> Best Regards<br>
> Muhammad Usama<br>
><br>
> Best regards,<br>
>> --<br>
>> Tatsuo Ishii<br>
>> SRA OSS, Inc. Japan<br>
>> English: <a href="http://www.sraoss.co.jp/index_en.php" rel="noreferrer" target="_blank">http://www.sraoss.co.jp/index_<wbr>en.php</a><br>
>> Japanese:<a href="http://www.sraoss.co.jp" rel="noreferrer" target="_blank">http://www.sraoss.co.<wbr>jp</a><br>
>><br>
>> From: Muhammad Usama <<a href="mailto:m.usama@gmail.com">m.usama@gmail.com</a>><br>
>> Subject: New Feature with patch: Quorum and Consensus for backend failover<br>
>> Date: Tue, 22 Aug 2017 00:18:27 +0500<br>
>> Message-ID: <CAEJvTzUbz-d8dfsJdLt=<wbr>XNYWdOMxKf06sp+p=uAbxyjvG=<wbr>vS3A@mail.<br>
>> <a href="http://gmail.com" rel="noreferrer" target="_blank">gmail.com</a>><br>
>><br>
>> > Hi<br>
>> ><br>
>> > I was working on the new feature to make the backend node failover quorum<br>
>> > aware and on the half way through the implementation I also added the<br>
>> > majority consensus feature for the same.<br>
>> ><br>
>> > So please find the first version of the patch for review that makes the<br>
>> > backend node failover consider the watchdog cluster quorum status and<br>
>> seek<br>
>> > the majority consensus before performing failover.<br>
>> ><br>
>> > *Changes in the Failover mechanism with watchdog.*<br>
>> > For this new feature I have modified the Pgpool-II's existing failover<br>
>> > mechanism with watchdog.<br>
>> > Previously as you know when the Pgpool-II require to perform a node<br>
>> > operation (failover, failback, promote-node) with the watchdog. The<br>
>> > watchdog used to propagated the failover request to all the Pgpool-II<br>
>> nodes<br>
>> > in the watchdog cluster and as soon as the request was received by the<br>
>> > node, it used to initiate the local failover and that failover was<br>
>> > synchronised on all nodes using the distributed locks.<br>
>> ><br>
>> > *Now Only the Master node performs the failover.*<br>
>> > The attached patch changes the mechanism of synchronised failover, and<br>
>> now<br>
>> > only the Pgpool-II of master watchdog node performs the failover, and all<br>
>> > other standby nodes sync the backend statuses after the master Pgpool-II<br>
>> is<br>
>> > finished with the failover.<br>
>> ><br>
>> > *Overview of new failover mechanism.*<br>
>> > -- If the failover request is received to the standby watchdog node(from<br>
>> > local Pgpool-II), That request is forwarded to the master watchdog and<br>
>> the<br>
>> > Pgpool-II main process is returned with the FAILOVER_RES_WILL_BE_DONE<br>
>> > return code. And upon receiving the FAILOVER_RES_WILL_BE_DONE from the<br>
>> > watchdog for the failover request the requesting Pgpool-II moves forward<br>
>> > without doing anything further for the particular failover command.<br>
>> ><br>
>> > -- Now when the failover request from standby node is received by the<br>
>> > master watchdog, after performing the validation, applying the consensus<br>
>> > rules the failover request is triggered on the local Pgpool-II .<br>
>> ><br>
>> > -- When the failover request is received to the master watchdog node from<br>
>> > the local Pgpool-II (On the IPC channel) the watchdog process inform the<br>
>> > Pgpool-II requesting process to proceed with failover (provided all<br>
>> > failover rules are satisfied).<br>
>> ><br>
>> > -- After the failover is finished on the master Pgpool-II, the failover<br>
>> > function calls the *wd_failover_end*() which sends the backend sync<br>
>> > required message to all standby watchdogs.<br>
>> ><br>
>> > -- Upon receiving the sync required message from master watchdog node all<br>
>> > Pgpool-II sync the new statuses of each backend node from the master<br>
>> > watchdog.<br>
>> ><br>
>> > *No More Failover locks*<br>
>> > Since with this new failover mechanism we do not require any<br>
>> > synchronisation and guards against the execution of failover_commands by<br>
>> > multiple Pgpool-II nodes, So the patch removes all the distributed locks<br>
>> > from failover function, This makes the failover simpler and faster.<br>
>> ><br>
>> > *New kind of Failover operation NODE_QUARANTINE_REQUEST*<br>
>> > The patch adds the new kind of backend node operation NODE_QUARANTINE<br>
>> which<br>
>> > is effectively same as the NODE_DOWN, but with node_quarantine the<br>
>> > failover_command is not triggered.<br>
>> > The NODE_DOWN_REQUEST is automatically converted to the<br>
>> > NODE_QUARANTINE_REQUEST when the failover is requested on the backend<br>
>> node<br>
>> > but watchdog cluster does not holds the quorum.<br>
>> > This means in the absence of quorum the failed backend nodes are<br>
>> > quarantined and when the quorum becomes available again the Pgpool-II<br>
>> > performs the failback operation on all quarantine nodes.<br>
>> > And again when the failback is performed on the quarantine backend node<br>
>> the<br>
>> > failover function does not trigger the failback_command.<br>
>> ><br>
>> > *Controlling the Failover behaviour.*<br>
>> > The patch adds three new configuration parameters to configure the<br>
>> failover<br>
>> > behaviour from user side.<br>
>> ><br>
>> > *failover_when_quorum_exists*<br>
>> > When enabled the failover command will only be executed when the watchdog<br>
>> > cluster holds the quorum. And when the quorum is absent and<br>
>> > failover_when_quorum_exists is enabled the failed backend nodes will get<br>
>> > quarantine until the quorum becomes available again.<br>
>> > disabling it will enable the old behaviour of failover commands.<br>
>> ><br>
>> ><br>
>> > *failover_require_consensus*<wbr>This new configuration parameter can be<br>
>> used to<br>
>> > make sure we get the majority vote before performing the failover on the<br>
>> > node. When *failover_require_consensus* is enabled then the failover is<br>
>> > only performed after receiving the failover request from the majority or<br>
>> > Pgpool-II nodes.<br>
>> > For example in three nodes cluster the failover will not be performed<br>
>> until<br>
>> > at least two nodes ask for performing the failover on the particular<br>
>> > backend node.<br>
>> ><br>
>> > It is also worthwhile to mention here that *failover_require_consensus*<br>
>> > only works when failover_when_quorum_exists is enables.<br>
>> ><br>
>> ><br>
>> > *enable_multiple_failover_<wbr>requests_from_node*<br>
>> > This parameter works in connection with *failover_require_consensus*<br>
>> > config. When enabled a single Pgpool-II node can vote for failover<br>
>> multiple<br>
>> > times.<br>
>> > For example in the three nodes cluster if one Pgpool-II node sends the<br>
>> > failover request of particular node twice that would be counted as two<br>
>> > votes in favour of failover and the failover will be performed even if we<br>
>> > do not get a vote from other two nodes.<br>
>> ><br>
>> > And when *enable_multiple_failover_<wbr>requests_from_node* is disabled, Only<br>
>> > the first vote from each Pgpool-II will be accepted and all other<br>
>> > subsequent votes will be marked duplicate and rejected.<br>
>> > So in that case we will require a majority votes from distinct nodes to<br>
>> > execute the failover.<br>
>> > Again this *enable_multiple_failover_<wbr>requests_from_node* only becomes<br>
>> > effective when both *failover_when_quorum_exists* and<br>
>> > *failover_require_consensus* are enabled.<br>
>> ><br>
>> ><br>
>> > *Controlling the failover: The Coding perspective.*<br>
>> > Although the failover functions are made quorum and consensus aware but<br>
>> > there is still a way to bypass the quorum conditions, and requirement of<br>
>> > consensus.<br>
>> ><br>
>> > For this the patch uses the existing request_details flags in<br>
>> > POOL_REQUEST_NODE to control the behaviour of failover.<br>
>> ><br>
>> > Here are the newly added flags values.<br>
>> ><br>
>> > *REQ_DETAIL_WATCHDOG*:<br>
>> > Setting this flag while issuing the failover command will not send the<br>
>> > failover request to the watchdog. But this flag may not be useful in any<br>
>> > other place than where it is already used.<br>
>> > Mostly this flag can be used to avoid the failover command from going to<br>
>> > watchdog that is already originated from watchdog. Otherwise we can end<br>
>> up<br>
>> > in infinite loop.<br>
>> ><br>
>> > *REQ_DETAIL_CONFIRMED*:<br>
>> > Setting this flag will bypass the *failover_require_consensus*<br>
>> > configuration and immediately perform the failover if quorum is present.<br>
>> > This flag can be used to issue the failover request originated from PCP<br>
>> > command.<br>
>> ><br>
>> > *REQ_DETAIL_UPDATE*:<br>
>> > This flag is used for the command where we are failing back the<br>
>> quarantine<br>
>> > nodes. Setting this flag will not trigger the failback_command.<br>
>> ><br>
>> > *Some conditional flags used:*<br>
>> > I was not sure about the configuration of each type of failover<br>
>> operation.<br>
>> > As we have three main failover operations NODE_UP_REQUEST,<br>
>> > NODE_DOWN_REQUEST, and PROMOTE_NODE_REQUEST<br>
>> > So I was thinking do we need to give the configuration option to the<br>
>> users,<br>
>> > if they want to enable/disable quorum checking and consensus for<br>
>> individual<br>
>> > failover operation type.<br>
>> > For example: is it a practical configuration where a user would want to<br>
>> > ensure quorum while preforming NODE_DOWN operation while does not want it<br>
>> > for NODE_UP.<br>
>> > So in this patch I use three compile time defines to enable disable the<br>
>> > individual failover operation, while we can decide on the best solution.<br>
>> ><br>
>> > NODE_UP_REQUIRE_CONSENSUS: defining it will enable quorum checking<br>
>> feature<br>
>> > for NODE_UP_REQUESTs<br>
>> ><br>
>> > NODE_DOWN_REQUIRE_CONSENSUS: defining it will enable quorum checking<br>
>> > feature for NODE_DOWN_REQUESTs<br>
>> ><br>
>> > NODE_PROMOTE_REQUIRE_<wbr>CONSENSUS: defining it will enable quorum checking<br>
>> > feature for PROMOTE_NODE_REQUESTs<br>
>> ><br>
>> > *Some Point for Discussion:*<br>
>> ><br>
>> > *Do we really need to check ReqInfo->switching flag before enqueuing<br>
>> > failover request.*<br>
>> > While working on the patch I was wondering why do we disallow enqueuing<br>
>> the<br>
>> > failover command when the failover is already in progress? For example in<br>
>> > *pcp_process_command*() function if we see the *Req_info->switching* flag<br>
>> > set we bailout with the error instead of enqueuing the command. Is is<br>
>> > really necessary?<br>
>> ><br>
>> > *Do we need more granule control over each failover operation:*<br>
>> > As described in section "Some conditional flags used" I want the opinion<br>
>> on<br>
>> > do we need configuration parameters in pgpool.conf to enable disable<br>
>> quorum<br>
>> > and consensus checking on individual failover types.<br>
>> ><br>
>> > *Which failover should be mark as Confirmed:*<br>
>> > As defined in the above section of REQ_DETAIL_CONFIRMED, We can mark the<br>
>> > failover request to not need consensus, currently the requests from the<br>
>> PCP<br>
>> > commands are fired with this flag. But I was wondering there may be more<br>
>> > places where we many need to use the flag.<br>
>> > For example I currently use the same confirmed flag when failover is<br>
>> > triggered because of *replication_stop_on_mismatch*<wbr>.<br>
>> ><br>
>> > I think we should think this flag for each place of failover, like when<br>
>> the<br>
>> > failover is triggered<br>
>> > because of health_check failure.<br>
>> > because of replication mismatch<br>
>> > because of backend_error<br>
>> > e.t.c<br>
>> ><br>
>> > *Node Quarantine behaviour.*<br>
>> > What do you think about the node quarantine used by this patch. Can you<br>
>> > think of some problem which can be caused by this?<br>
>> ><br>
>> > *What should be the default values for each newly added config<br>
>> parameters.*<br>
>> ><br>
>> ><br>
>> ><br>
>> > *TODOs*<br>
>> ><br>
>> > -- Updating the documentation is still todo. Will do that once every<br>
>> aspect<br>
>> > of the feature will be finalised.<br>
>> > -- Some code warnings and cleanups are still not done.<br>
>> > -- I am still little short on testing<br>
>> > -- Regression test cases for the feature<br>
>> ><br>
>> ><br>
>> > Thoughts and suggestions are most welcome.<br>
>> ><br>
>> > Thanks<br>
>> > Best regards<br>
>> > Muhammad Usama<br>
>><br>
</div></div></blockquote></div><br></div></div></div>