[pgpool-hackers: 2622] Re: New Feature with patch: Quorum and Consensus for backend failover

Wed Nov 29 02:06:36 JST 2017

Usama,

> Hi Ishii-San,
> 
> On Tue, Nov 28, 2017 at 5:55 AM, Tatsuo Ishii <ishii at sraoss.co.jp> wrote:
> 
>> Hi Usama,
>>
>> While writing a presentation material of Pgpool-II 3.7, I am not sure
>> I understand the behavior the quorum consusens behavior.
>>
>> > *enable_multiple_failover_requests_from_node*
>> > This parameter works in connection with *failover_require_consensus*
>> > config. When enabled a single Pgpool-II node can vote for failover
>> multiple
>> > times.
>>
>> In what situation a Pgpool-II node could send multiple failover
>> requests? My guess is in the following scenario:
>>
>> 1) Pgpool-II watchdog standby health check process detects the failure
>>    of backend A and send a faiover request to the master Pgpool-II.
>>
>> 2) Since the vote does not satisfy the quorum consensus, failver is
>>    not occurred. Just backend_info->quarantine is set and
>>    backend_info->backend_status is set to CON_DOWN.
>>
>> 3) Pgpool-II watchdog standby health check process detects the failure
>>    of backend A again, then sent a failover request to the master
>>    Pgpool-II again. If enable_multiple_failover_requests_from_node is
>>    set, failover will happen.
>>
>> But after thinking more, I realized that in step 3, since
>> backend_status is already set to CON_DOWN, health check will not be
>> performed against backend A. So the watchdog standby will not send
>> multiple vote.
> 
> 
>> Apparently I am missing something here.
>>
>> Can you please tell what is the scenario in that a watchdog sends
>> multiple votes for failover?
>>
>>
> Basically when allow_multiple_failover_requests_from_node is set then
> watchdog
> does not performs the quarantine operation and node status is not changed
> to DOWN.
> So it is possible for the node to send multiple votes for node failover.
> Also even when the allow_multiple_failover_requests_from_node is not set,
> Pgpool-II does not quarantines the node straightaway after first failover
> request while watchdog
> is waiting for consensus. What happens is, when the watchdog receives the
> failover requests
> and that request requires a consensus, it returns
> FAILOVER_RES_CONSENSUS_MAY_FAIL,
> and when the main pgpool process receives this return code for failover
> request from watchdog,
> it just ignores this request without changing the backend node status to
> down and relies on watchdog
> to handle that failover request, meanwhile pgpool continues with its normal
> duties,
> 
> Now when the same pgpool sends the failover request for the same backend
> node second time around,
> Then the behaviour depends upon the setting of
> allow_multiple_failover_requests_from_node configuration.
> 
> 1- When allow_multiple_failover_requests_from_node = off
>     Then watchdog returns  FAILOVER_RES_CONSENSUS_MAY_FAIL, and Pgpool main
> process quarantines
>      the backend node and set its status to DOWN when it receives this code
> from watchdog.
> 
> 1- When allow_multiple_failover_requests_from_node = on
>     Then watchdog returns FAILOVER_RES_BUILDING_CONSENSUS, and Pgpool main
> process does not
>     quarantines the backend node and its status remains unchanged and
> effectively health check
>     keeps executing on that backend node.

So when allow_multiple_failover_requests_from_node = on, Pgpool-II
never sets the backend node status to DOWN?

But the manual says:
"For example, in a three node watchdog cluster, if one Pgpool-II node
sends two failover requests for a particular backend node failover,
Both requests will be counted as a separate vote in the favor of the
failover and Pgpool-II will execute the failover, even if it does not
get the vote from any other Pgpool-II node."

I am confused.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

>  Thanks
> Best Regards
> Muhammad Usama
> 
> Best regards,
>> --
>> Tatsuo Ishii
>> SRA OSS, Inc. Japan
>> English: http://www.sraoss.co.jp/index_en.php
>> Japanese:http://www.sraoss.co.jp
>>
>> From: Muhammad Usama <m.usama at gmail.com>
>> Subject: New Feature with patch: Quorum and Consensus for backend failover
>> Date: Tue, 22 Aug 2017 00:18:27 +0500
>> Message-ID: <CAEJvTzUbz-d8dfsJdLt=XNYWdOMxKf06sp+p=uAbxyjvG=vS3A at mail.
>> gmail.com>
>>
>> > Hi
>> >
>> > I was working on the new feature to make the backend node failover quorum
>> > aware and on the half way through the implementation I also added the
>> > majority consensus feature for the same.
>> >
>> > So please find the first version of the patch for review that makes the
>> > backend node failover consider the watchdog cluster quorum status and
>> seek
>> > the majority consensus before performing failover.
>> >
>> > *Changes in the Failover mechanism with watchdog.*
>> > For this new feature I have modified the Pgpool-II's existing failover
>> > mechanism with watchdog.
>> > Previously as you know when the Pgpool-II require to perform a node
>> > operation (failover, failback, promote-node) with the watchdog. The
>> > watchdog used to propagated the failover request to all the Pgpool-II
>> nodes
>> > in the watchdog cluster and as soon as the request was received by the
>> > node, it used to initiate the local failover and that failover was
>> > synchronised on all nodes using the distributed locks.
>> >
>> > *Now Only the Master node performs the failover.*
>> > The attached patch changes the mechanism of synchronised failover, and
>> now
>> > only the Pgpool-II of master watchdog node performs the failover, and all
>> > other standby nodes sync the backend statuses after the master Pgpool-II
>> is
>> > finished with the failover.
>> >
>> > *Overview of new failover mechanism.*
>> > -- If the failover request is received to the standby watchdog node(from
>> > local Pgpool-II), That request is forwarded to the master watchdog and
>> the
>> > Pgpool-II main process is returned with the FAILOVER_RES_WILL_BE_DONE
>> > return code. And upon receiving the FAILOVER_RES_WILL_BE_DONE from the
>> > watchdog for the failover request the requesting Pgpool-II moves forward
>> > without doing anything further for the particular failover command.
>> >
>> > -- Now when the failover request from standby node is received by the
>> > master watchdog, after performing the validation, applying the consensus
>> > rules the failover request is triggered on the local Pgpool-II .
>> >
>> > -- When the failover request is received to the master watchdog node from
>> > the local Pgpool-II (On the IPC channel) the watchdog process inform the
>> > Pgpool-II requesting process to proceed with failover (provided all
>> > failover rules are satisfied).
>> >
>> > -- After the failover is finished on the master Pgpool-II, the failover
>> > function calls the *wd_failover_end*() which sends the backend sync
>> > required message to all standby watchdogs.
>> >
>> > -- Upon receiving the sync required message from master watchdog node all
>> > Pgpool-II sync the new statuses of each backend node from the master
>> > watchdog.
>> >
>> > *No More Failover locks*
>> > Since with this new failover mechanism we do not require any
>> > synchronisation and guards against the execution of failover_commands by
>> > multiple Pgpool-II nodes, So the patch removes all the distributed locks
>> > from failover function, This makes the failover simpler and faster.
>> >
>> > *New kind of Failover operation NODE_QUARANTINE_REQUEST*
>> > The patch adds the new kind of backend node operation NODE_QUARANTINE
>> which
>> > is effectively same as the NODE_DOWN, but with node_quarantine the
>> > failover_command is not triggered.
>> > The NODE_DOWN_REQUEST is automatically converted to the
>> > NODE_QUARANTINE_REQUEST when the failover is requested on the backend
>> node
>> > but watchdog cluster does not holds the quorum.
>> > This means in the absence of quorum the failed backend nodes are
>> > quarantined and when the quorum becomes available again the Pgpool-II
>> > performs the failback operation on all quarantine nodes.
>> > And again when the failback is performed on the quarantine backend node
>> the
>> > failover function does not trigger the failback_command.
>> >
>> > *Controlling the Failover behaviour.*
>> > The patch adds three new configuration parameters to configure the
>> failover
>> > behaviour from user side.
>> >
>> > *failover_when_quorum_exists*
>> > When enabled the failover command will only be executed when the watchdog
>> > cluster holds the quorum. And when the quorum is absent and
>> > failover_when_quorum_exists is enabled the failed backend nodes will get
>> > quarantine until the quorum becomes available again.
>> > disabling it will enable the old behaviour of failover commands.
>> >
>> >
>> > *failover_require_consensus*This new configuration parameter can be
>> used to
>> > make sure we get the majority vote before performing the failover on the
>> > node. When *failover_require_consensus* is enabled then the failover is
>> > only performed after receiving the failover request from the majority or
>> > Pgpool-II nodes.
>> > For example in three nodes cluster the failover will not be performed
>> until
>> > at least two nodes ask for performing the failover on the particular
>> > backend node.
>> >
>> > It is also worthwhile to mention here that *failover_require_consensus*
>> > only works when failover_when_quorum_exists is enables.
>> >
>> >
>> > *enable_multiple_failover_requests_from_node*
>> > This parameter works in connection with *failover_require_consensus*
>> > config. When enabled a single Pgpool-II node can vote for failover
>> multiple
>> > times.
>> > For example in the three nodes cluster if one Pgpool-II node sends the
>> > failover request of particular node twice that would be counted as two
>> > votes in favour of failover and the failover will be performed even if we
>> > do not get a vote from other two nodes.
>> >
>> > And when *enable_multiple_failover_requests_from_node* is disabled, Only
>> > the first vote from each Pgpool-II will be accepted and all other
>> > subsequent votes will be marked duplicate and rejected.
>> > So in that case we will require a majority votes from distinct nodes to
>> > execute the failover.
>> > Again this *enable_multiple_failover_requests_from_node* only becomes
>> > effective when both *failover_when_quorum_exists* and
>> > *failover_require_consensus* are enabled.
>> >
>> >
>> > *Controlling the failover: The Coding perspective.*
>> > Although the failover functions are made quorum and consensus aware but
>> > there is still a way to bypass the quorum conditions, and requirement of
>> > consensus.
>> >
>> > For this the patch uses the existing request_details flags in
>> > POOL_REQUEST_NODE to control the behaviour of failover.
>> >
>> > Here are the newly added flags values.
>> >
>> > *REQ_DETAIL_WATCHDOG*:
>> > Setting this flag while issuing the failover command will not send the
>> > failover request to the watchdog. But this flag may not be useful in any
>> > other place than where it is already used.
>> > Mostly this flag can be used to avoid the failover command from going to
>> > watchdog that is already originated from watchdog. Otherwise we can end
>> up
>> > in infinite loop.
>> >
>> > *REQ_DETAIL_CONFIRMED*:
>> > Setting this flag will bypass the *failover_require_consensus*
>> > configuration and immediately perform the failover if quorum is present.
>> > This flag can be used to issue the failover request originated from PCP
>> > command.
>> >
>> > *REQ_DETAIL_UPDATE*:
>> > This flag is used for the command where we are failing back the
>> quarantine
>> > nodes. Setting this flag will not trigger the failback_command.
>> >
>> > *Some conditional flags used:*
>> > I was not sure about the configuration of each type of failover
>> operation.
>> > As we have three main failover operations NODE_UP_REQUEST,
>> > NODE_DOWN_REQUEST, and PROMOTE_NODE_REQUEST
>> > So I was thinking do we need to give the configuration option to the
>> users,
>> > if they want to enable/disable quorum checking and consensus for
>> individual
>> > failover operation type.
>> > For example: is it a practical configuration where a user would want to
>> > ensure quorum while preforming NODE_DOWN operation while does not want it
>> > for NODE_UP.
>> > So in this patch I use three compile time defines to enable disable the
>> > individual failover operation, while we can decide on the best solution.
>> >
>> > NODE_UP_REQUIRE_CONSENSUS: defining it will enable quorum checking
>> feature
>> > for NODE_UP_REQUESTs
>> >
>> > NODE_DOWN_REQUIRE_CONSENSUS: defining it will enable quorum checking
>> > feature for NODE_DOWN_REQUESTs
>> >
>> > NODE_PROMOTE_REQUIRE_CONSENSUS: defining it will enable quorum checking
>> > feature for PROMOTE_NODE_REQUESTs
>> >
>> > *Some Point for Discussion:*
>> >
>> > *Do we really need to check ReqInfo->switching flag before enqueuing
>> > failover request.*
>> > While working on the patch I was wondering why do we disallow enqueuing
>> the
>> > failover command when the failover is already in progress? For example in
>> > *pcp_process_command*() function if we see the *Req_info->switching* flag
>> > set we bailout with the error instead of enqueuing the command. Is is
>> > really necessary?
>> >
>> > *Do we need more granule control over each failover operation:*
>> > As described in section "Some conditional flags used" I want the opinion
>> on
>> > do we need configuration parameters in pgpool.conf to enable disable
>> quorum
>> > and consensus checking on individual failover types.
>> >
>> > *Which failover should be mark as Confirmed:*
>> > As defined in the above section of REQ_DETAIL_CONFIRMED, We can mark the
>> > failover request to not need consensus, currently the requests from the
>> PCP
>> > commands are fired with this flag. But I was wondering there may be more
>> > places where we many need to use the flag.
>> > For example I currently use the same confirmed flag when failover is
>> > triggered because of *replication_stop_on_mismatch*.
>> >
>> > I think we should think this flag for each place of failover, like when
>> the
>> > failover is triggered
>> > because of health_check failure.
>> > because of replication mismatch
>> > because of backend_error
>> > e.t.c
>> >
>> > *Node Quarantine behaviour.*
>> > What do you think about the node quarantine used by this patch. Can you
>> > think of some problem which can be caused by this?
>> >
>> > *What should be the default values for each newly added config
>> parameters.*
>> >
>> >
>> >
>> > *TODOs*
>> >
>> > -- Updating the documentation is still todo. Will do that once every
>> aspect
>> > of the feature will be finalised.
>> > -- Some code warnings and cleanups are still not done.
>> > -- I am still little short on testing
>> > -- Regression test cases for the feature
>> >
>> >
>> > Thoughts and suggestions are most welcome.
>> >
>> > Thanks
>> > Best regards
>> > Muhammad Usama
>>