[pgpool-hackers: 2629] Re: New Feature with patch: Quorum and Consensus for backend failover

Fri Dec 1 08:33:12 JST 2017

Usama,

Thank you very much for the diagrams and explanations.
It was complex:-)

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

> Hi Ishii-San
> 
> On Tue, Nov 28, 2017 at 10:06 PM, Tatsuo Ishii <ishii at sraoss.co.jp> wrote:
> 
>> Usama,
>>
>> > Hi Ishii-San,
>> >
>> > On Tue, Nov 28, 2017 at 5:55 AM, Tatsuo Ishii <ishii at sraoss.co.jp>
>> wrote:
>> >
>> >> Hi Usama,
>> >>
>> >> While writing a presentation material of Pgpool-II 3.7, I am not sure
>> >> I understand the behavior the quorum consusens behavior.
>> >>
>> >> > *enable_multiple_failover_requests_from_node*
>> >> > This parameter works in connection with *failover_require_consensus*
>> >> > config. When enabled a single Pgpool-II node can vote for failover
>> >> multiple
>> >> > times.
>> >>
>> >> In what situation a Pgpool-II node could send multiple failover
>> >> requests? My guess is in the following scenario:
>> >>
>> >> 1) Pgpool-II watchdog standby health check process detects the failure
>> >>    of backend A and send a faiover request to the master Pgpool-II.
>> >>
>> >> 2) Since the vote does not satisfy the quorum consensus, failver is
>> >>    not occurred. Just backend_info->quarantine is set and
>> >>    backend_info->backend_status is set to CON_DOWN.
>> >>
>> >> 3) Pgpool-II watchdog standby health check process detects the failure
>> >>    of backend A again, then sent a failover request to the master
>> >>    Pgpool-II again. If enable_multiple_failover_requests_from_node is
>> >>    set, failover will happen.
>> >>
>> >> But after thinking more, I realized that in step 3, since
>> >> backend_status is already set to CON_DOWN, health check will not be
>> >> performed against backend A. So the watchdog standby will not send
>> >> multiple vote.
>> >
>> >
>> >> Apparently I am missing something here.
>> >>
>> >> Can you please tell what is the scenario in that a watchdog sends
>> >> multiple votes for failover?
>> >>
>> >>
>> > Basically when allow_multiple_failover_requests_from_node is set then
>> > watchdog
>> > does not performs the quarantine operation and node status is not changed
>> > to DOWN.
>> > So it is possible for the node to send multiple votes for node failover.
>> > Also even when the allow_multiple_failover_requests_from_node is not
>> set,
>> > Pgpool-II does not quarantines the node straightaway after first failover
>> > request while watchdog
>> > is waiting for consensus. What happens is, when the watchdog receives the
>> > failover requests
>> > and that request requires a consensus, it returns
>> > FAILOVER_RES_CONSENSUS_MAY_FAIL,
>> > and when the main pgpool process receives this return code for failover
>> > request from watchdog,
>> > it just ignores this request without changing the backend node status to
>> > down and relies on watchdog
>> > to handle that failover request, meanwhile pgpool continues with its
>> normal
>> > duties,
>> >
>> > Now when the same pgpool sends the failover request for the same backend
>> > node second time around,
>> > Then the behaviour depends upon the setting of
>> > allow_multiple_failover_requests_from_node configuration.
>> >
>> > 1- When allow_multiple_failover_requests_from_node = off
>> >     Then watchdog returns  FAILOVER_RES_CONSENSUS_MAY_FAIL, and Pgpool
>> main
>> > process quarantines
>> >      the backend node and set its status to DOWN when it receives this
>> code
>> > from watchdog.
>> >
>> > 1- When allow_multiple_failover_requests_from_node = on
>> >     Then watchdog returns FAILOVER_RES_BUILDING_CONSENSUS, and Pgpool
>> main
>> > process does not
>> >     quarantines the backend node and its status remains unchanged and
>> > effectively health check
>> >     keeps executing on that backend node.
>>
>> So when allow_multiple_failover_requests_from_node = on, Pgpool-II
>> never sets the backend node status to DOWN?
>>
> 
> Basically the backend status is set to down either when failover or
> quarantine operation is performed on the node
> so with allow_multiple_failover_requests_from_node = TRUE, pgpool main
> process does not perform the quarantine operation
> while the watchdog is building the consensus for failover, But as soon as
> consensus is built the failover is executed and
> node status is set to down. So effectively for the time between first
> failover request till the consensus is built and actual
> failover is performed, the node status remain UP and it can send multiple
> failover requests for the node.
> 
> 
> Please see the below flow diagrams of the scenarios with
> allow_multiple_failover_requests_from_node ON and OFF cases
> for further clarification.
> 
> *SCENARIO1*: when allow_multiple_failover_requests_from_node = TRUE
> 
> 
> *SCENARIO2*: when allow_multiple_failover_requests_from_node = FALSE
> 
> 
> 
> Note that the return code by watchdog in both scenarios on second failover
> request makes the difference.
> 
> Please let me know if you want further information/clarifications.
> 
> Thanks
> Best regards
> Muhammad Usama
> 
>>
>> But the manual says:
>> "For example, in a three node watchdog cluster, if one Pgpool-II node
>> sends two failover requests for a particular backend node failover,
>> Both requests will be counted as a separate vote in the favor of the
>> failover and Pgpool-II will execute the failover, even if it does not
>> get the vote from any other Pgpool-II node."
>>
>> I am confused.
>>
>> Best regards,
>> --
>> Tatsuo Ishii
>> SRA OSS, Inc. Japan
>> English: http://www.sraoss.co.jp/index_en.php
>> Japanese:http://www.sraoss.co.jp
>>
>> >  Thanks
>> > Best Regards
>> > Muhammad Usama
>> >
>> > Best regards,
>> >> --
>> >> Tatsuo Ishii
>> >> SRA OSS, Inc. Japan
>> >> English: http://www.sraoss.co.jp/index_en.php
>> >> Japanese:http://www.sraoss.co.jp
>> >>
>> >> From: Muhammad Usama <m.usama at gmail.com>
>> >> Subject: New Feature with patch: Quorum and Consensus for backend
>> failover
>> >> Date: Tue, 22 Aug 2017 00:18:27 +0500
>> >> Message-ID: <CAEJvTzUbz-d8dfsJdLt=XNYWdOMxKf06sp+p=uAbxyjvG=vS3A at mail.
>> >> gmail.com>
>> >>
>> >> > Hi
>> >> >
>> >> > I was working on the new feature to make the backend node failover
>> quorum
>> >> > aware and on the half way through the implementation I also added the
>> >> > majority consensus feature for the same.
>> >> >
>> >> > So please find the first version of the patch for review that makes
>> the
>> >> > backend node failover consider the watchdog cluster quorum status and
>> >> seek
>> >> > the majority consensus before performing failover.
>> >> >
>> >> > *Changes in the Failover mechanism with watchdog.*
>> >> > For this new feature I have modified the Pgpool-II's existing failover
>> >> > mechanism with watchdog.
>> >> > Previously as you know when the Pgpool-II require to perform a node
>> >> > operation (failover, failback, promote-node) with the watchdog. The
>> >> > watchdog used to propagated the failover request to all the Pgpool-II
>> >> nodes
>> >> > in the watchdog cluster and as soon as the request was received by the
>> >> > node, it used to initiate the local failover and that failover was
>> >> > synchronised on all nodes using the distributed locks.
>> >> >
>> >> > *Now Only the Master node performs the failover.*
>> >> > The attached patch changes the mechanism of synchronised failover, and
>> >> now
>> >> > only the Pgpool-II of master watchdog node performs the failover, and
>> all
>> >> > other standby nodes sync the backend statuses after the master
>> Pgpool-II
>> >> is
>> >> > finished with the failover.
>> >> >
>> >> > *Overview of new failover mechanism.*
>> >> > -- If the failover request is received to the standby watchdog
>> node(from
>> >> > local Pgpool-II), That request is forwarded to the master watchdog and
>> >> the
>> >> > Pgpool-II main process is returned with the FAILOVER_RES_WILL_BE_DONE
>> >> > return code. And upon receiving the FAILOVER_RES_WILL_BE_DONE from the
>> >> > watchdog for the failover request the requesting Pgpool-II moves
>> forward
>> >> > without doing anything further for the particular failover command.
>> >> >
>> >> > -- Now when the failover request from standby node is received by the
>> >> > master watchdog, after performing the validation, applying the
>> consensus
>> >> > rules the failover request is triggered on the local Pgpool-II .
>> >> >
>> >> > -- When the failover request is received to the master watchdog node
>> from
>> >> > the local Pgpool-II (On the IPC channel) the watchdog process inform
>> the
>> >> > Pgpool-II requesting process to proceed with failover (provided all
>> >> > failover rules are satisfied).
>> >> >
>> >> > -- After the failover is finished on the master Pgpool-II, the
>> failover
>> >> > function calls the *wd_failover_end*() which sends the backend sync
>> >> > required message to all standby watchdogs.
>> >> >
>> >> > -- Upon receiving the sync required message from master watchdog node
>> all
>> >> > Pgpool-II sync the new statuses of each backend node from the master
>> >> > watchdog.
>> >> >
>> >> > *No More Failover locks*
>> >> > Since with this new failover mechanism we do not require any
>> >> > synchronisation and guards against the execution of failover_commands
>> by
>> >> > multiple Pgpool-II nodes, So the patch removes all the distributed
>> locks
>> >> > from failover function, This makes the failover simpler and faster.
>> >> >
>> >> > *New kind of Failover operation NODE_QUARANTINE_REQUEST*
>> >> > The patch adds the new kind of backend node operation NODE_QUARANTINE
>> >> which
>> >> > is effectively same as the NODE_DOWN, but with node_quarantine the
>> >> > failover_command is not triggered.
>> >> > The NODE_DOWN_REQUEST is automatically converted to the
>> >> > NODE_QUARANTINE_REQUEST when the failover is requested on the backend
>> >> node
>> >> > but watchdog cluster does not holds the quorum.
>> >> > This means in the absence of quorum the failed backend nodes are
>> >> > quarantined and when the quorum becomes available again the Pgpool-II
>> >> > performs the failback operation on all quarantine nodes.
>> >> > And again when the failback is performed on the quarantine backend
>> node
>> >> the
>> >> > failover function does not trigger the failback_command.
>> >> >
>> >> > *Controlling the Failover behaviour.*
>> >> > The patch adds three new configuration parameters to configure the
>> >> failover
>> >> > behaviour from user side.
>> >> >
>> >> > *failover_when_quorum_exists*
>> >> > When enabled the failover command will only be executed when the
>> watchdog
>> >> > cluster holds the quorum. And when the quorum is absent and
>> >> > failover_when_quorum_exists is enabled the failed backend nodes will
>> get
>> >> > quarantine until the quorum becomes available again.
>> >> > disabling it will enable the old behaviour of failover commands.
>> >> >
>> >> >
>> >> > *failover_require_consensus*This new configuration parameter can be
>> >> used to
>> >> > make sure we get the majority vote before performing the failover on
>> the
>> >> > node. When *failover_require_consensus* is enabled then the failover
>> is
>> >> > only performed after receiving the failover request from the majority
>> or
>> >> > Pgpool-II nodes.
>> >> > For example in three nodes cluster the failover will not be performed
>> >> until
>> >> > at least two nodes ask for performing the failover on the particular
>> >> > backend node.
>> >> >
>> >> > It is also worthwhile to mention here that
>> *failover_require_consensus*
>> >> > only works when failover_when_quorum_exists is enables.
>> >> >
>> >> >
>> >> > *enable_multiple_failover_requests_from_node*
>> >> > This parameter works in connection with *failover_require_consensus*
>> >> > config. When enabled a single Pgpool-II node can vote for failover
>> >> multiple
>> >> > times.
>> >> > For example in the three nodes cluster if one Pgpool-II node sends the
>> >> > failover request of particular node twice that would be counted as two
>> >> > votes in favour of failover and the failover will be performed even
>> if we
>> >> > do not get a vote from other two nodes.
>> >> >
>> >> > And when *enable_multiple_failover_requests_from_node* is disabled,
>> Only
>> >> > the first vote from each Pgpool-II will be accepted and all other
>> >> > subsequent votes will be marked duplicate and rejected.
>> >> > So in that case we will require a majority votes from distinct nodes
>> to
>> >> > execute the failover.
>> >> > Again this *enable_multiple_failover_requests_from_node* only becomes
>> >> > effective when both *failover_when_quorum_exists* and
>> >> > *failover_require_consensus* are enabled.
>> >> >
>> >> >
>> >> > *Controlling the failover: The Coding perspective.*
>> >> > Although the failover functions are made quorum and consensus aware
>> but
>> >> > there is still a way to bypass the quorum conditions, and requirement
>> of
>> >> > consensus.
>> >> >
>> >> > For this the patch uses the existing request_details flags in
>> >> > POOL_REQUEST_NODE to control the behaviour of failover.
>> >> >
>> >> > Here are the newly added flags values.
>> >> >
>> >> > *REQ_DETAIL_WATCHDOG*:
>> >> > Setting this flag while issuing the failover command will not send the
>> >> > failover request to the watchdog. But this flag may not be useful in
>> any
>> >> > other place than where it is already used.
>> >> > Mostly this flag can be used to avoid the failover command from going
>> to
>> >> > watchdog that is already originated from watchdog. Otherwise we can
>> end
>> >> up
>> >> > in infinite loop.
>> >> >
>> >> > *REQ_DETAIL_CONFIRMED*:
>> >> > Setting this flag will bypass the *failover_require_consensus*
>> >> > configuration and immediately perform the failover if quorum is
>> present.
>> >> > This flag can be used to issue the failover request originated from
>> PCP
>> >> > command.
>> >> >
>> >> > *REQ_DETAIL_UPDATE*:
>> >> > This flag is used for the command where we are failing back the
>> >> quarantine
>> >> > nodes. Setting this flag will not trigger the failback_command.
>> >> >
>> >> > *Some conditional flags used:*
>> >> > I was not sure about the configuration of each type of failover
>> >> operation.
>> >> > As we have three main failover operations NODE_UP_REQUEST,
>> >> > NODE_DOWN_REQUEST, and PROMOTE_NODE_REQUEST
>> >> > So I was thinking do we need to give the configuration option to the
>> >> users,
>> >> > if they want to enable/disable quorum checking and consensus for
>> >> individual
>> >> > failover operation type.
>> >> > For example: is it a practical configuration where a user would want
>> to
>> >> > ensure quorum while preforming NODE_DOWN operation while does not
>> want it
>> >> > for NODE_UP.
>> >> > So in this patch I use three compile time defines to enable disable
>> the
>> >> > individual failover operation, while we can decide on the best
>> solution.
>> >> >
>> >> > NODE_UP_REQUIRE_CONSENSUS: defining it will enable quorum checking
>> >> feature
>> >> > for NODE_UP_REQUESTs
>> >> >
>> >> > NODE_DOWN_REQUIRE_CONSENSUS: defining it will enable quorum checking
>> >> > feature for NODE_DOWN_REQUESTs
>> >> >
>> >> > NODE_PROMOTE_REQUIRE_CONSENSUS: defining it will enable quorum
>> checking
>> >> > feature for PROMOTE_NODE_REQUESTs
>> >> >
>> >> > *Some Point for Discussion:*
>> >> >
>> >> > *Do we really need to check ReqInfo->switching flag before enqueuing
>> >> > failover request.*
>> >> > While working on the patch I was wondering why do we disallow
>> enqueuing
>> >> the
>> >> > failover command when the failover is already in progress? For
>> example in
>> >> > *pcp_process_command*() function if we see the *Req_info->switching*
>> flag
>> >> > set we bailout with the error instead of enqueuing the command. Is is
>> >> > really necessary?
>> >> >
>> >> > *Do we need more granule control over each failover operation:*
>> >> > As described in section "Some conditional flags used" I want the
>> opinion
>> >> on
>> >> > do we need configuration parameters in pgpool.conf to enable disable
>> >> quorum
>> >> > and consensus checking on individual failover types.
>> >> >
>> >> > *Which failover should be mark as Confirmed:*
>> >> > As defined in the above section of REQ_DETAIL_CONFIRMED, We can mark
>> the
>> >> > failover request to not need consensus, currently the requests from
>> the
>> >> PCP
>> >> > commands are fired with this flag. But I was wondering there may be
>> more
>> >> > places where we many need to use the flag.
>> >> > For example I currently use the same confirmed flag when failover is
>> >> > triggered because of *replication_stop_on_mismatch*.
>> >> >
>> >> > I think we should think this flag for each place of failover, like
>> when
>> >> the
>> >> > failover is triggered
>> >> > because of health_check failure.
>> >> > because of replication mismatch
>> >> > because of backend_error
>> >> > e.t.c
>> >> >
>> >> > *Node Quarantine behaviour.*
>> >> > What do you think about the node quarantine used by this patch. Can
>> you
>> >> > think of some problem which can be caused by this?
>> >> >
>> >> > *What should be the default values for each newly added config
>> >> parameters.*
>> >> >
>> >> >
>> >> >
>> >> > *TODOs*
>> >> >
>> >> > -- Updating the documentation is still todo. Will do that once every
>> >> aspect
>> >> > of the feature will be finalised.
>> >> > -- Some code warnings and cleanups are still not done.
>> >> > -- I am still little short on testing
>> >> > -- Regression test cases for the feature
>> >> >
>> >> >
>> >> > Thoughts and suggestions are most welcome.
>> >> >
>> >> > Thanks
>> >> > Best regards
>> >> > Muhammad Usama
>> >>
>>