[pgpool-hackers: 2624] Re: New Feature with patch: Quorum and Consensus for backend failover

Wed Nov 29 23:47:58 JST 2017

Hi Ishii-San

On Tue, Nov 28, 2017 at 10:06 PM, Tatsuo Ishii <ishii at sraoss.co.jp> wrote:

> Usama,
>
> > Hi Ishii-San,
> >
> > On Tue, Nov 28, 2017 at 5:55 AM, Tatsuo Ishii <ishii at sraoss.co.jp>
> wrote:
> >
> >> Hi Usama,
> >>
> >> While writing a presentation material of Pgpool-II 3.7, I am not sure
> >> I understand the behavior the quorum consusens behavior.
> >>
> >> > *enable_multiple_failover_requests_from_node*
> >> > This parameter works in connection with *failover_require_consensus*
> >> > config. When enabled a single Pgpool-II node can vote for failover
> >> multiple
> >> > times.
> >>
> >> In what situation a Pgpool-II node could send multiple failover
> >> requests? My guess is in the following scenario:
> >>
> >> 1) Pgpool-II watchdog standby health check process detects the failure
> >>    of backend A and send a faiover request to the master Pgpool-II.
> >>
> >> 2) Since the vote does not satisfy the quorum consensus, failver is
> >>    not occurred. Just backend_info->quarantine is set and
> >>    backend_info->backend_status is set to CON_DOWN.
> >>
> >> 3) Pgpool-II watchdog standby health check process detects the failure
> >>    of backend A again, then sent a failover request to the master
> >>    Pgpool-II again. If enable_multiple_failover_requests_from_node is
> >>    set, failover will happen.
> >>
> >> But after thinking more, I realized that in step 3, since
> >> backend_status is already set to CON_DOWN, health check will not be
> >> performed against backend A. So the watchdog standby will not send
> >> multiple vote.
> >
> >
> >> Apparently I am missing something here.
> >>
> >> Can you please tell what is the scenario in that a watchdog sends
> >> multiple votes for failover?
> >>
> >>
> > Basically when allow_multiple_failover_requests_from_node is set then
> > watchdog
> > does not performs the quarantine operation and node status is not changed
> > to DOWN.
> > So it is possible for the node to send multiple votes for node failover.
> > Also even when the allow_multiple_failover_requests_from_node is not
> set,
> > Pgpool-II does not quarantines the node straightaway after first failover
> > request while watchdog
> > is waiting for consensus. What happens is, when the watchdog receives the
> > failover requests
> > and that request requires a consensus, it returns
> > FAILOVER_RES_CONSENSUS_MAY_FAIL,
> > and when the main pgpool process receives this return code for failover
> > request from watchdog,
> > it just ignores this request without changing the backend node status to
> > down and relies on watchdog
> > to handle that failover request, meanwhile pgpool continues with its
> normal
> > duties,
> >
> > Now when the same pgpool sends the failover request for the same backend
> > node second time around,
> > Then the behaviour depends upon the setting of
> > allow_multiple_failover_requests_from_node configuration.
> >
> > 1- When allow_multiple_failover_requests_from_node = off
> >     Then watchdog returns  FAILOVER_RES_CONSENSUS_MAY_FAIL, and Pgpool
> main
> > process quarantines
> >      the backend node and set its status to DOWN when it receives this
> code
> > from watchdog.
> >
> > 1- When allow_multiple_failover_requests_from_node = on
> >     Then watchdog returns FAILOVER_RES_BUILDING_CONSENSUS, and Pgpool
> main
> > process does not
> >     quarantines the backend node and its status remains unchanged and
> > effectively health check
> >     keeps executing on that backend node.
>
> So when allow_multiple_failover_requests_from_node = on, Pgpool-II
> never sets the backend node status to DOWN?
>

Basically the backend status is set to down either when failover or
quarantine operation is performed on the node
so with allow_multiple_failover_requests_from_node = TRUE, pgpool main
process does not perform the quarantine operation
while the watchdog is building the consensus for failover, But as soon as
consensus is built the failover is executed and
node status is set to down. So effectively for the time between first
failover request till the consensus is built and actual
failover is performed, the node status remain UP and it can send multiple
failover requests for the node.

Please see the below flow diagrams of the scenarios with
allow_multiple_failover_requests_from_node ON and OFF cases
for further clarification.

*SCENARIO1*: when allow_multiple_failover_requests_from_node = TRUE

*SCENARIO2*: when allow_multiple_failover_requests_from_node = FALSE

Note that the return code by watchdog in both scenarios on second failover
request makes the difference.

Please let me know if you want further information/clarifications.

Thanks
Best regards
Muhammad Usama

>
> But the manual says:
> "For example, in a three node watchdog cluster, if one Pgpool-II node
> sends two failover requests for a particular backend node failover,
> Both requests will be counted as a separate vote in the favor of the
> failover and Pgpool-II will execute the failover, even if it does not
> get the vote from any other Pgpool-II node."
>
> I am confused.
>
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese:http://www.sraoss.co.jp
>
> >  Thanks
> > Best Regards
> > Muhammad Usama
> >
> > Best regards,
> >> --
> >> Tatsuo Ishii
> >> SRA OSS, Inc. Japan
> >> English: http://www.sraoss.co.jp/index_en.php
> >> Japanese:http://www.sraoss.co.jp
> >>
> >> From: Muhammad Usama <m.usama at gmail.com>
> >> Subject: New Feature with patch: Quorum and Consensus for backend
> failover
> >> Date: Tue, 22 Aug 2017 00:18:27 +0500
> >> Message-ID: <CAEJvTzUbz-d8dfsJdLt=XNYWdOMxKf06sp+p=uAbxyjvG=vS3A at mail.
> >> gmail.com>
> >>
> >> > Hi
> >> >
> >> > I was working on the new feature to make the backend node failover
> quorum
> >> > aware and on the half way through the implementation I also added the
> >> > majority consensus feature for the same.
> >> >
> >> > So please find the first version of the patch for review that makes
> the
> >> > backend node failover consider the watchdog cluster quorum status and
> >> seek
> >> > the majority consensus before performing failover.
> >> >
> >> > *Changes in the Failover mechanism with watchdog.*
> >> > For this new feature I have modified the Pgpool-II's existing failover
> >> > mechanism with watchdog.
> >> > Previously as you know when the Pgpool-II require to perform a node
> >> > operation (failover, failback, promote-node) with the watchdog. The
> >> > watchdog used to propagated the failover request to all the Pgpool-II
> >> nodes
> >> > in the watchdog cluster and as soon as the request was received by the
> >> > node, it used to initiate the local failover and that failover was
> >> > synchronised on all nodes using the distributed locks.
> >> >
> >> > *Now Only the Master node performs the failover.*
> >> > The attached patch changes the mechanism of synchronised failover, and
> >> now
> >> > only the Pgpool-II of master watchdog node performs the failover, and
> all
> >> > other standby nodes sync the backend statuses after the master
> Pgpool-II
> >> is
> >> > finished with the failover.
> >> >
> >> > *Overview of new failover mechanism.*
> >> > -- If the failover request is received to the standby watchdog
> node(from
> >> > local Pgpool-II), That request is forwarded to the master watchdog and
> >> the
> >> > Pgpool-II main process is returned with the FAILOVER_RES_WILL_BE_DONE
> >> > return code. And upon receiving the FAILOVER_RES_WILL_BE_DONE from the
> >> > watchdog for the failover request the requesting Pgpool-II moves
> forward
> >> > without doing anything further for the particular failover command.
> >> >
> >> > -- Now when the failover request from standby node is received by the
> >> > master watchdog, after performing the validation, applying the
> consensus
> >> > rules the failover request is triggered on the local Pgpool-II .
> >> >
> >> > -- When the failover request is received to the master watchdog node
> from
> >> > the local Pgpool-II (On the IPC channel) the watchdog process inform
> the
> >> > Pgpool-II requesting process to proceed with failover (provided all
> >> > failover rules are satisfied).
> >> >
> >> > -- After the failover is finished on the master Pgpool-II, the
> failover
> >> > function calls the *wd_failover_end*() which sends the backend sync
> >> > required message to all standby watchdogs.
> >> >
> >> > -- Upon receiving the sync required message from master watchdog node
> all
> >> > Pgpool-II sync the new statuses of each backend node from the master
> >> > watchdog.
> >> >
> >> > *No More Failover locks*
> >> > Since with this new failover mechanism we do not require any
> >> > synchronisation and guards against the execution of failover_commands
> by
> >> > multiple Pgpool-II nodes, So the patch removes all the distributed
> locks
> >> > from failover function, This makes the failover simpler and faster.
> >> >
> >> > *New kind of Failover operation NODE_QUARANTINE_REQUEST*
> >> > The patch adds the new kind of backend node operation NODE_QUARANTINE
> >> which
> >> > is effectively same as the NODE_DOWN, but with node_quarantine the
> >> > failover_command is not triggered.
> >> > The NODE_DOWN_REQUEST is automatically converted to the
> >> > NODE_QUARANTINE_REQUEST when the failover is requested on the backend
> >> node
> >> > but watchdog cluster does not holds the quorum.
> >> > This means in the absence of quorum the failed backend nodes are
> >> > quarantined and when the quorum becomes available again the Pgpool-II
> >> > performs the failback operation on all quarantine nodes.
> >> > And again when the failback is performed on the quarantine backend
> node
> >> the
> >> > failover function does not trigger the failback_command.
> >> >
> >> > *Controlling the Failover behaviour.*
> >> > The patch adds three new configuration parameters to configure the
> >> failover
> >> > behaviour from user side.
> >> >
> >> > *failover_when_quorum_exists*
> >> > When enabled the failover command will only be executed when the
> watchdog
> >> > cluster holds the quorum. And when the quorum is absent and
> >> > failover_when_quorum_exists is enabled the failed backend nodes will
> get
> >> > quarantine until the quorum becomes available again.
> >> > disabling it will enable the old behaviour of failover commands.
> >> >
> >> >
> >> > *failover_require_consensus*This new configuration parameter can be
> >> used to
> >> > make sure we get the majority vote before performing the failover on
> the
> >> > node. When *failover_require_consensus* is enabled then the failover
> is
> >> > only performed after receiving the failover request from the majority
> or
> >> > Pgpool-II nodes.
> >> > For example in three nodes cluster the failover will not be performed
> >> until
> >> > at least two nodes ask for performing the failover on the particular
> >> > backend node.
> >> >
> >> > It is also worthwhile to mention here that
> *failover_require_consensus*
> >> > only works when failover_when_quorum_exists is enables.
> >> >
> >> >
> >> > *enable_multiple_failover_requests_from_node*
> >> > This parameter works in connection with *failover_require_consensus*
> >> > config. When enabled a single Pgpool-II node can vote for failover
> >> multiple
> >> > times.
> >> > For example in the three nodes cluster if one Pgpool-II node sends the
> >> > failover request of particular node twice that would be counted as two
> >> > votes in favour of failover and the failover will be performed even
> if we
> >> > do not get a vote from other two nodes.
> >> >
> >> > And when *enable_multiple_failover_requests_from_node* is disabled,
> Only
> >> > the first vote from each Pgpool-II will be accepted and all other
> >> > subsequent votes will be marked duplicate and rejected.
> >> > So in that case we will require a majority votes from distinct nodes
> to
> >> > execute the failover.
> >> > Again this *enable_multiple_failover_requests_from_node* only becomes
> >> > effective when both *failover_when_quorum_exists* and
> >> > *failover_require_consensus* are enabled.
> >> >
> >> >
> >> > *Controlling the failover: The Coding perspective.*
> >> > Although the failover functions are made quorum and consensus aware
> but
> >> > there is still a way to bypass the quorum conditions, and requirement
> of
> >> > consensus.
> >> >
> >> > For this the patch uses the existing request_details flags in
> >> > POOL_REQUEST_NODE to control the behaviour of failover.
> >> >
> >> > Here are the newly added flags values.
> >> >
> >> > *REQ_DETAIL_WATCHDOG*:
> >> > Setting this flag while issuing the failover command will not send the
> >> > failover request to the watchdog. But this flag may not be useful in
> any
> >> > other place than where it is already used.
> >> > Mostly this flag can be used to avoid the failover command from going
> to
> >> > watchdog that is already originated from watchdog. Otherwise we can
> end
> >> up
> >> > in infinite loop.
> >> >
> >> > *REQ_DETAIL_CONFIRMED*:
> >> > Setting this flag will bypass the *failover_require_consensus*
> >> > configuration and immediately perform the failover if quorum is
> present.
> >> > This flag can be used to issue the failover request originated from
> PCP
> >> > command.
> >> >
> >> > *REQ_DETAIL_UPDATE*:
> >> > This flag is used for the command where we are failing back the
> >> quarantine
> >> > nodes. Setting this flag will not trigger the failback_command.
> >> >
> >> > *Some conditional flags used:*
> >> > I was not sure about the configuration of each type of failover
> >> operation.
> >> > As we have three main failover operations NODE_UP_REQUEST,
> >> > NODE_DOWN_REQUEST, and PROMOTE_NODE_REQUEST
> >> > So I was thinking do we need to give the configuration option to the
> >> users,
> >> > if they want to enable/disable quorum checking and consensus for
> >> individual
> >> > failover operation type.
> >> > For example: is it a practical configuration where a user would want
> to
> >> > ensure quorum while preforming NODE_DOWN operation while does not
> want it
> >> > for NODE_UP.
> >> > So in this patch I use three compile time defines to enable disable
> the
> >> > individual failover operation, while we can decide on the best
> solution.
> >> >
> >> > NODE_UP_REQUIRE_CONSENSUS: defining it will enable quorum checking
> >> feature
> >> > for NODE_UP_REQUESTs
> >> >
> >> > NODE_DOWN_REQUIRE_CONSENSUS: defining it will enable quorum checking
> >> > feature for NODE_DOWN_REQUESTs
> >> >
> >> > NODE_PROMOTE_REQUIRE_CONSENSUS: defining it will enable quorum
> checking
> >> > feature for PROMOTE_NODE_REQUESTs
> >> >
> >> > *Some Point for Discussion:*
> >> >
> >> > *Do we really need to check ReqInfo->switching flag before enqueuing
> >> > failover request.*
> >> > While working on the patch I was wondering why do we disallow
> enqueuing
> >> the
> >> > failover command when the failover is already in progress? For
> example in
> >> > *pcp_process_command*() function if we see the *Req_info->switching*
> flag
> >> > set we bailout with the error instead of enqueuing the command. Is is
> >> > really necessary?
> >> >
> >> > *Do we need more granule control over each failover operation:*
> >> > As described in section "Some conditional flags used" I want the
> opinion
> >> on
> >> > do we need configuration parameters in pgpool.conf to enable disable
> >> quorum
> >> > and consensus checking on individual failover types.
> >> >
> >> > *Which failover should be mark as Confirmed:*
> >> > As defined in the above section of REQ_DETAIL_CONFIRMED, We can mark
> the
> >> > failover request to not need consensus, currently the requests from
> the
> >> PCP
> >> > commands are fired with this flag. But I was wondering there may be
> more
> >> > places where we many need to use the flag.
> >> > For example I currently use the same confirmed flag when failover is
> >> > triggered because of *replication_stop_on_mismatch*.
> >> >
> >> > I think we should think this flag for each place of failover, like
> when
> >> the
> >> > failover is triggered
> >> > because of health_check failure.
> >> > because of replication mismatch
> >> > because of backend_error
> >> > e.t.c
> >> >
> >> > *Node Quarantine behaviour.*
> >> > What do you think about the node quarantine used by this patch. Can
> you
> >> > think of some problem which can be caused by this?
> >> >
> >> > *What should be the default values for each newly added config
> >> parameters.*
> >> >
> >> >
> >> >
> >> > *TODOs*
> >> >
> >> > -- Updating the documentation is still todo. Will do that once every
> >> aspect
> >> > of the feature will be finalised.
> >> > -- Some code warnings and cleanups are still not done.
> >> > -- I am still little short on testing
> >> > -- Regression test cases for the feature
> >> >
> >> >
> >> > Thoughts and suggestions are most welcome.
> >> >
> >> > Thanks
> >> > Best regards
> >> > Muhammad Usama
> >>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-hackers/attachments/20171129/5a703c2f/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: scenario1.png
Type: image/png
Size: 145466 bytes
Desc: not available
URL: <http://www.sraoss.jp/pipermail/pgpool-hackers/attachments/20171129/5a703c2f/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: scenario2.png
Type: image/png
Size: 131107 bytes
Desc: not available
URL: <http://www.sraoss.jp/pipermail/pgpool-hackers/attachments/20171129/5a703c2f/attachment-0003.png>