[pgpool-hackers: 2621] Re: New Feature with patch: Quorum and Consensus for backend failover

Wed Nov 29 02:07:51 JST 2017

> Hi Ishii-San
> 
> On Tue, Nov 28, 2017 at 8:09 AM, Tatsuo Ishii <ishii at sraoss.co.jp> wrote:
> 
>> Hi Usama,
>>
>> I have a question regarding the condition for "quorum exists".
>> I think "g_cluster.quorum_status >= 0" is the condition for "quorum
>> exists".
>> Am I correct?
>>
>> BTW, I observed from watchdog.c:
>>
>> When there are only two watchdog nodes participating and they are
>> alive, then g_cluster.quorum_status == 1, which means "quorum exists",
>> because following condition is satisfied
>> (get_mimimum_nodes_required_for_quorum() returns
>> (g_cluster.remoteNodeCount - 1 ) / 2), which is 0).
>>
>>         if ( g_cluster.clusterMasterInfo.standby_nodes_count >
>> get_mimimum_nodes_required_for_quorum())
>>         {
>>                 g_cluster.quorum_status = 1;
>>         }
>>
>> In this case, if pool_config->failover_when_quorum_exists is on, and
>> pool_config->failover_require_consensus is on, then failover will always
>> perform because:
>>
>>                 if (failoverObj->request_count <=
>> get_mimimum_nodes_required_for_quorum())
>>
>> always fails (get_mimimum_nodes_required_for_quorum() returns 0).
>>
>> So, it seems for a two-watchdog-cluster, regardless the setting of
>> failover_when_quorum_exists and failover_require_consensus, Pgpool-II
>> behaves as before. Is my understanding correct?
>>
> 
> Yes that is correct, This is because in watchdog we consider the existence
> of quorum if 50% of nodes are
> alive, So for two node cluster only one node is enough to complete the
> quorum.
> Thats why we recommend to have minimum three and odd number of nodes.
> 
> Do you have any reservations on this and want to have a changed behaviour??

No, I just wanted to make sure current behavior. Thanks for the explanation.

> Thanks
> Best Regards
> Muhammad Usama
> 
> 
>> Best regards,
>> --
>> Tatsuo Ishii
>> SRA OSS, Inc. Japan
>> English: http://www.sraoss.co.jp/index_en.php
>> Japanese:http://www.sraoss.co.jp
>>
>> > Hi Usama,
>> >
>> > While writing a presentation material of Pgpool-II 3.7, I am not sure
>> > I understand the behavior the quorum consusens behavior.
>> >
>> >> *enable_multiple_failover_requests_from_node*
>> >> This parameter works in connection with *failover_require_consensus*
>> >> config. When enabled a single Pgpool-II node can vote for failover
>> multiple
>> >> times.
>> >
>> > In what situation a Pgpool-II node could send multiple failover
>> > requests? My guess is in the following scenario:
>> >
>> > 1) Pgpool-II watchdog standby health check process detects the failure
>> >    of backend A and send a faiover request to the master Pgpool-II.
>> >
>> > 2) Since the vote does not satisfy the quorum consensus, failver is
>> >    not occurred. Just backend_info->quarantine is set and
>> >    backend_info->backend_status is set to CON_DOWN.
>> >
>> > 3) Pgpool-II watchdog standby health check process detects the failure
>> >    of backend A again, then sent a failover request to the master
>> >    Pgpool-II again. If enable_multiple_failover_requests_from_node is
>> >    set, failover will happen.
>> >
>> > But after thinking more, I realized that in step 3, since
>> > backend_status is already set to CON_DOWN, health check will not be
>> > performed against backend A. So the watchdog standby will not send
>> > multiple vote.
>> >
>> > Apparently I am missing something here.
>> >
>> > Can you please tell what is the scenario in that a watchdog sends
>> > multiple votes for failover?
>> >
>> > Best regards,
>> > --
>> > Tatsuo Ishii
>> > SRA OSS, Inc. Japan
>> > English: http://www.sraoss.co.jp/index_en.php
>> > Japanese:http://www.sraoss.co.jp
>> >
>> > From: Muhammad Usama <m.usama at gmail.com>
>> > Subject: New Feature with patch: Quorum and Consensus for backend
>> failover
>> > Date: Tue, 22 Aug 2017 00:18:27 +0500
>> > Message-ID: <CAEJvTzUbz-d8dfsJdLt=XNYWdOMxKf06sp+p=uAbxyjvG=vS3A
>> @mail.gmail.com>
>> >
>> >> Hi
>> >>
>> >> I was working on the new feature to make the backend node failover
>> quorum
>> >> aware and on the half way through the implementation I also added the
>> >> majority consensus feature for the same.
>> >>
>> >> So please find the first version of the patch for review that makes the
>> >> backend node failover consider the watchdog cluster quorum status and
>> seek
>> >> the majority consensus before performing failover.
>> >>
>> >> *Changes in the Failover mechanism with watchdog.*
>> >> For this new feature I have modified the Pgpool-II's existing failover
>> >> mechanism with watchdog.
>> >> Previously as you know when the Pgpool-II require to perform a node
>> >> operation (failover, failback, promote-node) with the watchdog. The
>> >> watchdog used to propagated the failover request to all the Pgpool-II
>> nodes
>> >> in the watchdog cluster and as soon as the request was received by the
>> >> node, it used to initiate the local failover and that failover was
>> >> synchronised on all nodes using the distributed locks.
>> >>
>> >> *Now Only the Master node performs the failover.*
>> >> The attached patch changes the mechanism of synchronised failover, and
>> now
>> >> only the Pgpool-II of master watchdog node performs the failover, and
>> all
>> >> other standby nodes sync the backend statuses after the master
>> Pgpool-II is
>> >> finished with the failover.
>> >>
>> >> *Overview of new failover mechanism.*
>> >> -- If the failover request is received to the standby watchdog node(from
>> >> local Pgpool-II), That request is forwarded to the master watchdog and
>> the
>> >> Pgpool-II main process is returned with the FAILOVER_RES_WILL_BE_DONE
>> >> return code. And upon receiving the FAILOVER_RES_WILL_BE_DONE from the
>> >> watchdog for the failover request the requesting Pgpool-II moves forward
>> >> without doing anything further for the particular failover command.
>> >>
>> >> -- Now when the failover request from standby node is received by the
>> >> master watchdog, after performing the validation, applying the consensus
>> >> rules the failover request is triggered on the local Pgpool-II .
>> >>
>> >> -- When the failover request is received to the master watchdog node
>> from
>> >> the local Pgpool-II (On the IPC channel) the watchdog process inform the
>> >> Pgpool-II requesting process to proceed with failover (provided all
>> >> failover rules are satisfied).
>> >>
>> >> -- After the failover is finished on the master Pgpool-II, the failover
>> >> function calls the *wd_failover_end*() which sends the backend sync
>> >> required message to all standby watchdogs.
>> >>
>> >> -- Upon receiving the sync required message from master watchdog node
>> all
>> >> Pgpool-II sync the new statuses of each backend node from the master
>> >> watchdog.
>> >>
>> >> *No More Failover locks*
>> >> Since with this new failover mechanism we do not require any
>> >> synchronisation and guards against the execution of failover_commands by
>> >> multiple Pgpool-II nodes, So the patch removes all the distributed locks
>> >> from failover function, This makes the failover simpler and faster.
>> >>
>> >> *New kind of Failover operation NODE_QUARANTINE_REQUEST*
>> >> The patch adds the new kind of backend node operation NODE_QUARANTINE
>> which
>> >> is effectively same as the NODE_DOWN, but with node_quarantine the
>> >> failover_command is not triggered.
>> >> The NODE_DOWN_REQUEST is automatically converted to the
>> >> NODE_QUARANTINE_REQUEST when the failover is requested on the backend
>> node
>> >> but watchdog cluster does not holds the quorum.
>> >> This means in the absence of quorum the failed backend nodes are
>> >> quarantined and when the quorum becomes available again the Pgpool-II
>> >> performs the failback operation on all quarantine nodes.
>> >> And again when the failback is performed on the quarantine backend node
>> the
>> >> failover function does not trigger the failback_command.
>> >>
>> >> *Controlling the Failover behaviour.*
>> >> The patch adds three new configuration parameters to configure the
>> failover
>> >> behaviour from user side.
>> >>
>> >> *failover_when_quorum_exists*
>> >> When enabled the failover command will only be executed when the
>> watchdog
>> >> cluster holds the quorum. And when the quorum is absent and
>> >> failover_when_quorum_exists is enabled the failed backend nodes will get
>> >> quarantine until the quorum becomes available again.
>> >> disabling it will enable the old behaviour of failover commands.
>> >>
>> >>
>> >> *failover_require_consensus*This new configuration parameter can be
>> used to
>> >> make sure we get the majority vote before performing the failover on the
>> >> node. When *failover_require_consensus* is enabled then the failover is
>> >> only performed after receiving the failover request from the majority or
>> >> Pgpool-II nodes.
>> >> For example in three nodes cluster the failover will not be performed
>> until
>> >> at least two nodes ask for performing the failover on the particular
>> >> backend node.
>> >>
>> >> It is also worthwhile to mention here that *failover_require_consensus*
>> >> only works when failover_when_quorum_exists is enables.
>> >>
>> >>
>> >> *enable_multiple_failover_requests_from_node*
>> >> This parameter works in connection with *failover_require_consensus*
>> >> config. When enabled a single Pgpool-II node can vote for failover
>> multiple
>> >> times.
>> >> For example in the three nodes cluster if one Pgpool-II node sends the
>> >> failover request of particular node twice that would be counted as two
>> >> votes in favour of failover and the failover will be performed even if
>> we
>> >> do not get a vote from other two nodes.
>> >>
>> >> And when *enable_multiple_failover_requests_from_node* is disabled,
>> Only
>> >> the first vote from each Pgpool-II will be accepted and all other
>> >> subsequent votes will be marked duplicate and rejected.
>> >> So in that case we will require a majority votes from distinct nodes to
>> >> execute the failover.
>> >> Again this *enable_multiple_failover_requests_from_node* only becomes
>> >> effective when both *failover_when_quorum_exists* and
>> >> *failover_require_consensus* are enabled.
>> >>
>> >>
>> >> *Controlling the failover: The Coding perspective.*
>> >> Although the failover functions are made quorum and consensus aware but
>> >> there is still a way to bypass the quorum conditions, and requirement of
>> >> consensus.
>> >>
>> >> For this the patch uses the existing request_details flags in
>> >> POOL_REQUEST_NODE to control the behaviour of failover.
>> >>
>> >> Here are the newly added flags values.
>> >>
>> >> *REQ_DETAIL_WATCHDOG*:
>> >> Setting this flag while issuing the failover command will not send the
>> >> failover request to the watchdog. But this flag may not be useful in any
>> >> other place than where it is already used.
>> >> Mostly this flag can be used to avoid the failover command from going to
>> >> watchdog that is already originated from watchdog. Otherwise we can end
>> up
>> >> in infinite loop.
>> >>
>> >> *REQ_DETAIL_CONFIRMED*:
>> >> Setting this flag will bypass the *failover_require_consensus*
>> >> configuration and immediately perform the failover if quorum is present.
>> >> This flag can be used to issue the failover request originated from PCP
>> >> command.
>> >>
>> >> *REQ_DETAIL_UPDATE*:
>> >> This flag is used for the command where we are failing back the
>> quarantine
>> >> nodes. Setting this flag will not trigger the failback_command.
>> >>
>> >> *Some conditional flags used:*
>> >> I was not sure about the configuration of each type of failover
>> operation.
>> >> As we have three main failover operations NODE_UP_REQUEST,
>> >> NODE_DOWN_REQUEST, and PROMOTE_NODE_REQUEST
>> >> So I was thinking do we need to give the configuration option to the
>> users,
>> >> if they want to enable/disable quorum checking and consensus for
>> individual
>> >> failover operation type.
>> >> For example: is it a practical configuration where a user would want to
>> >> ensure quorum while preforming NODE_DOWN operation while does not want
>> it
>> >> for NODE_UP.
>> >> So in this patch I use three compile time defines to enable disable the
>> >> individual failover operation, while we can decide on the best solution.
>> >>
>> >> NODE_UP_REQUIRE_CONSENSUS: defining it will enable quorum checking
>> feature
>> >> for NODE_UP_REQUESTs
>> >>
>> >> NODE_DOWN_REQUIRE_CONSENSUS: defining it will enable quorum checking
>> >> feature for NODE_DOWN_REQUESTs
>> >>
>> >> NODE_PROMOTE_REQUIRE_CONSENSUS: defining it will enable quorum checking
>> >> feature for PROMOTE_NODE_REQUESTs
>> >>
>> >> *Some Point for Discussion:*
>> >>
>> >> *Do we really need to check ReqInfo->switching flag before enqueuing
>> >> failover request.*
>> >> While working on the patch I was wondering why do we disallow enqueuing
>> the
>> >> failover command when the failover is already in progress? For example
>> in
>> >> *pcp_process_command*() function if we see the *Req_info->switching*
>> flag
>> >> set we bailout with the error instead of enqueuing the command. Is is
>> >> really necessary?
>> >>
>> >> *Do we need more granule control over each failover operation:*
>> >> As described in section "Some conditional flags used" I want the
>> opinion on
>> >> do we need configuration parameters in pgpool.conf to enable disable
>> quorum
>> >> and consensus checking on individual failover types.
>> >>
>> >> *Which failover should be mark as Confirmed:*
>> >> As defined in the above section of REQ_DETAIL_CONFIRMED, We can mark the
>> >> failover request to not need consensus, currently the requests from the
>> PCP
>> >> commands are fired with this flag. But I was wondering there may be more
>> >> places where we many need to use the flag.
>> >> For example I currently use the same confirmed flag when failover is
>> >> triggered because of *replication_stop_on_mismatch*.
>> >>
>> >> I think we should think this flag for each place of failover, like when
>> the
>> >> failover is triggered
>> >> because of health_check failure.
>> >> because of replication mismatch
>> >> because of backend_error
>> >> e.t.c
>> >>
>> >> *Node Quarantine behaviour.*
>> >> What do you think about the node quarantine used by this patch. Can you
>> >> think of some problem which can be caused by this?
>> >>
>> >> *What should be the default values for each newly added config
>> parameters.*
>> >>
>> >>
>> >>
>> >> *TODOs*
>> >>
>> >> -- Updating the documentation is still todo. Will do that once every
>> aspect
>> >> of the feature will be finalised.
>> >> -- Some code warnings and cleanups are still not done.
>> >> -- I am still little short on testing
>> >> -- Regression test cases for the feature
>> >>
>> >>
>> >> Thoughts and suggestions are most welcome.
>> >>
>> >> Thanks
>> >> Best regards
>> >> Muhammad Usama
>>