[pgpool-hackers: 3318] Re: duplicate failover request over allow_multiple_failover_requests_from_node=off

Tatsuo Ishii ishii at sraoss.co.jp
Fri Apr 19 15:45:06 JST 2019


Hi Usama,

> Hi
> 
> I have drafted a patch to make the master watchdog node resigns from master
> responsibilities if it fails to get the consensus for its
> primary backend node failover request. The patch is still little short on
> testing but I want to share the early version to get
> the feedback on behaviour.
> Also with this implementation the master/coordinator node only resigns from
> being a master
> when it fails to get the consensus for the primary node failover, but in
> case of failed consensus for standby node failover
> no action is taken by the watchdog master node. Do you think master should
> also resign in this case as well ?

I don't think so because still queries can be routed to primary (or
other standby servers if there are two or more standbys).

> Thanks
> Best Regards
> Muhammad Usama
> 
> 
> 
> On Tue, Apr 16, 2019 at 3:16 PM Muhammad Usama <m.usama at gmail.com> wrote:
> 
>> Hi Haruka Takatsuka,
>>
>> On Tue, Apr 16, 2019 at 2:42 PM TAKATSUKA Haruka <harukat at sraoss.co.jp>
>> wrote:
>>
>>> Hello Usama, and Pgpool Hackers
>>>
>>> Thanks for your answer.
>>> I tried your patch adjusting it for V3.7.x.
>>>
>>> Thanks for trying out the patch.
>>
>>
>>> In the scenario where the enabled health check find the connection failure
>>> and its recover, it works fine. But in the scenario where the health check
>>> is disabled and frontend requests find them, quarantine status is
>>> continued
>>> in the pgpool.
>>>
>>
>> Yes for disabled health-check scenarios its difficult to recover the node
>> automatically. but again
>> it is not advisable to use the consensus mechanism for failover by
>> disabling health check because
>> that would actually lead to the situation where the watchdog would never
>> come to consensus even in
>> the case of genuine backend failures. Since other pgpool nodes that are
>> not serving the clients
>> would never get to know about the backend node failure and keep sitting
>> idle, and would never vote
>> for the backend failures.
>>
>> I believe that is also documented in the failover_require_consensus section
>> of the documentation.
>>
>>
>>>
>>> I understand that this patch aims to recover from the quarantine status
>>> by health check. I confirmed it works so well. I think it can be a help
>>> for
>>> our customer at certain cases.
>>>
>>
>>
>>> However, there is a problem Ishii-san pointed out, witch continues
>>> emitting
>>> health check failure messages while its cause remains.
>>>
>>> Thats a valid observation, and I guess we can downgrade the log message
>> in that case and make it a DEBUG log.
>>
>>
>>
>>> A pgpool node who notices that it cannot get consensus or it's a minority
>>> will go down soon; I prefer this simple behavior rather than quarantining.
>>> Does any one tell me the reason why this design wasn't adopted?
>>>
>>> Taking the node down would be too aggressive strategy, and that would
>> actually kill the purpose.
>> The original idea of building the consensus for failover was to guard
>> against the temporary network
>> glitches. Because failover is a very expensive operation and comes with
>> its own complexities and possibility
>> of data loss.
>> Now consider the option of taking down the pgpool node when it is not able
>> to build consensus for backend node
>> failure because of some network glitch. That would mean that as soon as
>> the glitch occur the setup will lose one
>> pgpool node. That is a disaster in itself since that would mean the setup
>> will now have one less pgpool node,
>> which not only is bad for the high availability requirements but also it
>> might cause the setup to lose its quorum
>> altogether.
>>
>> So I guess the best way out here is what we discussed above, that when
>> master/coordinator node fails to build
>> the consensus it should give up its coordinator status and let the
>> watchdog decide its new leader.
>>
>> Thanks
>> Best Regards
>> Muhammad Usama
>>
>>
>>> with best regards,
>>> Haruka Takatsuka
>>>
>>>
>>> On Mon, 15 Apr 2019 19:14:54 +0500
>>> Muhammad Usama <m.usama at gmail.com> wrote:
>>>
>>> > Thanks for the patch, But your patch effectively disables the node
>>> > quarantine, which does't seems a right way.
>>> > Since the backend node that was quarantined because of absence of quorum
>>> > and/or consensus is already un-reachable
>>> > form the Pgpool-II node, and we don't want to select it as load-balance
>>> > node ( in case the node was secondary) or consider it
>>> > as available when it is not by not marking it as quarantine.
>>> >
>>> > In my opinion the right way to tackle the issue is  by keep setting the
>>> > quarantine state as it is done currently  but
>>> > also keep the health check working on quarantine nodes. So that as soon
>>> as
>>> > the connectivity to the
>>> > quarantined node resumes, it becomes the part of cluster automatically.
>>> >
>>> > Can you please try out the attached patch, to see if the solution works
>>> for
>>> > the situation?
>>> > The patch is generated against current master branch.
>>>
>>> _______________________________________________
>>> pgpool-hackers mailing list
>>> pgpool-hackers at pgpool.net
>>> http://www.pgpool.net/mailman/listinfo/pgpool-hackers
>>>
>>


More information about the pgpool-hackers mailing list