[pgpool-hackers: 3298] Re: duplicate failover request over allow_multiple_failover_requests_from_node=off

Tue Apr 16 11:55:01 JST 2019

Hi Usama,

> Hi  TAKATSUKA Haruka,
> 
> Thanks for the patch, But your patch effectively disables the node
> quarantine, which does't seems a right way.
> Since the backend node that was quarantined because of absence of quorum
> and/or consensus is already un-reachable
> form the Pgpool-II node, and we don't want to select it as load-balance
> node ( in case the node was secondary) or consider it
> as available when it is not by not marking it as quarantine.
> 
> In my opinion the right way to tackle the issue is  by keep setting the
> quarantine state as it is done currently  but
> also keep the health check working on quarantine nodes. So that as soon as
> the connectivity to the
> quarantined node resumes, it becomes the part of cluster automatically.

What if the connection failure between the primary PostgreSQL and one
of Pgpool-II servers is permanent? Doesn't health checking continues
forever?

> Can you please try out the attached patch, to see if the solution works for
> the situation?
> The patch is generated against current master branch.
> 
> Thanks
> Best Regards
> Muhammad Usama
> 
> On Wed, Apr 10, 2019 at 2:04 PM TAKATSUKA Haruka <harukat at sraoss.co.jp>
> wrote:
> 
>> Hello, Pgpool developers
>>
>>
>> I found Pgpool-II watchdog is too strict for duplicate failover request
>> with allow_multiple_failover_requests_from_node=off setting.
>>
>> For example, A watchdog cluster with 3 pgpool instances is here.
>> Their backends are PostgreSQL servers using streaming replication.
>>
>> When the communication between master/coordinator pgpool and
>> primary PostgreSQL node is down during a short period
>> (or pgpool do any false-positive judgement by various reasons),
>> and then the pgpool tries to failover but cannot get the consensus,
>> so it makes the primary node into quarantine status. It cannot
>> be reset automatically. As a result, the service becomes unavailable.
>>
>> This case generates logs like the following:
>>
>> pid 1234: LOG:  new IPC connection received
>> pid 1234: LOG:  watchdog received the failover command from local
>> pgpool-II on IPC interface
>> pid 1234: LOG:  watchdog is processing the failover command
>> [DEGENERATE_BACKEND_REQUEST] received from local pgpool-II on IPC interface
>> pid 1234: LOG:  Duplicate failover request from "pg1:5432 Linux pg1" node
>> pid 1234: DETAIL:  request ignored
>> pid 1234: LOG:  failover requires the majority vote, waiting for consensus
>> pid 1234: DETAIL:  failover request noted
>> pid 4321: LOG:  degenerate backend request for 1 node(s) from pid [4321],
>> is changed to quarantine node request by watchdog
>> pid 4321: DETAIL:  watchdog is taking time to build consensus
>>
>> Note that this case dosen't have any communication truouble among
>> the Pgpool watchdog nodes.
>> You can reproduce it by changing one PostgreSQL's pg_hba.conf to
>> reject the helth check access from one pgpool node in short period.
>>
>> The document don't say that duplicate failover requests make the node
>> quarantine immediately. I think it should be just igunoring the request.
>>
>> A patch file for head of V3_7_STABLE is attached.
>> Pgpool with this patch also disturbs failover by single pgpool's repeated
>> failover requests. But it can recover when the connection trouble is gone.
>>
>> Does this change have any problem?
>>
>>
>> with best regards,
>> TAKATSUKA Haruka <harukat at sraoss.co.jp>
>> _______________________________________________
>> pgpool-hackers mailing list
>> pgpool-hackers at pgpool.net
>> http://www.pgpool.net/mailman/listinfo/pgpool-hackers
>>