[pgpool-hackers: 3304] Re: duplicate failover request over allow_multiple_failover_requests_from_node=off

Tatsuo Ishii ishii at sraoss.co.jp
Tue Apr 16 17:27:21 JST 2019


>> Question is, why can't we automatically recover from detached state as
>> well as quarantine state?
>>
> 
> Well ideally we should also automatically recover from detached state as
> well, but the problem
> is that when the node is detached, specifically the primary node, the
> failover procedure
> promotes another standby to make it a new master and follow_master adjusts
> the standby
> nodes to point to the new master. Now even when the old primary that was
> detached becomes
> reachable again, attaching it automatically would lead to the verity of
> problems and split-brain.
> I think it is possible to implement the mechanism to verify the detached
> PostgreSQL node status when it
> becomes reachable again and after taking appropriate actions attach it back
> automatically but currently
> we don't have anything like that in Pgpool. So we instead rely on user
> intervention to do the re-attach
> using pcp_attach_node or online recovery mechanisms.
> 
> Now if we look at the quarantine nodes, they are just as good as alive
> nodes (but unreachable by pgpool at the moment).
> Because when the node was quarantined, Pgpool-II never executed any
> failover and/or follow_master commands
> and did not interfered with the PostgreSQL backend in any way to alter its
> timeline or recovery states,
> So when the quarantine node becomes reachable again it is safe to
> automatically connect them back to the Pgpool-II

Ok, that makes sense.

>> >> BTW,
>> >>
>> >> > > When the communication between master/coordinator pgpool and
>> >> > > primary PostgreSQL node is down during a short period
>> >> >
>> >> > I wonder why you don't set appropriate health check retry parameters
>> >> > to avoid such a temporary communication failure in the firs place. A
>> >> > brain surgery to ignore the error reports from Pgpool-II does not seem
>> >> > to be a sane choice.
>> >>
>> >> The original reporter didn't answer my question. I think it is likely
>> >> a problem of misconfiguraton (should use longer heath check retry).
>> >>
>> >> In summary I think for shorter period communication failure just
>> >> increasing health check parameters is enough. However for longer
>> >> period communication failure, the watchdog node should decline the
>> >> role.
>> >>
>> >
>> > I am sorry I didn't totally get it what you mean here.
>> > Do you mean that the pgpool-II node that has the primary node in
>> quarantine
>> > state should resign from the master/coordinator
>> > pgpool-II node (if it was a master/coordinator) in that case?
>>
>> Yes, exactly. Note that if the PostgreSQL node is one of standbys,
>> keeping the quarantine state is fine because users query could be
>> processed.
>>
> 
> Yes that makes total sense. I will make that change as separate patch.

Thanks. However this will change existing behavior. Probably we should
make the change against master branch only?

> Thanks
> Best Regards
> Muhammad Usama
> 
> 
>> > Thanks
>> > Best Regards
>> > Muhammad Usama
>> >
>> >
>> >> >> > Can you please try out the attached patch, to see if the solution
>> >> works
>> >> >> for
>> >> >> > the situation?
>> >> >> > The patch is generated against current master branch.
>> >> >> >
>> >> >> > Thanks
>> >> >> > Best Regards
>> >> >> > Muhammad Usama
>> >> >> >
>> >> >> > On Wed, Apr 10, 2019 at 2:04 PM TAKATSUKA Haruka <
>> >> harukat at sraoss.co.jp>
>> >> >> > wrote:
>> >> >> >
>> >> >> >> Hello, Pgpool developers
>> >> >> >>
>> >> >> >>
>> >> >> >> I found Pgpool-II watchdog is too strict for duplicate failover
>> >> request
>> >> >> >> with allow_multiple_failover_requests_from_node=off setting.
>> >> >> >>
>> >> >> >> For example, A watchdog cluster with 3 pgpool instances is here.
>> >> >> >> Their backends are PostgreSQL servers using streaming replication.
>> >> >> >>
>> >> >> >> When the communication between master/coordinator pgpool and
>> >> >> >> primary PostgreSQL node is down during a short period
>> >> >> >> (or pgpool do any false-positive judgement by various reasons),
>> >> >> >> and then the pgpool tries to failover but cannot get the
>> consensus,
>> >> >> >> so it makes the primary node into quarantine status. It cannot
>> >> >> >> be reset automatically. As a result, the service becomes
>> unavailable.
>> >> >> >>
>> >> >> >> This case generates logs like the following:
>> >> >> >>
>> >> >> >> pid 1234: LOG:  new IPC connection received
>> >> >> >> pid 1234: LOG:  watchdog received the failover command from local
>> >> >> >> pgpool-II on IPC interface
>> >> >> >> pid 1234: LOG:  watchdog is processing the failover command
>> >> >> >> [DEGENERATE_BACKEND_REQUEST] received from local pgpool-II on IPC
>> >> >> interface
>> >> >> >> pid 1234: LOG:  Duplicate failover request from "pg1:5432 Linux
>> pg1"
>> >> >> node
>> >> >> >> pid 1234: DETAIL:  request ignored
>> >> >> >> pid 1234: LOG:  failover requires the majority vote, waiting for
>> >> >> consensus
>> >> >> >> pid 1234: DETAIL:  failover request noted
>> >> >> >> pid 4321: LOG:  degenerate backend request for 1 node(s) from pid
>> >> >> [4321],
>> >> >> >> is changed to quarantine node request by watchdog
>> >> >> >> pid 4321: DETAIL:  watchdog is taking time to build consensus
>> >> >> >>
>> >> >> >> Note that this case dosen't have any communication truouble among
>> >> >> >> the Pgpool watchdog nodes.
>> >> >> >> You can reproduce it by changing one PostgreSQL's pg_hba.conf to
>> >> >> >> reject the helth check access from one pgpool node in short
>> period.
>> >> >> >>
>> >> >> >> The document don't say that duplicate failover requests make the
>> node
>> >> >> >> quarantine immediately. I think it should be just igunoring the
>> >> request.
>> >> >> >>
>> >> >> >> A patch file for head of V3_7_STABLE is attached.
>> >> >> >> Pgpool with this patch also disturbs failover by single pgpool's
>> >> >> repeated
>> >> >> >> failover requests. But it can recover when the connection trouble
>> is
>> >> >> gone.
>> >> >> >>
>> >> >> >> Does this change have any problem?
>> >> >> >>
>> >> >> >>
>> >> >> >> with best regards,
>> >> >> >> TAKATSUKA Haruka <harukat at sraoss.co.jp>
>> >> >> >> _______________________________________________
>> >> >> >> pgpool-hackers mailing list
>> >> >> >> pgpool-hackers at pgpool.net
>> >> >> >> http://www.pgpool.net/mailman/listinfo/pgpool-hackers
>> >> >> >>
>> >> >>
>> >>
>>


More information about the pgpool-hackers mailing list