[pgpool-hackers: 3301] Re: duplicate failover request over allow_multiple_failover_requests_from_node=off

Muhammad Usama m.usama at gmail.com
Tue Apr 16 16:24:21 JST 2019


On Tue, Apr 16, 2019 at 12:14 PM Tatsuo Ishii <ishii at sraoss.co.jp> wrote:

> > On Tue, Apr 16, 2019 at 7:55 AM Tatsuo Ishii <ishii at sraoss.co.jp> wrote:
> >
> >> Hi Usama,
> >>
> >> > Hi  TAKATSUKA Haruka,
> >> >
> >> > Thanks for the patch, But your patch effectively disables the node
> >> > quarantine, which does't seems a right way.
> >> > Since the backend node that was quarantined because of absence of
> quorum
> >> > and/or consensus is already un-reachable
> >> > form the Pgpool-II node, and we don't want to select it as
> load-balance
> >> > node ( in case the node was secondary) or consider it
> >> > as available when it is not by not marking it as quarantine.
> >> >
> >> > In my opinion the right way to tackle the issue is  by keep setting
> the
> >> > quarantine state as it is done currently  but
> >> > also keep the health check working on quarantine nodes. So that as
> soon
> >> as
> >> > the connectivity to the
> >> > quarantined node resumes, it becomes the part of cluster
> automatically.
> >>
> >> What if the connection failure between the primary PostgreSQL and one
> >> of Pgpool-II servers is permanent? Doesn't health checking continues
> >> forever?
> >>
> >
> > Yes, only for the quarantined PostgreSQL nodes. But I don't think there
> is
> > a problem
> > in that. As conceptually the quarantine nodes are not failed node (they
> are
> > just unusable at that moment)
> > and taking the node out of quarantine zone shouldn't require the manual
> > intervention. So I think its the correct
> > way to continue the health checking on quarantined nodes.
> >
> > Do you see an issue with the approach ?
>
> Yes. Think about the case when the PostgreSQL node is primary. Users
> cannot issue write queries while the retrying. The network failure
> could persist days and the whole database cluster is unusable in the
> period.
>

Yes thats true, But not allowing the node to go into quarantine state will
still not solve it,
Because the primary would still be unavailable anyway even if we set the
quarantine state
or not. So whole idea of this patch is to recover from quarantine state
automatically as soon as
the connectivity resumes.
Similarly failover of that node is again not an option if the user wants to
do failover only when the
network consensus exists, otherwise he should just disable
failover_require_consensus.


> BTW,
>
> > > When the communication between master/coordinator pgpool and
> > > primary PostgreSQL node is down during a short period
> >
> > I wonder why you don't set appropriate health check retry parameters
> > to avoid such a temporary communication failure in the firs place. A
> > brain surgery to ignore the error reports from Pgpool-II does not seem
> > to be a sane choice.
>
> The original reporter didn't answer my question. I think it is likely
> a problem of misconfiguraton (should use longer heath check retry).
>
> In summary I think for shorter period communication failure just
> increasing health check parameters is enough. However for longer
> period communication failure, the watchdog node should decline the
> role.
>

I am sorry I didn't totally get it what you mean here.
Do you mean that the pgpool-II node that has the primary node in quarantine
state should resign from the master/coordinator
pgpool-II node (if it was a master/coordinator) in that case?

Thanks
Best Regards
Muhammad Usama


> >> > Can you please try out the attached patch, to see if the solution
> works
> >> for
> >> > the situation?
> >> > The patch is generated against current master branch.
> >> >
> >> > Thanks
> >> > Best Regards
> >> > Muhammad Usama
> >> >
> >> > On Wed, Apr 10, 2019 at 2:04 PM TAKATSUKA Haruka <
> harukat at sraoss.co.jp>
> >> > wrote:
> >> >
> >> >> Hello, Pgpool developers
> >> >>
> >> >>
> >> >> I found Pgpool-II watchdog is too strict for duplicate failover
> request
> >> >> with allow_multiple_failover_requests_from_node=off setting.
> >> >>
> >> >> For example, A watchdog cluster with 3 pgpool instances is here.
> >> >> Their backends are PostgreSQL servers using streaming replication.
> >> >>
> >> >> When the communication between master/coordinator pgpool and
> >> >> primary PostgreSQL node is down during a short period
> >> >> (or pgpool do any false-positive judgement by various reasons),
> >> >> and then the pgpool tries to failover but cannot get the consensus,
> >> >> so it makes the primary node into quarantine status. It cannot
> >> >> be reset automatically. As a result, the service becomes unavailable.
> >> >>
> >> >> This case generates logs like the following:
> >> >>
> >> >> pid 1234: LOG:  new IPC connection received
> >> >> pid 1234: LOG:  watchdog received the failover command from local
> >> >> pgpool-II on IPC interface
> >> >> pid 1234: LOG:  watchdog is processing the failover command
> >> >> [DEGENERATE_BACKEND_REQUEST] received from local pgpool-II on IPC
> >> interface
> >> >> pid 1234: LOG:  Duplicate failover request from "pg1:5432 Linux pg1"
> >> node
> >> >> pid 1234: DETAIL:  request ignored
> >> >> pid 1234: LOG:  failover requires the majority vote, waiting for
> >> consensus
> >> >> pid 1234: DETAIL:  failover request noted
> >> >> pid 4321: LOG:  degenerate backend request for 1 node(s) from pid
> >> [4321],
> >> >> is changed to quarantine node request by watchdog
> >> >> pid 4321: DETAIL:  watchdog is taking time to build consensus
> >> >>
> >> >> Note that this case dosen't have any communication truouble among
> >> >> the Pgpool watchdog nodes.
> >> >> You can reproduce it by changing one PostgreSQL's pg_hba.conf to
> >> >> reject the helth check access from one pgpool node in short period.
> >> >>
> >> >> The document don't say that duplicate failover requests make the node
> >> >> quarantine immediately. I think it should be just igunoring the
> request.
> >> >>
> >> >> A patch file for head of V3_7_STABLE is attached.
> >> >> Pgpool with this patch also disturbs failover by single pgpool's
> >> repeated
> >> >> failover requests. But it can recover when the connection trouble is
> >> gone.
> >> >>
> >> >> Does this change have any problem?
> >> >>
> >> >>
> >> >> with best regards,
> >> >> TAKATSUKA Haruka <harukat at sraoss.co.jp>
> >> >> _______________________________________________
> >> >> pgpool-hackers mailing list
> >> >> pgpool-hackers at pgpool.net
> >> >> http://www.pgpool.net/mailman/listinfo/pgpool-hackers
> >> >>
> >>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-hackers/attachments/20190416/f4f5dd5b/attachment-0001.html>


More information about the pgpool-hackers mailing list