[pgpool-hackers: 3333] Re: [pgpool-committers: 5734] pgpool: Fix for duplicate failover request ...

Tue May 28 18:57:53 JST 2019

On Mon, May 27, 2019 at 10:25 AM Tatsuo Ishii <ishii at sraoss.co.jp> wrote:

> > Hello, Usama and pgpool-hackers
> >
> > Thank you very much for writing and committing patches to solve the issue
> > that I reported.
> >
> > What do you think about back-patching them into 3.7.x or older versions?
> > Pgpool-II watchdog cluster can easily fall into the status that doesn't
> > work without the patch.
> >
> > The quarantine feature hasn't been documented in detail up to now, so
> > I think the behavior change of the first patch (*1) has no problem.
> > My customer who has trouble related to this mail thread uses 3.7.x
> version
> > and strongly hopes such a pgpool improvement.
> >
> > *1
> >
> https://git.postgresql.org/gitweb/?p=pgpool2.git;a=commitdiff;h=3dd1cd3f1528
>
> We know that this patch breaks existing regression test. So I don't
> think no user would notice the behavior change if the patch is
> back ported.
>

I sort of agree with Takatsuka San, that the continuous health checking on
the quarantined backed nodes (first patch)
is not the kind of change that should cause problems if we decide to
backport it to older branches, In fact. it should be treated
as a bug and back-patched to all affected branches. Also, the reason for
the test case failure happening after that commit
was not because of the behaviour or functionality change. rather it was due
to the missing facility in health-check
debugging/testing mechanism which was required by that test case after the
change.

Thanks
Best regards
Muhammad Usama

>
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese:http://www.sraoss.co.jp
>
> > Thanks,
> > Haruka Takatsuka
> >
> >
> > On Tue, 21 May 2019 12:00:03 +0500
> > Muhammad Usama <m.usama at gmail.com> wrote:
> >
> >> Hi Ishii-San
> >>
> >> The discussion on the thread  [pgpool-hackers: 3318] yielded two
> patches,
> >> one was related
> >> to continuing the health check on the quarantined node and the other one
> >> was related to the
> >> de-escalation and resigning of the master watchdog if the primary
> backend
> >> node gets into
> >> quarantine state on the master.
> >> So this commit only takes care of the first part that is to continue
> health
> >> check and I still have
> >> to commit the second patch taking care of the resigning from master
> status
> >> part. The regression
> >> failure of  test 013.watchdog_failover_require_consensus will also
> follow
> >> the second patch for this issue.
> >>
> >> I am sorry I think I am missing something on the part of consensus made
> in
> >> the discussion,
> >> I think we agreed on the thread to commit both the patch  but only in
> the
> >> master branch since
> >> it was change of the existing behaviour and we don't wanted to back
> port it
> >> to older branches.
> >>
> >> Please see the snippet from our discussion on the thread from which I
> infer
> >> that we are in agreement
> >> to commit the changes
> >>
> >> --quote--
> >> ...
> >> > Now if we look at the quarantine nodes, they are just as good as alive
> >> > nodes (but unreachable by pgpool at the moment).
> >> > Because when the node was quarantined, Pgpool-II never executed any
> >> > failover and/or follow_master commands
> >> > and did not interfered with the PostgreSQL backend in any way to
> alter its
> >> > timeline or recovery states,
> >> > So when the quarantine node becomes reachable again it is safe to
> >> > automatically connect them back to the Pgpool-II
> >>
> >> Ok, that makes sense.
> >>
> >> >> >> BTW,
> >> >> >>
> >> >> >> > > When the communication between master/coordinator pgpool and
> >> >> >> > > primary PostgreSQL node is down during a short period
> >> >> >> >
> >> >> >> > I wonder why you don't set appropriate health check retry
> parameters
> >> >> >> > to avoid such a temporary communication failure in the firs
> place. A
> >> >> >> > brain surgery to ignore the error reports from Pgpool-II does
> not
> >> seem
> >> >> >> > to be a sane choice.
> >> >> >>
> >> >> >> The original reporter didn't answer my question. I think it is
> likely
> >> >> >> a problem of misconfiguraton (should use longer heath check
> retry).
> >> >> >>
> >> >> >> In summary I think for shorter period communication failure just
> >> >> >> increasing health check parameters is enough. However for longer
> >> >> >> period communication failure, the watchdog node should decline the
> >> >> >> role.
> >> >> >>
> >> >> >
> >> >> > I am sorry I didn't totally get it what you mean here.
> >> >> > Do you mean that the pgpool-II node that has the primary node in
> >> >> quarantine
> >> >> > state should resign from the master/coordinator
> >> >> > pgpool-II node (if it was a master/coordinator) in that case?
> >> >>
> >> >> Yes, exactly. Note that if the PostgreSQL node is one of standbys,
> >> >> keeping the quarantine state is fine because users query could be
> >> >> processed.
> >> >>
> >> >
> >> > Yes that makes total sense. I will make that change as separate patch.
> >>
> >> Thanks. However this will change existing behavior. Probably we should
> >> make the change against master branch only?
> >>
> >> --un quote--
> >>
> >>
> >>
> >> On Tue, May 21, 2019 at 4:32 AM Tatsuo Ishii <ishii at sraoss.co.jp>
> wrote:
> >>
> >> > Usama,
> >> >
> >> > Since this commit regression test/buildfarm are failing:
> >> >
> >> > testing 013.watchdog_failover_require_consensus... failed.
> >> >
> >> > Also I think this commit seems to be against the consensus made in the
> >> > discussion [pgpool-hackers: 3295] thread.
> >> >
> >> > I thought we agreed on [pgpool-hackers: 3318] so that:
> >> >
> >> > ---------------------------------------------------------------------
> >> > Hi Usama,
> >> >
> >> > > Hi
> >> > >
> >> > > I have drafted a patch to make the master watchdog node resigns from
> >> > master
> >> > > responsibilities if it fails to get the consensus for its
> >> > > primary backend node failover request. The patch is still little
> short on
> >> > > testing but I want to share the early version to get
> >> > > the feedback on behaviour.
> >> > > Also with this implementation the master/coordinator node only
> resigns
> >> > from
> >> > > being a master
> >> > > when it fails to get the consensus for the primary node failover,
> but in
> >> > > case of failed consensus for standby node failover
> >> > > no action is taken by the watchdog master node. Do you think master
> >> > should
> >> > > also resign in this case as well ?
> >> >
> >> > I don't think so because still queries can be routed to primary (or
> >> > other standby servers if there are two or more standbys).
> >> >
> >>
> >> My understand from this part of discussion was that, we agreed to keep
> the
> >> master status
> >> of the watchdog node if one of the standby node on the pgpool
> >> watchdog-master gets into
> >> quarantine and only go for resignation if the primary gets quarantine.
> >>
> >> Have I misunderstood something?
> >>
> >> Thanks
> >> Best regards
> >> Muhammad Usama
> >>
> >>
> >>
> >>
> >> > ---------------------------------------------------------------------
> >> >
> >> > From: Muhammad Usama <m.usama at gmail.com>
> >> > Subject: [pgpool-committers: 5734] pgpool: Fix for [pgpool-hackers:
> 3295]
> >> > duplicate failover request ...
> >> > Date: Wed, 15 May 2019 21:40:01 +0000
> >> > Message-ID: <E1hR1d3-0005o4-1k at gothos.postgresql.org>
> >> >
> >> > > Fix for [pgpool-hackers: 3295] duplicate failover request ...
> >> > >
> >> > > Pgpool should keep the backend health check running on quarantined
> nodes
> >> > so
> >> > > that when the connectivity resumes, they should automatically get
> removed
> >> > > from the quarantine. Otherwise the temporary network glitch could
> send
> >> > the node
> >> > > into permanent quarantine state.
> >> > >
> >> > > Branch
> >> > > ------
> >> > > master
> >> > >
> >> > > Details
> >> > > -------
> >> > >
> >> >
> https://git.postgresql.org/gitweb?p=pgpool2.git;a=commitdiff;h=3dd1cd3f15287ee6bb8b09f0642f99db98e9776a
> >> > >
> >> > > Modified Files
> >> > > --------------
> >> > > src/main/health_check.c | 28 ++++++++++++++++++++++++----
> >> > > 1 file changed, 24 insertions(+), 4 deletions(-)
> >> > >
> >
> > _______________________________________________
> > pgpool-hackers mailing list
> > pgpool-hackers at pgpool.net
> > http://www.pgpool.net/mailman/listinfo/pgpool-hackers
> _______________________________________________
> pgpool-hackers mailing list
> pgpool-hackers at pgpool.net
> http://www.pgpool.net/mailman/listinfo/pgpool-hackers
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-hackers/attachments/20190528/a854d465/attachment.html>