[pgpool-hackers: 3326] Re: [pgpool-committers: 5734] pgpool: Fix for duplicate failover request ...

Tue May 21 16:00:03 JST 2019

Hi Ishii-San

The discussion on the thread  [pgpool-hackers: 3318] yielded two patches,
one was related
to continuing the health check on the quarantined node and the other one
was related to the
de-escalation and resigning of the master watchdog if the primary backend
node gets into
quarantine state on the master.
So this commit only takes care of the first part that is to continue health
check and I still have
to commit the second patch taking care of the resigning from master status
part. The regression
failure of  test 013.watchdog_failover_require_consensus will also follow
the second patch for this issue.

I am sorry I think I am missing something on the part of consensus made in
the discussion,
I think we agreed on the thread to commit both the patch  but only in the
master branch since
it was change of the existing behaviour and we don't wanted to back port it
to older branches.

Please see the snippet from our discussion on the thread from which I infer
that we are in agreement
to commit the changes

--quote--
...
> Now if we look at the quarantine nodes, they are just as good as alive
> nodes (but unreachable by pgpool at the moment).
> Because when the node was quarantined, Pgpool-II never executed any
> failover and/or follow_master commands
> and did not interfered with the PostgreSQL backend in any way to alter its
> timeline or recovery states,
> So when the quarantine node becomes reachable again it is safe to
> automatically connect them back to the Pgpool-II

Ok, that makes sense.

>> >> BTW,
>> >>
>> >> > > When the communication between master/coordinator pgpool and
>> >> > > primary PostgreSQL node is down during a short period
>> >> >
>> >> > I wonder why you don't set appropriate health check retry parameters
>> >> > to avoid such a temporary communication failure in the firs place. A
>> >> > brain surgery to ignore the error reports from Pgpool-II does not
seem
>> >> > to be a sane choice.
>> >>
>> >> The original reporter didn't answer my question. I think it is likely
>> >> a problem of misconfiguraton (should use longer heath check retry).
>> >>
>> >> In summary I think for shorter period communication failure just
>> >> increasing health check parameters is enough. However for longer
>> >> period communication failure, the watchdog node should decline the
>> >> role.
>> >>
>> >
>> > I am sorry I didn't totally get it what you mean here.
>> > Do you mean that the pgpool-II node that has the primary node in
>> quarantine
>> > state should resign from the master/coordinator
>> > pgpool-II node (if it was a master/coordinator) in that case?
>>
>> Yes, exactly. Note that if the PostgreSQL node is one of standbys,
>> keeping the quarantine state is fine because users query could be
>> processed.
>>
>
> Yes that makes total sense. I will make that change as separate patch.

Thanks. However this will change existing behavior. Probably we should
make the change against master branch only?

--un quote--

On Tue, May 21, 2019 at 4:32 AM Tatsuo Ishii <ishii at sraoss.co.jp> wrote:

> Usama,
>
> Since this commit regression test/buildfarm are failing:
>
> testing 013.watchdog_failover_require_consensus... failed.
>
> Also I think this commit seems to be against the consensus made in the
> discussion [pgpool-hackers: 3295] thread.
>
> I thought we agreed on [pgpool-hackers: 3318] so that:
>
> ---------------------------------------------------------------------
> Hi Usama,
>
> > Hi
> >
> > I have drafted a patch to make the master watchdog node resigns from
> master
> > responsibilities if it fails to get the consensus for its
> > primary backend node failover request. The patch is still little short on
> > testing but I want to share the early version to get
> > the feedback on behaviour.
> > Also with this implementation the master/coordinator node only resigns
> from
> > being a master
> > when it fails to get the consensus for the primary node failover, but in
> > case of failed consensus for standby node failover
> > no action is taken by the watchdog master node. Do you think master
> should
> > also resign in this case as well ?
>
> I don't think so because still queries can be routed to primary (or
> other standby servers if there are two or more standbys).
>

My understand from this part of discussion was that, we agreed to keep the
master status
of the watchdog node if one of the standby node on the pgpool
watchdog-master gets into
quarantine and only go for resignation if the primary gets quarantine.

Have I misunderstood something?

Thanks
Best regards
Muhammad Usama

> ---------------------------------------------------------------------
>
> From: Muhammad Usama <m.usama at gmail.com>
> Subject: [pgpool-committers: 5734] pgpool: Fix for [pgpool-hackers: 3295]
> duplicate failover request ...
> Date: Wed, 15 May 2019 21:40:01 +0000
> Message-ID: <E1hR1d3-0005o4-1k at gothos.postgresql.org>
>
> > Fix for [pgpool-hackers: 3295] duplicate failover request ...
> >
> > Pgpool should keep the backend health check running on quarantined nodes
> so
> > that when the connectivity resumes, they should automatically get removed
> > from the quarantine. Otherwise the temporary network glitch could send
> the node
> > into permanent quarantine state.
> >
> > Branch
> > ------
> > master
> >
> > Details
> > -------
> >
> https://git.postgresql.org/gitweb?p=pgpool2.git;a=commitdiff;h=3dd1cd3f15287ee6bb8b09f0642f99db98e9776a
> >
> > Modified Files
> > --------------
> > src/main/health_check.c | 28 ++++++++++++++++++++++++----
> > 1 file changed, 24 insertions(+), 4 deletions(-)
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-hackers/attachments/20190521/7b347b49/attachment.html>