[pgpool-hackers: 3331] Re: [pgpool-committers: 5734] pgpool: Fix for duplicate failover request ...

Mon May 27 14:25:53 JST 2019

> Hello, Usama and pgpool-hackers
> 
> Thank you very much for writing and committing patches to solve the issue
> that I reported.
> 
> What do you think about back-patching them into 3.7.x or older versions?
> Pgpool-II watchdog cluster can easily fall into the status that doesn't
> work without the patch.
> 
> The quarantine feature hasn't been documented in detail up to now, so
> I think the behavior change of the first patch (*1) has no problem.
> My customer who has trouble related to this mail thread uses 3.7.x version
> and strongly hopes such a pgpool improvement.
> 
> *1
> https://git.postgresql.org/gitweb/?p=pgpool2.git;a=commitdiff;h=3dd1cd3f1528

We know that this patch breaks existing regression test. So I don't
think no user would notice the behavior change if the patch is
back ported.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

> Thanks,
> Haruka Takatsuka
> 
> 
> On Tue, 21 May 2019 12:00:03 +0500
> Muhammad Usama <m.usama at gmail.com> wrote:
> 
>> Hi Ishii-San
>> 
>> The discussion on the thread  [pgpool-hackers: 3318] yielded two patches,
>> one was related
>> to continuing the health check on the quarantined node and the other one
>> was related to the
>> de-escalation and resigning of the master watchdog if the primary backend
>> node gets into
>> quarantine state on the master.
>> So this commit only takes care of the first part that is to continue health
>> check and I still have
>> to commit the second patch taking care of the resigning from master status
>> part. The regression
>> failure of  test 013.watchdog_failover_require_consensus will also follow
>> the second patch for this issue.
>> 
>> I am sorry I think I am missing something on the part of consensus made in
>> the discussion,
>> I think we agreed on the thread to commit both the patch  but only in the
>> master branch since
>> it was change of the existing behaviour and we don't wanted to back port it
>> to older branches.
>> 
>> Please see the snippet from our discussion on the thread from which I infer
>> that we are in agreement
>> to commit the changes
>> 
>> --quote--
>> ...
>> > Now if we look at the quarantine nodes, they are just as good as alive
>> > nodes (but unreachable by pgpool at the moment).
>> > Because when the node was quarantined, Pgpool-II never executed any
>> > failover and/or follow_master commands
>> > and did not interfered with the PostgreSQL backend in any way to alter its
>> > timeline or recovery states,
>> > So when the quarantine node becomes reachable again it is safe to
>> > automatically connect them back to the Pgpool-II
>> 
>> Ok, that makes sense.
>> 
>> >> >> BTW,
>> >> >>
>> >> >> > > When the communication between master/coordinator pgpool and
>> >> >> > > primary PostgreSQL node is down during a short period
>> >> >> >
>> >> >> > I wonder why you don't set appropriate health check retry parameters
>> >> >> > to avoid such a temporary communication failure in the firs place. A
>> >> >> > brain surgery to ignore the error reports from Pgpool-II does not
>> seem
>> >> >> > to be a sane choice.
>> >> >>
>> >> >> The original reporter didn't answer my question. I think it is likely
>> >> >> a problem of misconfiguraton (should use longer heath check retry).
>> >> >>
>> >> >> In summary I think for shorter period communication failure just
>> >> >> increasing health check parameters is enough. However for longer
>> >> >> period communication failure, the watchdog node should decline the
>> >> >> role.
>> >> >>
>> >> >
>> >> > I am sorry I didn't totally get it what you mean here.
>> >> > Do you mean that the pgpool-II node that has the primary node in
>> >> quarantine
>> >> > state should resign from the master/coordinator
>> >> > pgpool-II node (if it was a master/coordinator) in that case?
>> >>
>> >> Yes, exactly. Note that if the PostgreSQL node is one of standbys,
>> >> keeping the quarantine state is fine because users query could be
>> >> processed.
>> >>
>> >
>> > Yes that makes total sense. I will make that change as separate patch.
>> 
>> Thanks. However this will change existing behavior. Probably we should
>> make the change against master branch only?
>> 
>> --un quote--
>> 
>> 
>> 
>> On Tue, May 21, 2019 at 4:32 AM Tatsuo Ishii <ishii at sraoss.co.jp> wrote:
>> 
>> > Usama,
>> >
>> > Since this commit regression test/buildfarm are failing:
>> >
>> > testing 013.watchdog_failover_require_consensus... failed.
>> >
>> > Also I think this commit seems to be against the consensus made in the
>> > discussion [pgpool-hackers: 3295] thread.
>> >
>> > I thought we agreed on [pgpool-hackers: 3318] so that:
>> >
>> > ---------------------------------------------------------------------
>> > Hi Usama,
>> >
>> > > Hi
>> > >
>> > > I have drafted a patch to make the master watchdog node resigns from
>> > master
>> > > responsibilities if it fails to get the consensus for its
>> > > primary backend node failover request. The patch is still little short on
>> > > testing but I want to share the early version to get
>> > > the feedback on behaviour.
>> > > Also with this implementation the master/coordinator node only resigns
>> > from
>> > > being a master
>> > > when it fails to get the consensus for the primary node failover, but in
>> > > case of failed consensus for standby node failover
>> > > no action is taken by the watchdog master node. Do you think master
>> > should
>> > > also resign in this case as well ?
>> >
>> > I don't think so because still queries can be routed to primary (or
>> > other standby servers if there are two or more standbys).
>> >
>> 
>> My understand from this part of discussion was that, we agreed to keep the
>> master status
>> of the watchdog node if one of the standby node on the pgpool
>> watchdog-master gets into
>> quarantine and only go for resignation if the primary gets quarantine.
>> 
>> Have I misunderstood something?
>> 
>> Thanks
>> Best regards
>> Muhammad Usama
>> 
>> 
>> 
>> 
>> > ---------------------------------------------------------------------
>> >
>> > From: Muhammad Usama <m.usama at gmail.com>
>> > Subject: [pgpool-committers: 5734] pgpool: Fix for [pgpool-hackers: 3295]
>> > duplicate failover request ...
>> > Date: Wed, 15 May 2019 21:40:01 +0000
>> > Message-ID: <E1hR1d3-0005o4-1k at gothos.postgresql.org>
>> >
>> > > Fix for [pgpool-hackers: 3295] duplicate failover request ...
>> > >
>> > > Pgpool should keep the backend health check running on quarantined nodes
>> > so
>> > > that when the connectivity resumes, they should automatically get removed
>> > > from the quarantine. Otherwise the temporary network glitch could send
>> > the node
>> > > into permanent quarantine state.
>> > >
>> > > Branch
>> > > ------
>> > > master
>> > >
>> > > Details
>> > > -------
>> > >
>> > https://git.postgresql.org/gitweb?p=pgpool2.git;a=commitdiff;h=3dd1cd3f15287ee6bb8b09f0642f99db98e9776a
>> > >
>> > > Modified Files
>> > > --------------
>> > > src/main/health_check.c | 28 ++++++++++++++++++++++++----
>> > > 1 file changed, 24 insertions(+), 4 deletions(-)
>> > >
> 
> _______________________________________________
> pgpool-hackers mailing list
> pgpool-hackers at pgpool.net
> http://www.pgpool.net/mailman/listinfo/pgpool-hackers