<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, May 27, 2019 at 10:25 AM Tatsuo Ishii <<a href="mailto:ishii@sraoss.co.jp">ishii@sraoss.co.jp</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">> Hello, Usama and pgpool-hackers<br>
> <br>
> Thank you very much for writing and committing patches to solve the issue<br>
> that I reported.<br>
> <br>
> What do you think about back-patching them into 3.7.x or older versions?<br>
> Pgpool-II watchdog cluster can easily fall into the status that doesn't<br>
> work without the patch.<br>
> <br>
> The quarantine feature hasn't been documented in detail up to now, so<br>
> I think the behavior change of the first patch (*1) has no problem.<br>
> My customer who has trouble related to this mail thread uses 3.7.x version<br>
> and strongly hopes such a pgpool improvement.<br>
> <br>
> *1<br>
> <a href="https://git.postgresql.org/gitweb/?p=pgpool2.git;a=commitdiff;h=3dd1cd3f1528" rel="noreferrer" target="_blank">https://git.postgresql.org/gitweb/?p=pgpool2.git;a=commitdiff;h=3dd1cd3f1528</a><br>
<br>
We know that this patch breaks existing regression test. So I don't<br>
think no user would notice the behavior change if the patch is<br>
back ported.<br></blockquote><div><br></div><div>I sort of agree with Takatsuka San, that the continuous health checking on the quarantined backed nodes (first patch)</div><div>is not the kind of change that should cause problems if we decide to backport it to older branches, In fact. it should be treated</div><div>as a bug and back-patched to all affected branches. Also, the reason for the test case failure happening after that commit</div><div>was not because of the behaviour or functionality change. rather it was due to the missing facility in health-check</div><div>debugging/testing mechanism which was required by that test case after the change. </div><div><br></div><div>Thanks</div><div>Best regards </div><div>Muhammad Usama</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
Best regards,<br>
--<br>
Tatsuo Ishii<br>
SRA OSS, Inc. Japan<br>
English: <a href="http://www.sraoss.co.jp/index_en.php" rel="noreferrer" target="_blank">http://www.sraoss.co.jp/index_en.php</a><br>
Japanese:<a href="http://www.sraoss.co.jp" rel="noreferrer" target="_blank">http://www.sraoss.co.jp</a><br>
<br>
> Thanks,<br>
> Haruka Takatsuka<br>
> <br>
> <br>
> On Tue, 21 May 2019 12:00:03 +0500<br>
> Muhammad Usama <<a href="mailto:m.usama@gmail.com" target="_blank">m.usama@gmail.com</a>> wrote:<br>
> <br>
>> Hi Ishii-San<br>
>> <br>
>> The discussion on the thread [pgpool-hackers: 3318] yielded two patches,<br>
>> one was related<br>
>> to continuing the health check on the quarantined node and the other one<br>
>> was related to the<br>
>> de-escalation and resigning of the master watchdog if the primary backend<br>
>> node gets into<br>
>> quarantine state on the master.<br>
>> So this commit only takes care of the first part that is to continue health<br>
>> check and I still have<br>
>> to commit the second patch taking care of the resigning from master status<br>
>> part. The regression<br>
>> failure of test 013.watchdog_failover_require_consensus will also follow<br>
>> the second patch for this issue.<br>
>> <br>
>> I am sorry I think I am missing something on the part of consensus made in<br>
>> the discussion,<br>
>> I think we agreed on the thread to commit both the patch but only in the<br>
>> master branch since<br>
>> it was change of the existing behaviour and we don't wanted to back port it<br>
>> to older branches.<br>
>> <br>
>> Please see the snippet from our discussion on the thread from which I infer<br>
>> that we are in agreement<br>
>> to commit the changes<br>
>> <br>
>> --quote--<br>
>> ...<br>
>> > Now if we look at the quarantine nodes, they are just as good as alive<br>
>> > nodes (but unreachable by pgpool at the moment).<br>
>> > Because when the node was quarantined, Pgpool-II never executed any<br>
>> > failover and/or follow_master commands<br>
>> > and did not interfered with the PostgreSQL backend in any way to alter its<br>
>> > timeline or recovery states,<br>
>> > So when the quarantine node becomes reachable again it is safe to<br>
>> > automatically connect them back to the Pgpool-II<br>
>> <br>
>> Ok, that makes sense.<br>
>> <br>
>> >> >> BTW,<br>
>> >> >><br>
>> >> >> > > When the communication between master/coordinator pgpool and<br>
>> >> >> > > primary PostgreSQL node is down during a short period<br>
>> >> >> ><br>
>> >> >> > I wonder why you don't set appropriate health check retry parameters<br>
>> >> >> > to avoid such a temporary communication failure in the firs place. A<br>
>> >> >> > brain surgery to ignore the error reports from Pgpool-II does not<br>
>> seem<br>
>> >> >> > to be a sane choice.<br>
>> >> >><br>
>> >> >> The original reporter didn't answer my question. I think it is likely<br>
>> >> >> a problem of misconfiguraton (should use longer heath check retry).<br>
>> >> >><br>
>> >> >> In summary I think for shorter period communication failure just<br>
>> >> >> increasing health check parameters is enough. However for longer<br>
>> >> >> period communication failure, the watchdog node should decline the<br>
>> >> >> role.<br>
>> >> >><br>
>> >> ><br>
>> >> > I am sorry I didn't totally get it what you mean here.<br>
>> >> > Do you mean that the pgpool-II node that has the primary node in<br>
>> >> quarantine<br>
>> >> > state should resign from the master/coordinator<br>
>> >> > pgpool-II node (if it was a master/coordinator) in that case?<br>
>> >><br>
>> >> Yes, exactly. Note that if the PostgreSQL node is one of standbys,<br>
>> >> keeping the quarantine state is fine because users query could be<br>
>> >> processed.<br>
>> >><br>
>> ><br>
>> > Yes that makes total sense. I will make that change as separate patch.<br>
>> <br>
>> Thanks. However this will change existing behavior. Probably we should<br>
>> make the change against master branch only?<br>
>> <br>
>> --un quote--<br>
>> <br>
>> <br>
>> <br>
>> On Tue, May 21, 2019 at 4:32 AM Tatsuo Ishii <<a href="mailto:ishii@sraoss.co.jp" target="_blank">ishii@sraoss.co.jp</a>> wrote:<br>
>> <br>
>> > Usama,<br>
>> ><br>
>> > Since this commit regression test/buildfarm are failing:<br>
>> ><br>
>> > testing 013.watchdog_failover_require_consensus... failed.<br>
>> ><br>
>> > Also I think this commit seems to be against the consensus made in the<br>
>> > discussion [pgpool-hackers: 3295] thread.<br>
>> ><br>
>> > I thought we agreed on [pgpool-hackers: 3318] so that:<br>
>> ><br>
>> > ---------------------------------------------------------------------<br>
>> > Hi Usama,<br>
>> ><br>
>> > > Hi<br>
>> > ><br>
>> > > I have drafted a patch to make the master watchdog node resigns from<br>
>> > master<br>
>> > > responsibilities if it fails to get the consensus for its<br>
>> > > primary backend node failover request. The patch is still little short on<br>
>> > > testing but I want to share the early version to get<br>
>> > > the feedback on behaviour.<br>
>> > > Also with this implementation the master/coordinator node only resigns<br>
>> > from<br>
>> > > being a master<br>
>> > > when it fails to get the consensus for the primary node failover, but in<br>
>> > > case of failed consensus for standby node failover<br>
>> > > no action is taken by the watchdog master node. Do you think master<br>
>> > should<br>
>> > > also resign in this case as well ?<br>
>> ><br>
>> > I don't think so because still queries can be routed to primary (or<br>
>> > other standby servers if there are two or more standbys).<br>
>> ><br>
>> <br>
>> My understand from this part of discussion was that, we agreed to keep the<br>
>> master status<br>
>> of the watchdog node if one of the standby node on the pgpool<br>
>> watchdog-master gets into<br>
>> quarantine and only go for resignation if the primary gets quarantine.<br>
>> <br>
>> Have I misunderstood something?<br>
>> <br>
>> Thanks<br>
>> Best regards<br>
>> Muhammad Usama<br>
>> <br>
>> <br>
>> <br>
>> <br>
>> > ---------------------------------------------------------------------<br>
>> ><br>
>> > From: Muhammad Usama <<a href="mailto:m.usama@gmail.com" target="_blank">m.usama@gmail.com</a>><br>
>> > Subject: [pgpool-committers: 5734] pgpool: Fix for [pgpool-hackers: 3295]<br>
>> > duplicate failover request ...<br>
>> > Date: Wed, 15 May 2019 21:40:01 +0000<br>
>> > Message-ID: <<a href="mailto:E1hR1d3-0005o4-1k@gothos.postgresql.org" target="_blank">E1hR1d3-0005o4-1k@gothos.postgresql.org</a>><br>
>> ><br>
>> > > Fix for [pgpool-hackers: 3295] duplicate failover request ...<br>
>> > ><br>
>> > > Pgpool should keep the backend health check running on quarantined nodes<br>
>> > so<br>
>> > > that when the connectivity resumes, they should automatically get removed<br>
>> > > from the quarantine. Otherwise the temporary network glitch could send<br>
>> > the node<br>
>> > > into permanent quarantine state.<br>
>> > ><br>
>> > > Branch<br>
>> > > ------<br>
>> > > master<br>
>> > ><br>
>> > > Details<br>
>> > > -------<br>
>> > ><br>
>> > <a href="https://git.postgresql.org/gitweb?p=pgpool2.git;a=commitdiff;h=3dd1cd3f15287ee6bb8b09f0642f99db98e9776a" rel="noreferrer" target="_blank">https://git.postgresql.org/gitweb?p=pgpool2.git;a=commitdiff;h=3dd1cd3f15287ee6bb8b09f0642f99db98e9776a</a><br>
>> > ><br>
>> > > Modified Files<br>
>> > > --------------<br>
>> > > src/main/health_check.c | 28 ++++++++++++++++++++++++----<br>
>> > > 1 file changed, 24 insertions(+), 4 deletions(-)<br>
>> > ><br>
> <br>
> _______________________________________________<br>
> pgpool-hackers mailing list<br>
> <a href="mailto:pgpool-hackers@pgpool.net" target="_blank">pgpool-hackers@pgpool.net</a><br>
> <a href="http://www.pgpool.net/mailman/listinfo/pgpool-hackers" rel="noreferrer" target="_blank">http://www.pgpool.net/mailman/listinfo/pgpool-hackers</a><br>
_______________________________________________<br>
pgpool-hackers mailing list<br>
<a href="mailto:pgpool-hackers@pgpool.net" target="_blank">pgpool-hackers@pgpool.net</a><br>
<a href="http://www.pgpool.net/mailman/listinfo/pgpool-hackers" rel="noreferrer" target="_blank">http://www.pgpool.net/mailman/listinfo/pgpool-hackers</a><br>
</blockquote></div></div>