<div dir="ltr"><div>Hi Ishii-San</div><div><br></div><div>The discussion on the thread [pgpool-hackers: 3318] yielded two patches, one was related</div><div>to continuing the health check on the quarantined node and the other one was related to the</div><div>de-escalation and resigning of the master watchdog if the primary backend node gets into</div><div>quarantine state on the master.</div><div>So this commit only takes care of the first part that is to continue health check and I still have</div><div>to commit the second patch taking care of the resigning from master status part. The regression</div><div>failure of test 013.watchdog_failover_require_consensus will also follow the second patch for this issue.</div><div><br></div><div>I am sorry I think I am missing something on the part of consensus made in the discussion,</div><div>I think we agreed on the thread to commit both the patch but only in the master branch since</div><div>it was change of the existing behaviour and we don't wanted to back port it to older branches.</div><div><br></div><div>Please see the snippet from our discussion on the thread from which I infer that we are in agreement</div><div>to commit the changes</div><div><br></div><div>--quote-- </div><div>...</div><div><div class="gmail-HOEnZb gmail-adM"><div class="gmail-im" style="color:rgb(80,0,80)">> Now if we look at the quarantine nodes, they are just as good as alive<br>> nodes (but unreachable by pgpool at the moment).<br>> Because when the node was quarantined, Pgpool-II never executed any<br>> failover and/or follow_master commands<br>> and did not interfered with the PostgreSQL backend in any way to alter its<br>> timeline or recovery states,<br>> So when the quarantine node becomes reachable again it is safe to<br>> automatically connect them back to the Pgpool-II<br><br></div></div>Ok, that makes sense.<span class="gmail-im" style="color:rgb(80,0,80)"><br><br>>> >> BTW,<br>>> >><br>>> >> > > When the communication between master/coordinator pgpool and<br>>> >> > > primary PostgreSQL node is down during a short period<br>>> >> ><br>>> >> > I wonder why you don't set appropriate health check retry parameters<br>>> >> > to avoid such a temporary communication failure in the firs place. A<br>>> >> > brain surgery to ignore the error reports from Pgpool-II does not seem<br>>> >> > to be a sane choice.<br>>> >><br>>> >> The original reporter didn't answer my question. I think it is likely<br>>> >> a problem of misconfiguraton (should use longer heath check retry).<br>>> >><br>>> >> In summary I think for shorter period communication failure just<br>>> >> increasing health check parameters is enough. However for longer<br>>> >> period communication failure, the watchdog node should decline the<br>>> >> role.<br>>> >><br>>> ><br>>> > I am sorry I didn't totally get it what you mean here.<br>>> > Do you mean that the pgpool-II node that has the primary node in<br>>> quarantine<br>>> > state should resign from the master/coordinator<br>>> > pgpool-II node (if it was a master/coordinator) in that case?<br>>><br>>> Yes, exactly. Note that if the PostgreSQL node is one of standbys,<br>>> keeping the quarantine state is fine because users query could be<br>>> processed.<br>>><br>> <br>> Yes that makes total sense. I will make that change as separate patch.<br><br></span>Thanks. However this will change existing behavior. Probably we should<br>make the change against master branch only?<br></div><div><br></div><div>--un quote--</div><div><br></div><div><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, May 21, 2019 at 4:32 AM Tatsuo Ishii <<a href="mailto:ishii@sraoss.co.jp">ishii@sraoss.co.jp</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Usama,<br>
<br>
Since this commit regression test/buildfarm are failing:<br>
<br>
testing 013.watchdog_failover_require_consensus... failed.<br>
<br>
Also I think this commit seems to be against the consensus made in the<br>
discussion [pgpool-hackers: 3295] thread.<br>
<br>
I thought we agreed on [pgpool-hackers: 3318] so that:<br>
<br>
---------------------------------------------------------------------<br>
Hi Usama,<br>
<br>
> Hi<br>
> <br>
> I have drafted a patch to make the master watchdog node resigns from master<br>
> responsibilities if it fails to get the consensus for its<br>
> primary backend node failover request. The patch is still little short on<br>
> testing but I want to share the early version to get<br>
> the feedback on behaviour.<br>
> Also with this implementation the master/coordinator node only resigns from<br>
> being a master<br>
> when it fails to get the consensus for the primary node failover, but in<br>
> case of failed consensus for standby node failover<br>
> no action is taken by the watchdog master node. Do you think master should<br>
> also resign in this case as well ?<br>
<br>
I don't think so because still queries can be routed to primary (or<br>
other standby servers if there are two or more standbys).<br></blockquote><div><br></div><div>My understand from this part of discussion was that, we agreed to keep the master status</div><div>of the watchdog node if one of the standby node on the pgpool watchdog-master gets into </div><div>quarantine and only go for resignation if the primary gets quarantine. </div><div><br></div><div>Have I misunderstood something?</div><div><br></div><div>Thanks</div><div>Best regards</div><div>Muhammad Usama</div><div><br></div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
---------------------------------------------------------------------<br>
<br>
From: Muhammad Usama <<a href="mailto:m.usama@gmail.com" target="_blank">m.usama@gmail.com</a>><br>
Subject: [pgpool-committers: 5734] pgpool: Fix for [pgpool-hackers: 3295] duplicate failover request ...<br>
Date: Wed, 15 May 2019 21:40:01 +0000<br>
Message-ID: <<a href="mailto:E1hR1d3-0005o4-1k@gothos.postgresql.org" target="_blank">E1hR1d3-0005o4-1k@gothos.postgresql.org</a>><br>
<br>
> Fix for [pgpool-hackers: 3295] duplicate failover request ...<br>
> <br>
> Pgpool should keep the backend health check running on quarantined nodes so<br>
> that when the connectivity resumes, they should automatically get removed<br>
> from the quarantine. Otherwise the temporary network glitch could send the node<br>
> into permanent quarantine state.<br>
> <br>
> Branch<br>
> ------<br>
> master<br>
> <br>
> Details<br>
> -------<br>
> <a href="https://git.postgresql.org/gitweb?p=pgpool2.git;a=commitdiff;h=3dd1cd3f15287ee6bb8b09f0642f99db98e9776a" rel="noreferrer" target="_blank">https://git.postgresql.org/gitweb?p=pgpool2.git;a=commitdiff;h=3dd1cd3f15287ee6bb8b09f0642f99db98e9776a</a><br>
> <br>
> Modified Files<br>
> --------------<br>
> src/main/health_check.c | 28 ++++++++++++++++++++++++----<br>
> 1 file changed, 24 insertions(+), 4 deletions(-)<br>
> <br>
</blockquote></div></div>