<div dir="ltr"><div>Hi Ishii-San</div><div><br></div><div>The discussion on the thread  [pgpool-hackers: 3318] yielded two patches, one was related</div><div>to continuing the health check on the quarantined node and the other one was related to the</div><div>de-escalation and resigning of the master watchdog if the primary backend node gets into</div><div>quarantine state on the master.</div><div>So this commit only takes care of the first part that is to continue health check and I still have</div><div>to commit the second patch taking care of the resigning from master status part. The regression</div><div>failure of  test 013.watchdog_failover_require_consensus will also follow the second patch for this issue.</div><div><br></div><div>I am sorry I think I am missing something on the part of consensus made in the discussion,</div><div>I think we agreed on the thread to commit both the patch  but only in the master branch since</div><div>it was change of the existing behaviour and we don&#39;t wanted to back port it to older branches.</div><div><br></div><div>Please see the snippet from our discussion on the thread from which I infer that we are in agreement</div><div>to commit the changes</div><div><br></div><div>--quote-- </div><div>...</div><div><div class="gmail-HOEnZb gmail-adM"><div class="gmail-im" style="color:rgb(80,0,80)">&gt; Now if we look at the quarantine nodes, they are just as good as alive<br>&gt; nodes (but unreachable by pgpool at the moment).<br>&gt; Because when the node was quarantined, Pgpool-II never executed any<br>&gt; failover and/or follow_master commands<br>&gt; and did not interfered with the PostgreSQL backend in any way to alter its<br>&gt; timeline or recovery states,<br>&gt; So when the quarantine node becomes reachable again it is safe to<br>&gt; automatically connect them back to the Pgpool-II<br><br></div></div>Ok, that makes sense.<span class="gmail-im" style="color:rgb(80,0,80)"><br><br>&gt;&gt; &gt;&gt; BTW,<br>&gt;&gt; &gt;&gt;<br>&gt;&gt; &gt;&gt; &gt; &gt; When the communication between master/coordinator pgpool and<br>&gt;&gt; &gt;&gt; &gt; &gt; primary PostgreSQL node is down during a short period<br>&gt;&gt; &gt;&gt; &gt;<br>&gt;&gt; &gt;&gt; &gt; I wonder why you don&#39;t set appropriate health check retry parameters<br>&gt;&gt; &gt;&gt; &gt; to avoid such a temporary communication failure in the firs place. A<br>&gt;&gt; &gt;&gt; &gt; brain surgery to ignore the error reports from Pgpool-II does not seem<br>&gt;&gt; &gt;&gt; &gt; to be a sane choice.<br>&gt;&gt; &gt;&gt;<br>&gt;&gt; &gt;&gt; The original reporter didn&#39;t answer my question. I think it is likely<br>&gt;&gt; &gt;&gt; a problem of misconfiguraton (should use longer heath check retry).<br>&gt;&gt; &gt;&gt;<br>&gt;&gt; &gt;&gt; In summary I think for shorter period communication failure just<br>&gt;&gt; &gt;&gt; increasing health check parameters is enough. However for longer<br>&gt;&gt; &gt;&gt; period communication failure, the watchdog node should decline the<br>&gt;&gt; &gt;&gt; role.<br>&gt;&gt; &gt;&gt;<br>&gt;&gt; &gt;<br>&gt;&gt; &gt; I am sorry I didn&#39;t totally get it what you mean here.<br>&gt;&gt; &gt; Do you mean that the pgpool-II node that has the primary node in<br>&gt;&gt; quarantine<br>&gt;&gt; &gt; state should resign from the master/coordinator<br>&gt;&gt; &gt; pgpool-II node (if it was a master/coordinator) in that case?<br>&gt;&gt;<br>&gt;&gt; Yes, exactly. Note that if the PostgreSQL node is one of standbys,<br>&gt;&gt; keeping the quarantine state is fine because users query could be<br>&gt;&gt; processed.<br>&gt;&gt;<br>&gt; <br>&gt; Yes that makes total sense. I will make that change as separate patch.<br><br></span>Thanks. However this will change existing behavior. Probably we should<br>make the change against master branch only?<br></div><div><br></div><div>--un quote--</div><div><br></div><div><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, May 21, 2019 at 4:32 AM Tatsuo Ishii &lt;<a href="mailto:ishii@sraoss.co.jp">ishii@sraoss.co.jp</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Usama,<br>

<br>

Since this commit regression test/buildfarm are failing:<br>

<br>

testing 013.watchdog_failover_require_consensus... failed.<br>

<br>

Also I think this commit seems to be against the consensus made in the<br>

discussion [pgpool-hackers: 3295] thread.<br>

<br>

I thought we agreed on [pgpool-hackers: 3318] so that:<br>

<br>

---------------------------------------------------------------------<br>

Hi Usama,<br>

<br>

&gt; Hi<br>

&gt; <br>

&gt; I have drafted a patch to make the master watchdog node resigns from master<br>

&gt; responsibilities if it fails to get the consensus for its<br>

&gt; primary backend node failover request. The patch is still little short on<br>

&gt; testing but I want to share the early version to get<br>

&gt; the feedback on behaviour.<br>

&gt; Also with this implementation the master/coordinator node only resigns from<br>

&gt; being a master<br>

&gt; when it fails to get the consensus for the primary node failover, but in<br>

&gt; case of failed consensus for standby node failover<br>

&gt; no action is taken by the watchdog master node. Do you think master should<br>

&gt; also resign in this case as well ?<br>

<br>

I don&#39;t think so because still queries can be routed to primary (or<br>

other standby servers if there are two or more standbys).<br></blockquote><div><br></div><div>My understand from this part of discussion was that, we agreed to keep the master status</div><div>of the watchdog node if one of the standby node on the pgpool watchdog-master gets into </div><div>quarantine and only go for resignation if the primary gets quarantine. </div><div><br></div><div>Have I misunderstood something?</div><div><br></div><div>Thanks</div><div>Best regards</div><div>Muhammad Usama</div><div><br></div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

---------------------------------------------------------------------<br>

<br>

From: Muhammad Usama &lt;<a href="mailto:m.usama@gmail.com" target="_blank">m.usama@gmail.com</a>&gt;<br>

Subject: [pgpool-committers: 5734] pgpool: Fix for [pgpool-hackers: 3295] duplicate failover request ...<br>

Date: Wed, 15 May 2019 21:40:01 +0000<br>

Message-ID: &lt;<a href="mailto:E1hR1d3-0005o4-1k@gothos.postgresql.org" target="_blank">E1hR1d3-0005o4-1k@gothos.postgresql.org</a>&gt;<br>

<br>

&gt; Fix for [pgpool-hackers: 3295] duplicate failover request ...<br>

&gt; <br>

&gt; Pgpool should keep the backend health check running on quarantined nodes so<br>

&gt; that when the connectivity resumes, they should automatically get removed<br>

&gt; from the quarantine. Otherwise the temporary network glitch could send the node<br>

&gt; into permanent quarantine state.<br>

&gt; <br>

&gt; Branch<br>

&gt; ------<br>

&gt; master<br>

&gt; <br>

&gt; Details<br>

&gt; -------<br>

&gt; <a href="https://git.postgresql.org/gitweb?p=pgpool2.git;a=commitdiff;h=3dd1cd3f15287ee6bb8b09f0642f99db98e9776a" rel="noreferrer" target="_blank">https://git.postgresql.org/gitweb?p=pgpool2.git;a=commitdiff;h=3dd1cd3f15287ee6bb8b09f0642f99db98e9776a</a><br>

&gt; <br>

&gt; Modified Files<br>

&gt; --------------<br>

&gt; src/main/health_check.c | 28 ++++++++++++++++++++++++----<br>

&gt; 1 file changed, 24 insertions(+), 4 deletions(-)<br>

&gt; <br>

</blockquote></div></div>