<div dir="ltr"><div>Hi Haruka Takatsuka,</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Apr 16, 2019 at 2:42 PM TAKATSUKA Haruka &lt;<a href="mailto:harukat@sraoss.co.jp">harukat@sraoss.co.jp</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hello Usama, and Pgpool Hackers<br>

<br>

Thanks for your answer.<br>

I tried your patch adjusting it for V3.7.x.<br>

<br></blockquote><div>Thanks for trying out the patch. </div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

In the scenario where the enabled health check find the connection failure<br>

and its recover, it works fine. But in the scenario where the health check<br>

is disabled and frontend requests find them, quarantine status is continued<br>

in the pgpool.<br></blockquote><div><br></div><div>Yes for disabled health-check scenarios its difficult to recover the node automatically. but again</div><div>it is not advisable to use the consensus mechanism for failover by disabling health check because</div><div>that would actually lead to the situation where the watchdog would never come to consensus even in</div><div>the case of genuine backend failures. Since other pgpool nodes that are not serving the clients</div><div>would never get to know about the backend node failure and keep sitting idle, and would never vote</div><div>for the backend failures.</div><div><br></div><div>I believe that is also documented in the <span style="color:rgb(0,0,0);font-family:monospace;font-size:medium">failover_require_consensus </span>section of the documentation.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

I understand that this patch aims to recover from the quarantine status<br>

by health check. I confirmed it works so well. I think it can be a help for<br>

our customer at certain cases.<br></blockquote><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>

However, there is a problem Ishii-san pointed out, witch continues emitting<br>

health check failure messages while its cause remains.<br>

<br></blockquote><div>Thats a valid observation, and I guess we can downgrade the log message in that case and make it a DEBUG log.</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

A pgpool node who notices that it cannot get consensus or it&#39;s a minority<br>

will go down soon; I prefer this simple behavior rather than quarantining.<br>

Does any one tell me the reason why this design wasn&#39;t adopted?<br>

<br></blockquote><div>Taking the node down would be too aggressive strategy, and that would actually kill the purpose.</div><div>The original idea of building the consensus for failover was to guard against the temporary network</div><div>glitches. Because failover is a very expensive operation and comes with its own complexities and possibility</div><div>of data loss.</div><div>Now consider the option of taking down the pgpool node when it is not able to build consensus for backend node</div><div>failure because of some network glitch. That would mean that as soon as the glitch occur the setup will lose one</div><div>pgpool node. That is a disaster in itself since that would mean the setup will now have one less pgpool node,</div><div>which not only is bad for the high availability requirements but also it might cause the setup to lose its quorum</div><div>altogether.</div><div><br></div><div>So I guess the best way out here is what we discussed above, that when master/coordinator node fails to build</div><div>the consensus it should give up its coordinator status and let the watchdog decide its new leader.</div><div><br></div><div>Thanks</div><div>Best Regards</div><div>Muhammad Usama</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

with best regards,<br>

Haruka Takatsuka<br>

<br>

<br>

On Mon, 15 Apr 2019 19:14:54 +0500<br>

Muhammad Usama &lt;<a href="mailto:m.usama@gmail.com" target="_blank">m.usama@gmail.com</a>&gt; wrote:<br>

<br>

&gt; Thanks for the patch, But your patch effectively disables the node<br>

&gt; quarantine, which does&#39;t seems a right way.<br>

&gt; Since the backend node that was quarantined because of absence of quorum<br>

&gt; and/or consensus is already un-reachable<br>

&gt; form the Pgpool-II node, and we don&#39;t want to select it as load-balance<br>

&gt; node ( in case the node was secondary) or consider it<br>

&gt; as available when it is not by not marking it as quarantine.<br>

&gt; <br>

&gt; In my opinion the right way to tackle the issue is  by keep setting the<br>

&gt; quarantine state as it is done currently  but<br>

&gt; also keep the health check working on quarantine nodes. So that as soon as<br>

&gt; the connectivity to the<br>

&gt; quarantined node resumes, it becomes the part of cluster automatically.<br>

&gt; <br>

&gt; Can you please try out the attached patch, to see if the solution works for<br>

&gt; the situation?<br>

&gt; The patch is generated against current master branch.<br>

<br>

_______________________________________________<br>

pgpool-hackers mailing list<br>

<a href="mailto:pgpool-hackers@pgpool.net" target="_blank">pgpool-hackers@pgpool.net</a><br>

<a href="http://www.pgpool.net/mailman/listinfo/pgpool-hackers" rel="noreferrer" target="_blank">http://www.pgpool.net/mailman/listinfo/pgpool-hackers</a><br>

</blockquote></div></div>