<p dir="ltr">Hi,</p>

<p dir="ltr">&gt;&gt; Let&#39;s say<br>

&gt;&gt;<br>

&gt;&gt;    - health_check_period = 10<br>

&gt;&gt;    - health_check_max_retries = 11<br>

&gt;&gt;    - health_check_timeout =10<br>

&gt;&gt;    - health_check_retry_delay = 1<br>

&gt;&gt;<br>

&gt;&gt; If at time 0 master goes down and health_check is started, after 11 tries<br>

&gt;&gt; that takes 10+1 seconds each, failover is triggered at time 121</p>

<p dir="ltr">&gt;That&#39;s the worst case. Most of error checking will return far before<br>

&gt;timeout, so usually the failover trigger time would be time 11 * 1 = 11.</p>

<p dir="ltr">the problem here is that I need that failover does not happen before 120s, but obviously neither I would like it to happen after a lot of time. Best option for me would be to failover after exactly 120s.</p>

<p dir="ltr">There are also 2 timeout to be considered: connection_timeout (10s) and health_check_timeout (let&#39;s say 10 seconds).</p>

<p dir="ltr">I&#39;ve made a small test using psql and two different cases:<br>

  * trying to connect to a node that is up but postgresql service is down<br>

  * trying to connect to a node that is down</p>

<p dir="ltr">Having a look at strace, the behaviour is quite different.<br>

In case 1. an error is returned almost instantly by the connect() and the poll().<br>

In case 2, the connect() instead hits a timeout after about 30s.</p>

<p dir="ltr">I think that in case postgresql is down for upgrade but the node is up, pgpool behaviour may be similar to case 1, so probably the health checks fail quickly.</p>

<p dir="ltr">On the other hand if the node is down or there are network issues pgpool probably will end up hitting connect_timeout or health_check_timeout.</p>

<p dir="ltr">Since my nodes are located in the same place and network between them is reliable, I should probably reduce both timeout to a few seconds and wait a little more between tries.</p>

<p dir="ltr">If I choose:<br>

  * health_check_timeout and connection timeout = 3<br>

  * health_check_retry_delay = 8<br>

  * health_check_max_retries = 15</p>

<p dir="ltr">I should probably obtain failover not before 15 * 8 = 120s if postgresql is down (case 1, quick failures) and not after (8 + 3) * 15 = 165s if the node is down or unrechable (case 2, connection timeout).</p>

<p dir="ltr">I think that 3 seconds may be enough for an health test even after heavy load. Moreover those tests are allowed to fail occasionally without triggering failover.</p>

<p dir="ltr">May you please comment about my parameter choice?</p>

<p dir="ltr">Thank you and best regards,</p>

<p dir="ltr">Gabriele Monfardini <br></p>

<p dir="ltr">-----<br>

Gabriele Monfardini<br>

LdP Progetti GIS<br>

tel:<a href="tel:0577.531049"> 0577.531049</a><br>

email:<a href="mailto:monfardini@ldpgis.it"> monfardini@ldpgis.it</a></p>

<p dir="ltr">On Fri, May 20, 2016 at 4:30 PM, Tatsuo Ishii &lt;<a href="mailto:ishii@postgresql.org">ishii@postgresql.org</a>&gt; wrote:<br>

</p>

<blockquote><p dir="ltr">&gt;<br>

</p>

</blockquote>

<p dir="ltr">&gt; &gt;</p>

<p dir="ltr">&gt; &gt; Hi all,<br>

&gt; &gt;<br>

&gt; &gt; I have a setup with two pgpools in HA and two backends in streaming<br>

&gt; &gt; replication.<br>

&gt; &gt; The problem is that, due to unattended upgrade, master has been restarted<br>

&gt; &gt; and master pgpool has correctly started failover.<br>

&gt; &gt;<br>

&gt; &gt; We would like to prevent this, playing with health_check parameters, in<br>

&gt; &gt; order for pgpool to cope with short master outage without performing<br>

&gt; &gt; failover.<br>

&gt; &gt;<br>

&gt; &gt; I&#39;ve found an old blog post of Tatsuo Ishii,<br>

&gt; &gt;<a href="http://pgsqlpgpool.blogspot.it/2013/09/health-check-parameters.html"> http://pgsqlpgpool.blogspot.it/2013/09/health-check-parameters.html</a>, in<br>

&gt; &gt; which the following statement is made:<br>

&gt; &gt;<br>

&gt; &gt; Please note that &quot;health_check_max_retries *<br>

&gt; &gt;&gt; (health_check_timeout+health_check_retry_delay)&quot; should be smaller than<br>

&gt; &gt;&gt; health_check_period.<br>

&gt;<br>

&gt; Yeah, this is not a best advice.<br>

&gt;<br>

&gt; &gt; Looking at the code however it seems to me that things are a little<br>

&gt; &gt; different (probably I&#39;m wrong).<br>

&gt; &gt;<br>

&gt; &gt;    1. in main loop health check for backends is performed<br>

&gt; &gt;    (do_health_check), starting from 0 to number_of_backends<br>

&gt; &gt;    2. suppose that i-th backend health check fails because of timeout. The<br>

&gt; &gt;    process is interrupted by the timer.<br>

&gt; &gt;    3. if (current_try &lt;= health_check_max_retries) =&gt;<br>

&gt; &gt;    sleep health_check_retry_delay<br>

&gt; &gt;    4. we&#39;re back in main loop, the health check restart from i, the backend<br>

&gt; &gt;    for which health_check failed<br>

&gt; &gt;    5. suppose that health_check fails again and again<br>

&gt; &gt;    6. when (current_try &gt; health_check_max_retries) =&gt; set backend down<br>

&gt; &gt;    7. we&#39;re back in main loop, the health check restart from i, the backend<br>

&gt; &gt;    for which health_check failed, but now its state is DOWN so we continue to<br>

&gt; &gt;    next backend<br>

&gt; &gt;    8. in main loop when do_health_check exits, all backend are down or all<br>

&gt; &gt;    backend currently not down are healthy<br>

&gt; &gt;    9. then we sleep health_check_period in main loop before starting again<br>

&gt; &gt;    the check from the beginning.<br>

&gt; &gt;<br>

&gt; &gt;<br>

&gt; &gt; If I understand it correctly, health_check_period is slept unconditionally<br>

&gt; &gt; at the end of the check so it is not needed to set it as high as per the<br>

&gt; &gt; formula in the blog.<br>

&gt;<br>

&gt; Correct.<br>

&gt;<br>

&gt; &gt; Moreover if there are many backends and many failures last backend may be<br>

&gt; &gt; checked again after a long time, in the worst case after about<br>

&gt; &gt;<br>

&gt; &gt; (number_of_backends-1) * health_check_max_retries *<br>

&gt; &gt; (health_check_timeout+health_check_retry_delay) + health_check_period<br>

&gt;<br>

&gt; Again, correct. To enhance this, we need to create separate health<br>

&gt; check process, and each process performs health check for each<br>

&gt; PostgreSQL concurrently.<br>

&gt;<br>

&gt; &gt; Suppose that I choose that is acceptable that master may goes down for at<br>

&gt; &gt; max 120 seconds before failover.<br>

&gt; &gt;<br>

&gt; &gt; Since I have only two backends, I should probably set<br>

&gt; &gt;<br>

&gt; &gt; health_check_max_retries * (health_check_timeout+health_check_retry_delay)<br>

&gt; &gt; + health_check_period<br>

&gt; &gt;<br>

&gt; &gt; to about 120s.<br>

&gt; &gt;<br>

&gt; &gt; Let&#39;s say<br>

&gt; &gt;<br>

&gt; &gt;    - health_check_period = 10<br>

&gt; &gt;    - health_check_max_retries = 11<br>

&gt; &gt;    - health_check_timeout =10<br>

&gt; &gt;    - health_check_retry_delay = 1<br>

&gt; &gt;<br>

&gt; &gt; If at time 0 master goes down and health_check is started, after 11 tries<br>

&gt; &gt; that takes 10+1 seconds each, failover is triggered at time 121<br>

&gt;<br>

&gt; That&#39;s the worst case. Most of error checking will return far before<br>

&gt; timeout, so usually the failover trigger time would be time 11 * 1 = 11.<br>

&gt;<br>

&gt; &gt; In case all health checks returns OK in negligible time, that should<br>

&gt; &gt; happens almost always, health_check_period assures that no checks are done<br>

&gt; &gt; for next 10 seconds.<br>

&gt;<br>

&gt; Right.<br>

&gt;<br>

&gt; &gt; Can you please confirm my findings or correct me?<br>

&gt;<br>

&gt; Thank you for your analysis!<br>

&gt;<br>

&gt; Best regards,<br>

&gt; --<br>

&gt; Tatsuo Ishii<br>

&gt; SRA OSS, Inc. Japan<br>

&gt; English:<a href="http://www.sraoss.co.jp/index_en.php"> http://www.sraoss.co.jp/index_en.php</a><br>

&gt; Japanese:<a href="http://www.sraoss.co.jp">http://www.sraoss.co.jp</a><br>

&gt;<br>

&gt; &gt; Best regards,<br>

&gt; &gt;<br>

&gt; &gt; Gabriele Monfardini<br>

&gt; &gt;<br>

&gt; &gt; -----<br>

&gt; &gt; Gabriele Monfardini<br>

&gt; &gt; LdP Progetti GIS<br>

&gt; &gt; tel:<a href="tel:0577.531049"> 0577.531049</a><br>

&gt; &gt; email:<a href="mailto:monfardini@ldpgis.it"> monfardini@ldpgis.it</a><br>

&gt; _______________________________________________<br>

&gt; pgpool-general mailing list<br>

<a href="mailto:pgpool-general@pgpool.net">&gt; pgpool-general@pgpool.net</a><br>

<a href="http://www.pgpool.net/mailman/listinfo/pgpool-general">&gt; http://www.pgpool.net/mailman/listinfo/pgpool-general</a><br>

</p>