[pgpool-general: 4699] Re: Info on health_check parameters
Tatsuo Ishii
ishii at postgresql.org
Sat May 21 05:56:20 JST 2016
> Hi,
>
>>> Let's say
>>>
>>> - health_check_period = 10
>>> - health_check_max_retries = 11
>>> - health_check_timeout =10
>>> - health_check_retry_delay = 1
>>>
>>> If at time 0 master goes down and health_check is started, after 11 tries
>>> that takes 10+1 seconds each, failover is triggered at time 121
>
>>That's the worst case. Most of error checking will return far before
>>timeout, so usually the failover trigger time would be time 11 * 1 = 11.
>
> the problem here is that I need that failover does not happen before 120s,
> but obviously neither I would like it to happen after a lot of time. Best
> option for me would be to failover after exactly 120s.
>
> There are also 2 timeout to be considered: connection_timeout (10s)
> and health_check_timeout (let's say 10 seconds).
>
> I've made a small test using psql and two different cases:
> * trying to connect to a node that is up but postgresql service is down
> * trying to connect to a node that is down
>
> Having a look at strace, the behaviour is quite different.
> In case 1. an error is returned almost instantly by the connect() and the
> poll().
> In case 2, the connect() instead hits a timeout after about 30s.
>
> I think that in case postgresql is down for upgrade but the node is up,
> pgpool behaviour may be similar to case 1, so probably the health checks
> fail quickly.
>
> On the other hand if the node is down or there are network issues pgpool
> probably will end up hitting connect_timeout or health_check_timeout.
>
> Since my nodes are located in the same place and network between them is
> reliable, I should probably reduce both timeout to a few seconds and wait a
> little more between tries.
>
> If I choose:
> * health_check_timeout and connection timeout = 3
> * health_check_retry_delay = 8
> * health_check_max_retries = 15
>
> I should probably obtain failover not before 15 * 8 = 120s if postgresql is
> down (case 1, quick failures) and not after (8 + 3) * 15 = 165s if the node
> is down or unrechable (case 2, connection timeout).
>
> I think that 3 seconds may be enough for an health test even after heavy
> load. Moreover those tests are allowed to fail occasionally without
> triggering failover.
>
> May you please comment about my parameter choice?
Thank you for the study. Your choice of paramters looks good.
Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp
> Thank you and best regards,
>
> Gabriele Monfardini
>
> -----
> Gabriele Monfardini
> LdP Progetti GIS
> tel: 0577.531049
> email: monfardini at ldpgis.it
>
> On Fri, May 20, 2016 at 4:30 PM, Tatsuo Ishii <ishii at postgresql.org> wrote:
>
>>
>
>> >
>
>> > Hi all,
>> >
>> > I have a setup with two pgpools in HA and two backends in streaming
>> > replication.
>> > The problem is that, due to unattended upgrade, master has been
> restarted
>> > and master pgpool has correctly started failover.
>> >
>> > We would like to prevent this, playing with health_check parameters, in
>> > order for pgpool to cope with short master outage without performing
>> > failover.
>> >
>> > I've found an old blog post of Tatsuo Ishii,
>> > http://pgsqlpgpool.blogspot.it/2013/09/health-check-parameters.html, in
>> > which the following statement is made:
>> >
>> > Please note that "health_check_max_retries *
>> >> (health_check_timeout+health_check_retry_delay)" should be smaller than
>> >> health_check_period.
>>
>> Yeah, this is not a best advice.
>>
>> > Looking at the code however it seems to me that things are a little
>> > different (probably I'm wrong).
>> >
>> > 1. in main loop health check for backends is performed
>> > (do_health_check), starting from 0 to number_of_backends
>> > 2. suppose that i-th backend health check fails because of timeout.
> The
>> > process is interrupted by the timer.
>> > 3. if (current_try <= health_check_max_retries) =>
>> > sleep health_check_retry_delay
>> > 4. we're back in main loop, the health check restart from i, the
> backend
>> > for which health_check failed
>> > 5. suppose that health_check fails again and again
>> > 6. when (current_try > health_check_max_retries) => set backend down
>> > 7. we're back in main loop, the health check restart from i, the
> backend
>> > for which health_check failed, but now its state is DOWN so we
> continue to
>> > next backend
>> > 8. in main loop when do_health_check exits, all backend are down or
> all
>> > backend currently not down are healthy
>> > 9. then we sleep health_check_period in main loop before starting
> again
>> > the check from the beginning.
>> >
>> >
>> > If I understand it correctly, health_check_period is slept
> unconditionally
>> > at the end of the check so it is not needed to set it as high as per the
>> > formula in the blog.
>>
>> Correct.
>>
>> > Moreover if there are many backends and many failures last backend may
> be
>> > checked again after a long time, in the worst case after about
>> >
>> > (number_of_backends-1) * health_check_max_retries *
>> > (health_check_timeout+health_check_retry_delay) + health_check_period
>>
>> Again, correct. To enhance this, we need to create separate health
>> check process, and each process performs health check for each
>> PostgreSQL concurrently.
>>
>> > Suppose that I choose that is acceptable that master may goes down for
> at
>> > max 120 seconds before failover.
>> >
>> > Since I have only two backends, I should probably set
>> >
>> > health_check_max_retries *
> (health_check_timeout+health_check_retry_delay)
>> > + health_check_period
>> >
>> > to about 120s.
>> >
>> > Let's say
>> >
>> > - health_check_period = 10
>> > - health_check_max_retries = 11
>> > - health_check_timeout =10
>> > - health_check_retry_delay = 1
>> >
>> > If at time 0 master goes down and health_check is started, after 11
> tries
>> > that takes 10+1 seconds each, failover is triggered at time 121
>>
>> That's the worst case. Most of error checking will return far before
>> timeout, so usually the failover trigger time would be time 11 * 1 = 11.
>>
>> > In case all health checks returns OK in negligible time, that should
>> > happens almost always, health_check_period assures that no checks are
> done
>> > for next 10 seconds.
>>
>> Right.
>>
>> > Can you please confirm my findings or correct me?
>>
>> Thank you for your analysis!
>>
>> Best regards,
>> --
>> Tatsuo Ishii
>> SRA OSS, Inc. Japan
>> English: http://www.sraoss.co.jp/index_en.php
>> Japanese:http://www.sraoss.co.jp
>>
>> > Best regards,
>> >
>> > Gabriele Monfardini
>> >
>> > -----
>> > Gabriele Monfardini
>> > LdP Progetti GIS
>> > tel: 0577.531049
>> > email: monfardini at ldpgis.it
>> _______________________________________________
>> pgpool-general mailing list
>> pgpool-general at pgpool.net <pgpool-general at pgpool.net>
>> http://www.pgpool.net/mailman/listinfo/pgpool-general
> <http://www.pgpool.net/mailman/listinfo/pgpool-general>
More information about the pgpool-general
mailing list