[pgpool-hackers: 3500] Re: Proposal: health check statistics

Sun Jan 26 20:19:51 JST 2020

Pushed with bug fixes and enhancements along with an SGML document for
new "show pool_health_check_stats".

> Ok, here is the first cut of this work (see attached patches). I
> implemented "show pool_health_check_stats;" command to show health
> check statistics stored in shared memory. Below is a sample session of
> that. There are two backend nodes 0 and 1. 1 was shutdown and failover
> happened. Then it automatically failed back because auto_failback =
> on. Facts you can see from it include:
> 
> - node 0's last_failed_health_check is empty because there's no failed
>   health happened on node 0.
> 
> - node 0's last_skip_health_check is also empty. As health check
>   skipping happens against downed node, which does not happen on node
>   0.
> 
> - on node 1 fail_count = 1 as failover happened once.
> 
> - on node 1 skip_count = 4, which means health check skipping happened
>   4 times until node 1 failed back.
> 
> - on node 1 retry_count is 4, which means health check retried 4 times
>   until it decided node 1 failed.
> 
> - duration of each health check is observed on node 0 and 1. max duration on node 0 is 657 ms
>   while min duration is 557 ms.
> 
> - on node 1, health check last performed at 20:38:22 but it was
>   actually skipped because last_skip_health_check recorded the same
>   time. The health check triggered the faiover at 20:37:12.  Actually
>   in the log we can find following line at the same time.
> 
>   2020-01-25 20:37:12: pid 29238: LOG:  health check failed on node 1 (timeout:0)
> 
> test=# show pool_health_check_stats;
> -[ RECORD 1 ]----------------+--------------------
> node_id                      | 0
> hostname                     | /tmp
> port                         | 11002
> status                       | up
> last_status_change           | 2020-01-25 20:36:03
> total_count                  | 15
> success_count                | 15
> fail_count                   | 0
> skip_count                   | 0
> retry_count                  | 0
> average_retry_count          | 0.000000
> max_retry_count              | 0
> max_duration                 | 657
> min_duration                 | 557
> average_duration             | 606.800000
> last_health_check            | 2020-01-25 20:38:19
> last_successful_health_check | 2020-01-25 20:38:19
> last_skip_health_check       | 
> last_failed_health_check     | 
> -[ RECORD 2 ]----------------+--------------------
> node_id                      | 1
> hostname                     | /tmp
> port                         | 11003
> status                       | waiting
> last_status_change           | 2020-01-25 20:38:22
> total_count                  | 15
> success_count                | 7
> fail_count                   | 1
> skip_count                   | 7
> retry_count                  | 4
> average_retry_count          | 0.266667
> max_retry_count              | 4
> max_duration                 | 623
> min_duration                 | 557
> average_duration             | 593.000000
> last_health_check            | 2020-01-25 20:38:22
> last_successful_health_check | 2020-01-25 20:36:59
> last_skip_health_check       | 2020-01-25 20:38:22
> last_failed_health_check     | 2020-01-25 20:37:12
> 
> BTW, I am not sure if followings should be included this work;
> 
>> - cause of the status change (failover, failback etc.)
>> - last 10 status change timestamp and it's status at the time ("10" should be configurable)
> 
> Because they are best handled by failover process. I would like to
> focus on health check statistics.
> 
> Next work will be:
> 
> - More tests.
> - Implement PCP command.
> - Implement pgpool_adm function.
> - Write documents.
> 
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese:http://www.sraoss.co.jp