[pgpool-hackers: 3500] Re: Proposal: health check statistics
Tatsuo Ishii
ishii at sraoss.co.jp
Sun Jan 26 20:19:51 JST 2020
Pushed with bug fixes and enhancements along with an SGML document for
new "show pool_health_check_stats".
> Ok, here is the first cut of this work (see attached patches). I
> implemented "show pool_health_check_stats;" command to show health
> check statistics stored in shared memory. Below is a sample session of
> that. There are two backend nodes 0 and 1. 1 was shutdown and failover
> happened. Then it automatically failed back because auto_failback =
> on. Facts you can see from it include:
>
> - node 0's last_failed_health_check is empty because there's no failed
> health happened on node 0.
>
> - node 0's last_skip_health_check is also empty. As health check
> skipping happens against downed node, which does not happen on node
> 0.
>
> - on node 1 fail_count = 1 as failover happened once.
>
> - on node 1 skip_count = 4, which means health check skipping happened
> 4 times until node 1 failed back.
>
> - on node 1 retry_count is 4, which means health check retried 4 times
> until it decided node 1 failed.
>
> - duration of each health check is observed on node 0 and 1. max duration on node 0 is 657 ms
> while min duration is 557 ms.
>
> - on node 1, health check last performed at 20:38:22 but it was
> actually skipped because last_skip_health_check recorded the same
> time. The health check triggered the faiover at 20:37:12. Actually
> in the log we can find following line at the same time.
>
> 2020-01-25 20:37:12: pid 29238: LOG: health check failed on node 1 (timeout:0)
>
> test=# show pool_health_check_stats;
> -[ RECORD 1 ]----------------+--------------------
> node_id | 0
> hostname | /tmp
> port | 11002
> status | up
> last_status_change | 2020-01-25 20:36:03
> total_count | 15
> success_count | 15
> fail_count | 0
> skip_count | 0
> retry_count | 0
> average_retry_count | 0.000000
> max_retry_count | 0
> max_duration | 657
> min_duration | 557
> average_duration | 606.800000
> last_health_check | 2020-01-25 20:38:19
> last_successful_health_check | 2020-01-25 20:38:19
> last_skip_health_check |
> last_failed_health_check |
> -[ RECORD 2 ]----------------+--------------------
> node_id | 1
> hostname | /tmp
> port | 11003
> status | waiting
> last_status_change | 2020-01-25 20:38:22
> total_count | 15
> success_count | 7
> fail_count | 1
> skip_count | 7
> retry_count | 4
> average_retry_count | 0.266667
> max_retry_count | 4
> max_duration | 623
> min_duration | 557
> average_duration | 593.000000
> last_health_check | 2020-01-25 20:38:22
> last_successful_health_check | 2020-01-25 20:36:59
> last_skip_health_check | 2020-01-25 20:38:22
> last_failed_health_check | 2020-01-25 20:37:12
>
> BTW, I am not sure if followings should be included this work;
>
>> - cause of the status change (failover, failback etc.)
>> - last 10 status change timestamp and it's status at the time ("10" should be configurable)
>
> Because they are best handled by failover process. I would like to
> focus on health check statistics.
>
> Next work will be:
>
> - More tests.
> - Implement PCP command.
> - Implement pgpool_adm function.
> - Write documents.
>
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese:http://www.sraoss.co.jp
More information about the pgpool-hackers
mailing list