[pgpool-hackers: 3499] Re: Proposal: health check statistics
Tatsuo Ishii
ishii at sraoss.co.jp
Sat Jan 25 21:22:57 JST 2020
Ok, here is the first cut of this work (see attached patches). I
implemented "show pool_health_check_stats;" command to show health
check statistics stored in shared memory. Below is a sample session of
that. There are two backend nodes 0 and 1. 1 was shutdown and failover
happened. Then it automatically failed back because auto_failback =
on. Facts you can see from it include:
- node 0's last_failed_health_check is empty because there's no failed
health happened on node 0.
- node 0's last_skip_health_check is also empty. As health check
skipping happens against downed node, which does not happen on node
0.
- on node 1 fail_count = 1 as failover happened once.
- on node 1 skip_count = 4, which means health check skipping happened
4 times until node 1 failed back.
- on node 1 retry_count is 4, which means health check retried 4 times
until it decided node 1 failed.
- duration of each health check is observed on node 0 and 1. max duration on node 0 is 657 ms
while min duration is 557 ms.
- on node 1, health check last performed at 20:38:22 but it was
actually skipped because last_skip_health_check recorded the same
time. The health check triggered the faiover at 20:37:12. Actually
in the log we can find following line at the same time.
2020-01-25 20:37:12: pid 29238: LOG: health check failed on node 1 (timeout:0)
test=# show pool_health_check_stats;
-[ RECORD 1 ]----------------+--------------------
node_id | 0
hostname | /tmp
port | 11002
status | up
last_status_change | 2020-01-25 20:36:03
total_count | 15
success_count | 15
fail_count | 0
skip_count | 0
retry_count | 0
average_retry_count | 0.000000
max_retry_count | 0
max_duration | 657
min_duration | 557
average_duration | 606.800000
last_health_check | 2020-01-25 20:38:19
last_successful_health_check | 2020-01-25 20:38:19
last_skip_health_check |
last_failed_health_check |
-[ RECORD 2 ]----------------+--------------------
node_id | 1
hostname | /tmp
port | 11003
status | waiting
last_status_change | 2020-01-25 20:38:22
total_count | 15
success_count | 7
fail_count | 1
skip_count | 7
retry_count | 4
average_retry_count | 0.266667
max_retry_count | 4
max_duration | 623
min_duration | 557
average_duration | 593.000000
last_health_check | 2020-01-25 20:38:22
last_successful_health_check | 2020-01-25 20:36:59
last_skip_health_check | 2020-01-25 20:38:22
last_failed_health_check | 2020-01-25 20:37:12
BTW, I am not sure if followings should be included this work;
> - cause of the status change (failover, failback etc.)
> - last 10 status change timestamp and it's status at the time ("10" should be configurable)
Because they are best handled by failover process. I would like to
focus on health check statistics.
Next work will be:
- More tests.
- Implement PCP command.
- Implement pgpool_adm function.
- Write documents.
Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp
-------------- next part --------------
A non-text attachment was scrubbed...
Name: health_check_stats.diff
Type: text/x-patch
Size: 22171 bytes
Desc: not available
URL: <http://www.sraoss.jp/pipermail/pgpool-hackers/attachments/20200125/5a479229/attachment.bin>
More information about the pgpool-hackers
mailing list