[pgpool-hackers: 3499] Re: Proposal: health check statistics

Sat Jan 25 21:22:57 JST 2020

Ok, here is the first cut of this work (see attached patches). I
implemented "show pool_health_check_stats;" command to show health
check statistics stored in shared memory. Below is a sample session of
that. There are two backend nodes 0 and 1. 1 was shutdown and failover
happened. Then it automatically failed back because auto_failback =
on. Facts you can see from it include:

- node 0's last_failed_health_check is empty because there's no failed
  health happened on node 0.

- node 0's last_skip_health_check is also empty. As health check
  skipping happens against downed node, which does not happen on node
  0.

- on node 1 fail_count = 1 as failover happened once.

- on node 1 skip_count = 4, which means health check skipping happened
  4 times until node 1 failed back.

- on node 1 retry_count is 4, which means health check retried 4 times
  until it decided node 1 failed.

- duration of each health check is observed on node 0 and 1. max duration on node 0 is 657 ms
  while min duration is 557 ms.

- on node 1, health check last performed at 20:38:22 but it was
  actually skipped because last_skip_health_check recorded the same
  time. The health check triggered the faiover at 20:37:12.  Actually
  in the log we can find following line at the same time.

  2020-01-25 20:37:12: pid 29238: LOG:  health check failed on node 1 (timeout:0)

test=# show pool_health_check_stats;
-[ RECORD 1 ]----------------+--------------------
node_id                      | 0
hostname                     | /tmp
port                         | 11002
status                       | up
last_status_change           | 2020-01-25 20:36:03
total_count                  | 15
success_count                | 15
fail_count                   | 0
skip_count                   | 0
retry_count                  | 0
average_retry_count          | 0.000000
max_retry_count              | 0
max_duration                 | 657
min_duration                 | 557
average_duration             | 606.800000
last_health_check            | 2020-01-25 20:38:19
last_successful_health_check | 2020-01-25 20:38:19
last_skip_health_check       | 
last_failed_health_check     | 
-[ RECORD 2 ]----------------+--------------------
node_id                      | 1
hostname                     | /tmp
port                         | 11003
status                       | waiting
last_status_change           | 2020-01-25 20:38:22
total_count                  | 15
success_count                | 7
fail_count                   | 1
skip_count                   | 7
retry_count                  | 4
average_retry_count          | 0.266667
max_retry_count              | 4
max_duration                 | 623
min_duration                 | 557
average_duration             | 593.000000
last_health_check            | 2020-01-25 20:38:22
last_successful_health_check | 2020-01-25 20:36:59
last_skip_health_check       | 2020-01-25 20:38:22
last_failed_health_check     | 2020-01-25 20:37:12

BTW, I am not sure if followings should be included this work;

> - cause of the status change (failover, failback etc.)
> - last 10 status change timestamp and it's status at the time ("10" should be configurable)

Because they are best handled by failover process. I would like to
focus on health check statistics.

Next work will be:

- More tests.
- Implement PCP command.
- Implement pgpool_adm function.
- Write documents.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp
-------------- next part --------------
A non-text attachment was scrubbed...
Name: health_check_stats.diff
Type: text/x-patch
Size: 22171 bytes
Desc: not available
URL: <http://www.sraoss.jp/pipermail/pgpool-hackers/attachments/20200125/5a479229/attachment.bin>