[pgpool-hackers: 3519] Re: Proposal: health check statistics

Tatsuo Ishii ishii at sraoss.co.jp
Tue Feb 25 17:28:13 JST 2020


I have added dedicated PCP command and pgpool_adm extension.
See manuals for more details.

http://tatsuo-ishii.github.io/pgpool-II/current/pcp-health-check-stats.html
http://tatsuo-ishii.github.io/pgpool-II/current/pgpool-adm-pcp-health-check-stats.html

> Pushed with bug fixes and enhancements along with an SGML document for
> new "show pool_health_check_stats".
> 
>> Ok, here is the first cut of this work (see attached patches). I
>> implemented "show pool_health_check_stats;" command to show health
>> check statistics stored in shared memory. Below is a sample session of
>> that. There are two backend nodes 0 and 1. 1 was shutdown and failover
>> happened. Then it automatically failed back because auto_failback =
>> on. Facts you can see from it include:
>> 
>> - node 0's last_failed_health_check is empty because there's no failed
>>   health happened on node 0.
>> 
>> - node 0's last_skip_health_check is also empty. As health check
>>   skipping happens against downed node, which does not happen on node
>>   0.
>> 
>> - on node 1 fail_count = 1 as failover happened once.
>> 
>> - on node 1 skip_count = 4, which means health check skipping happened
>>   4 times until node 1 failed back.
>> 
>> - on node 1 retry_count is 4, which means health check retried 4 times
>>   until it decided node 1 failed.
>> 
>> - duration of each health check is observed on node 0 and 1. max duration on node 0 is 657 ms
>>   while min duration is 557 ms.
>> 
>> - on node 1, health check last performed at 20:38:22 but it was
>>   actually skipped because last_skip_health_check recorded the same
>>   time. The health check triggered the faiover at 20:37:12.  Actually
>>   in the log we can find following line at the same time.
>> 
>>   2020-01-25 20:37:12: pid 29238: LOG:  health check failed on node 1 (timeout:0)
>> 
>> test=# show pool_health_check_stats;
>> -[ RECORD 1 ]----------------+--------------------
>> node_id                      | 0
>> hostname                     | /tmp
>> port                         | 11002
>> status                       | up
>> last_status_change           | 2020-01-25 20:36:03
>> total_count                  | 15
>> success_count                | 15
>> fail_count                   | 0
>> skip_count                   | 0
>> retry_count                  | 0
>> average_retry_count          | 0.000000
>> max_retry_count              | 0
>> max_duration                 | 657
>> min_duration                 | 557
>> average_duration             | 606.800000
>> last_health_check            | 2020-01-25 20:38:19
>> last_successful_health_check | 2020-01-25 20:38:19
>> last_skip_health_check       | 
>> last_failed_health_check     | 
>> -[ RECORD 2 ]----------------+--------------------
>> node_id                      | 1
>> hostname                     | /tmp
>> port                         | 11003
>> status                       | waiting
>> last_status_change           | 2020-01-25 20:38:22
>> total_count                  | 15
>> success_count                | 7
>> fail_count                   | 1
>> skip_count                   | 7
>> retry_count                  | 4
>> average_retry_count          | 0.266667
>> max_retry_count              | 4
>> max_duration                 | 623
>> min_duration                 | 557
>> average_duration             | 593.000000
>> last_health_check            | 2020-01-25 20:38:22
>> last_successful_health_check | 2020-01-25 20:36:59
>> last_skip_health_check       | 2020-01-25 20:38:22
>> last_failed_health_check     | 2020-01-25 20:37:12
>> 
>> BTW, I am not sure if followings should be included this work;
>> 
>>> - cause of the status change (failover, failback etc.)
>>> - last 10 status change timestamp and it's status at the time ("10" should be configurable)
>> 
>> Because they are best handled by failover process. I would like to
>> focus on health check statistics.
>> 
>> Next work will be:
>> 
>> - More tests.
>> - Implement PCP command.
>> - Implement pgpool_adm function.
>> - Write documents.
>> 
>> Best regards,
>> --
>> Tatsuo Ishii
>> SRA OSS, Inc. Japan
>> English: http://www.sraoss.co.jp/index_en.php
>> Japanese:http://www.sraoss.co.jp
> _______________________________________________
> pgpool-hackers mailing list
> pgpool-hackers at pgpool.net
> http://www.pgpool.net/mailman/listinfo/pgpool-hackers


More information about the pgpool-hackers mailing list