[pgpool-hackers: 3501] Re: Proposal: health check statistics

Wed Jan 29 20:32:10 JST 2020

Hi!

May you add in this stats how 'full' pool is? I'd like to know how many
free connectios the pool can receive. I'm using pgpool in docker with a low
configuration that allow me to scale, but when a container exceeds the
number of connections to deal with, the connection is aborted. This
information will help me to scale up and down my container replicas and
keep my high availability working fine.

Thanks!

Em dom., 26 de jan. de 2020 às 08:19, Tatsuo Ishii <ishii at sraoss.co.jp>
escreveu:

> Pushed with bug fixes and enhancements along with an SGML document for
> new "show pool_health_check_stats".
>
> > Ok, here is the first cut of this work (see attached patches). I
> > implemented "show pool_health_check_stats;" command to show health
> > check statistics stored in shared memory. Below is a sample session of
> > that. There are two backend nodes 0 and 1. 1 was shutdown and failover
> > happened. Then it automatically failed back because auto_failback =
> > on. Facts you can see from it include:
> >
> > - node 0's last_failed_health_check is empty because there's no failed
> >   health happened on node 0.
> >
> > - node 0's last_skip_health_check is also empty. As health check
> >   skipping happens against downed node, which does not happen on node
> >   0.
> >
> > - on node 1 fail_count = 1 as failover happened once.
> >
> > - on node 1 skip_count = 4, which means health check skipping happened
> >   4 times until node 1 failed back.
> >
> > - on node 1 retry_count is 4, which means health check retried 4 times
> >   until it decided node 1 failed.
> >
> > - duration of each health check is observed on node 0 and 1. max
> duration on node 0 is 657 ms
> >   while min duration is 557 ms.
> >
> > - on node 1, health check last performed at 20:38:22 but it was
> >   actually skipped because last_skip_health_check recorded the same
> >   time. The health check triggered the faiover at 20:37:12.  Actually
> >   in the log we can find following line at the same time.
> >
> >   2020-01-25 20:37:12: pid 29238: LOG:  health check failed on node 1
> (timeout:0)
> >
> > test=# show pool_health_check_stats;
> > -[ RECORD 1 ]----------------+--------------------
> > node_id                      | 0
> > hostname                     | /tmp
> > port                         | 11002
> > status                       | up
> > last_status_change           | 2020-01-25 20:36:03
> > total_count                  | 15
> > success_count                | 15
> > fail_count                   | 0
> > skip_count                   | 0
> > retry_count                  | 0
> > average_retry_count          | 0.000000
> > max_retry_count              | 0
> > max_duration                 | 657
> > min_duration                 | 557
> > average_duration             | 606.800000
> > last_health_check            | 2020-01-25 20:38:19
> > last_successful_health_check | 2020-01-25 20:38:19
> > last_skip_health_check       |
> > last_failed_health_check     |
> > -[ RECORD 2 ]----------------+--------------------
> > node_id                      | 1
> > hostname                     | /tmp
> > port                         | 11003
> > status                       | waiting
> > last_status_change           | 2020-01-25 20:38:22
> > total_count                  | 15
> > success_count                | 7
> > fail_count                   | 1
> > skip_count                   | 7
> > retry_count                  | 4
> > average_retry_count          | 0.266667
> > max_retry_count              | 4
> > max_duration                 | 623
> > min_duration                 | 557
> > average_duration             | 593.000000
> > last_health_check            | 2020-01-25 20:38:22
> > last_successful_health_check | 2020-01-25 20:36:59
> > last_skip_health_check       | 2020-01-25 20:38:22
> > last_failed_health_check     | 2020-01-25 20:37:12
> >
> > BTW, I am not sure if followings should be included this work;
> >
> >> - cause of the status change (failover, failback etc.)
> >> - last 10 status change timestamp and it's status at the time ("10"
> should be configurable)
> >
> > Because they are best handled by failover process. I would like to
> > focus on health check statistics.
> >
> > Next work will be:
> >
> > - More tests.
> > - Implement PCP command.
> > - Implement pgpool_adm function.
> > - Write documents.
> >
> > Best regards,
> > --
> > Tatsuo Ishii
> > SRA OSS, Inc. Japan
> > English: http://www.sraoss.co.jp/index_en.php
> > Japanese:http://www.sraoss.co.jp
> _______________________________________________
> pgpool-hackers mailing list
> pgpool-hackers at pgpool.net
> http://www.pgpool.net/mailman/listinfo/pgpool-hackers
>

-- 
Abraços!

=============================
José Roberto Emerich Junior
=============================
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-hackers/attachments/20200129/5db1dc96/attachment.html>