[pgpool-hackers: 3496] Re: Proposal: health check statistics

Thu Jan 23 12:10:59 JST 2020

>>>> Currently Pgpool-II's health check process logs various information
>>>> including backend connection problem, retrying to recover from it, and
>>>> so on. This information is very important for users because it reports
>>>> the healthiness problem of PostgreSQL.　For example, observing
>>>> increase of retry count may suggest that network connection between
>>>> Pgpool-II and PostgreSQL having trouble so that users could replace
>>>> the switch before actual failure occurs. Problem is, it is annoying to
>>>> look for such that information from log files afterward since it may
>>>> already disappear or was not logged by other problems (such as disk
>>>> full).
>>>> 
>>>> I would like to propose a new feature:
>>>> 
>>>> - Accumulate health check statistics on shared memory so that later on
>>>>   users can look into the stats using PCP commands.
>>>> 
>>>> - Such statistics includes:
>>>>   - failure count per backend nodes
>>>>   - retry count per backend nodes
>>>>   - success count after retries
>>> 
>>> I think, we should add statistis about:
>>> - success count per backend nodes
>>> 
>>> If pgpool's statistics have this, we can know parcentage of failure.
>> 
>> That's definitely a good thing for users. Than you for your suggestion.
> 
> So, here is the revised proposal for health check statistics.
> (all per node data).
> 
> - total count
> - total success count
> - total failure count
> - total retry count
> - average retry count
> - maximum retry count
> - average response time
> - maximum response time
> - the latest healthchek timestamp
> - the latest retry timestamp
> - the latest status change timestamp
> - cause of the status change (failover, failback etc.)
> - current status (up, down...)
> - last 10 status change timestamp and it's status at the time ("10" should be configurable)

I have started to implement this feature. Currently I have stats like this:

/* statistics per node */
typedef struct {
	uint64	total_count;	/* total count of health check */
	uint64	success_count;	/* total count of successful health check */
	uint64	skip_count;		/* total count of skipped health check */
	uint64	retry_count;	/* total count of health check retries */
	uint32	max_retry_count;	/* max retry count in a health check session */
	int		max_health_check_duration;	/* max duration spent for a health check session in milli seconds */
	int		min_health_check_duration;	/* minimum duration spent for a health check session in milli seconds */
	time_t	last_health_check;	/* last health check timestamp */
	time_t	last_successful_health_check;	/* last succesfull health check timestamp */
	time_t	last_skip_health_check;			/* last skipped health check timestamp */
	time_t	last_failed_health_check;		/* last failed health check timestamp */
} POOL_HEALTH_CHECK_STATISTICS;

> - total failure count
This is not neccessary because total_count - success_count -
skip_count = failure count. (I have added skip_count).

> - average retry count
This could be calculated from other data and I removed it.

> - average response time
Ditto.

> - maximum response time
I changed "response time" to "duration" as it sounds more closed as it is.

Also I added min_health_check_duration which could be useful.

> - cause of the status change (failover, failback etc.)
Better to be collected by failover process.

> - current status (up, down...)
Already in shared memory.

> - last 10 status change timestamp and it's status at the time ("10" should be configurable)

Better to be collected by failover process.

Here are sample log from health check (storing into shared memory is
not implemented yet).

2020-01-23 11:29:18: pid 3086: LOG:  health check stats: total_count: 15 success_count: 15 skip_count: 0 retry_count: 0, max_retry_count: 0 max_duration: 994 min_duration: 1
2020-01-23 11:29:18: pid 3086: LOG:  health check stats: last_health_check: Thu Jan 23 11:29:18 2020
	 last_successful_health_check: Thu Jan 23 11:29:18 2020
	 last_skip_health_check: Thu Jan  1 09:00:00 1970
	 last_failed_health_check: Thu Jan  1 09:00:00 1970

2020-01-23 11:29:20: pid 3087: LOG:  health check stats: total_count: 15 success_count: 5 skip_count: 9 retry_count: 4, max_retry_count: 4 max_duration: 968 min_duration: 913
2020-01-23 11:29:20: pid 3087: LOG:  health check stats: last_health_check: Thu Jan 23 11:29:20 2020
	 last_successful_health_check: Thu Jan 23 11:29:20 2020
	 last_skip_health_check: Thu Jan 23 11:29:10 2020
	 last_failed_health_check: Thu Jan 23 11:27:40 2020

2020-01-23 11:29:28: pid 3086: LOG:  health check stats: total_count: 16 success_count: 16 skip_count: 0 retry_count: 0, max_retry_count: 0 max_duration: 994 min_duration: 1
2020-01-23 11:29:28: pid 3086: LOG:  health check stats: last_health_check: Thu Jan 23 11:29:28 2020
	 last_successful_health_check: Thu Jan 23 11:29:28 2020
	 last_skip_health_check: Thu Jan  1 09:00:00 1970
	 last_failed_health_check: Thu Jan  1 09:00:00 1970

2020-01-23 11:29:30: pid 3087: LOG:  health check stats: total_count: 16 success_count: 6 skip_count: 9 retry_count: 4, max_retry_count: 4 max_duration: 975 min_duration: 913
2020-01-23 11:29:30: pid 3087: LOG:  health check stats: last_health_check: Thu Jan 23 11:29:30 2020
	 last_successful_health_check: Thu Jan 23 11:29:30 2020
	 last_skip_health_check: Thu Jan 23 11:29:10 2020
	 last_failed_health_check: Thu Jan 23 11:27:40 2020

I am going to implement things so that they are stored into shared
memory. And I will implement show/pcp commands to display it.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp