[pgpool-hackers: 193] Re: pgpool health check failsafe mechanism

Mon Apr 8 17:51:19 JST 2013

Hi Tatsuo Ishii,

By looking at the source code, It seems that health check mechanism depends
on failover option (fail_over_on_backend_error + backend_flag) for non
parallel mode and will go in never ending loop if failover is disabled (As
I mentioned earlier on Issue#3 in first email) i.e.

pgpool2/main.c

> /* do we need health checking for PostgreSQL? */
> if (pool_config->health_check_period > 0)
> {
> ...
> ...
> if (POOL_DISALLOW_TO_FAILOVER(BACKEND_INFO(sts).flag))
> {
>      pool_log("health_check: %d failover is canceld because failover is
> disallowed", sts);
> }
> else if (retrycnt <= pool_config->health_check_max_retries)
> ...
> ...
> }

It seems failover depend on configuration option not only
fail_over_on_backend_error but as well as backend_flag too. If
fail_over_on_backend_error is "on" but backend_flag is
"DISALLOW_TO_FAILOVER" it will not trigger fail over for related slave
node. On the other hand If child process find an error in connection for
any related node it aborts. As you suggested earlier It seems the only
appropriate thing that should be done is failover and restart all child
processes, if error in connection to any related node found.

In the example (Issue#3 in first email) I mentioned earlier there is dead
end and pgpool goes in endless loop and become non responsive for new
connections if we use following configuration settings i.e.

pgpool.conf

> fail_over_on_backend_error  = on
> backend_flag0 = 'DISALLOW_TO_FAILOVER'
> backend_flag1 = 'DISALLOW_TO_FAILOVER'
> health_check_period = 5
> health_check_timeout = 1
> health_check_retry_delay = 10

On each new
connection new_connection()->notice_backend_error()->degenerate_backend_set()
give the following warning i.e.

if (POOL_DISALLOW_TO_FAILOVER(BACKEND_INFO(node_id_set[i]).flag))
> {
>      pool_log("degenerate_backend_set: %d failover request from pid %d is
> canceld because failover is disallowed", node_id_set[i], getpid());
>      continue;
> }

As mentioned in the fail_over_on_backend_error documentation, failover can
happen even when fail_over_on_backend_error=off when it detects
administrative shutdown of postmaster i.e.

http://www.pgpool.net/docs/latest/pgpool-en.html

> fail_over_on_backend_error V2.3 -
> If true, and an error occurs when reading/writing to the backend
> communication, pgpool-II will trigger the fail over procedure. If set to
> false, pgpool will report an error and disconnect the session. If you set
> this parameter to off, it is recommended that you turn on health checking.
> Please note that even if this parameter is set to off, however, pgpool will
> also do the fail over when pgpool detects the administrative shutdown of
> postmaster.
> You need to reload pgpool.conf if you change this value.

If failover/degenerate is only option to handle the situation where slave
node is non responsive/crashed etc, can't it be allowed in the code to do
failover on connection error (even when it is disabled) ?. Thanks.

Best Regards,
Asif Naeem

On Wed, Apr 3, 2013 at 11:43 AM, Asif Naeem <anaeem.it at gmail.com> wrote:

> Hi,
>
> We are facing issue with pgpool health check failsafe mechanism in
> production environment. I have previously posted this issue on
> http://www.pgpool.net/mantisbt/view.php?id=50. I have observed 2 issue
> with gpool-II version 3.2.3 (built with latest source code) i.e.
>
> Used versions i.e.
>
>> pgpool-II version 3.2.3
>> postgresql 9.2.3 (Master + Slave)
>
>
> 1. In master slave configuration, if health check and failover is enabled
> i.e.
>
> pgpool.conf
>
>> backend_flag0 = 'ALLOW_TO_FAILOVER'
>> backend_flag1 = 'ALLOW_TO_FAILOVER'
>>
> health_check_period = 5
>> health_check_timeout = 1
>> health_check_max_retries = 2
>> health_check_retry_delay = 10
>
> load_balance_mode = off
>
>
> On Linux64, When master server is running fine and without load balancing
> and when suddenly if network interruption happen or any other reason (I
> mimic the situation via forcefully shutdown dbserver via immediate mode
> etc) and pgpool is not able to make connection to slave server. After that
> first connection attempt to pgpool return without error/warning message and
> pgpool do fail over and kill all child processes. Does that make sense that
> when there is no load balancing and master dbserver is serving the queries
> well and disconnection of slave server trigger failover ?.
>
> pgpool.log
>
>> ....
>> 2013-04-02 17:24:36 DEBUG: pid 65431: I am 65431 accept fd 6
>> 2013-04-02 17:24:36 DEBUG: pid 65431: read_startup_packet:
>> application_name: psql
>> 2013-04-02 17:24:36 DEBUG: pid 65431: Protocol Major: 3 Minor: 0
>> database: postgres user: asif
>> 2013-04-02 17:24:36 DEBUG: pid 65431: new_connection: connecting 0 backend
>> 2013-04-02 17:24:36 DEBUG: pid 65431: new_connection: connecting 1 backend
>> 2013-04-02 17:24:36 ERROR: pid 65431: connect_inet_domain_socket:
>> getsockopt() detected error: Connection refused
>> 2013-04-02 17:24:36 ERROR: pid 65431: connection to localhost(7445) failed
>> 2013-04-02 17:24:36 ERROR: pid 65431: new_connection: create_cp() failed
>> 2013-04-02 17:24:36 LOG:   pid 65431: degenerate_backend_set: 1 fail over
>> request from pid 65431
>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler called
>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: starting to
>> select new master node
>> 2013-04-02 17:24:36 LOG:   pid 65417: starting degeneration. shutdown
>> host localhost(7445)
>> 2013-04-02 17:24:36 LOG:   pid 65417: Restart all children
>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65418
>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65419
>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65420
>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65421
>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65422
>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65423
>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65424
>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65425
>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65426
>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65427
>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65428
>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65429
>> ...
>> ...
>
>
> 2. In the same previous configuration, If I disable failover i.e.
>
> pgpool.conf
>
>> backend_flag0 = 'DISALLOW_TO_FAILOVER'
>> backend_flag1 = 'DISALLOW_TO_FAILOVER'
>>
> health_check_period = 5
>> health_check_timeout = 1
>> health_check_max_retries = 2
>> health_check_retry_delay = 10
>
> load_balance_mode = off
>
>
> On Linux64, When master server is running fine and there is no load
> balancing and no failover and suddenly slave server appear to be
> disconnected because of network interruption happen or any other reason (I
> mimic it by forcefully shutdown dbserver via immediate mode etc). After
> that no connection attempt got successful to pgpool until health check
> complete and master database server log shows the following messages i.e.
>
> dbserver.log
>   ...
>   ...
>   LOG: incomplete startup packet
>   LOG: incomplete startup packet
>   LOG: incomplete startup packet
>   LOG: incomplete startup packet
>   LOG: incomplete startup packet
>   ...
>
> 3. While testing this scenario on my MacOSX machine (gcc), it seems that
> health check is not getting complete and endless with pgpool configuration
> settings as issue #2 above and it completely refrain me from to to connect
> pgpool any more i.e.
>
> pgpool.log
>
>> ...
>> ...
>> 2013-04-03 11:29:29 DEBUG: pid 44263: retrying *679* th health checking
>> 2013-04-03 11:29:29 DEBUG: pid 44263: health_check: 0 th DB node status: 2
>> 2013-04-03 11:29:29 DEBUG: pid 44263: pool_ssl: SSL requested but SSL
>> support is not available
>> 2013-04-03 11:29:29 DEBUG: pid 44263: s_do_auth: auth kind: 0
>> 2013-04-03 11:29:29 DEBUG: pid 44263: s_do_auth: backend key data received
>> 2013-04-03 11:29:29 DEBUG: pid 44263: s_do_auth: transaction state: I
>> 2013-04-03 11:29:29 DEBUG: pid 44263: health_check: 1 th DB node status: 2
>> 2013-04-03 11:29:29 ERROR: pid 44263: connect_inet_domain_socket:
>> getsockopt() detected error: Connection refused
>> 2013-04-03 11:29:29 ERROR: pid 44263: make_persistent_db_connection:
>> connection to localhost(7445) failed
>> 2013-04-03 11:29:29 ERROR: pid 44263: health check failed. 1 th host
>> localhost at port 7445 is down
>> 2013-04-03 11:29:29 LOG:   pid 44263: health_check: 1 failover is canceld
>> because failover is disallowed
>> 2013-04-03 11:29:34 DEBUG: pid 44263: retrying *680* th health checking
>> 2013-04-03 11:29:34 DEBUG: pid 44263: health_check: 0 th DB node status: 2
>> 2013-04-03 11:29:34 DEBUG: pid 44263: pool_ssl: SSL requested but SSL
>> support is not available
>> 2013-04-03 11:29:34 DEBUG: pid 44263: s_do_auth: auth kind: 0
>> 2013-04-03 11:29:34 DEBUG: pid 44263: s_do_auth: backend key data received
>> 2013-04-03 11:29:34 DEBUG: pid 44263: s_do_auth: transaction state: I
>> 2013-04-03 11:29:34 DEBUG: pid 44263: health_check: 1 th DB node status: 2
>> 2013-04-03 11:29:34 ERROR: pid 44263: connect_inet_domain_socket:
>> getsockopt() detected error: Connection refused
>> 2013-04-03 11:29:34 ERROR: pid 44263: make_persistent_db_connection:
>> connection to localhost(7445) failed
>> 2013-04-03 11:29:34 ERROR: pid 44263: health check failed. 1 th host
>> localhost at port 7445 is down
>> 2013-04-03 11:29:34 LOG:   pid 44263: health_check: 1 failover is canceld
>> because failover is disallowed
>> ...
>> ...
>
>
> I will try it on Linux64 machine too. Thanks.
>
> Best Regards,
> Asif Naeem
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-hackers/attachments/20130408/444ced97/attachment.html>