[pgpool-hackers: 194] Re: pgpool health check failsafe mechanism

Tue Apr 9 09:05:14 JST 2013

Well, "will go in never ending loop" is a little bit incorrect
statement.  What happens here is, pgpool tries to fail over every
health_check_period and it is canceled because DISALLOW_TO_FAILOVER
flag was set. This particular set up has at least two use cases:

- PostgreSQL is protected by heartbeat/pacemaker or any other HA(High
  Availability software). When a PostgreSQL server fails, they are
  responsible for taking over the node by the standby PostgreSQL. Once
  the PostgreSQL comes up, pgpool will start to accept connections
  from clients.

- Admin wants to upgrade PostgreSQL immediately because of security
  issues with it (like recent PostgreSQL). He stops PostgreSQL one by
  one and upgrades them. While admin stops PostgreSQL, pgpool refuses
  to accept connections from clients and database consistency among
  database nodes are safely kept. This will make minimize the down
  time.

In summary, I see no point to change current behavior of pgpool.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

> Hi Tatsuo Ishii,
> 
> By looking at the source code, It seems that health check mechanism depends
> on failover option (fail_over_on_backend_error + backend_flag) for non
> parallel mode and will go in never ending loop if failover is disabled (As
> I mentioned earlier on Issue#3 in first email) i.e.
> 
> pgpool2/main.c
> 
>> /* do we need health checking for PostgreSQL? */
>> if (pool_config->health_check_period > 0)
>> {
>> ...
>> ...
>> if (POOL_DISALLOW_TO_FAILOVER(BACKEND_INFO(sts).flag))
>> {
>>      pool_log("health_check: %d failover is canceld because failover is
>> disallowed", sts);
>> }
>> else if (retrycnt <= pool_config->health_check_max_retries)
>> ...
>> ...
>> }
> 
> 
> It seems failover depend on configuration option not only
> fail_over_on_backend_error but as well as backend_flag too. If
> fail_over_on_backend_error is "on" but backend_flag is
> "DISALLOW_TO_FAILOVER" it will not trigger fail over for related slave
> node. On the other hand If child process find an error in connection for
> any related node it aborts. As you suggested earlier It seems the only
> appropriate thing that should be done is failover and restart all child
> processes, if error in connection to any related node found.
> 
> In the example (Issue#3 in first email) I mentioned earlier there is dead
> end and pgpool goes in endless loop and become non responsive for new
> connections if we use following configuration settings i.e.
> 
> pgpool.conf
> 
>> fail_over_on_backend_error  = on
>> backend_flag0 = 'DISALLOW_TO_FAILOVER'
>> backend_flag1 = 'DISALLOW_TO_FAILOVER'
>> health_check_period = 5
>> health_check_timeout = 1
>> health_check_retry_delay = 10
> 
> 
> On each new
> connection new_connection()->notice_backend_error()->degenerate_backend_set()
> give the following warning i.e.
> 
> if (POOL_DISALLOW_TO_FAILOVER(BACKEND_INFO(node_id_set[i]).flag))
>> {
>>      pool_log("degenerate_backend_set: %d failover request from pid %d is
>> canceld because failover is disallowed", node_id_set[i], getpid());
>>      continue;
>> }
> 
> 
> As mentioned in the fail_over_on_backend_error documentation, failover can
> happen even when fail_over_on_backend_error=off when it detects
> administrative shutdown of postmaster i.e.
> 
> http://www.pgpool.net/docs/latest/pgpool-en.html
> 
>> fail_over_on_backend_error V2.3 -
>> If true, and an error occurs when reading/writing to the backend
>> communication, pgpool-II will trigger the fail over procedure. If set to
>> false, pgpool will report an error and disconnect the session. If you set
>> this parameter to off, it is recommended that you turn on health checking.
>> Please note that even if this parameter is set to off, however, pgpool will
>> also do the fail over when pgpool detects the administrative shutdown of
>> postmaster.
>> You need to reload pgpool.conf if you change this value.
> 
> 
> If failover/degenerate is only option to handle the situation where slave
> node is non responsive/crashed etc, can't it be allowed in the code to do
> failover on connection error (even when it is disabled) ?. Thanks.
> 
> Best Regards,
> Asif Naeem
> 
> On Wed, Apr 3, 2013 at 11:43 AM, Asif Naeem <anaeem.it at gmail.com> wrote:
> 
>> Hi,
>>
>> We are facing issue with pgpool health check failsafe mechanism in
>> production environment. I have previously posted this issue on
>> http://www.pgpool.net/mantisbt/view.php?id=50. I have observed 2 issue
>> with gpool-II version 3.2.3 (built with latest source code) i.e.
>>
>> Used versions i.e.
>>
>>> pgpool-II version 3.2.3
>>> postgresql 9.2.3 (Master + Slave)
>>
>>
>> 1. In master slave configuration, if health check and failover is enabled
>> i.e.
>>
>> pgpool.conf
>>
>>> backend_flag0 = 'ALLOW_TO_FAILOVER'
>>> backend_flag1 = 'ALLOW_TO_FAILOVER'
>>>
>> health_check_period = 5
>>> health_check_timeout = 1
>>> health_check_max_retries = 2
>>> health_check_retry_delay = 10
>>
>> load_balance_mode = off
>>
>>
>> On Linux64, When master server is running fine and without load balancing
>> and when suddenly if network interruption happen or any other reason (I
>> mimic the situation via forcefully shutdown dbserver via immediate mode
>> etc) and pgpool is not able to make connection to slave server. After that
>> first connection attempt to pgpool return without error/warning message and
>> pgpool do fail over and kill all child processes. Does that make sense that
>> when there is no load balancing and master dbserver is serving the queries
>> well and disconnection of slave server trigger failover ?.
>>
>> pgpool.log
>>
>>> ....
>>> 2013-04-02 17:24:36 DEBUG: pid 65431: I am 65431 accept fd 6
>>> 2013-04-02 17:24:36 DEBUG: pid 65431: read_startup_packet:
>>> application_name: psql
>>> 2013-04-02 17:24:36 DEBUG: pid 65431: Protocol Major: 3 Minor: 0
>>> database: postgres user: asif
>>> 2013-04-02 17:24:36 DEBUG: pid 65431: new_connection: connecting 0 backend
>>> 2013-04-02 17:24:36 DEBUG: pid 65431: new_connection: connecting 1 backend
>>> 2013-04-02 17:24:36 ERROR: pid 65431: connect_inet_domain_socket:
>>> getsockopt() detected error: Connection refused
>>> 2013-04-02 17:24:36 ERROR: pid 65431: connection to localhost(7445) failed
>>> 2013-04-02 17:24:36 ERROR: pid 65431: new_connection: create_cp() failed
>>> 2013-04-02 17:24:36 LOG:   pid 65431: degenerate_backend_set: 1 fail over
>>> request from pid 65431
>>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler called
>>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: starting to
>>> select new master node
>>> 2013-04-02 17:24:36 LOG:   pid 65417: starting degeneration. shutdown
>>> host localhost(7445)
>>> 2013-04-02 17:24:36 LOG:   pid 65417: Restart all children
>>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65418
>>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65419
>>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65420
>>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65421
>>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65422
>>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65423
>>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65424
>>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65425
>>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65426
>>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65427
>>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65428
>>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65429
>>> ...
>>> ...
>>
>>
>> 2. In the same previous configuration, If I disable failover i.e.
>>
>> pgpool.conf
>>
>>> backend_flag0 = 'DISALLOW_TO_FAILOVER'
>>> backend_flag1 = 'DISALLOW_TO_FAILOVER'
>>>
>> health_check_period = 5
>>> health_check_timeout = 1
>>> health_check_max_retries = 2
>>> health_check_retry_delay = 10
>>
>> load_balance_mode = off
>>
>>
>> On Linux64, When master server is running fine and there is no load
>> balancing and no failover and suddenly slave server appear to be
>> disconnected because of network interruption happen or any other reason (I
>> mimic it by forcefully shutdown dbserver via immediate mode etc). After
>> that no connection attempt got successful to pgpool until health check
>> complete and master database server log shows the following messages i.e.
>>
>> dbserver.log
>>   ...
>>   ...
>>   LOG: incomplete startup packet
>>   LOG: incomplete startup packet
>>   LOG: incomplete startup packet
>>   LOG: incomplete startup packet
>>   LOG: incomplete startup packet
>>   ...
>>
>> 3. While testing this scenario on my MacOSX machine (gcc), it seems that
>> health check is not getting complete and endless with pgpool configuration
>> settings as issue #2 above and it completely refrain me from to to connect
>> pgpool any more i.e.
>>
>> pgpool.log
>>
>>> ...
>>> ...
>>> 2013-04-03 11:29:29 DEBUG: pid 44263: retrying *679* th health checking
>>> 2013-04-03 11:29:29 DEBUG: pid 44263: health_check: 0 th DB node status: 2
>>> 2013-04-03 11:29:29 DEBUG: pid 44263: pool_ssl: SSL requested but SSL
>>> support is not available
>>> 2013-04-03 11:29:29 DEBUG: pid 44263: s_do_auth: auth kind: 0
>>> 2013-04-03 11:29:29 DEBUG: pid 44263: s_do_auth: backend key data received
>>> 2013-04-03 11:29:29 DEBUG: pid 44263: s_do_auth: transaction state: I
>>> 2013-04-03 11:29:29 DEBUG: pid 44263: health_check: 1 th DB node status: 2
>>> 2013-04-03 11:29:29 ERROR: pid 44263: connect_inet_domain_socket:
>>> getsockopt() detected error: Connection refused
>>> 2013-04-03 11:29:29 ERROR: pid 44263: make_persistent_db_connection:
>>> connection to localhost(7445) failed
>>> 2013-04-03 11:29:29 ERROR: pid 44263: health check failed. 1 th host
>>> localhost at port 7445 is down
>>> 2013-04-03 11:29:29 LOG:   pid 44263: health_check: 1 failover is canceld
>>> because failover is disallowed
>>> 2013-04-03 11:29:34 DEBUG: pid 44263: retrying *680* th health checking
>>> 2013-04-03 11:29:34 DEBUG: pid 44263: health_check: 0 th DB node status: 2
>>> 2013-04-03 11:29:34 DEBUG: pid 44263: pool_ssl: SSL requested but SSL
>>> support is not available
>>> 2013-04-03 11:29:34 DEBUG: pid 44263: s_do_auth: auth kind: 0
>>> 2013-04-03 11:29:34 DEBUG: pid 44263: s_do_auth: backend key data received
>>> 2013-04-03 11:29:34 DEBUG: pid 44263: s_do_auth: transaction state: I
>>> 2013-04-03 11:29:34 DEBUG: pid 44263: health_check: 1 th DB node status: 2
>>> 2013-04-03 11:29:34 ERROR: pid 44263: connect_inet_domain_socket:
>>> getsockopt() detected error: Connection refused
>>> 2013-04-03 11:29:34 ERROR: pid 44263: make_persistent_db_connection:
>>> connection to localhost(7445) failed
>>> 2013-04-03 11:29:34 ERROR: pid 44263: health check failed. 1 th host
>>> localhost at port 7445 is down
>>> 2013-04-03 11:29:34 LOG:   pid 44263: health_check: 1 failover is canceld
>>> because failover is disallowed
>>> ...
>>> ...
>>
>>
>> I will try it on Linux64 machine too. Thanks.
>>
>> Best Regards,
>> Asif Naeem
>>
>>