[pgpool-hackers: 197] Re: pgpool health check failsafe mechanism

Tatsuo Ishii ishii at postgresql.org
Thu Apr 11 08:33:56 JST 2013


> Thank you Tatsuo. I would say "will go in never ending loop" if any of
> slave stop responding (until alive again) as It is been observed earlier
> i.e.
> 
> pgpool.log
> 
>> ....
>> 2013-04-04 12:34:41 DEBUG: pid 44263: retrying *10867* th health checking
>> 2013-04-04 12:34:41 DEBUG: pid 44263: health_check: 0 th DB node status: 2
>> 2013-04-04 12:34:41 DEBUG: pid 44263: pool_ssl: SSL requested but SSL
>> support is not available
>> 2013-04-04 12:34:41 DEBUG: pid 44263: s_do_auth: auth kind: 0
>> 2013-04-04 12:34:41 DEBUG: pid 44263: s_do_auth: backend key data received
>> 2013-04-04 12:34:41 DEBUG: pid 44263: s_do_auth: transaction state: I
>> 2013-04-04 12:34:41 DEBUG: pid 44263: health_check: 1 th DB node status: 2
>> 2013-04-04 12:34:41 ERROR: pid 44263: connect_inet_domain_socket:
>> getsockopt() detected error: Connection refused
>> 2013-04-04 12:34:41 ERROR: pid 44263: make_persistent_db_connection:
>> connection to localhost(7445) failed
>> 2013-04-04 12:34:41 ERROR: pid 44263: health check failed. 1 th host
>> localhost at port 7445 is down
>> 2013-04-04 12:34:41 LOG:   pid 44263: health_check: 1 failover is canceld
>> because failover is disallowed
>> ....
>> ....
> 
> 
> AFAIU discussing it with you that it is a feature not a bug. In the
> presented scenario, If any of slave got down or missing ( maybe because of
> network issue ), until it become available/up again, pgpool will be non
> responsive to any new connection (with no warning or message). Do you agree
> ?. Thanks.

It's a feature if you disable fail over.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

> Best Regards,
> Asif Naeem
> 
> On Tue, Apr 9, 2013 at 5:05 AM, Tatsuo Ishii <ishii at postgresql.org> wrote:
> 
>> Well, "will go in never ending loop" is a little bit incorrect
>> statement.  What happens here is, pgpool tries to fail over every
>> health_check_period and it is canceled because DISALLOW_TO_FAILOVER
>> flag was set. This particular set up has at least two use cases:
>>
>> - PostgreSQL is protected by heartbeat/pacemaker or any other HA(High
>>   Availability software). When a PostgreSQL server fails, they are
>>   responsible for taking over the node by the standby PostgreSQL. Once
>>   the PostgreSQL comes up, pgpool will start to accept connections
>>   from clients.
>>
>> - Admin wants to upgrade PostgreSQL immediately because of security
>>   issues with it (like recent PostgreSQL). He stops PostgreSQL one by
>>   one and upgrades them. While admin stops PostgreSQL, pgpool refuses
>>   to accept connections from clients and database consistency among
>>   database nodes are safely kept. This will make minimize the down
>>   time.
>>
>> In summary, I see no point to change current behavior of pgpool.
>> --
>> Tatsuo Ishii
>> SRA OSS, Inc. Japan
>> English: http://www.sraoss.co.jp/index_en.php
>> Japanese: http://www.sraoss.co.jp
>>
>> > Hi Tatsuo Ishii,
>> >
>> > By looking at the source code, It seems that health check mechanism
>> depends
>> > on failover option (fail_over_on_backend_error + backend_flag) for non
>> > parallel mode and will go in never ending loop if failover is disabled
>> (As
>> > I mentioned earlier on Issue#3 in first email) i.e.
>> >
>> > pgpool2/main.c
>> >
>> >> /* do we need health checking for PostgreSQL? */
>> >> if (pool_config->health_check_period > 0)
>> >> {
>> >> ...
>> >> ...
>> >> if (POOL_DISALLOW_TO_FAILOVER(BACKEND_INFO(sts).flag))
>> >> {
>> >>      pool_log("health_check: %d failover is canceld because failover is
>> >> disallowed", sts);
>> >> }
>> >> else if (retrycnt <= pool_config->health_check_max_retries)
>> >> ...
>> >> ...
>> >> }
>> >
>> >
>> > It seems failover depend on configuration option not only
>> > fail_over_on_backend_error but as well as backend_flag too. If
>> > fail_over_on_backend_error is "on" but backend_flag is
>> > "DISALLOW_TO_FAILOVER" it will not trigger fail over for related slave
>> > node. On the other hand If child process find an error in connection for
>> > any related node it aborts. As you suggested earlier It seems the only
>> > appropriate thing that should be done is failover and restart all child
>> > processes, if error in connection to any related node found.
>> >
>> > In the example (Issue#3 in first email) I mentioned earlier there is dead
>> > end and pgpool goes in endless loop and become non responsive for new
>> > connections if we use following configuration settings i.e.
>> >
>> > pgpool.conf
>> >
>> >> fail_over_on_backend_error  = on
>> >> backend_flag0 = 'DISALLOW_TO_FAILOVER'
>> >> backend_flag1 = 'DISALLOW_TO_FAILOVER'
>> >> health_check_period = 5
>> >> health_check_timeout = 1
>> >> health_check_retry_delay = 10
>> >
>> >
>> > On each new
>> > connection
>> new_connection()->notice_backend_error()->degenerate_backend_set()
>> > give the following warning i.e.
>> >
>> > if (POOL_DISALLOW_TO_FAILOVER(BACKEND_INFO(node_id_set[i]).flag))
>> >> {
>> >>      pool_log("degenerate_backend_set: %d failover request from pid %d
>> is
>> >> canceld because failover is disallowed", node_id_set[i], getpid());
>> >>      continue;
>> >> }
>> >
>> >
>> > As mentioned in the fail_over_on_backend_error documentation, failover
>> can
>> > happen even when fail_over_on_backend_error=off when it detects
>> > administrative shutdown of postmaster i.e.
>> >
>> > http://www.pgpool.net/docs/latest/pgpool-en.html
>> >
>> >> fail_over_on_backend_error V2.3 -
>> >> If true, and an error occurs when reading/writing to the backend
>> >> communication, pgpool-II will trigger the fail over procedure. If set to
>> >> false, pgpool will report an error and disconnect the session. If you
>> set
>> >> this parameter to off, it is recommended that you turn on health
>> checking.
>> >> Please note that even if this parameter is set to off, however, pgpool
>> will
>> >> also do the fail over when pgpool detects the administrative shutdown of
>> >> postmaster.
>> >> You need to reload pgpool.conf if you change this value.
>> >
>> >
>> > If failover/degenerate is only option to handle the situation where slave
>> > node is non responsive/crashed etc, can't it be allowed in the code to do
>> > failover on connection error (even when it is disabled) ?. Thanks.
>> >
>> > Best Regards,
>> > Asif Naeem
>> >
>> > On Wed, Apr 3, 2013 at 11:43 AM, Asif Naeem <anaeem.it at gmail.com> wrote:
>> >
>> >> Hi,
>> >>
>> >> We are facing issue with pgpool health check failsafe mechanism in
>> >> production environment. I have previously posted this issue on
>> >> http://www.pgpool.net/mantisbt/view.php?id=50. I have observed 2 issue
>> >> with gpool-II version 3.2.3 (built with latest source code) i.e.
>> >>
>> >> Used versions i.e.
>> >>
>> >>> pgpool-II version 3.2.3
>> >>> postgresql 9.2.3 (Master + Slave)
>> >>
>> >>
>> >> 1. In master slave configuration, if health check and failover is
>> enabled
>> >> i.e.
>> >>
>> >> pgpool.conf
>> >>
>> >>> backend_flag0 = 'ALLOW_TO_FAILOVER'
>> >>> backend_flag1 = 'ALLOW_TO_FAILOVER'
>> >>>
>> >> health_check_period = 5
>> >>> health_check_timeout = 1
>> >>> health_check_max_retries = 2
>> >>> health_check_retry_delay = 10
>> >>
>> >> load_balance_mode = off
>> >>
>> >>
>> >> On Linux64, When master server is running fine and without load
>> balancing
>> >> and when suddenly if network interruption happen or any other reason (I
>> >> mimic the situation via forcefully shutdown dbserver via immediate mode
>> >> etc) and pgpool is not able to make connection to slave server. After
>> that
>> >> first connection attempt to pgpool return without error/warning message
>> and
>> >> pgpool do fail over and kill all child processes. Does that make sense
>> that
>> >> when there is no load balancing and master dbserver is serving the
>> queries
>> >> well and disconnection of slave server trigger failover ?.
>> >>
>> >> pgpool.log
>> >>
>> >>> ....
>> >>> 2013-04-02 17:24:36 DEBUG: pid 65431: I am 65431 accept fd 6
>> >>> 2013-04-02 17:24:36 DEBUG: pid 65431: read_startup_packet:
>> >>> application_name: psql
>> >>> 2013-04-02 17:24:36 DEBUG: pid 65431: Protocol Major: 3 Minor: 0
>> >>> database: postgres user: asif
>> >>> 2013-04-02 17:24:36 DEBUG: pid 65431: new_connection: connecting 0
>> backend
>> >>> 2013-04-02 17:24:36 DEBUG: pid 65431: new_connection: connecting 1
>> backend
>> >>> 2013-04-02 17:24:36 ERROR: pid 65431: connect_inet_domain_socket:
>> >>> getsockopt() detected error: Connection refused
>> >>> 2013-04-02 17:24:36 ERROR: pid 65431: connection to localhost(7445)
>> failed
>> >>> 2013-04-02 17:24:36 ERROR: pid 65431: new_connection: create_cp()
>> failed
>> >>> 2013-04-02 17:24:36 LOG:   pid 65431: degenerate_backend_set: 1 fail
>> over
>> >>> request from pid 65431
>> >>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler called
>> >>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: starting to
>> >>> select new master node
>> >>> 2013-04-02 17:24:36 LOG:   pid 65417: starting degeneration. shutdown
>> >>> host localhost(7445)
>> >>> 2013-04-02 17:24:36 LOG:   pid 65417: Restart all children
>> >>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65418
>> >>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65419
>> >>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65420
>> >>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65421
>> >>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65422
>> >>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65423
>> >>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65424
>> >>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65425
>> >>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65426
>> >>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65427
>> >>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65428
>> >>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65429
>> >>> ...
>> >>> ...
>> >>
>> >>
>> >> 2. In the same previous configuration, If I disable failover i.e.
>> >>
>> >> pgpool.conf
>> >>
>> >>> backend_flag0 = 'DISALLOW_TO_FAILOVER'
>> >>> backend_flag1 = 'DISALLOW_TO_FAILOVER'
>> >>>
>> >> health_check_period = 5
>> >>> health_check_timeout = 1
>> >>> health_check_max_retries = 2
>> >>> health_check_retry_delay = 10
>> >>
>> >> load_balance_mode = off
>> >>
>> >>
>> >> On Linux64, When master server is running fine and there is no load
>> >> balancing and no failover and suddenly slave server appear to be
>> >> disconnected because of network interruption happen or any other reason
>> (I
>> >> mimic it by forcefully shutdown dbserver via immediate mode etc). After
>> >> that no connection attempt got successful to pgpool until health check
>> >> complete and master database server log shows the following messages
>> i.e.
>> >>
>> >> dbserver.log
>> >>   ...
>> >>   ...
>> >>   LOG: incomplete startup packet
>> >>   LOG: incomplete startup packet
>> >>   LOG: incomplete startup packet
>> >>   LOG: incomplete startup packet
>> >>   LOG: incomplete startup packet
>> >>   ...
>> >>
>> >> 3. While testing this scenario on my MacOSX machine (gcc), it seems that
>> >> health check is not getting complete and endless with pgpool
>> configuration
>> >> settings as issue #2 above and it completely refrain me from to to
>> connect
>> >> pgpool any more i.e.
>> >>
>> >> pgpool.log
>> >>
>> >>> ...
>> >>> ...
>> >>> 2013-04-03 11:29:29 DEBUG: pid 44263: retrying *679* th health checking
>> >>> 2013-04-03 11:29:29 DEBUG: pid 44263: health_check: 0 th DB node
>> status: 2
>> >>> 2013-04-03 11:29:29 DEBUG: pid 44263: pool_ssl: SSL requested but SSL
>> >>> support is not available
>> >>> 2013-04-03 11:29:29 DEBUG: pid 44263: s_do_auth: auth kind: 0
>> >>> 2013-04-03 11:29:29 DEBUG: pid 44263: s_do_auth: backend key data
>> received
>> >>> 2013-04-03 11:29:29 DEBUG: pid 44263: s_do_auth: transaction state: I
>> >>> 2013-04-03 11:29:29 DEBUG: pid 44263: health_check: 1 th DB node
>> status: 2
>> >>> 2013-04-03 11:29:29 ERROR: pid 44263: connect_inet_domain_socket:
>> >>> getsockopt() detected error: Connection refused
>> >>> 2013-04-03 11:29:29 ERROR: pid 44263: make_persistent_db_connection:
>> >>> connection to localhost(7445) failed
>> >>> 2013-04-03 11:29:29 ERROR: pid 44263: health check failed. 1 th host
>> >>> localhost at port 7445 is down
>> >>> 2013-04-03 11:29:29 LOG:   pid 44263: health_check: 1 failover is
>> canceld
>> >>> because failover is disallowed
>> >>> 2013-04-03 11:29:34 DEBUG: pid 44263: retrying *680* th health checking
>> >>> 2013-04-03 11:29:34 DEBUG: pid 44263: health_check: 0 th DB node
>> status: 2
>> >>> 2013-04-03 11:29:34 DEBUG: pid 44263: pool_ssl: SSL requested but SSL
>> >>> support is not available
>> >>> 2013-04-03 11:29:34 DEBUG: pid 44263: s_do_auth: auth kind: 0
>> >>> 2013-04-03 11:29:34 DEBUG: pid 44263: s_do_auth: backend key data
>> received
>> >>> 2013-04-03 11:29:34 DEBUG: pid 44263: s_do_auth: transaction state: I
>> >>> 2013-04-03 11:29:34 DEBUG: pid 44263: health_check: 1 th DB node
>> status: 2
>> >>> 2013-04-03 11:29:34 ERROR: pid 44263: connect_inet_domain_socket:
>> >>> getsockopt() detected error: Connection refused
>> >>> 2013-04-03 11:29:34 ERROR: pid 44263: make_persistent_db_connection:
>> >>> connection to localhost(7445) failed
>> >>> 2013-04-03 11:29:34 ERROR: pid 44263: health check failed. 1 th host
>> >>> localhost at port 7445 is down
>> >>> 2013-04-03 11:29:34 LOG:   pid 44263: health_check: 1 failover is
>> canceld
>> >>> because failover is disallowed
>> >>> ...
>> >>> ...
>> >>
>> >>
>> >> I will try it on Linux64 machine too. Thanks.
>> >>
>> >> Best Regards,
>> >> Asif Naeem
>> >>
>> >>
>>


More information about the pgpool-hackers mailing list