Thank you Tatsuo. I would say "will go in never ending loop" if any of slave stop responding (until alive again) as It is been observed earlier i.e.<div><br></div><div>
<span class="Apple-style-span" style="color:rgb(34,34,34);font-size:13px;font-family:Arial">pgpool.log</span></div><blockquote class="gmail_quote" style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
....<br>2013-04-04 12:34:41 DEBUG: pid 44263: retrying <b>10867</b> th health checking<br>2013-04-04 12:34:41 DEBUG: pid 44263: health_check: 0 th DB node status: 2<br>2013-04-04 12:34:41 DEBUG: pid 44263: pool_ssl: SSL requested but SSL support is not available<br>
2013-04-04 12:34:41 DEBUG: pid 44263: s_do_auth: auth kind: 0<br>2013-04-04 12:34:41 DEBUG: pid 44263: s_do_auth: backend key data received<br>2013-04-04 12:34:41 DEBUG: pid 44263: s_do_auth: transaction state: I<br>2013-04-04 12:34:41 DEBUG: pid 44263: health_check: 1 th DB node status: 2<br>
2013-04-04 12:34:41 ERROR: pid 44263: connect_inet_domain_socket: getsockopt() detected error: Connection refused<br>2013-04-04 12:34:41 ERROR: pid 44263: make_persistent_db_connection: connection to localhost(7445) failed<br>
2013-04-04 12:34:41 ERROR: pid 44263: health check failed. 1 th host localhost at port 7445 is down<br>2013-04-04 12:34:41 LOG: pid 44263: health_check: 1 failover is canceld because failover is disallowed<br>....<br>....</blockquote>
<div><br></div><div>AFAIU discussing it with you that it is a feature not a bug. In the presented scenario, If any of slave got down or missing ( maybe because of network issue ), until it become available/up again, pgpool will be non responsive to any new connection (with no warning or message). Do you agree ?. Thanks.</div>
<div><br><div><div>Best Regards,</div><div>Asif Naeem<br><br><div class="gmail_quote">On Tue, Apr 9, 2013 at 5:05 AM, Tatsuo Ishii <span dir="ltr"><<a href="mailto:ishii@postgresql.org" target="_blank">ishii@postgresql.org</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Well, "will go in never ending loop" is a little bit incorrect<br>
statement. What happens here is, pgpool tries to fail over every<br>
health_check_period and it is canceled because DISALLOW_TO_FAILOVER<br>
flag was set. This particular set up has at least two use cases:<br>
<br>
- PostgreSQL is protected by heartbeat/pacemaker or any other HA(High<br>
Availability software). When a PostgreSQL server fails, they are<br>
responsible for taking over the node by the standby PostgreSQL. Once<br>
the PostgreSQL comes up, pgpool will start to accept connections<br>
from clients.<br>
<br>
- Admin wants to upgrade PostgreSQL immediately because of security<br>
issues with it (like recent PostgreSQL). He stops PostgreSQL one by<br>
one and upgrades them. While admin stops PostgreSQL, pgpool refuses<br>
to accept connections from clients and database consistency among<br>
database nodes are safely kept. This will make minimize the down<br>
time.<br>
<br>
In summary, I see no point to change current behavior of pgpool.<br>
--<br>
Tatsuo Ishii<br>
SRA OSS, Inc. Japan<br>
English: <a href="http://www.sraoss.co.jp/index_en.php" target="_blank">http://www.sraoss.co.jp/index_en.php</a><br>
Japanese: <a href="http://www.sraoss.co.jp" target="_blank">http://www.sraoss.co.jp</a><br>
<div><div class="h5"><br>
> Hi Tatsuo Ishii,<br>
><br>
> By looking at the source code, It seems that health check mechanism depends<br>
> on failover option (fail_over_on_backend_error + backend_flag) for non<br>
> parallel mode and will go in never ending loop if failover is disabled (As<br>
> I mentioned earlier on Issue#3 in first email) i.e.<br>
><br>
> pgpool2/main.c<br>
><br>
>> /* do we need health checking for PostgreSQL? */<br>
>> if (pool_config->health_check_period > 0)<br>
>> {<br>
>> ...<br>
>> ...<br>
>> if (POOL_DISALLOW_TO_FAILOVER(BACKEND_INFO(sts).flag))<br>
>> {<br>
>> pool_log("health_check: %d failover is canceld because failover is<br>
>> disallowed", sts);<br>
>> }<br>
>> else if (retrycnt <= pool_config->health_check_max_retries)<br>
>> ...<br>
>> ...<br>
>> }<br>
><br>
><br>
> It seems failover depend on configuration option not only<br>
> fail_over_on_backend_error but as well as backend_flag too. If<br>
> fail_over_on_backend_error is "on" but backend_flag is<br>
> "DISALLOW_TO_FAILOVER" it will not trigger fail over for related slave<br>
> node. On the other hand If child process find an error in connection for<br>
> any related node it aborts. As you suggested earlier It seems the only<br>
> appropriate thing that should be done is failover and restart all child<br>
> processes, if error in connection to any related node found.<br>
><br>
> In the example (Issue#3 in first email) I mentioned earlier there is dead<br>
> end and pgpool goes in endless loop and become non responsive for new<br>
> connections if we use following configuration settings i.e.<br>
><br>
> pgpool.conf<br>
><br>
>> fail_over_on_backend_error = on<br>
>> backend_flag0 = 'DISALLOW_TO_FAILOVER'<br>
>> backend_flag1 = 'DISALLOW_TO_FAILOVER'<br>
>> health_check_period = 5<br>
>> health_check_timeout = 1<br>
>> health_check_retry_delay = 10<br>
><br>
><br>
> On each new<br>
> connection new_connection()->notice_backend_error()->degenerate_backend_set()<br>
> give the following warning i.e.<br>
><br>
> if (POOL_DISALLOW_TO_FAILOVER(BACKEND_INFO(node_id_set[i]).flag))<br>
>> {<br>
>> pool_log("degenerate_backend_set: %d failover request from pid %d is<br>
>> canceld because failover is disallowed", node_id_set[i], getpid());<br>
>> continue;<br>
>> }<br>
><br>
><br>
> As mentioned in the fail_over_on_backend_error documentation, failover can<br>
> happen even when fail_over_on_backend_error=off when it detects<br>
> administrative shutdown of postmaster i.e.<br>
><br>
> <a href="http://www.pgpool.net/docs/latest/pgpool-en.html" target="_blank">http://www.pgpool.net/docs/latest/pgpool-en.html</a><br>
><br>
>> fail_over_on_backend_error V2.3 -<br>
>> If true, and an error occurs when reading/writing to the backend<br>
>> communication, pgpool-II will trigger the fail over procedure. If set to<br>
>> false, pgpool will report an error and disconnect the session. If you set<br>
>> this parameter to off, it is recommended that you turn on health checking.<br>
>> Please note that even if this parameter is set to off, however, pgpool will<br>
>> also do the fail over when pgpool detects the administrative shutdown of<br>
>> postmaster.<br>
>> You need to reload pgpool.conf if you change this value.<br>
><br>
><br>
> If failover/degenerate is only option to handle the situation where slave<br>
> node is non responsive/crashed etc, can't it be allowed in the code to do<br>
> failover on connection error (even when it is disabled) ?. Thanks.<br>
><br>
> Best Regards,<br>
> Asif Naeem<br>
><br>
> On Wed, Apr 3, 2013 at 11:43 AM, Asif Naeem <<a href="mailto:anaeem.it@gmail.com">anaeem.it@gmail.com</a>> wrote:<br>
><br>
>> Hi,<br>
>><br>
>> We are facing issue with pgpool health check failsafe mechanism in<br>
>> production environment. I have previously posted this issue on<br>
>> <a href="http://www.pgpool.net/mantisbt/view.php?id=50" target="_blank">http://www.pgpool.net/mantisbt/view.php?id=50</a>. I have observed 2 issue<br>
>> with gpool-II version 3.2.3 (built with latest source code) i.e.<br>
>><br>
>> Used versions i.e.<br>
>><br>
>>> pgpool-II version 3.2.3<br>
>>> postgresql 9.2.3 (Master + Slave)<br>
>><br>
>><br>
>> 1. In master slave configuration, if health check and failover is enabled<br>
>> i.e.<br>
>><br>
>> pgpool.conf<br>
>><br>
>>> backend_flag0 = 'ALLOW_TO_FAILOVER'<br>
>>> backend_flag1 = 'ALLOW_TO_FAILOVER'<br>
>>><br>
>> health_check_period = 5<br>
>>> health_check_timeout = 1<br>
>>> health_check_max_retries = 2<br>
>>> health_check_retry_delay = 10<br>
>><br>
>> load_balance_mode = off<br>
>><br>
>><br>
>> On Linux64, When master server is running fine and without load balancing<br>
>> and when suddenly if network interruption happen or any other reason (I<br>
>> mimic the situation via forcefully shutdown dbserver via immediate mode<br>
>> etc) and pgpool is not able to make connection to slave server. After that<br>
>> first connection attempt to pgpool return without error/warning message and<br>
>> pgpool do fail over and kill all child processes. Does that make sense that<br>
>> when there is no load balancing and master dbserver is serving the queries<br>
>> well and disconnection of slave server trigger failover ?.<br>
>><br>
>> pgpool.log<br>
>><br>
>>> ....<br>
>>> 2013-04-02 17:24:36 DEBUG: pid 65431: I am 65431 accept fd 6<br>
>>> 2013-04-02 17:24:36 DEBUG: pid 65431: read_startup_packet:<br>
>>> application_name: psql<br>
>>> 2013-04-02 17:24:36 DEBUG: pid 65431: Protocol Major: 3 Minor: 0<br>
>>> database: postgres user: asif<br>
>>> 2013-04-02 17:24:36 DEBUG: pid 65431: new_connection: connecting 0 backend<br>
>>> 2013-04-02 17:24:36 DEBUG: pid 65431: new_connection: connecting 1 backend<br>
>>> 2013-04-02 17:24:36 ERROR: pid 65431: connect_inet_domain_socket:<br>
>>> getsockopt() detected error: Connection refused<br>
>>> 2013-04-02 17:24:36 ERROR: pid 65431: connection to localhost(7445) failed<br>
>>> 2013-04-02 17:24:36 ERROR: pid 65431: new_connection: create_cp() failed<br>
>>> 2013-04-02 17:24:36 LOG: pid 65431: degenerate_backend_set: 1 fail over<br>
>>> request from pid 65431<br>
>>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler called<br>
>>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: starting to<br>
>>> select new master node<br>
>>> 2013-04-02 17:24:36 LOG: pid 65417: starting degeneration. shutdown<br>
>>> host localhost(7445)<br>
>>> 2013-04-02 17:24:36 LOG: pid 65417: Restart all children<br>
>>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65418<br>
>>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65419<br>
>>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65420<br>
>>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65421<br>
>>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65422<br>
>>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65423<br>
>>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65424<br>
>>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65425<br>
>>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65426<br>
>>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65427<br>
>>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65428<br>
>>> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65429<br>
>>> ...<br>
>>> ...<br>
>><br>
>><br>
>> 2. In the same previous configuration, If I disable failover i.e.<br>
>><br>
>> pgpool.conf<br>
>><br>
>>> backend_flag0 = 'DISALLOW_TO_FAILOVER'<br>
>>> backend_flag1 = 'DISALLOW_TO_FAILOVER'<br>
>>><br>
>> health_check_period = 5<br>
>>> health_check_timeout = 1<br>
>>> health_check_max_retries = 2<br>
>>> health_check_retry_delay = 10<br>
>><br>
>> load_balance_mode = off<br>
>><br>
>><br>
>> On Linux64, When master server is running fine and there is no load<br>
>> balancing and no failover and suddenly slave server appear to be<br>
>> disconnected because of network interruption happen or any other reason (I<br>
>> mimic it by forcefully shutdown dbserver via immediate mode etc). After<br>
>> that no connection attempt got successful to pgpool until health check<br>
>> complete and master database server log shows the following messages i.e.<br>
>><br>
>> dbserver.log<br>
>> ...<br>
>> ...<br>
>> LOG: incomplete startup packet<br>
>> LOG: incomplete startup packet<br>
>> LOG: incomplete startup packet<br>
>> LOG: incomplete startup packet<br>
>> LOG: incomplete startup packet<br>
>> ...<br>
>><br>
>> 3. While testing this scenario on my MacOSX machine (gcc), it seems that<br>
>> health check is not getting complete and endless with pgpool configuration<br>
>> settings as issue #2 above and it completely refrain me from to to connect<br>
>> pgpool any more i.e.<br>
>><br>
>> pgpool.log<br>
>><br>
>>> ...<br>
>>> ...<br>
</div></div>>>> 2013-04-03 11:29:29 DEBUG: pid 44263: retrying *679* th health checking<br>
<div class="im">>>> 2013-04-03 11:29:29 DEBUG: pid 44263: health_check: 0 th DB node status: 2<br>
>>> 2013-04-03 11:29:29 DEBUG: pid 44263: pool_ssl: SSL requested but SSL<br>
>>> support is not available<br>
>>> 2013-04-03 11:29:29 DEBUG: pid 44263: s_do_auth: auth kind: 0<br>
>>> 2013-04-03 11:29:29 DEBUG: pid 44263: s_do_auth: backend key data received<br>
>>> 2013-04-03 11:29:29 DEBUG: pid 44263: s_do_auth: transaction state: I<br>
>>> 2013-04-03 11:29:29 DEBUG: pid 44263: health_check: 1 th DB node status: 2<br>
>>> 2013-04-03 11:29:29 ERROR: pid 44263: connect_inet_domain_socket:<br>
>>> getsockopt() detected error: Connection refused<br>
>>> 2013-04-03 11:29:29 ERROR: pid 44263: make_persistent_db_connection:<br>
>>> connection to localhost(7445) failed<br>
>>> 2013-04-03 11:29:29 ERROR: pid 44263: health check failed. 1 th host<br>
>>> localhost at port 7445 is down<br>
>>> 2013-04-03 11:29:29 LOG: pid 44263: health_check: 1 failover is canceld<br>
>>> because failover is disallowed<br>
</div>>>> 2013-04-03 11:29:34 DEBUG: pid 44263: retrying *680* th health checking<br>
<div class="HOEnZb"><div class="h5">>>> 2013-04-03 11:29:34 DEBUG: pid 44263: health_check: 0 th DB node status: 2<br>
>>> 2013-04-03 11:29:34 DEBUG: pid 44263: pool_ssl: SSL requested but SSL<br>
>>> support is not available<br>
>>> 2013-04-03 11:29:34 DEBUG: pid 44263: s_do_auth: auth kind: 0<br>
>>> 2013-04-03 11:29:34 DEBUG: pid 44263: s_do_auth: backend key data received<br>
>>> 2013-04-03 11:29:34 DEBUG: pid 44263: s_do_auth: transaction state: I<br>
>>> 2013-04-03 11:29:34 DEBUG: pid 44263: health_check: 1 th DB node status: 2<br>
>>> 2013-04-03 11:29:34 ERROR: pid 44263: connect_inet_domain_socket:<br>
>>> getsockopt() detected error: Connection refused<br>
>>> 2013-04-03 11:29:34 ERROR: pid 44263: make_persistent_db_connection:<br>
>>> connection to localhost(7445) failed<br>
>>> 2013-04-03 11:29:34 ERROR: pid 44263: health check failed. 1 th host<br>
>>> localhost at port 7445 is down<br>
>>> 2013-04-03 11:29:34 LOG: pid 44263: health_check: 1 failover is canceld<br>
>>> because failover is disallowed<br>
>>> ...<br>
>>> ...<br>
>><br>
>><br>
>> I will try it on Linux64 machine too. Thanks.<br>
>><br>
>> Best Regards,<br>
>> Asif Naeem<br>
>><br>
>><br>
</div></div></blockquote></div><br></div></div></div>