[pgpool-hackers: 899] Re: Making Failover more robust.

Tatsuo Ishii ishii at postgresql.org
Fri May 8 10:07:22 JST 2015


Hi Usama,

I have briefly looked into your patch. I have a few questions for it.
Does your patch deal with the case when the health checking is *not*
involved?  For example, what if a new client connects to pgpool-II and
pgpool-II tries to connect to PostgreSQL but PostgreSQL refuses the
request because of max_connections reaches? It seems your patch does
not handle the case.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

> Hi Tatsuo
> 
> Please find the attached POC patch for making the health checking more
> robust.
> Instead of explicitly handling the max_connections reached, error the patch
> introduced a new configuration parameter *health_check_ignore_errorcodes.*
> This new config parameter can be assigned the array of comma separated
> PostgreSQL error codes which will be ignored as a health check failure.
> For example to ignore
>  ERRCODE_TOO_MANY_CONNECTIONS and
> ERRCODE_CANNOT_CONNECT_NOW" errors, assign respective error codes  '53300,
> 53100' to the parameter.
> 
> The behavior of the health checking is as per the discussion above, in case
> when any of error happens is listed in ignore_error_codes list, e.g
> pgpool-II continues to work without iniating failiover and node status is
> remained untouched.
> 
> Please note that this is just a work in progress patch for authenticating
> the concept.
> 
> Thanks
> Best regards
> Muhammad Usama
> 
> 
> On Sun, Apr 26, 2015 at 6:43 PM, Tatsuo Ishii <ishii at postgresql.org> wrote:
> 
>> Usama,
>>
>> I understand that failover caused by max_connections error is a pain.
>>
>> Here is an another idea to handle max_connections error to prevent
>> failover. When pgpool-II fails to create connection to backend, it
>> examines the error code and if it was due to max_connections reaching
>> error, just returns "Sorry max connections... " error message to
>> client and *does not* trigger failover. In the mean time the health
>> check does not trigger failover if the error message was
>> max_connections reaching error and just logs the incident.
>>
>> The short coming of this method is the log file will be flooded with
>> the error message if the max_connections setting is too low. However
>> this is not pgpool-II's problem.
>>
>> What do you think?
>>
>> Best regards,
>> --
>> Tatsuo Ishii
>> SRA OSS, Inc. Japan
>> English: http://www.sraoss.co.jp/index_en.php
>> Japanese:http://www.sraoss.co.jp
>>
>> >> Hi
>> >>
>> >> Please see my response inline.
>> >>
>> >> On Mon, Apr 20, 2015 at 6:34 AM, Yugo Nagata <nagata at sraoss.co.jp>
>> wrote:
>> >>
>> >>> Hi,
>> >>>
>> >>> It seems like good idea, but I have some questions.
>> >>
>> >>
>> >>> I'm not sure how different from using health_check_max_retries.
>> >>>
>> >>
>> >> health_check_max_retries only waits for a specific amount of time for
>> the
>> >> node to get back online and does not care about what has caused the
>> node to
>> >> become unavailable. And having larger values for this configuration to
>> >> cover the transient errors also delays the failovers in cases of actual
>> >> node failures.
>> >
>> > I have thought about your idea a little bit more. Suppose we have a
>> > primary PostgreSQL node and two standby PostgreSQL nodes. If the
>> > primary returns max connections reaching error, then remaining two
>> > standby nodes will return the same error soon or later because
>> > pgpool-II tries to connect all of DB nodes and it's rare that standbys
>> > have different max_connections than the primary. This will result in
>> > "all backend down" error, which is same as the current situation.
>> >
>> >>> How should NODE_TEMP_DOWN be defined? This should include
>> >>> only max_connections error or also other case like health
>> >>> check errors within health_check_max_retries?
>> >>>
>> >>
>> >> I am thinking of NODE_TEMP_DOWN  for only temporary kind of errors
>> >> where PostgreSQL node is reachable but connection is explicitly closed
>> by
>> >> PG server. Currently I can only think of  max_connections reached error
>> at
>> >> the moment, but I am sure there are other cases.
>> >>
>> >>
>> >>> How are NODE_TEMP_DOWN nodes treated by child processes?
>> >>> While the status is NODE_TEMP_DOWN, are these allowed to be
>> >>> sent queries from children?
>> >>>
>> >>
>> >> I think it should be treated similarly as NODE_DOWN status by child
>> >> processes.
>> >
>> > Suppose one of standby nodes comes back from the NODE_TEMP_DOWN state
>> > first. Since there's no writable DB node, DML query will fail. This is
>> > not good from user's point of view. Maybe we could only allow standby
>> > nodes come back from the state only if the primary is online. However
>> > this is too complex and I am not sure it's worth the trouble.
>> >
>> > In summary, the idea "NODE_TEMP_DOWN" itself is great but I am not
>> > sure the max_connections problem is best handled by the
>> > state. Probably NODE_TEMP_DOWN state should be applied to temporary
>> > errors *and* which are not equally happen on all DB nodes (like the
>> > max_connections problem). Probably certain kind of network error might
>> > be one of the candidates but I'm not sure if we could reliably know
>> > that kind of state from the error code which the OS returns.
>> >
>> >>> How long does NODE_TEMP_DOWN state last? Forever untill
>> >>> health check succeeds again? Or, this should be controlled
>> >>> by other parameter?
>> >>>
>> >>
>> >> This one need to be thought out a little more. Some of the options are.
>> It
>> >> always remains as NODE_TEMP_DOWN until the node comes back or die
>> >> permanently, Or we can control it with a new configuration parameter,
>> which
>> >> could put a time limit on this status before failing the node.
>> >>
>> >>
>> >> Thanks
>> >> Kind regards
>> >> Muhammad Usama
>> >>
>> >>
>> >>> On Fri, 17 Apr 2015 20:07:10 +0500
>> >>> Muhammad Usama <m.usama at gmail.com> wrote:
>> >>>
>> >>> > Hi
>> >>> >
>> >>> > Currenlty pgpool-II does not discriminate between types and nature of
>> >>> > backend failures, especially when performing the backend health
>> check,
>> >>> And
>> >>> > it triggers the node failover as soon as the health check fails to
>> >>> connect
>> >>> > to backend PostgreSQL server (of course after retries gets expired).
>> This
>> >>> > is a big problem in case of transient failures like for example if
>> >>> > max_connection is reached on the backend node and health check
>> connection
>> >>> > gets denied, it will still be considered as a backend node failure by
>> >>> > pgpool-II and it will go on to trigger a failover. Despite the fact
>> that
>> >>> > node actually is working fine and pgpool-II child processes are
>> >>> > successfully connected to that.
>> >>> >
>> >>> > So I think pgpool-II health check should consider the cause and type
>> of
>> >>> > error happened on backend and depending on the type of error It
>> should
>> >>> > either register the failover request, ignore the error or may be just
>> >>> > change the backend node status. We could introduce a new node status
>> to
>> >>> > identify these type of situations, (e-g NODE_TEMP_DOWN) and have a
>> new
>> >>> > configuration parameter to control the behavior of this state. And
>> >>> instead
>> >>> > of straight away initiating the failover on a node, Health check
>> keeps on
>> >>> > probing for the node with this new NODE_TEMP_DOWN status and
>> >>> automatically
>> >>> > make the node available when health check succeeds on the node.
>> >>> >
>> >>> > Thoughts, suggestions and design ideas are most welcome
>> >>> >
>> >>> > Thanks
>> >>> > Best regards!
>> >>> > Muhammad Usama
>> >>>
>> >>>
>> >>> --
>> >>> Yugo Nagata <nagata at sraoss.co.jp>
>> >>>
>> > _______________________________________________
>> > pgpool-hackers mailing list
>> > pgpool-hackers at pgpool.net
>> > http://www.pgpool.net/mailman/listinfo/pgpool-hackers
>>


More information about the pgpool-hackers mailing list