[pgpool-hackers: 1493] Re: Proposal: minimize process restart when fail over occurs

Thu Apr 7 10:41:24 JST 2016

I have moved forward a little bit with this. At this point I have just
a created necessary infrastructure to deal with the goal. See
[pgpool-committers: 3127] for more details.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

> So this is a proposal for pgpool-II 3.6.
> 
> I already did some discussion on this:
> 
> From: Tatsuo Ishii <ishii at postgresql.org>
> Subject: [pgpool-hackers: 1413] Item #11, torward pgpool-II 3.6
> Date: Fri, 19 Feb 2016 12:03:12 +0900 (JST)
> Message-ID: <20160219.120312.816223524770393776.t-ishii at sraoss.co.jp>
> 
> Here is a more or less formal proposal which is replacing it.
> 
> Goal:
> 
> Currently pgpool-II kills all child process when fail over (or switch
> over by pcp_detach_node) occurs. Of course this leads to disconnecting
> of all existing client connections because the peer process which
> client is connecting is gone. This proposal is seeking a way to
> minimize such session disconnections.
> 
> o Precondition:
> 
> I assume this proposal is for streaming replication mode only. Maybe
> we could expand this for other modes in the future. I also assume the
> broken server is not primary.
> 
> o Consideration:
> 
> What is the reason why we need to kill child process? Basically the
> problem is the retry in the TCP/IP stack layer when the connection
> goes wrong, for example, the network cable is pulled out. In this case
> the only way to stop the retry is restarting the process.
> 
> There are several chances where we could avoid the restarting:
> 
> 1) Knowing that we are not dealing with a fail over caused by the
> cabling problem. There are at least two cases we know the problem is
> not a cabling:
> 
>  a) the fail over is triggered by pcp_detach_node.
> 
>  b) the fail over is triggered by posmaster shutdown.
> 
> For other cases we need to find a way to know that the problem is a
> cabling or not. Currently we use timeout to detect such that
> situation. So if we could know if the timeout is occurred or not, then
> we could know the problem is a cabling or not.
> 
> 2) Once we succeed in #1, next thing we need to do is, whether a
> session in question is using the broken server. This is fairly easy
> because we already have the info on shared memory. If the session uses
> the broken server, then we need to restart the process. No way. Other
> case we just close a connection to the broken backend (if any).
> 
> o Things we need to do:
> 
> - Invent a way to know if the fail over request is created by
>   pcp_detach_node. Probably we add a new flag to the fail over request
>   packet to indicate whether the origin of the request is
>   pcp_detach_node or not.
> 
> - The same technique above can be used for the admin PostgreSQL
>   shutdown case.
> 
> - Create a API to deal with connections using the broken server.
> 
> o What are the benefit once above proposal is implemented?
> 
> - If conditions below are met, the user session can be survives after fail over.
> 
>  - Operated in streaming replication mode
> 
>  - The failed server is not primary
> 
>  - The session does not connect to the broken broken standby server
> 
> Comments, opinions?
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese:http://www.sraoss.co.jp
> _______________________________________________
> pgpool-hackers mailing list
> pgpool-hackers at pgpool.net
> http://www.pgpool.net/mailman/listinfo/pgpool-hackers