[pgpool-hackers: 3278] Re: Deal with recovery failure by an abnormally exiting child process

Tue Mar 26 21:22:18 JST 2019

On Mon, 25 Feb 2019 18:18:09 +0900
Yugo Nagata <nagata at sraoss.co.jp> wrote:

> On Tue, 12 Feb 2019 14:01:05 +0900
> Yugo Nagata <nagata at sraoss.co.jp> wrote:
> 
> > On Tue, 08 Jan 2019 11:16:19 +0900 (JST)
> > Tatsuo Ishii <ishii at sraoss.co.jp> wrote:
> > 
> > > >> In bug 431, it was reported that recovery second stage fails if there
> > > >> was an abnormally exiting child process (typically caused by SIGKILL
> > > >> or segfault). This is because the global connection counter
> > > >> (Req_info->conn_counter) is left when the child process abnormaly
> > > >> exits. In general we have nothing to do for abnormaly exiting process
> > > >> situation and we recommend to restart whole Pgpool-II in this case.
> > > >> 
> > > >> However I find a tricky solution for a particular situation: if
> > > >> client_idle_limit_in_recovery is properly set (i.e.
> > > >> client_idle_limit_in_recovery >= recovery_timeout).
> > > 
> > > Sorry this should have been: 0< client_idle_limit_in_recovery <= recovery_timeout || client_idle_limit_in_recovery == -1
> > > 
> > > >> The logic is shown in the patch:
> > > >> 
> > > >> 	/*
> > > >> 	 * recovery_timeout was expired. Before returning with failure status,
> > > >> 	 * let's check if this is caused by the malformed conn_counter. If a child
> > > >> 	 * process abnormally exits (killed by SIGKILL or SEGFAULT, for example),
> > > >> 	 * then conn_counter is not decremented at process exit, thus it will
> > > >> 	 * never be returning to 0. This could be detected by checking if
> > > >> 	 * client_idle_limit_in_recovery is enabled and less value than
> > > >> 	 * recovery_timeout because all clients must be kicked out by the time
> > > >> 	 * when client_idle_limit_in_recovery is expired. If so, we should reset
> > > >> 	 * conn_counter to 0 also.
> > > >> 
> > > >> Should we emply this? Is it too tricky? Comments are welcome.
> > 
> > I think it is a good assumpsion that if client_idle_limit_in_recovery is expired
> > here then all clients' conn_counter can be reset to 0. 
> 
> The customer has tested the fixed version, 3.6.15, but they got the same problem
> of bug 431, after a child was terminated by segfault, although 
> client_idle_limit_in_recovery = -1.  
> 
> I found this is due to watchdog.  When watchdog is enabled, wd_start_recovery() is
> called just after "2nd stage" starts. In wd_start_recovery(), the recovery request is
> sent to other pgpool, and the other pgpool will waits for all children exiting.  
> However, if some child process has exited abnormally in the other pgpool, this never 
> returns  a response because Req_info->conn_counter cannot be zero. Therefore, the
> original pgpool will waits for the response until the  timeout is detected, and
> the online recovery fails eventually.
> 
> I don't have a concrete patch to fix this for now, but a fix similar to one in 3.6.15 
> will be needed in process_wd_command_timer_event(), where Req_info->conn_counter is 
> checked periodically for processing recovery commands received by watchdog.

Any progress on this problem?

I wander Usama might be able to handle this because the watchdog infrastructure
is designed by him...

I made a ticket on mantis:
https://www.pgpool.net/mantisbt/view.php?id=483

> 
> Regards,
> -- 
> Yugo Nagata <nagata at sraoss.co.jp>
> _______________________________________________
> pgpool-hackers mailing list
> pgpool-hackers at pgpool.net
> http://www.pgpool.net/mailman/listinfo/pgpool-hackers

-- 
Yugo Nagata <nagata at sraoss.co.jp>