[pgpool-hackers: 3251] Re: Deal with recovery failure by an abnormally exiting child process

Yugo Nagata nagata at sraoss.co.jp
Mon Feb 25 18:18:09 JST 2019


On Tue, 12 Feb 2019 14:01:05 +0900
Yugo Nagata <nagata at sraoss.co.jp> wrote:

> On Tue, 08 Jan 2019 11:16:19 +0900 (JST)
> Tatsuo Ishii <ishii at sraoss.co.jp> wrote:
> 
> > >> In bug 431, it was reported that recovery second stage fails if there
> > >> was an abnormally exiting child process (typically caused by SIGKILL
> > >> or segfault). This is because the global connection counter
> > >> (Req_info->conn_counter) is left when the child process abnormaly
> > >> exits. In general we have nothing to do for abnormaly exiting process
> > >> situation and we recommend to restart whole Pgpool-II in this case.
> > >> 
> > >> However I find a tricky solution for a particular situation: if
> > >> client_idle_limit_in_recovery is properly set (i.e.
> > >> client_idle_limit_in_recovery >= recovery_timeout).
> > 
> > Sorry this should have been: 0< client_idle_limit_in_recovery <= recovery_timeout || client_idle_limit_in_recovery == -1
> > 
> > >> The logic is shown in the patch:
> > >> 
> > >> 	/*
> > >> 	 * recovery_timeout was expired. Before returning with failure status,
> > >> 	 * let's check if this is caused by the malformed conn_counter. If a child
> > >> 	 * process abnormally exits (killed by SIGKILL or SEGFAULT, for example),
> > >> 	 * then conn_counter is not decremented at process exit, thus it will
> > >> 	 * never be returning to 0. This could be detected by checking if
> > >> 	 * client_idle_limit_in_recovery is enabled and less value than
> > >> 	 * recovery_timeout because all clients must be kicked out by the time
> > >> 	 * when client_idle_limit_in_recovery is expired. If so, we should reset
> > >> 	 * conn_counter to 0 also.
> > >> 
> > >> Should we emply this? Is it too tricky? Comments are welcome.
> 
> I think it is a good assumpsion that if client_idle_limit_in_recovery is expired
> here then all clients' conn_counter can be reset to 0. 

The customer has tested the fixed version, 3.6.15, but they got the same problem
of bug 431, after a child was terminated by segfault, although 
client_idle_limit_in_recovery = -1.  

I found this is due to watchdog.  When watchdog is enabled, wd_start_recovery() is
called just after "2nd stage" starts. In wd_start_recovery(), the recovery request is
sent to other pgpool, and the other pgpool will waits for all children exiting.  
However, if some child process has exited abnormally in the other pgpool, this never 
returns  a response because Req_info->conn_counter cannot be zero. Therefore, the
original pgpool will waits for the response until the  timeout is detected, and
the online recovery fails eventually.

I don't have a concrete patch to fix this for now, but a fix similar to one in 3.6.15 
will be needed in process_wd_command_timer_event(), where Req_info->conn_counter is 
checked periodically for processing recovery commands received by watchdog.


Regards,
-- 
Yugo Nagata <nagata at sraoss.co.jp>


More information about the pgpool-hackers mailing list