[pgpool-hackers: 3278] Re: Deal with recovery failure by an abnormally exiting child process
Yugo Nagata
nagata at sraoss.co.jp
Tue Mar 26 21:22:18 JST 2019
On Mon, 25 Feb 2019 18:18:09 +0900
Yugo Nagata <nagata at sraoss.co.jp> wrote:
> On Tue, 12 Feb 2019 14:01:05 +0900
> Yugo Nagata <nagata at sraoss.co.jp> wrote:
>
> > On Tue, 08 Jan 2019 11:16:19 +0900 (JST)
> > Tatsuo Ishii <ishii at sraoss.co.jp> wrote:
> >
> > > >> In bug 431, it was reported that recovery second stage fails if there
> > > >> was an abnormally exiting child process (typically caused by SIGKILL
> > > >> or segfault). This is because the global connection counter
> > > >> (Req_info->conn_counter) is left when the child process abnormaly
> > > >> exits. In general we have nothing to do for abnormaly exiting process
> > > >> situation and we recommend to restart whole Pgpool-II in this case.
> > > >>
> > > >> However I find a tricky solution for a particular situation: if
> > > >> client_idle_limit_in_recovery is properly set (i.e.
> > > >> client_idle_limit_in_recovery >= recovery_timeout).
> > >
> > > Sorry this should have been: 0< client_idle_limit_in_recovery <= recovery_timeout || client_idle_limit_in_recovery == -1
> > >
> > > >> The logic is shown in the patch:
> > > >>
> > > >> /*
> > > >> * recovery_timeout was expired. Before returning with failure status,
> > > >> * let's check if this is caused by the malformed conn_counter. If a child
> > > >> * process abnormally exits (killed by SIGKILL or SEGFAULT, for example),
> > > >> * then conn_counter is not decremented at process exit, thus it will
> > > >> * never be returning to 0. This could be detected by checking if
> > > >> * client_idle_limit_in_recovery is enabled and less value than
> > > >> * recovery_timeout because all clients must be kicked out by the time
> > > >> * when client_idle_limit_in_recovery is expired. If so, we should reset
> > > >> * conn_counter to 0 also.
> > > >>
> > > >> Should we emply this? Is it too tricky? Comments are welcome.
> >
> > I think it is a good assumpsion that if client_idle_limit_in_recovery is expired
> > here then all clients' conn_counter can be reset to 0.
>
> The customer has tested the fixed version, 3.6.15, but they got the same problem
> of bug 431, after a child was terminated by segfault, although
> client_idle_limit_in_recovery = -1.
>
> I found this is due to watchdog. When watchdog is enabled, wd_start_recovery() is
> called just after "2nd stage" starts. In wd_start_recovery(), the recovery request is
> sent to other pgpool, and the other pgpool will waits for all children exiting.
> However, if some child process has exited abnormally in the other pgpool, this never
> returns a response because Req_info->conn_counter cannot be zero. Therefore, the
> original pgpool will waits for the response until the timeout is detected, and
> the online recovery fails eventually.
>
> I don't have a concrete patch to fix this for now, but a fix similar to one in 3.6.15
> will be needed in process_wd_command_timer_event(), where Req_info->conn_counter is
> checked periodically for processing recovery commands received by watchdog.
Any progress on this problem?
I wander Usama might be able to handle this because the watchdog infrastructure
is designed by him...
I made a ticket on mantis:
https://www.pgpool.net/mantisbt/view.php?id=483
>
> Regards,
> --
> Yugo Nagata <nagata at sraoss.co.jp>
> _______________________________________________
> pgpool-hackers mailing list
> pgpool-hackers at pgpool.net
> http://www.pgpool.net/mailman/listinfo/pgpool-hackers
--
Yugo Nagata <nagata at sraoss.co.jp>
More information about the pgpool-hackers
mailing list