[pgpool-hackers: 131] Found bug with watchdog resulting in pgpool segmentation fault

Fri Sep 14 01:54:42 JST 2012

Hi,

I found a bug today when looking at the watchdog functionalities, here
is the way to reproduce the issue :

1) Enable and configure watchdog using default unix socket
2) Run pgpool in non daemon mode ( pgpool -n -f pgpool.conf)
3) in an other terminal (sorry) : killall -9 pgpool

At this stage pgpool stop violently and the two Unix socket are not
removed :

    /tmp/.s.PGSQL.9898
    /tmp/.s.PGSQL.9999

That's normal until here.

Umount your delegate_IP : ifconfig eth0:0 down

Then when you will start pgpool again il will died complaining about
"Address already in use":

    root at centos1 pgpool-II-3.2.0]# /usr/local/pgpool/bin/pgpool -n -f
/usr/local/pgpool/etc/pgpool.conf
    pid file found but it seems bogus. Trying to start pgpool anyway...
    all commands have sticky bit
    2012-09-13 18:31:33 LOG:   pid 3154: watchdog might call network
commands which using sticky bit.
    2012-09-13 18:31:33 ERROR: pid 3154: bind(/tmp/.s.PGSQL.9999)
failed. reason: Address already in use
    2012-09-13 18:31:33 ERROR: pid 3154: unlink() failed: No such file
or directory

you will now just have /tmp/.s.PGSQL.9898 as the other socket have been
removed but not this one because the path of the socket is not already
set before pgpool is dying (see last log line).

Then start again pgpool :

[root at centos1 pgpool-II-3.2.0]# /usr/local/pgpool/bin/pgpool -n -f
/usr/local/pgpool/etc/pgpool.conf
all commands have sticky bit
2012-09-13 18:36:20 LOG:   pid 3173: watchdog might call network
commands which using sticky bit.
2012-09-13 18:36:22 LOG:   pid 3173: wd_create_send_socket: connect()
reports failure (Connection refused). You can safely ignore this while
starting up.
2012-09-13 18:36:25 LOG:   pid 3173: wd_escalation: escalated to master
pgpool
2012-09-13 18:36:27 LOG:   pid 3173: wd_create_send_socket: connect()
reports failure (Connection refused). You can safely ignore this while
starting up.
2012-09-13 18:36:27 LOG:   pid 3173: wd_escalation:  escaleted to
delegate_IP holder
2012-09-13 18:36:27 LOG:   pid 3173: wd_init: start watchdog
2012-09-13 18:36:27 LOG:   pid 3173: pgpool-II successfully started.
version 3.2.0 (namameboshi)
2012-09-13 18:36:27 ERROR: pid 3173: bind(/tmp/.s.PGSQL.9898) failed.
reason: Address already in use

Ok here type CTRL+C and wait until the segfault appears, it can take a
minutes and you will see lot of the following message:

 2012-09-13 18:37:12 LOG:   pid 3173: received fast shutdown request
2012-09-13 18:37:12 LOG:   pid 3173: watchdog_pid: 3181
2012-09-13 18:37:12 LOG:   pid 3173: received fast shutdown request
2012-09-13 18:37:12 LOG:   pid 3173: watchdog_pid: 3181
2012-09-13 18:37:12 LOG:   pid 3173: received fast shutdown request
2012-09-13 18:37:12 LOG:   pid 3173: watchdog_pid: 3181
2012-09-13 18:37:12 LOG:   pid 3173: received fast shutdown request
2012-09-13 18:37:12 LOG:   pid 3173: watchdog_pid: 3181
segmentation fault

At this time you have again the staled sockets :

    /tmp/.s.PGSQL.9898
    /tmp/.s.PGSQL.9999

I've attached a patch that solves the problem by setting the path to the
pcp socket at the same time than the main pgpool unix socket, this way
the socket file will be remove at pgpool restart. I thinks there's
probably a better patch because the issue can still appears if pgpool
fail on removing that socket (priviledge change on the socket for
example) and the segfault will appear again.

-- 
Gilles Darold
Administrateur de bases de données
http://dalibo.com - http://dalibo.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: patch_segfault_watchdog.diff
Type: text/x-patch
Size: 1205 bytes
Desc: not available
URL: <http://www.sraoss.jp/pipermail/pgpool-hackers/attachments/20120913/964f1255/attachment.bin>