[pgpool-general: 1330] WatchDog and pgool sudden stop working

Tomas Halgas tomas at halgas.sk
Fri Jan 18 22:47:23 JST 2013


Hello,

I have repeated issue on my pgpool servers (two servers monitoring each other via pgpool watchdog).

About each 2nd or 3rd day one (or both) pgpools stop working, most of times the log starts with:
 read_startup_packet: incorrect packet length (0)

and then follow many of those messages on both servers:
 wd_lifecheck: lifecheck failed 3 times. pgpool seems not to be working


Today it happen only on server which was pgpool master and correctly switched to the other pgpool server.
This is log from the master:
 2013-01-18 08:18:20 ERROR: pid 3160: read_startup_packet: incorrect packet length (0)
 2013-01-18 08:18:23 LOG:   pid 3152: wd_lifecheck: lifecheck failed 3 times. pgpool seems not to be working
 2013-01-18 08:18:23 LOG:   pid 3152: wd_IP_down: ifconfig down succeeded
 2013-01-18 08:18:32 LOG:   pid 3152: wd_lifecheck: lifecheck failed 3 times. pgpool seems not to be working
 2013-01-18 08:18:39 LOG:   pid 3152: wd_lifecheck: lifecheck failed 3 times. pgpool seems not to be working
 2013-01-18 08:18:46 LOG:   pid 3152: wd_lifecheck: lifecheck failed 3 times. pgpool seems not to be working


There was no postgresql/network outage, nor I was able to find any corellation to anything on the systems
which could trigger this.

Usualy this happen on both servers in same time so I have quickly to solve the situation by pkill -9 pgpool - which then gets
restarted by DJBs svscan (pkill is not able to actually stop the pgpool when it's in this state).
But today as the other pgpool was working I was doing a bit more troublehooting on the former pgpool master and have found that
it was accepting connections from db clients, but the connection never finished (it got hanged).
Connection to PCP was working all the time OK (attaching pcp_pool_status to this email).

I have executed strace in pid 3090 to see what's happening but i'm not really able to read the output properly so I would appreciate if someone
can have a look into attached strace output if she/he can identify what could be the reason for the hang.
Intereseting enough that after about 40-45 minutes the pgpool on the former pgpool master started to work again properly.
There was nothing in logfile, just the log entries "wd_lifecheck: lifecheck failed 3 times. pgpool seems not to be working" stopped to apear.
This event should be included inside the strace output too as the strace was running during that time.

One thing I have also noticed shorty AFTER the pgpool on former master started working again by itself is that
all processes on former pgpool master were started on Jan 15 , but one pgpool process had other started time shown via the ps axuwf
and by coincidence it is the time when the pgpool started working again:

pgsql    23021  0.0  0.0 110308  1876 ?        S    09:02   0:00  |   |               \_ pgpool: wait for connection request

Maybe it is normal, I just noticed the time of the process is other than other pgpool processes, here is the whole snapshot of ps axuwf related to pgpool:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      1975  0.0  0.0   4164   336 ?        S     2012   0:00  |   \_ supervise pgpool
root      3065  0.0  0.0  17740  1424 ?        S    Jan15   0:00  |   |   \_ /bin/sh ./run
pgsql     3078  0.0  0.0  20296  1788 ?        S    Jan15   0:00  |   |       \_ -su -c /usr/local/bin/pgpool -f /usr/local/etc/pgpool.conf -n >> /var/log/pgpool.log
pgsql     3090  0.0  0.0 110184  4640 ?        S    Jan15   0:12  |   |           \_ /usr/local/bin/pgpool -f /usr/local/etc/pgpool.conf -n
pgsql     3151  0.0  0.0 110184   740 ?        S    Jan15   0:00  |   |               \_ pgpool: watchdog
pgsql     3152  0.0  0.0 184048  1592 ?        Sl   Jan15   0:43  |   |               \_ pgpool: lifecheck
pgsql    24999  0.0  0.0   4392   352 ?        S    09:08   0:00  |   |               |   \_ ping -q -c3 172.31.2.1
pgsql     3153  0.0  0.1 119348 11176 ?        S    Jan15   1:44  |   |               \_ pgpool: az az 172.31.2.159(50064) idle
pgsql     3154  0.0  0.1 119484 11280 ?        S    Jan15   1:55  |   |               \_ pgpool: az az 172.31.2.159(50068) idle
pgsql     3155  0.0  0.1 119340 11088 ?        S    Jan15   1:47  |   |               \_ pgpool: az az 172.31.2.159(49960) idle
pgsql     3156  0.0  0.1 119008 10744 ?        S    Jan15   1:36  |   |               \_ pgpool: az az 172.31.2.159(50045) idle
pgsql     3157  0.0  0.1 119148 10856 ?        S    Jan15   1:42  |   |               \_ pgpool: az az 172.31.2.159(50075) idle
pgsql     3158  0.0  0.1 119208 10964 ?        S    Jan15   1:44  |   |               \_ pgpool: az az 172.31.2.159(50072) idle
pgsql     3159  0.0  0.1 118836 10608 ?        S    Jan15   1:37  |   |               \_ pgpool: az az 172.31.2.159(50006) idle
pgsql     3160  0.0  0.1 118360 10112 ?        S    Jan15   1:25  |   |               \_ pgpool: az az 172.31.2.158(57783) idle
pgsql     3161  0.0  0.1 119352 11128 ?        S    Jan15   1:50  |   |               \_ pgpool: az az 172.31.2.159(50007) idle
pgsql     3162  0.0  0.1 118772 10508 ?        S    Jan15   1:34  |   |               \_ pgpool: az az 172.31.2.159(50046) idle
pgsql     3163  0.0  0.1 119620 11388 ?        S    Jan15   1:51  |   |               \_ pgpool: az az 172.31.2.159(50062) idle
pgsql     3164  0.0  0.1 120384 12092 ?        S    Jan15   1:52  |   |               \_ pgpool: az az 172.31.2.159(50047) idle
pgsql     3165  0.0  0.1 119384 11220 ?        S    Jan15   1:51  |   |               \_ pgpool: az az 172.31.2.159(50010) idle
pgsql     3166  0.0  0.1 120084 11824 ?        S    Jan15   2:08  |   |               \_ pgpool: az az 172.31.2.159(50042) idle
pgsql     3167  0.0  0.1 120444 12116 ?        S    Jan15   1:52  |   |               \_ pgpool: az az 172.31.2.159(50074) idle
pgsql     3168  0.0  0.1 119732 11492 ?        S    Jan15   1:53  |   |               \_ pgpool: az az 172.31.2.159(50070) idle
pgsql     3170  0.0  0.1 119868 11680 ?        S    Jan15   1:58  |   |               \_ pgpool: az az 172.31.2.159(50067) idle
pgsql     3171  0.0  0.1 120224 11980 ?        S    Jan15   2:04  |   |               \_ pgpool: az az 172.31.2.159(49972) idle
pgsql     3172  0.0  0.1 119880 11676 ?        S    Jan15   1:57  |   |               \_ pgpool: az az 172.31.2.159(50048) idle
pgsql     3173  0.0  0.1 119440 11136 ?        S    Jan15   1:51  |   |               \_ pgpool: az az 172.31.2.159(50008) idle
pgsql     3174  0.0  0.1 119296 11108 ?        S    Jan15   1:52  |   |               \_ pgpool: az az 172.31.2.159(50011) idle
pgsql     3175  0.0  0.1 118952 10700 ?        S    Jan15   1:38  |   |               \_ pgpool: az az 172.31.2.159(50049) idle
pgsql     3176  0.0  0.1 119200 11008 ?        S    Jan15   1:47  |   |               \_ pgpool: az az 172.31.2.159(50066) idle
pgsql     3178  0.0  0.1 119616 11376 ?        S    Jan15   1:55  |   |               \_ pgpool: az az 172.31.2.159(49973) idle
pgsql     3179  0.0  0.1 118920 10712 ?        S    Jan15   1:40  |   |               \_ pgpool: az az 172.31.2.159(50043) idle
pgsql     3181  0.0  0.1 120188 11872 ?        S    Jan15   1:48  |   |               \_ pgpool: az az 172.31.2.158(57784) idle
pgsql     3183  0.0  0.1 119688 11460 ?        S    Jan15   1:57  |   |               \_ pgpool: az az 172.31.2.159(50076) idle
pgsql     3185  0.0  0.1 119444 11160 ?        S    Jan15   1:47  |   |               \_ pgpool: az az 172.31.2.159(49959) idle
pgsql     3186  0.0  0.1 119428 11216 ?        S    Jan15   1:54  |   |               \_ pgpool: az az 172.31.2.159(50069) idle
pgsql     3187  0.0  0.1 119888 11708 ?        S    Jan15   1:57  |   |               \_ pgpool: az az 172.31.2.159(50044) idle
pgsql     3188  0.0  0.1 119344 11092 ?        S    Jan15   1:44  |   |               \_ pgpool: az az 172.31.2.159(50009) idle
pgsql     3189  0.0  0.0 110772  1764 ?        S    Jan15   0:00  |   |               \_ pgpool: PCP: wait for connection request
pgsql     3190  0.0  0.0 110296  1320 ?        S    Jan15   0:25  |   |               \_ pgpool: worker process
pgsql    23021  0.0  0.0 110308  1876 ?        S    09:02   0:00  |   |               \_ pgpool: wait for connection request


My setup:
172.31.2.158    - pgpool1 (former master - all above logs were done within this server)
172.31.2.159    - pgpool2 (former slave)
172.31.2.160    - virtual IP managed by pgpool where db clients connects to (originally assigned to pgpool1 server)

And all above hosts i'm having included in /etc/hosts to avoid any DNS influence..

I have later today enabled debug_level = 1 in pgpool.conf and limited logfile to 2GB (which limits the log only for abou 10 minutes)

Also have included my pgpool.conf files which I believe are setup correctly, but one never knows...

Many thanks for any help, although I like pgpool very much it's not stable in my environment which is quite frustrating as I opted for pgpool to improve
service availability but till now it's rather exactly opposite.

Tomas
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 2013-01-18.zip
Type: application/x-zip-compressed
Size: 21629 bytes
Desc: not available
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20130118/f0327891/attachment-0001.bin>


More information about the pgpool-general mailing list