[pgpool-general: 1336] Re: WatchDog and pgool sudden stop working
Tatsuo Ishii
ishii at postgresql.org
Wed Jan 23 11:00:23 JST 2013
I suspect that your num_init_children is too small. In this case it
is possible that watch dog life check cannot connect to pgpool and
timed out, and goes into error status. You really should manage
incoming connections from pgpool clients is not beyond
num_init_children -1 (1 is the room for watch dog life check).
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp
> Hello Tatsuo,
>
> today my active pgpool again stopped working and finally I have full debug log of this situation, I hope it will help to find the root cause of the troubles with spontanous watchdog fail.
>
> wwwpri is the server with master pgpool and the log file is from the same wwwpri server:
>
>
>
> 2013-01-22 16:45:31 DEBUG: pid 26847: starting health checking
> 2013-01-22 16:45:31 DEBUG: pid 26847: health_check: 0 th DB node status: 2
> 2013-01-22 16:45:31 DEBUG: pid 26847: pool_ssl: SSL requested but SSL support is not available
> 2013-01-22 16:45:31 DEBUG: pid 26847: s_do_auth: auth kind: 0
> 2013-01-22 16:45:31 DEBUG: pid 26847: s_do_auth: backend key data received
> 2013-01-22 16:45:31 DEBUG: pid 26847: s_do_auth: transaction state: I
> 2013-01-22 16:45:31 DEBUG: pid 26847: health_check: 1 th DB node status: 2
> 2013-01-22 16:45:31 DEBUG: pid 26847: pool_ssl: SSL requested but SSL support is not available
> 2013-01-22 16:45:31 DEBUG: pid 26847: s_do_auth: auth kind: 0
> 2013-01-22 16:45:31 DEBUG: pid 26847: s_do_auth: backend key data received
> 2013-01-22 16:45:31 DEBUG: pid 26847: s_do_auth: transaction state: I
> 2013-01-22 16:45:32 DEBUG: pid 26890: pool_ssl: SSL requested but SSL support is not available
> 2013-01-22 16:45:32 DEBUG: pid 26890: s_do_auth: auth kind: 0
> 2013-01-22 16:45:32 DEBUG: pid 26890: s_do_auth: backend key data received
> 2013-01-22 16:45:32 DEBUG: pid 26890: s_do_auth: transaction state: I
> 2013-01-22 16:45:32 DEBUG: pid 26890: pool_ssl: SSL requested but SSL support is not available
> 2013-01-22 16:45:32 DEBUG: pid 26890: s_do_auth: auth kind: 0
> 2013-01-22 16:45:32 DEBUG: pid 26890: s_do_auth: backend key data received
> 2013-01-22 16:45:32 DEBUG: pid 26890: s_do_auth: transaction state: I
> 2013-01-22 16:45:32 DEBUG: pid 26890: do_query: extended:0 query:SELECT pg_current_xlog_location()
> 2013-01-22 16:45:32 DEBUG: pid 26890: do_query: kind: T
> 2013-01-22 16:45:32 DEBUG: pid 26890: do_query: row description received
> 2013-01-22 16:45:32 DEBUG: pid 26890: num_fileds: 1
> 2013-01-22 16:45:32 DEBUG: pid 26890: do_query: kind: D
> 2013-01-22 16:45:32 DEBUG: pid 26890: do_query: data row received
> 2013-01-22 16:45:32 DEBUG: pid 26890: do_query: kind: C
> 2013-01-22 16:45:32 DEBUG: pid 26890: do_query: Command complete received
> 2013-01-22 16:45:32 DEBUG: pid 26890: do_query: kind: Z
> 2013-01-22 16:45:32 DEBUG: pid 26890: do_query: Ready for query
> 2013-01-22 16:45:32 DEBUG: pid 26890: do_query: extended:0 query:SELECT pg_last_xlog_replay_location()
> 2013-01-22 16:45:32 DEBUG: pid 26890: do_query: kind: T
> 2013-01-22 16:45:32 DEBUG: pid 26890: do_query: row description received
> 2013-01-22 16:45:32 DEBUG: pid 26890: num_fileds: 1
> 2013-01-22 16:45:32 DEBUG: pid 26890: do_query: kind: D
> 2013-01-22 16:45:32 DEBUG: pid 26890: do_query: data row received
> 2013-01-22 16:45:32 DEBUG: pid 26890: do_query: kind: C
> 2013-01-22 16:45:32 DEBUG: pid 26890: do_query: Command complete received
> 2013-01-22 16:45:32 DEBUG: pid 26890: do_query: kind: Z
> 2013-01-22 16:45:32 DEBUG: pid 26890: do_query: Ready for query
> 2013-01-22 16:45:33 DEBUG: pid 26855: wd_lifecheck: checking pgpool 0 (wwwpri:54320)
> 2013-01-22 16:45:33 DEBUG: pid 26855: wd_lifecheck: NG, status: 3 life:1
> 2013-01-22 16:45:33 LOG: pid 26855: wd_lifecheck: lifecheck failed 5 times. pgpool seems not to be working
> 2013-01-22 16:45:33 LOG: pid 26855: wd_IP_down: ifconfig down succeeded
> 2013-01-22 16:45:33 DEBUG: pid 26855: wd_lifecheck: checking pgpool 1 (wwwsec:54320)
> 2013-01-22 16:45:33 DEBUG: pid 26855: wd_lifecheck: OK, status: 2
> 2013-01-22 16:45:40 DEBUG: pid 26855: get_result: ping data: PING 172.31.2.1 (172.31.2.1) 56(84) bytes of data.
>
> --- 172.31.2.1 ping statistics ---
> 3 packets transmitted, 3 received, 0% packet loss, time 2000ms
> rtt min/avg/max/mdev = 0.342/0.393/0.485/0.068 ms
>
> 2013-01-22 16:45:42 DEBUG: pid 26890: pool_ssl: SSL requested but SSL support is not available
> 2013-01-22 16:45:42 DEBUG: pid 26890: s_do_auth: auth kind: 0
> 2013-01-22 16:45:42 DEBUG: pid 26890: s_do_auth: backend key data received
> 2013-01-22 16:45:42 DEBUG: pid 26890: s_do_auth: transaction state: I
> 2013-01-22 16:45:42 DEBUG: pid 26890: pool_ssl: SSL requested but SSL support is not available
> 2013-01-22 16:45:42 DEBUG: pid 26890: s_do_auth: auth kind: 0
> 2013-01-22 16:45:42 DEBUG: pid 26890: s_do_auth: backend key data received
> 2013-01-22 16:45:42 DEBUG: pid 26890: s_do_auth: transaction state: I
> 2013-01-22 16:45:42 DEBUG: pid 26890: do_query: extended:0 query:SELECT pg_current_xlog_location()
> 2013-01-22 16:45:42 DEBUG: pid 26890: do_query: kind: T
> 2013-01-22 16:45:42 DEBUG: pid 26890: do_query: row description received
> 2013-01-22 16:45:42 DEBUG: pid 26890: num_fileds: 1
> 2013-01-22 16:45:42 DEBUG: pid 26890: do_query: kind: D
> 2013-01-22 16:45:42 DEBUG: pid 26890: do_query: data row received
> 2013-01-22 16:45:42 DEBUG: pid 26890: do_query: kind: C
> 2013-01-22 16:45:42 DEBUG: pid 26890: do_query: Command complete received
> 2013-01-22 16:45:42 DEBUG: pid 26890: do_query: kind: Z
> 2013-01-22 16:45:42 DEBUG: pid 26890: do_query: Ready for query
> 2013-01-22 16:45:42 DEBUG: pid 26890: do_query: extended:0 query:SELECT pg_last_xlog_replay_location()
> 2013-01-22 16:45:42 DEBUG: pid 26890: do_query: kind: T
> 2013-01-22 16:45:42 DEBUG: pid 26890: do_query: row description received
> 2013-01-22 16:45:42 DEBUG: pid 26890: num_fileds: 1
> 2013-01-22 16:45:42 DEBUG: pid 26890: do_query: kind: D
> 2013-01-22 16:45:42 DEBUG: pid 26890: do_query: data row received
> 2013-01-22 16:45:42 DEBUG: pid 26890: do_query: kind: C
> 2013-01-22 16:45:42 DEBUG: pid 26890: do_query: Command complete received
> 2013-01-22 16:45:42 DEBUG: pid 26890: do_query: kind: Z
> 2013-01-22 16:45:42 DEBUG: pid 26890: do_query: Ready for query
> 2013-01-22 16:45:43 DEBUG: pid 26855: wd_lifecheck: checking pgpool 0 (wwwpri:54320)
> 2013-01-22 16:45:43 DEBUG: pid 26855: wd_lifecheck: NG, status: 4 life:0
> 2013-01-22 16:45:43 LOG: pid 26855: wd_lifecheck: lifecheck failed 5 times. pgpool seems not to be working
> 2013-01-22 16:45:43 DEBUG: pid 26855: wd_lifecheck: checking pgpool 1 (wwwsec:54320)
> 2013-01-22 16:45:43 DEBUG: pid 26855: wd_lifecheck: OK, status: 2
> 2013-01-22 16:45:50 DEBUG: pid 26855: get_result: ping data: PING 172.31.2.1 (172.31.2.1) 56(84) bytes of data.
>
> --- 172.31.2.1 ping statistics ---
> 3 packets transmitted, 3 received, 0% packet loss, time 1999ms
> rtt min/avg/max/mdev = 0.300/0.335/0.382/0.037 ms
>
>
> Here is log file grep only for "wd_lifecheck: NG, status: "
>
> 2013-01-22 16:44:53 DEBUG: pid 26855: wd_lifecheck: NG, status: 3 life:5
> 2013-01-22 16:45:03 DEBUG: pid 26855: wd_lifecheck: NG, status: 3 life:4
> 2013-01-22 16:45:13 DEBUG: pid 26855: wd_lifecheck: NG, status: 3 life:3
> 2013-01-22 16:45:23 DEBUG: pid 26855: wd_lifecheck: NG, status: 3 life:2
> 2013-01-22 16:45:33 DEBUG: pid 26855: wd_lifecheck: NG, status: 3 life:1
> 2013-01-22 16:45:43 DEBUG: pid 26855: wd_lifecheck: NG, status: 4 life:0
> 2013-01-22 16:45:53 DEBUG: pid 26855: wd_lifecheck: NG, status: 4 life:0
> 2013-01-22 16:46:03 DEBUG: pid 26855: wd_lifecheck: NG, status: 4 life:0
> 2013-01-22 16:46:13 DEBUG: pid 26855: wd_lifecheck: NG, status: 4 life:0
> 2013-01-22 16:46:23 DEBUG: pid 26855: wd_lifecheck: NG, status: 4 life:0
> 2013-01-22 16:46:33 DEBUG: pid 26855: wd_lifecheck: NG, status: 4 life:0
> 2013-01-22 16:46:43 DEBUG: pid 26855: wd_lifecheck: NG, status: 4 life:0
> 2013-01-22 16:46:53 DEBUG: pid 26855: wd_lifecheck: NG, status: 4 life:0
> 2013-01-22 16:47:03 DEBUG: pid 26855: wd_lifecheck: NG, status: 4 life:0
> 2013-01-22 16:47:13 DEBUG: pid 26855: wd_lifecheck: NG, status: 4 life:0
> 2013-01-22 16:47:23 DEBUG: pid 26855: wd_lifecheck: NG, status: 4 life:0
> 2013-01-22 16:47:33 DEBUG: pid 26855: wd_lifecheck: NG, status: 4 life:0
> 2013-01-22 16:47:43 DEBUG: pid 26855: wd_lifecheck: NG, status: 4 life:0
>
>
>
> Any help is highly appreciated.
> Thank you,
>
> Tomas
>
>
>
>
>
> On 22, Jan, 2013, at 7:42 , Tomas Halgas <tomas at halgas.sk> wrote:
>
>> Hi Tatsuo,
>>
>> thank you for your reply in 1st place.
>> The symptom is that once "wd_life_check failed" occurs then pgpool switches virtual ip to second pgpool node.
>>
>> Original pgpool node then still accept new connection on TCP level from client (on physical IP), but times out (not providing even prompt for psql client).
>>
>> I'm not able to test already established sessions as those are established by php web application.. Should I try to keep one open connection from psql client just for this case to test next time it happen?
>> I'm now restarting both pgpool daemons every night as a workaround as this regular restart seems to avoid the hang after 2-3 days period.
>>
>> Regrads,
>>
>> Tomas
>>
>>
>> On 21, Jan, 2013, at 23:51 , Tatsuo Ishii <ishii at postgresql.org> wrote:
>>
>>> What do you mean by "pgpools stop working"?
>>>
>>> I see many clients already connect to pgpool. Can you issue a query
>>> through one of the sessions?
>>> --
>>> Tatsuo Ishii
>>> SRA OSS, Inc. Japan
>>> English: http://www.sraoss.co.jp/index_en.php
>>> Japanese: http://www.sraoss.co.jp
>>>
>>>> I have repeated issue on my pgpool servers (two servers monitoring each other via pgpool watchdog).
>>>>
>>>> About each 2nd or 3rd day one (or both) pgpools stop working, most of times the log starts with:
>>>> read_startup_packet: incorrect packet length (0)
>>>>
>>>> and then follow many of those messages on both servers:
>>>> wd_lifecheck: lifecheck failed 3 times. pgpool seems not to be working
>>>>
>>>>
>>>> Today it happen only on server which was pgpool master and correctly switched to the other pgpool server.
>>>> This is log from the master:
>>>> 2013-01-18 08:18:20 ERROR: pid 3160: read_startup_packet: incorrect packet length (0)
>>>> 2013-01-18 08:18:23 LOG: pid 3152: wd_lifecheck: lifecheck failed 3 times. pgpool seems not to be working
>>>> 2013-01-18 08:18:23 LOG: pid 3152: wd_IP_down: ifconfig down succeeded
>>>> 2013-01-18 08:18:32 LOG: pid 3152: wd_lifecheck: lifecheck failed 3 times. pgpool seems not to be working
>>>> 2013-01-18 08:18:39 LOG: pid 3152: wd_lifecheck: lifecheck failed 3 times. pgpool seems not to be working
>>>> 2013-01-18 08:18:46 LOG: pid 3152: wd_lifecheck: lifecheck failed 3 times. pgpool seems not to be working
>>>>
>>>>
>>>> There was no postgresql/network outage, nor I was able to find any corellation to anything on the systems
>>>> which could trigger this.
>>>>
>>>> Usualy this happen on both servers in same time so I have quickly to solve the situation by pkill -9 pgpool - which then gets
>>>> restarted by DJBs svscan (pkill is not able to actually stop the pgpool when it's in this state).
>>>> But today as the other pgpool was working I was doing a bit more troublehooting on the former pgpool master and have found that
>>>> it was accepting connections from db clients, but the connection never finished (it got hanged).
>>>> Connection to PCP was working all the time OK (attaching pcp_pool_status to this email).
>>>>
>>>> I have executed strace in pid 3090 to see what's happening but i'm not really able to read the output properly so I would appreciate if someone
>>>> can have a look into attached strace output if she/he can identify what could be the reason for the hang.
>>>> Intereseting enough that after about 40-45 minutes the pgpool on the former pgpool master started to work again properly.
>>>> There was nothing in logfile, just the log entries "wd_lifecheck: lifecheck failed 3 times. pgpool seems not to be working" stopped to apear.
>>>> This event should be included inside the strace output too as the strace was running during that time.
>>>>
>>>> One thing I have also noticed shorty AFTER the pgpool on former master started working again by itself is that
>>>> all processes on former pgpool master were started on Jan 15 , but one pgpool process had other started time shown via the ps axuwf
>>>> and by coincidence it is the time when the pgpool started working again:
>>>>
>>>> pgsql 23021 0.0 0.0 110308 1876 ? S 09:02 0:00 | | \_ pgpool: wait for connection request
>>>>
>>>> Maybe it is normal, I just noticed the time of the process is other than other pgpool processes, here is the whole snapshot of ps axuwf related to pgpool:
>>>>
>>>> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
>>>> root 1975 0.0 0.0 4164 336 ? S 2012 0:00 | \_ supervise pgpool
>>>> root 3065 0.0 0.0 17740 1424 ? S Jan15 0:00 | | \_ /bin/sh ./run
>>>> pgsql 3078 0.0 0.0 20296 1788 ? S Jan15 0:00 | | \_ -su -c /usr/local/bin/pgpool -f /usr/local/etc/pgpool.conf -n >> /var/log/pgpool.log
>>>> pgsql 3090 0.0 0.0 110184 4640 ? S Jan15 0:12 | | \_ /usr/local/bin/pgpool -f /usr/local/etc/pgpool.conf -n
>>>> pgsql 3151 0.0 0.0 110184 740 ? S Jan15 0:00 | | \_ pgpool: watchdog
>>>> pgsql 3152 0.0 0.0 184048 1592 ? Sl Jan15 0:43 | | \_ pgpool: lifecheck
>>>> pgsql 24999 0.0 0.0 4392 352 ? S 09:08 0:00 | | | \_ ping -q -c3 172.31.2.1
>>>> pgsql 3153 0.0 0.1 119348 11176 ? S Jan15 1:44 | | \_ pgpool: az az 172.31.2.159(50064) idle
>>>> pgsql 3154 0.0 0.1 119484 11280 ? S Jan15 1:55 | | \_ pgpool: az az 172.31.2.159(50068) idle
>>>> pgsql 3155 0.0 0.1 119340 11088 ? S Jan15 1:47 | | \_ pgpool: az az 172.31.2.159(49960) idle
>>>> pgsql 3156 0.0 0.1 119008 10744 ? S Jan15 1:36 | | \_ pgpool: az az 172.31.2.159(50045) idle
>>>> pgsql 3157 0.0 0.1 119148 10856 ? S Jan15 1:42 | | \_ pgpool: az az 172.31.2.159(50075) idle
>>>> pgsql 3158 0.0 0.1 119208 10964 ? S Jan15 1:44 | | \_ pgpool: az az 172.31.2.159(50072) idle
>>>> pgsql 3159 0.0 0.1 118836 10608 ? S Jan15 1:37 | | \_ pgpool: az az 172.31.2.159(50006) idle
>>>> pgsql 3160 0.0 0.1 118360 10112 ? S Jan15 1:25 | | \_ pgpool: az az 172.31.2.158(57783) idle
>>>> pgsql 3161 0.0 0.1 119352 11128 ? S Jan15 1:50 | | \_ pgpool: az az 172.31.2.159(50007) idle
>>>> pgsql 3162 0.0 0.1 118772 10508 ? S Jan15 1:34 | | \_ pgpool: az az 172.31.2.159(50046) idle
>>>> pgsql 3163 0.0 0.1 119620 11388 ? S Jan15 1:51 | | \_ pgpool: az az 172.31.2.159(50062) idle
>>>> pgsql 3164 0.0 0.1 120384 12092 ? S Jan15 1:52 | | \_ pgpool: az az 172.31.2.159(50047) idle
>>>> pgsql 3165 0.0 0.1 119384 11220 ? S Jan15 1:51 | | \_ pgpool: az az 172.31.2.159(50010) idle
>>>> pgsql 3166 0.0 0.1 120084 11824 ? S Jan15 2:08 | | \_ pgpool: az az 172.31.2.159(50042) idle
>>>> pgsql 3167 0.0 0.1 120444 12116 ? S Jan15 1:52 | | \_ pgpool: az az 172.31.2.159(50074) idle
>>>> pgsql 3168 0.0 0.1 119732 11492 ? S Jan15 1:53 | | \_ pgpool: az az 172.31.2.159(50070) idle
>>>> pgsql 3170 0.0 0.1 119868 11680 ? S Jan15 1:58 | | \_ pgpool: az az 172.31.2.159(50067) idle
>>>> pgsql 3171 0.0 0.1 120224 11980 ? S Jan15 2:04 | | \_ pgpool: az az 172.31.2.159(49972) idle
>>>> pgsql 3172 0.0 0.1 119880 11676 ? S Jan15 1:57 | | \_ pgpool: az az 172.31.2.159(50048) idle
>>>> pgsql 3173 0.0 0.1 119440 11136 ? S Jan15 1:51 | | \_ pgpool: az az 172.31.2.159(50008) idle
>>>> pgsql 3174 0.0 0.1 119296 11108 ? S Jan15 1:52 | | \_ pgpool: az az 172.31.2.159(50011) idle
>>>> pgsql 3175 0.0 0.1 118952 10700 ? S Jan15 1:38 | | \_ pgpool: az az 172.31.2.159(50049) idle
>>>> pgsql 3176 0.0 0.1 119200 11008 ? S Jan15 1:47 | | \_ pgpool: az az 172.31.2.159(50066) idle
>>>> pgsql 3178 0.0 0.1 119616 11376 ? S Jan15 1:55 | | \_ pgpool: az az 172.31.2.159(49973) idle
>>>> pgsql 3179 0.0 0.1 118920 10712 ? S Jan15 1:40 | | \_ pgpool: az az 172.31.2.159(50043) idle
>>>> pgsql 3181 0.0 0.1 120188 11872 ? S Jan15 1:48 | | \_ pgpool: az az 172.31.2.158(57784) idle
>>>> pgsql 3183 0.0 0.1 119688 11460 ? S Jan15 1:57 | | \_ pgpool: az az 172.31.2.159(50076) idle
>>>> pgsql 3185 0.0 0.1 119444 11160 ? S Jan15 1:47 | | \_ pgpool: az az 172.31.2.159(49959) idle
>>>> pgsql 3186 0.0 0.1 119428 11216 ? S Jan15 1:54 | | \_ pgpool: az az 172.31.2.159(50069) idle
>>>> pgsql 3187 0.0 0.1 119888 11708 ? S Jan15 1:57 | | \_ pgpool: az az 172.31.2.159(50044) idle
>>>> pgsql 3188 0.0 0.1 119344 11092 ? S Jan15 1:44 | | \_ pgpool: az az 172.31.2.159(50009) idle
>>>> pgsql 3189 0.0 0.0 110772 1764 ? S Jan15 0:00 | | \_ pgpool: PCP: wait for connection request
>>>> pgsql 3190 0.0 0.0 110296 1320 ? S Jan15 0:25 | | \_ pgpool: worker process
>>>> pgsql 23021 0.0 0.0 110308 1876 ? S 09:02 0:00 | | \_ pgpool: wait for connection request
>>>>
>>>>
>>>> My setup:
>>>> 172.31.2.158 - pgpool1 (former master - all above logs were done within this server)
>>>> 172.31.2.159 - pgpool2 (former slave)
>>>> 172.31.2.160 - virtual IP managed by pgpool where db clients connects to (originally assigned to pgpool1 server)
>>>>
>>>> And all above hosts i'm having included in /etc/hosts to avoid any DNS influence..
>>>>
>>>> I have later today enabled debug_level = 1 in pgpool.conf and limited logfile to 2GB (which limits the log only for abou 10 minutes)
>>>>
>>>> Also have included my pgpool.conf files which I believe are setup correctly, but one never knows...
>>>>
>>>> Many thanks for any help, although I like pgpool very much it's not stable in my environment which is quite frustrating as I opted for pgpool to improve
>>>> service availability but till now it's rather exactly opposite.
>>>>
>>>> Tomas
>>>
>>
>
More information about the pgpool-general
mailing list