[pgpool-general: 1046] watchdog enabled delegate_IP on multiple nodes simultaneously

Lonni J Friedman netllama at gmail.com
Thu Sep 27 01:05:09 JST 2012


I'm running 3.2.0 on two Linux servers, with use_watchdog=on.  Two
days ago, I noticed that the watchdog enabled the delegate_IP on both
servers simultaneously (and remains in that state as of today).  This
seems like the wrong behavior.  I was under the impression that the
delegate_IP should be up on only 1 server at any time?  I suppose this
might be harmless, as long as both servers are otherwise working ok,
but it seems to defeat the point of enabling the IP only when the
other server is down, if they're both up simultaneously.

I've verified that both servers are responding by pinging the
delegate_IP (from a separate system), and checking the HWaddress from
the 'arp' command.  If I then purge the arp cache and ping again, it
will eventually show the other server's HWaddress associated with the
delegate_IP.

In the pgpool logs on both servers, I do see the following around the
time that this happened:
wd_lifecheck: lifecheck failed 3 times. pgpool seems not to be working

My primary concern is why both servers have the delegate_IP up
simultaneously when there was clearly some sort of problem that should
have caused it to be brought down on at least 1 of them.

Both servers can communicate with each other (they can ping each
other, and I can invoke psql to connect to localhost pgpool from both
servers).  Here are all the uncommented settings in the WATCHDOG
section of pgpool.conf (with wd_hostname differing for each server):
#########
use_watchdog = on
                                    # Activates watchdog
trusted_servers = 'cuda-fs1,cuda-vm0,cuda-fs2'
                                    # trusted server list which are used
                                    # to confirm network connection
                                    # (hostA,hostB,hostC,...)
delegate_IP = '10.31.97.78'
                                    # delegate IP address
wd_hostname = '10.31.99.166'
                                    # Host name or IP address of this watchdog
wd_port = 9000
                                    # port number for watchdog service
wd_interval = 10
                                    # lifecheck interval (sec) > 0
ping_path = '/bin'
                                    # ping command path
ifconfig_path = '/sbin'
                                    # ifconfig command path
if_up_cmd = 'ifconfig eth0:0 inet $_IP_$ netmask 255.255.252.0'
                                    # startup delegate IP command
if_down_cmd = 'ifconfig eth0:0 down'
                                    # shutdown delegate IP command
arping_path = '/usr/sbin'           # arping command path
arping_cmd = 'arping -U $_IP_$ -w 1'
                                    # arping command
wd_life_point = 3
                                    # lifecheck retry times
wd_lifecheck_query = 'SELECT 1'
                                    # lifecheck query to pgpool from watchdog
other_pgpool_hostname0 = '10.31.99.165'
                                    # Host name or IP address to
connect to for other pgpool 0
other_pgpool_port0 = 9999
                                    # Port number for othet pgpool 0
other_wd_port0 = 9000
#########

Here's ifconfig output from each server.  First 10.31.99.165:
#########
eth0      Link encap:Ethernet  HWaddr 52:54:00:FC:5A:DD
          inet addr:10.31.99.165  Bcast:10.31.99.255  Mask:255.255.252.0
          inet6 addr: fe80::5054:ff:fefc:5add/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:4195383952 errors:0 dropped:0 overruns:0 frame:0
          TX packets:4618638427 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:5682945530436 (5.1 TiB)  TX bytes:7468635177326 (6.7 TiB)

eth0:0    Link encap:Ethernet  HWaddr 52:54:00:FC:5A:DD
          inet addr:10.31.97.78  Bcast:10.31.99.255  Mask:255.255.252.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
#########

And 10.31.99.166:
#########
eth0      Link encap:Ethernet  HWaddr 00:16:3E:87:6F:43
          inet addr:10.31.99.166  Bcast:10.31.99.255  Mask:255.255.252.0
          inet6 addr: fe80::216:3eff:fe87:6f43/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:4618130586 errors:0 dropped:0 overruns:0 frame:0
          TX packets:200152071 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:5702786076291 (5.1 TiB)  TX bytes:15085410736 (14.0 GiB)

eth0:0    Link encap:Ethernet  HWaddr 00:16:3E:87:6F:43
          inet addr:10.31.97.78  Bcast:10.31.99.255  Mask:255.255.252.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
#########

Here's the content of the pgpool log from 10.31.99.165:
#########
2012-09-24 10:55:34 ERROR: pid 28064: new_connection: create_cp() failed
2012-09-24 10:55:34 LOG:   pid 27612: wd_lifecheck: lifecheck failed 3
times. pgpool seems not to be working
2012-09-24 10:55:34 LOG:   pid 27612: wd_lifecheck: lifecheck failed 3
times. pgpool seems not to be working
2012-09-24 10:55:45 ERROR: pid 28088: connect_inet_domain_socket:
connect() failed: Connection refused
2012-09-24 10:55:45 ERROR: pid 28088: connection to cuda-db0(5432) failed
2012-09-24 10:55:45 ERROR: pid 28088: new_connection: create_cp() failed
2012-09-24 10:55:46 ERROR: pid 28086: connect_inet_domain_socket:
connect() failed: Connection refused
2012-09-24 10:55:46 ERROR: pid 28086: connection to cuda-db0(5432) failed
2012-09-24 10:55:46 ERROR: pid 28086: new_connection: create_cp() failed
2012-09-24 10:55:46 LOG:   pid 27612: wd_lifecheck: lifecheck failed 3
times. pgpool seems not to be working
2012-09-24 10:55:46 LOG:   pid 27612: wd_lifecheck: lifecheck failed 3
times. pgpool seems not to be working
2012-09-24 10:55:57 ERROR: pid 28099: connect_inet_domain_socket:
connect() failed: Connection refused
2012-09-24 10:55:57 ERROR: pid 28099: connection to cuda-db0(5432) failed
2012-09-24 10:55:57 ERROR: pid 28099: new_connection: create_cp() failed
2012-09-24 10:55:58 ERROR: pid 27674: connect_inet_domain_socket:
connect() failed: Connection refused
2012-09-24 10:55:58 ERROR: pid 27674: connection to cuda-db0(5432) failed
2012-09-24 10:55:58 ERROR: pid 27674: new_connection: create_cp() failed
2012-09-24 10:55:58 LOG:   pid 27612: wd_lifecheck: lifecheck failed 3
times. pgpool seems not to be working
2012-09-24 10:55:58 LOG:   pid 27612: wd_lifecheck: lifecheck failed 3
times. pgpool seems not to be working
2012-09-24 10:56:09 ERROR: pid 28108: connect_inet_domain_socket:
connect() failed: Connection refused
2012-09-24 10:56:09 ERROR: pid 28108: connection to cuda-db0(5432) failed
2012-09-24 10:56:09 ERROR: pid 28108: new_connection: create_cp() failed
2012-09-24 10:56:10 ERROR: pid 28106: connect_inet_domain_socket:
connect() failed: Connection refused
2012-09-24 10:56:10 ERROR: pid 28106: connection to cuda-db0(5432) failed
2012-09-24 10:56:10 ERROR: pid 28106: new_connection: create_cp() failed
2012-09-24 10:56:10 LOG:   pid 27612: wd_lifecheck: lifecheck failed 3
times. pgpool seems not to be working
2012-09-24 10:56:10 LOG:   pid 27612: wd_lifecheck: lifecheck failed 3
times. pgpool seems not to be working
2012-09-24 10:56:22 LOG:   pid 27612: wd_escalation: escalated to master pgpool
2012-09-24 10:56:24 LOG:   pid 27612: wd_escalation:  escaleted to
delegate_IP holder
2012-09-24 10:57:04 LOG:   pid 27813: send_failback_request: fail back
0 th node request from pid 27813
2012-09-24 10:57:04 ERROR: pid 27596: failover_handler: invalid
node_id 0 status:2 MAX_NUM_BACKENDS: 128
2012-09-24 10:57:06 LOG:   pid 27813: send_failback_request: fail back
1 th node request from pid 27813
2012-09-24 10:57:06 ERROR: pid 27596: failover_handler: invalid
node_id 1 status:2 MAX_NUM_BACKENDS: 128
2012-09-24 10:57:09 LOG:   pid 27813: send_failback_request: fail back
2 th node request from pid 27813
2012-09-24 10:57:09 ERROR: pid 27596: failover_handler: invalid
node_id 2 status:2 MAX_NUM_BACKENDS: 128
#########

and 10.31.99.166:
#########
2012-09-24 10:55:22 ERROR: pid 7192: new_connection: create_cp() failed
2012-09-24 10:55:22 LOG:   pid 6724: wd_lifecheck: lifecheck failed 3
times. pgpool seems not to be working
2012-09-24 10:55:22 LOG:   pid 6724: wd_lifecheck: lifecheck failed 3
times. pgpool seems not to be working
2012-09-24 10:55:22 ERROR: pid 7191: connect_inet_domain_socket:
connect() failed: Connection refused
2012-09-24 10:55:22 ERROR: pid 7191: connection to cuda-db0(5432) failed
2012-09-24 10:55:22 ERROR: pid 7191: new_connection: create_cp() failed
2012-09-24 10:55:34 ERROR: pid 7202: connect_inet_domain_socket:
connect() failed: Connection refused
2012-09-24 10:55:34 ERROR: pid 7202: connection to cuda-db0(5432) failed
2012-09-24 10:55:34 ERROR: pid 7202: new_connection: create_cp() failed
2012-09-24 10:55:34 LOG:   pid 6724: wd_lifecheck: lifecheck failed 3
times. pgpool seems not to be working
2012-09-24 10:55:34 LOG:   pid 6724: wd_lifecheck: lifecheck failed 3
times. pgpool seems not to be working
2012-09-24 10:55:34 ERROR: pid 7209: connect_inet_domain_socket:
connect() failed: Connection refused
2012-09-24 10:55:34 ERROR: pid 7209: connection to cuda-db0(5432) failed
2012-09-24 10:55:34 ERROR: pid 7209: new_connection: create_cp() failed
2012-09-24 10:55:46 ERROR: pid 7213: connect_inet_domain_socket:
connect() failed: Connection refused
2012-09-24 10:55:46 ERROR: pid 7213: connection to cuda-db0(5432) failed
2012-09-24 10:55:46 ERROR: pid 7213: new_connection: create_cp() failed
2012-09-24 10:55:46 LOG:   pid 6724: wd_lifecheck: lifecheck failed 3
times. pgpool seems not to be working
2012-09-24 10:55:46 LOG:   pid 6724: wd_lifecheck: lifecheck failed 3
times. pgpool seems not to be working
2012-09-24 10:55:46 ERROR: pid 7221: connect_inet_domain_socket:
connect() failed: Connection refused
2012-09-24 10:55:46 ERROR: pid 7221: connection to cuda-db0(5432) failed
2012-09-24 10:55:46 ERROR: pid 7221: new_connection: create_cp() failed
2012-09-24 10:55:58 ERROR: pid 7223: connect_inet_domain_socket:
connect() failed: Connection refused
2012-09-24 10:55:58 ERROR: pid 7223: connection to cuda-db0(5432) failed
2012-09-24 10:55:58 ERROR: pid 7223: new_connection: create_cp() failed
2012-09-24 10:55:58 LOG:   pid 6724: wd_lifecheck: lifecheck failed 3
times. pgpool seems not to be working
2012-09-24 10:55:58 LOG:   pid 6724: wd_lifecheck: lifecheck failed 3
times. pgpool seems not to be working
2012-09-24 10:55:58 ERROR: pid 7231: connect_inet_domain_socket:
connect() failed: Connection refused
2012-09-24 10:55:58 ERROR: pid 7231: connection to cuda-db0(5432) failed
2012-09-24 10:55:58 ERROR: pid 7231: new_connection: create_cp() failed
2012-09-24 10:56:10 ERROR: pid 7233: connect_inet_domain_socket:
connect() failed: Connection refused
2012-09-24 10:56:10 ERROR: pid 7233: connection to cuda-db0(5432) failed
2012-09-24 10:56:10 ERROR: pid 7233: new_connection: create_cp() failed
2012-09-24 10:56:10 LOG:   pid 6724: wd_lifecheck: lifecheck failed 3
times. pgpool seems not to be working
2012-09-24 10:56:10 LOG:   pid 6724: wd_lifecheck: lifecheck failed 3
times. pgpool seems not to be working
2012-09-24 10:56:10 ERROR: pid 7243: connect_inet_domain_socket:
connect() failed: Connection refused
2012-09-24 10:56:10 ERROR: pid 7243: connection to cuda-db0(5432) failed
2012-09-24 10:56:10 ERROR: pid 7243: new_connection: create_cp() failed
2012-09-24 10:56:22 LOG:   pid 6724: wd_escalation: escalated to master pgpool
2012-09-24 10:56:24 LOG:   pid 6724: wd_escalation:  escaleted to
delegate_IP holder
#########

I'm not sure what sort of information is needed to debug what went
wrong.  Let me know if something else is needed, and I'll do my best
to provide it.  thanks


More information about the pgpool-general mailing list