[pgpool-general: 4796] Failover happening on already failed node

Wed Jul 20 04:19:20 JST 2016

The setup is two nodes, each having pgpool (3.4.7) and postgres (9.3), in streaming replication mode.  Node 0 is master/primary.  Pgpool using Watchdog.

1) Perform a hard shutdown of node 1.  Pgpool_0 will failover postgres_1, but nothing really happens because it's not a primary node.
2) Restart node 1.  Start pgpool_1.
3) Pgpool_1 will try to connect to the local postgres instance, and eventually decide that it's unreachable (because postgres isn't running).  Now pgpool_1 does another failover of postgres_1

I don't see this happen all the time, but I'm not sure why this is happening at all.  When pgpool_1 starts, it should rejoin the watchdog cluster and learn that postgres_1 is already considered down, and not trigger failover.  According to the logs of pgpool_0, it's invalid to failover postgres_1 again, because it's already in state 3 (down).

Below is a portion of the logs from both pgpools after step 2.  I've trimmed some of the lines where pgpool_1 is repeatedly trying to connect to the postgres that is unavailable.

Is there something obviously wrong?  I can get any more information necessary.

Thanks,
Jeremiah

---- Pgpool_0 ----
2016-07-18 15:32:39: pid 3148: DETAIL:  receive add request from 172.20.20.230:5432 and accept it
2016-07-18 15:32:45: pid 3148: LOG:  sending watchdog response
2016-07-18 15:32:45: pid 3148: DETAIL:  WD_STAND_FOR_LOCK_HOLDER received it
2016-07-18 15:32:45: pid 3148: LOG:  invalid degenerate backend request, node id : 1 status: [3] is not valid for failover

---- Pgpool_1  ----
2016-07-18 15:32:38: pid 1811: LOG:  Setting up socket for ::1:5432
2016-07-18 15:32:38: pid 1811: LOG:  Setting up socket for 127.0.0.1:5432
2016-07-18 15:32:38: pid 1811: LOG:  watchdog started
2016-07-18 15:32:39: pid 1811: LOG:  pgpool-II successfully started. version 3.4.7 (tataraboshi)
2016-07-18 15:32:39: pid 1811: LOG:  find_primary_node: checking backend no 0
2016-07-18 15:32:39: pid 2020: LOG:  failed to connect to PostgreSQL server on "172.20.20.230:5433", getsockopt() detected error "Connection refused"
2016-07-18 15:32:39: pid 2020: ERROR:  failed to make persistent db connection
2016-07-18 15:32:39: pid 2020: DETAIL:  connection to host:"172.20.20.230:5433" failed
2016-07-18 15:32:39: pid 1811: LOG:  find_primary_node: primary node id is 0
<connection refused some more>
2016-07-18 15:32:40: pid 1816: LOG:  set SO_REUSEPORT option to the socket
2016-07-18 15:32:40: pid 1816: LOG:  creating socket for sending heartbeat
2016-07-18 15:32:40: pid 1816: DETAIL:  set SO_REUSEPORT
2016-07-18 15:32:40: pid 1815: LOG:  set SO_REUSEPORT option to the socket
2016-07-18 15:32:40: pid 1815: LOG:  creating watchdog heartbeat receive socket.
2016-07-18 15:32:40: pid 1815: DETAIL:  set SO_REUSEPORT
2016-07-18 15:32:42: pid 1811: LOG:  health checking retry count 1
<connection refused>
2016-07-18 15:32:45: pid 1811: LOG:  health checking retry count 2
<connection refused>
2016-07-18 15:32:45: pid 1811: LOG:  setting backend node 1 status to NODE DOWN
2016-07-18 15:32:45: pid 1811: LOG:  watchdog notifying to start interlocking
2016-07-18 15:32:45: pid 1811: LOG:  watchdog became a new lock holder
2016-07-18 15:32:49: pid 2020: ERROR:  Failed to check replication time lag
2016-07-18 15:32:49: pid 2020: DETAIL:  No persistent db connection for the node 1
2016-07-18 15:32:49: pid 2020: HINT:  check sr_check_user and sr_check_password
2016-07-18 15:32:49: pid 2020: CONTEXT:  while checking replication time lag
<connection refused>
2016-07-18 15:32:56: pid 1811: WARNING:  watchdog start interlocking, timed out
2016-07-18 15:32:56: pid 1811: LOG:  starting degeneration. shutdown host 172.20.20.230(5433)
2016-07-18 15:32:56: pid 1811: LOG:  Restart all children
2016-07-18 15:32:56: pid 1811: LOG:  execute command: /etc/pgpool-II/failover.sh ...