[pgpool-general: 4796] Failover happening on already failed node
Jeremiah Penery
jpenery at redskytech.com
Wed Jul 20 04:19:20 JST 2016
The setup is two nodes, each having pgpool (3.4.7) and postgres (9.3), in streaming replication mode. Node 0 is master/primary. Pgpool using Watchdog.
1) Perform a hard shutdown of node 1. Pgpool_0 will failover postgres_1, but nothing really happens because it's not a primary node.
2) Restart node 1. Start pgpool_1.
3) Pgpool_1 will try to connect to the local postgres instance, and eventually decide that it's unreachable (because postgres isn't running). Now pgpool_1 does another failover of postgres_1
I don't see this happen all the time, but I'm not sure why this is happening at all. When pgpool_1 starts, it should rejoin the watchdog cluster and learn that postgres_1 is already considered down, and not trigger failover. According to the logs of pgpool_0, it's invalid to failover postgres_1 again, because it's already in state 3 (down).
Below is a portion of the logs from both pgpools after step 2. I've trimmed some of the lines where pgpool_1 is repeatedly trying to connect to the postgres that is unavailable.
Is there something obviously wrong? I can get any more information necessary.
Thanks,
Jeremiah
---- Pgpool_0 ----
2016-07-18 15:32:39: pid 3148: DETAIL: receive add request from 172.20.20.230:5432 and accept it
2016-07-18 15:32:45: pid 3148: LOG: sending watchdog response
2016-07-18 15:32:45: pid 3148: DETAIL: WD_STAND_FOR_LOCK_HOLDER received it
2016-07-18 15:32:45: pid 3148: LOG: invalid degenerate backend request, node id : 1 status: [3] is not valid for failover
---- Pgpool_1 ----
2016-07-18 15:32:38: pid 1811: LOG: Setting up socket for ::1:5432
2016-07-18 15:32:38: pid 1811: LOG: Setting up socket for 127.0.0.1:5432
2016-07-18 15:32:38: pid 1811: LOG: watchdog started
2016-07-18 15:32:39: pid 1811: LOG: pgpool-II successfully started. version 3.4.7 (tataraboshi)
2016-07-18 15:32:39: pid 1811: LOG: find_primary_node: checking backend no 0
2016-07-18 15:32:39: pid 2020: LOG: failed to connect to PostgreSQL server on "172.20.20.230:5433", getsockopt() detected error "Connection refused"
2016-07-18 15:32:39: pid 2020: ERROR: failed to make persistent db connection
2016-07-18 15:32:39: pid 2020: DETAIL: connection to host:"172.20.20.230:5433" failed
2016-07-18 15:32:39: pid 1811: LOG: find_primary_node: primary node id is 0
<connection refused some more>
2016-07-18 15:32:40: pid 1816: LOG: set SO_REUSEPORT option to the socket
2016-07-18 15:32:40: pid 1816: LOG: creating socket for sending heartbeat
2016-07-18 15:32:40: pid 1816: DETAIL: set SO_REUSEPORT
2016-07-18 15:32:40: pid 1815: LOG: set SO_REUSEPORT option to the socket
2016-07-18 15:32:40: pid 1815: LOG: creating watchdog heartbeat receive socket.
2016-07-18 15:32:40: pid 1815: DETAIL: set SO_REUSEPORT
2016-07-18 15:32:42: pid 1811: LOG: health checking retry count 1
<connection refused>
2016-07-18 15:32:45: pid 1811: LOG: health checking retry count 2
<connection refused>
2016-07-18 15:32:45: pid 1811: LOG: setting backend node 1 status to NODE DOWN
2016-07-18 15:32:45: pid 1811: LOG: watchdog notifying to start interlocking
2016-07-18 15:32:45: pid 1811: LOG: watchdog became a new lock holder
2016-07-18 15:32:49: pid 2020: ERROR: Failed to check replication time lag
2016-07-18 15:32:49: pid 2020: DETAIL: No persistent db connection for the node 1
2016-07-18 15:32:49: pid 2020: HINT: check sr_check_user and sr_check_password
2016-07-18 15:32:49: pid 2020: CONTEXT: while checking replication time lag
<connection refused>
2016-07-18 15:32:56: pid 1811: WARNING: watchdog start interlocking, timed out
2016-07-18 15:32:56: pid 1811: LOG: starting degeneration. shutdown host 172.20.20.230(5433)
2016-07-18 15:32:56: pid 1811: LOG: Restart all children
2016-07-18 15:32:56: pid 1811: LOG: execute command: /etc/pgpool-II/failover.sh ...
More information about the pgpool-general
mailing list