[pgpool-hackers: 1328] Re: [pgpool-committers: 2927] pgpool: Fixing watchdog handling of local network failure events

Tue Jan 12 18:12:07 JST 2016

On Tue, Jan 12, 2016 at 11:53 AM, Tatsuo Ishii <ishii at postgresql.org> wrote:

> Usama,
>
> After the commit, I got regression failure with 004.watchdog on my
> Ubuntu laptop. log/004.watchdog and
> tests/004.watchdog/testdir/{master,standby}/log/pgpool.log are
> attached.
>

HI Ishii-San

Thanks for sharing the test case log. I am looking into the possible cause
of the failure. But it is very strange as the standby-pgpool.log suggests
that pgpool-II watchdog test case is working as expected but only the log
message "successfully joined the watchdog cluster as standby node" is
missing from standby log that is used by testing script to detect when
watchdog node joins the cluster as a standby, Although from the behavior
and the log it seems like the code around this log message was successfully
executed but somehow the log file is missing the concerned log message.
Do you get this failure all the time?

Best regards
Muhammad Usama

>
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese:http://www.sraoss.co.jp
>
> From: Muhammad Usama <m.usama at gmail.com>
> Subject: [pgpool-committers: 2927] pgpool: Fixing watchdog handling of
> local network failure events
> Date: Mon, 11 Jan 2016 14:28:28 +0000
> Message-ID: <E1aIdSO-0001pH-SL at gothos.postgresql.org>
>
> > Fixing watchdog handling of local network failure events
> >
> > WD_EVENT_NW_IP_IS_REMOVED and WD_EVENT_LOCAL_NODE_LOST must be handled in
> > the main state machine function to make sure all watchdog states should
> act
> > properly on these events.
> >
> > Branch
> > ------
> > master
> >
> > Details
> > -------
> >
> http://git.postgresql.org/gitweb?p=pgpool2.git;a=commitdiff;h=6ac8844288d21dc598d90abef3324c7cd9f9ac53
> >
> > Modified Files
> > --------------
> > src/watchdog/watchdog.c | 63
> ++++++++++++++++++++++++++++++++++++++-----------
> > 1 file changed, 49 insertions(+), 14 deletions(-)
> >
> > _______________________________________________
> > pgpool-committers mailing list
> > pgpool-committers at pgpool.net
> > http://www.pgpool.net/mailman/listinfo/pgpool-committers
>
> creating master pgpool...Satrting set up in raw mode
> creating startall and shutdownall
> creating failover script
> creating database cluster
> /home/t-ishii/work/pgpool-II/current/pgpool2/src/test/regression/tests/004.watchdog/master/data0...done.
> update postgreql.conf
> creating pgpool_remote_start
> creating basebackup.sh
> start all
> waiting for pgpool-II coming up...done.
> create extensions
> Pager usage is off.
>  node_id | hostname | port  | status | lb_weight |  role  | select_cnt
> ---------+----------+-------+--------+-----------+--------+------------
>  0       | /tmp     | 11002 | 2      | 1.000000  | master | 0
> (1 row)
>
> shutdown all
>
> pgpool-II setting for raw mode is done.
> To start the whole system, use
> /home/t-ishii/work/pgpool-II/current/pgpool2/src/test/regression/tests/004.watchdog/master/startall.
> To shutdown the whole system, use
> /home/t-ishii/work/pgpool-II/current/pgpool2/src/test/regression/tests/004.watchdog/master/shutdownall.
> pcp command user name is "t-ishii", password is "t-ishii".
> Each PostgreSQL, pgpool-II and pcp port is as follows:
> #1 port is 11002
> pgpool port is 11000
> pcp port is 11001
> The info above is in README.port.
> master setup done.
> waiting for server to start....10019 2016-01-12 15:28:35 JST LOG:
> redirecting log output to logging collector process
> 10019 2016-01-12 15:28:35 JST HINT:  Future log output will appear in
> directory "pg_log".
>  done
> server started
> creating standby pgpool...Waiting for the pgpool master...
> Master brought up successfully.
> Waiting for the standby to join cluster...
> [check] 1 times
> [check] 2 times
> [check] 3 times
> [check] 4 times
> [check] 5 times
> [check] 6 times
> [check] 7 times
> [check] 8 times
> [check] 9 times
> [check] 10 times
> 2016-01-12 15:29:11: pid 10190: LOG:  stop request sent to pgpool. waiting
> for termination...
> .done.
> Checking if the Standby pgpool-II detected the master shutdown...
> Master shutdown detected.
> Checking if the Standby pgpool-II takes over the master responsibility...
> [check] 1 times
> [check] 2 times
> [check] 3 times
> [check] 4 times
> [check] 5 times
> Standby successfully became the new master.
> 2016-01-12 15:29:22: pid 10214: LOG:  stop request sent to pgpool. waiting
> for termination...
> .done.
> 2016-01-12 15:29:23: pid 10218: FATAL:  could not read pid file
> waiting for server to shut down.... done
> server stopped
> 3 out of 4 successfull
>
> 2016-01-12 15:28:36: pid 10028: LOG:  Backend status file
> /home/t-ishii/work/pgpool-II/current/pgpool2/src/test/regression/tests/004.watchdog/master/log/pgpool_status
> discarded
> 2016-01-12 15:28:36: pid 10028: LOG:  waiting for watchdog to initialize
> 2016-01-12 15:28:36: pid 10032: LOG:  setting the local watchdog node name
> to "Linux_tishii-CF-SX3HE4BP_11000"
> 2016-01-12 15:28:36: pid 10032: LOG:  watchdog cluster configured with 1
> remote nodes
> 2016-01-12 15:28:36: pid 10032: LOG:  watchdog remote node:0 on
> localhost:21104
> 2016-01-12 15:28:36: pid 10032: LOG:  IPC socket path:
> "/tmp/.s.PGPOOLWD_CMD.21004"
> 2016-01-12 15:28:40: pid 10032: LOG:  watchdog node state changed from
> [LOADING] to [JOINING]
> 2016-01-12 15:28:45: pid 10032: LOG:  watchdog node state changed from
> [JOINING] to [INITIALIZING]
> 2016-01-12 15:28:46: pid 10032: LOG:  I am the only alive node in the
> watchdog cluster
> 2016-01-12 15:28:46: pid 10032: HINT:  skiping stand for coordinator state
> 2016-01-12 15:28:46: pid 10032: LOG:  watchdog node state changed from
> [INITIALIZING] to [MASTER]
> 2016-01-12 15:28:46: pid 10032: LOG:  I am announcing my self as
> master/coordinator watchdog node
> 2016-01-12 15:28:51: pid 10032: LOG:  I am the cluster leader node
> 2016-01-12 15:28:51: pid 10032: DETAIL:  our declare coordinator message
> is accepted by all nodes
> 2016-01-12 15:28:51: pid 10032: LOG:  I am the cluster leader node.
> Starting escalation process
> 2016-01-12 15:28:51: pid 10028: LOG:  watchdog process is initialized
> 2016-01-12 15:28:51: pid 10032: LOG:  escalation process started with
> PID:10076
> 2016-01-12 15:28:51: pid 10032: LOG:  new IPC connection is received
> 2016-01-12 15:28:51: pid 10028: LOG:  Setting up socket for
> 127.0.0.1:11000
> 2016-01-12 15:28:51: pid 10077: LOG:  2 watchdog nodes are configured for
> lifecheck
> 2016-01-12 15:28:51: pid 10077: LOG:  watchdog nodes ID:0
> Name:"Linux_tishii-CF-SX3HE4BP_11000"
> 2016-01-12 15:28:51: pid 10077: DETAIL:  Host:"localhost" WD Port:21004
> pgpool-II port:11000
> 2016-01-12 15:28:51: pid 10077: LOG:  watchdog nodes ID:1 Name:""
> 2016-01-12 15:28:51: pid 10077: DETAIL:  Host:"localhost" WD Port:21104
> pgpool-II port:11100
> 2016-01-12 15:28:51: pid 10076: LOG:  watchdog: escalation started
> 2016-01-12 15:28:51: pid 10032: LOG:  watchdog escalation process with
> pid: 10076 exit with SUCCESS.
> 2016-01-12 15:28:51: pid 10032: LOG:  watchdog escalation process with
> pid: 10076 exits with status 0
> 2016-01-12 15:28:51: pid 10028: LOG:  pgpool-II successfully started.
> version 3.5beta1 (ekieboshi)
> 2016-01-12 15:28:51: pid 10109: LOG:  DB node id: 0 backend pid: 10116
> statement:  DISCARD ALL
> 2016-01-12 15:28:51: pid 10032: LOG:  new watchdog node connection is
> received from "127.0.0.1:17819"
> 2016-01-12 15:28:52: pid 10081: LOG:  set SO_REUSEPORT option to the socket
> 2016-01-12 15:28:52: pid 10081: LOG:  creating socket for sending heartbeat
> 2016-01-12 15:28:52: pid 10081: DETAIL:  set SO_REUSEPORT
> 2016-01-12 15:28:52: pid 10079: LOG:  set SO_REUSEPORT option to the socket
> 2016-01-12 15:28:52: pid 10079: LOG:  creating watchdog heartbeat receive
> socket.
> 2016-01-12 15:28:52: pid 10079: DETAIL:  set SO_REUSEPORT
> 2016-01-12 15:28:58: pid 10032: LOG:  new outbond connection to
> localhost:21104
> 2016-01-12 15:29:01: pid 10077: LOG:  watchdog: lifecheck started
> 2016-01-12 15:29:11: pid 10028: LOG:  received fast shutdown request
> 2016-01-12 15:29:11: pid 10028: LOG:  shutdown request. closing listen
> socket
> 2016-01-12 15:29:11: pid 10078: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:11: pid 10085: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:11: pid 10109: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:11: pid 10080: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:11: pid 10082: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:11: pid 10083: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:11: pid 10084: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:11: pid 10086: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:11: pid 10087: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:11: pid 10088: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:11: pid 10089: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:11: pid 10090: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:11: pid 10091: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:11: pid 10092: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:11: pid 10093: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:11: pid 10094: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:11: pid 10095: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:11: pid 10111: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:11: pid 10102: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:11: pid 10103: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:11: pid 10101: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:11: pid 10104: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:11: pid 10100: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:11: pid 10032: LOG:  Watchdog is shutting down
> 2016-01-12 15:29:11: pid 10105: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:11: pid 10099: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:11: pid 10106: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:11: pid 10191: LOG:  watchdog: de-escalation started
> 2016-01-12 15:29:11: pid 10098: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:11: pid 10110: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:11: pid 10097: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:11: pid 10107: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:11: pid 10108: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:11: pid 10096: LOG:  child process received shutdown
> request signal 2
>
> 2016-01-12 15:28:51: pid 10119: LOG:  Backend status file
> /home/t-ishii/work/pgpool-II/current/pgpool2/src/test/regression/tests/004.watchdog/master/log/pgpool_status
> discarded
> 2016-01-12 15:28:51: pid 10119: LOG:  waiting for watchdog to initialize
> 2016-01-12 15:28:51: pid 10123: LOG:  setting the local watchdog node name
> to "Linux_tishii-CF-SX3HE4BP_11100"
> 2016-01-12 15:28:51: pid 10123: LOG:  watchdog cluster configured with 1
> remote nodes
> 2016-01-12 15:28:51: pid 10123: LOG:  watchdog remote node:0 on
> localhost:21004
> 2016-01-12 15:28:51: pid 10123: LOG:  IPC socket path:
> "/tmp/.s.PGPOOLWD_CMD.21104"
> 2016-01-12 15:28:51: pid 10123: LOG:  new outbond connection to
> localhost:21004
> 2016-01-12 15:28:51: pid 10123: LOG:  watchdog node state changed from
> [LOADING] to [INITIALIZING]
> 2016-01-12 15:28:52: pid 10123: LOG:  watchdog node state changed from
> [INITIALIZING] to [STANDBY]
> 2016-01-12 15:28:52: pid 10119: LOG:  watchdog process is initialized
> 2016-01-12 15:28:52: pid 10123: LOG:  new IPC connection is received
> 2016-01-12 15:28:52: pid 10119: LOG:  Setting up socket for
> 127.0.0.1:11100
> 2016-01-12 15:28:52: pid 10129: LOG:  2 watchdog nodes are configured for
> lifecheck
> 2016-01-12 15:28:52: pid 10129: LOG:  watchdog nodes ID:0
> Name:"Linux_tishii-CF-SX3HE4BP_11100"
> 2016-01-12 15:28:52: pid 10129: DETAIL:  Host:"localhost" WD Port:21104
> pgpool-II port:11100
> 2016-01-12 15:28:52: pid 10129: LOG:  watchdog nodes ID:1
> Name:"Linux_tishii-CF-SX3HE4BP_11000"
> 2016-01-12 15:28:52: pid 10129: DETAIL:  Host:"localhost" WD Port:21004
> pgpool-II port:11000
> 2016-01-12 15:28:52: pid 10119: LOG:  pgpool-II successfully started.
> version 3.5beta1 (ekieboshi)
> 2016-01-12 15:28:53: pid 10134: LOG:  set SO_REUSEPORT option to the socket
> 2016-01-12 15:28:53: pid 10134: LOG:  creating socket for sending heartbeat
> 2016-01-12 15:28:53: pid 10134: DETAIL:  set SO_REUSEPORT
> 2016-01-12 15:28:53: pid 10132: LOG:  set SO_REUSEPORT option to the socket
> 2016-01-12 15:28:53: pid 10132: LOG:  creating watchdog heartbeat receive
> socket.
> 2016-01-12 15:28:53: pid 10132: DETAIL:  set SO_REUSEPORT
> 2016-01-12 15:28:58: pid 10123: LOG:  new watchdog node connection is
> received from "127.0.0.1:28554"
> 2016-01-12 15:29:02: pid 10129: LOG:  watchdog: lifecheck started
> 2016-01-12 15:29:11: pid 10123: LOG:  remote node
> "Linux_tishii-CF-SX3HE4BP_11000" is shutting down
> 2016-01-12 15:29:11: pid 10123: LOG:  watchdog cluster has lost the
> coordinator node
> 2016-01-12 15:29:11: pid 10123: LOG:  watchdog node state changed from
> [STANDBY] to [JOINING]
> 2016-01-12 15:29:11: pid 10123: LOG:  read from socket failed with error
> :"Connection reset by peer"
> 2016-01-12 15:29:11: pid 10123: LOG:  read from socket failed, remote end
> closed the connection
> 2016-01-12 15:29:16: pid 10123: LOG:  watchdog node state changed from
> [JOINING] to [INITIALIZING]
> 2016-01-12 15:29:17: pid 10123: LOG:  I am the only alive node in the
> watchdog cluster
> 2016-01-12 15:29:17: pid 10123: HINT:  skiping stand for coordinator state
> 2016-01-12 15:29:17: pid 10123: LOG:  watchdog node state changed from
> [INITIALIZING] to [MASTER]
> 2016-01-12 15:29:17: pid 10123: LOG:  I am announcing my self as
> master/coordinator watchdog node
> 2016-01-12 15:29:22: pid 10123: LOG:  I am the cluster leader node
> 2016-01-12 15:29:22: pid 10123: DETAIL:  our declare coordinator message
> is accepted by all nodes
> 2016-01-12 15:29:22: pid 10123: LOG:  I am the cluster leader node.
> Starting escalation process
> 2016-01-12 15:29:22: pid 10123: LOG:  escalation process started with
> PID:10212
> 2016-01-12 15:29:22: pid 10212: LOG:  watchdog: escalation started
> 2016-01-12 15:29:22: pid 10123: LOG:  watchdog escalation process with
> pid: 10212 exit with SUCCESS.
> 2016-01-12 15:29:22: pid 10123: LOG:  watchdog escalation process with
> pid: 10212 exits with status 0
> 2016-01-12 15:29:22: pid 10119: LOG:  received fast shutdown request
> 2016-01-12 15:29:22: pid 10119: LOG:  shutdown request. closing listen
> socket
> 2016-01-12 15:29:22: pid 10130: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:22: pid 10141: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:22: pid 10144: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:22: pid 10133: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:22: pid 10142: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:22: pid 10136: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:22: pid 10135: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:22: pid 10137: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:22: pid 10147: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:22: pid 10143: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:22: pid 10138: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:22: pid 10149: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:22: pid 10150: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:22: pid 10145: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:22: pid 10139: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:22: pid 10153: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:22: pid 10151: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:22: pid 10140: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:22: pid 10162: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:22: pid 10160: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:22: pid 10163: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:22: pid 10148: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:22: pid 10161: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:22: pid 10152: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:22: pid 10159: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:22: pid 10154: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:22: pid 10158: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:22: pid 10155: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:22: pid 10123: LOG:  Watchdog is shutting down
> 2016-01-12 15:29:22: pid 10157: LOG:  child process received shutdown
> request signal 2
> 2016-01-12 15:29:22: pid 10215: LOG:  watchdog: de-escalation started
> 2016-01-12 15:29:22: pid 10156: LOG:  child process received shutdown
> request signal 2
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-hackers/attachments/20160112/d28c346a/attachment-0001.html>