<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Jan 12, 2016 at 11:53 AM, Tatsuo Ishii <span dir="ltr"><<a href="mailto:ishii@postgresql.org" target="_blank">ishii@postgresql.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">Usama,<br>
<br>
After the commit, I got regression failure with 004.watchdog on my<br>
Ubuntu laptop. log/004.watchdog and<br>
tests/004.watchdog/testdir/{master,standby}/log/pgpool.log are<br>
attached.<br></blockquote><div><br></div>HI Ishii-San<div><br></div><div>Thanks for sharing the test case log. I am looking into the possible cause of the failure. But it is very strange as the standby-pgpool.log suggests that pgpool-II watchdog test case is working as expected but only the log message "<font face="monospace, monospace">successfully joined the watchdog cluster as standby node</font>" is missing from standby log that is used by testing script to detect when watchdog node joins the cluster as a standby, Although from the behavior and the log it seems like the code around this log message was successfully executed but somehow the log file is missing the concerned log message.</div><div>Do you get this failure all the time?</div><div><br></div><div><br></div><div>Best regards</div><div>Muhammad Usama</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<br>
Best regards,<br>
--<br>
Tatsuo Ishii<br>
SRA OSS, Inc. Japan<br>
English: <a href="http://www.sraoss.co.jp/index_en.php" rel="noreferrer" target="_blank">http://www.sraoss.co.jp/index_en.php</a><br>
Japanese:<a href="http://www.sraoss.co.jp" rel="noreferrer" target="_blank">http://www.sraoss.co.jp</a><br>
<br>
From: Muhammad Usama <<a href="mailto:m.usama@gmail.com">m.usama@gmail.com</a>><br>
Subject: [pgpool-committers: 2927] pgpool: Fixing watchdog handling of local network failure events<br>
Date: Mon, 11 Jan 2016 14:28:28 +0000<br>
Message-ID: <<a href="mailto:E1aIdSO-0001pH-SL@gothos.postgresql.org">E1aIdSO-0001pH-SL@gothos.postgresql.org</a>><br>
<div class=""><div class="h5"><br>
> Fixing watchdog handling of local network failure events<br>
><br>
> WD_EVENT_NW_IP_IS_REMOVED and WD_EVENT_LOCAL_NODE_LOST must be handled in<br>
> the main state machine function to make sure all watchdog states should act<br>
> properly on these events.<br>
><br>
> Branch<br>
> ------<br>
> master<br>
><br>
> Details<br>
> -------<br>
> <a href="http://git.postgresql.org/gitweb?p=pgpool2.git;a=commitdiff;h=6ac8844288d21dc598d90abef3324c7cd9f9ac53" rel="noreferrer" target="_blank">http://git.postgresql.org/gitweb?p=pgpool2.git;a=commitdiff;h=6ac8844288d21dc598d90abef3324c7cd9f9ac53</a><br>
><br>
> Modified Files<br>
> --------------<br>
> src/watchdog/watchdog.c | 63 ++++++++++++++++++++++++++++++++++++++-----------<br>
> 1 file changed, 49 insertions(+), 14 deletions(-)<br>
><br>
> _______________________________________________<br>
> pgpool-committers mailing list<br>
> <a href="mailto:pgpool-committers@pgpool.net">pgpool-committers@pgpool.net</a><br>
> <a href="http://www.pgpool.net/mailman/listinfo/pgpool-committers" rel="noreferrer" target="_blank">http://www.pgpool.net/mailman/listinfo/pgpool-committers</a><br>
</div></div><br>creating master pgpool...Satrting set up in raw mode<br>
creating startall and shutdownall<br>
creating failover script<br>
creating database cluster /home/t-ishii/work/pgpool-II/current/pgpool2/src/test/regression/tests/004.watchdog/master/data0...done.<br>
update postgreql.conf<br>
creating pgpool_remote_start<br>
creating basebackup.sh<br>
start all<br>
waiting for pgpool-II coming up...done.<br>
create extensions<br>
Pager usage is off.<br>
node_id | hostname | port | status | lb_weight | role | select_cnt<br>
---------+----------+-------+--------+-----------+--------+------------<br>
0 | /tmp | 11002 | 2 | 1.000000 | master | 0<br>
(1 row)<br>
<br>
shutdown all<br>
<br>
pgpool-II setting for raw mode is done.<br>
To start the whole system, use /home/t-ishii/work/pgpool-II/current/pgpool2/src/test/regression/tests/004.watchdog/master/startall.<br>
To shutdown the whole system, use /home/t-ishii/work/pgpool-II/current/pgpool2/src/test/regression/tests/004.watchdog/master/shutdownall.<br>
pcp command user name is "t-ishii", password is "t-ishii".<br>
Each PostgreSQL, pgpool-II and pcp port is as follows:<br>
#1 port is 11002<br>
pgpool port is 11000<br>
pcp port is 11001<br>
The info above is in README.port.<br>
master setup done.<br>
waiting for server to start....10019 2016-01-12 15:28:35 JST LOG: redirecting log output to logging collector process<br>
10019 2016-01-12 15:28:35 JST HINT: Future log output will appear in directory "pg_log".<br>
done<br>
server started<br>
creating standby pgpool...Waiting for the pgpool master...<br>
Master brought up successfully.<br>
Waiting for the standby to join cluster...<br>
[check] 1 times<br>
[check] 2 times<br>
[check] 3 times<br>
[check] 4 times<br>
[check] 5 times<br>
[check] 6 times<br>
[check] 7 times<br>
[check] 8 times<br>
[check] 9 times<br>
[check] 10 times<br>
2016-01-12 15:29:11: pid 10190: LOG: stop request sent to pgpool. waiting for termination...<br>
.done.<br>
Checking if the Standby pgpool-II detected the master shutdown...<br>
Master shutdown detected.<br>
Checking if the Standby pgpool-II takes over the master responsibility...<br>
[check] 1 times<br>
[check] 2 times<br>
[check] 3 times<br>
[check] 4 times<br>
[check] 5 times<br>
Standby successfully became the new master.<br>
2016-01-12 15:29:22: pid 10214: LOG: stop request sent to pgpool. waiting for termination...<br>
.done.<br>
2016-01-12 15:29:23: pid 10218: FATAL: could not read pid file<br>
waiting for server to shut down.... done<br>
server stopped<br>
3 out of 4 successfull<br>
<br>2016-01-12 15:28:36: pid 10028: LOG: Backend status file /home/t-ishii/work/pgpool-II/current/pgpool2/src/test/regression/tests/004.watchdog/master/log/pgpool_status discarded<br>
2016-01-12 15:28:36: pid 10028: LOG: waiting for watchdog to initialize<br>
2016-01-12 15:28:36: pid 10032: LOG: setting the local watchdog node name to "Linux_tishii-CF-SX3HE4BP_11000"<br>
2016-01-12 15:28:36: pid 10032: LOG: watchdog cluster configured with 1 remote nodes<br>
2016-01-12 15:28:36: pid 10032: LOG: watchdog remote node:0 on localhost:21104<br>
2016-01-12 15:28:36: pid 10032: LOG: IPC socket path: "/tmp/.s.PGPOOLWD_CMD.21004"<br>
2016-01-12 15:28:40: pid 10032: LOG: watchdog node state changed from [LOADING] to [JOINING]<br>
2016-01-12 15:28:45: pid 10032: LOG: watchdog node state changed from [JOINING] to [INITIALIZING]<br>
2016-01-12 15:28:46: pid 10032: LOG: I am the only alive node in the watchdog cluster<br>
2016-01-12 15:28:46: pid 10032: HINT: skiping stand for coordinator state<br>
2016-01-12 15:28:46: pid 10032: LOG: watchdog node state changed from [INITIALIZING] to [MASTER]<br>
2016-01-12 15:28:46: pid 10032: LOG: I am announcing my self as master/coordinator watchdog node<br>
2016-01-12 15:28:51: pid 10032: LOG: I am the cluster leader node<br>
2016-01-12 15:28:51: pid 10032: DETAIL: our declare coordinator message is accepted by all nodes<br>
2016-01-12 15:28:51: pid 10032: LOG: I am the cluster leader node. Starting escalation process<br>
2016-01-12 15:28:51: pid 10028: LOG: watchdog process is initialized<br>
2016-01-12 15:28:51: pid 10032: LOG: escalation process started with PID:10076<br>
2016-01-12 15:28:51: pid 10032: LOG: new IPC connection is received<br>
2016-01-12 15:28:51: pid 10028: LOG: Setting up socket for <a href="http://127.0.0.1:11000" rel="noreferrer" target="_blank">127.0.0.1:11000</a><br>
2016-01-12 15:28:51: pid 10077: LOG: 2 watchdog nodes are configured for lifecheck<br>
2016-01-12 15:28:51: pid 10077: LOG: watchdog nodes ID:0 Name:"Linux_tishii-CF-SX3HE4BP_11000"<br>
2016-01-12 15:28:51: pid 10077: DETAIL: Host:"localhost" WD Port:21004 pgpool-II port:11000<br>
2016-01-12 15:28:51: pid 10077: LOG: watchdog nodes ID:1 Name:""<br>
2016-01-12 15:28:51: pid 10077: DETAIL: Host:"localhost" WD Port:21104 pgpool-II port:11100<br>
2016-01-12 15:28:51: pid 10076: LOG: watchdog: escalation started<br>
2016-01-12 15:28:51: pid 10032: LOG: watchdog escalation process with pid: 10076 exit with SUCCESS.<br>
2016-01-12 15:28:51: pid 10032: LOG: watchdog escalation process with pid: 10076 exits with status 0<br>
2016-01-12 15:28:51: pid 10028: LOG: pgpool-II successfully started. version 3.5beta1 (ekieboshi)<br>
2016-01-12 15:28:51: pid 10109: LOG: DB node id: 0 backend pid: 10116 statement: DISCARD ALL<br>
2016-01-12 15:28:51: pid 10032: LOG: new watchdog node connection is received from "<a href="http://127.0.0.1:17819" rel="noreferrer" target="_blank">127.0.0.1:17819</a>"<br>
2016-01-12 15:28:52: pid 10081: LOG: set SO_REUSEPORT option to the socket<br>
2016-01-12 15:28:52: pid 10081: LOG: creating socket for sending heartbeat<br>
2016-01-12 15:28:52: pid 10081: DETAIL: set SO_REUSEPORT<br>
2016-01-12 15:28:52: pid 10079: LOG: set SO_REUSEPORT option to the socket<br>
2016-01-12 15:28:52: pid 10079: LOG: creating watchdog heartbeat receive socket.<br>
2016-01-12 15:28:52: pid 10079: DETAIL: set SO_REUSEPORT<br>
2016-01-12 15:28:58: pid 10032: LOG: new outbond connection to localhost:21104<br>
2016-01-12 15:29:01: pid 10077: LOG: watchdog: lifecheck started<br>
2016-01-12 15:29:11: pid 10028: LOG: received fast shutdown request<br>
2016-01-12 15:29:11: pid 10028: LOG: shutdown request. closing listen socket<br>
2016-01-12 15:29:11: pid 10078: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:11: pid 10085: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:11: pid 10109: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:11: pid 10080: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:11: pid 10082: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:11: pid 10083: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:11: pid 10084: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:11: pid 10086: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:11: pid 10087: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:11: pid 10088: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:11: pid 10089: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:11: pid 10090: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:11: pid 10091: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:11: pid 10092: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:11: pid 10093: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:11: pid 10094: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:11: pid 10095: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:11: pid 10111: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:11: pid 10102: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:11: pid 10103: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:11: pid 10101: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:11: pid 10104: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:11: pid 10100: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:11: pid 10032: LOG: Watchdog is shutting down<br>
2016-01-12 15:29:11: pid 10105: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:11: pid 10099: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:11: pid 10106: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:11: pid 10191: LOG: watchdog: de-escalation started<br>
2016-01-12 15:29:11: pid 10098: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:11: pid 10110: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:11: pid 10097: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:11: pid 10107: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:11: pid 10108: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:11: pid 10096: LOG: child process received shutdown request signal 2<br>
<br>2016-01-12 15:28:51: pid 10119: LOG: Backend status file /home/t-ishii/work/pgpool-II/current/pgpool2/src/test/regression/tests/004.watchdog/master/log/pgpool_status discarded<br>
2016-01-12 15:28:51: pid 10119: LOG: waiting for watchdog to initialize<br>
2016-01-12 15:28:51: pid 10123: LOG: setting the local watchdog node name to "Linux_tishii-CF-SX3HE4BP_11100"<br>
2016-01-12 15:28:51: pid 10123: LOG: watchdog cluster configured with 1 remote nodes<br>
2016-01-12 15:28:51: pid 10123: LOG: watchdog remote node:0 on localhost:21004<br>
2016-01-12 15:28:51: pid 10123: LOG: IPC socket path: "/tmp/.s.PGPOOLWD_CMD.21104"<br>
2016-01-12 15:28:51: pid 10123: LOG: new outbond connection to localhost:21004<br>
2016-01-12 15:28:51: pid 10123: LOG: watchdog node state changed from [LOADING] to [INITIALIZING]<br>
2016-01-12 15:28:52: pid 10123: LOG: watchdog node state changed from [INITIALIZING] to [STANDBY]<br>
2016-01-12 15:28:52: pid 10119: LOG: watchdog process is initialized<br>
2016-01-12 15:28:52: pid 10123: LOG: new IPC connection is received<br>
2016-01-12 15:28:52: pid 10119: LOG: Setting up socket for <a href="http://127.0.0.1:11100" rel="noreferrer" target="_blank">127.0.0.1:11100</a><br>
2016-01-12 15:28:52: pid 10129: LOG: 2 watchdog nodes are configured for lifecheck<br>
2016-01-12 15:28:52: pid 10129: LOG: watchdog nodes ID:0 Name:"Linux_tishii-CF-SX3HE4BP_11100"<br>
2016-01-12 15:28:52: pid 10129: DETAIL: Host:"localhost" WD Port:21104 pgpool-II port:11100<br>
2016-01-12 15:28:52: pid 10129: LOG: watchdog nodes ID:1 Name:"Linux_tishii-CF-SX3HE4BP_11000"<br>
2016-01-12 15:28:52: pid 10129: DETAIL: Host:"localhost" WD Port:21004 pgpool-II port:11000<br>
2016-01-12 15:28:52: pid 10119: LOG: pgpool-II successfully started. version 3.5beta1 (ekieboshi)<br>
2016-01-12 15:28:53: pid 10134: LOG: set SO_REUSEPORT option to the socket<br>
2016-01-12 15:28:53: pid 10134: LOG: creating socket for sending heartbeat<br>
2016-01-12 15:28:53: pid 10134: DETAIL: set SO_REUSEPORT<br>
2016-01-12 15:28:53: pid 10132: LOG: set SO_REUSEPORT option to the socket<br>
2016-01-12 15:28:53: pid 10132: LOG: creating watchdog heartbeat receive socket.<br>
2016-01-12 15:28:53: pid 10132: DETAIL: set SO_REUSEPORT<br>
2016-01-12 15:28:58: pid 10123: LOG: new watchdog node connection is received from "<a href="http://127.0.0.1:28554" rel="noreferrer" target="_blank">127.0.0.1:28554</a>"<br>
2016-01-12 15:29:02: pid 10129: LOG: watchdog: lifecheck started<br>
2016-01-12 15:29:11: pid 10123: LOG: remote node "Linux_tishii-CF-SX3HE4BP_11000" is shutting down<br>
2016-01-12 15:29:11: pid 10123: LOG: watchdog cluster has lost the coordinator node<br>
2016-01-12 15:29:11: pid 10123: LOG: watchdog node state changed from [STANDBY] to [JOINING]<br>
2016-01-12 15:29:11: pid 10123: LOG: read from socket failed with error :"Connection reset by peer"<br>
2016-01-12 15:29:11: pid 10123: LOG: read from socket failed, remote end closed the connection<br>
2016-01-12 15:29:16: pid 10123: LOG: watchdog node state changed from [JOINING] to [INITIALIZING]<br>
2016-01-12 15:29:17: pid 10123: LOG: I am the only alive node in the watchdog cluster<br>
2016-01-12 15:29:17: pid 10123: HINT: skiping stand for coordinator state<br>
2016-01-12 15:29:17: pid 10123: LOG: watchdog node state changed from [INITIALIZING] to [MASTER]<br>
2016-01-12 15:29:17: pid 10123: LOG: I am announcing my self as master/coordinator watchdog node<br>
2016-01-12 15:29:22: pid 10123: LOG: I am the cluster leader node<br>
2016-01-12 15:29:22: pid 10123: DETAIL: our declare coordinator message is accepted by all nodes<br>
2016-01-12 15:29:22: pid 10123: LOG: I am the cluster leader node. Starting escalation process<br>
2016-01-12 15:29:22: pid 10123: LOG: escalation process started with PID:10212<br>
2016-01-12 15:29:22: pid 10212: LOG: watchdog: escalation started<br>
2016-01-12 15:29:22: pid 10123: LOG: watchdog escalation process with pid: 10212 exit with SUCCESS.<br>
2016-01-12 15:29:22: pid 10123: LOG: watchdog escalation process with pid: 10212 exits with status 0<br>
2016-01-12 15:29:22: pid 10119: LOG: received fast shutdown request<br>
2016-01-12 15:29:22: pid 10119: LOG: shutdown request. closing listen socket<br>
2016-01-12 15:29:22: pid 10130: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:22: pid 10141: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:22: pid 10144: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:22: pid 10133: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:22: pid 10142: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:22: pid 10136: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:22: pid 10135: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:22: pid 10137: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:22: pid 10147: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:22: pid 10143: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:22: pid 10138: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:22: pid 10149: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:22: pid 10150: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:22: pid 10145: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:22: pid 10139: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:22: pid 10153: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:22: pid 10151: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:22: pid 10140: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:22: pid 10162: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:22: pid 10160: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:22: pid 10163: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:22: pid 10148: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:22: pid 10161: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:22: pid 10152: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:22: pid 10159: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:22: pid 10154: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:22: pid 10158: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:22: pid 10155: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:22: pid 10123: LOG: Watchdog is shutting down<br>
2016-01-12 15:29:22: pid 10157: LOG: child process received shutdown request signal 2<br>
2016-01-12 15:29:22: pid 10215: LOG: watchdog: de-escalation started<br>
2016-01-12 15:29:22: pid 10156: LOG: child process received shutdown request signal 2<br>
<br></blockquote></div><br></div></div>