[pgpool-hackers: 1330] Re: [pgpool-committers: 2927] pgpool: Fixing watchdog handling of local network failure events

Tatsuo Ishii ishii at postgresql.org
Wed Jan 13 10:53:56 JST 2016


> On Tue, Jan 12, 2016 at 11:53 AM, Tatsuo Ishii <ishii at postgresql.org> wrote:
> 
>> Usama,
>>
>> After the commit, I got regression failure with 004.watchdog on my
>> Ubuntu laptop. log/004.watchdog and
>> tests/004.watchdog/testdir/{master,standby}/log/pgpool.log are
>> attached.
>>
> 
> HI Ishii-San
> 
> Thanks for sharing the test case log. I am looking into the possible cause
> of the failure. But it is very strange as the standby-pgpool.log suggests
> that pgpool-II watchdog test case is working as expected but only the log
> message "successfully joined the watchdog cluster as standby node" is
> missing from standby log that is used by testing script to detect when
> watchdog node joins the cluster as a standby, Although from the behavior
> and the log it seems like the code around this log message was successfully
> executed but somehow the log file is missing the concerned log message.
> Do you get this failure all the time?

I have tested for several times the regression test and I failed to
reproduce the failure. Maybe timing dependent problem? Anyway let's
see if it happens again...

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

> Best regards
> Muhammad Usama
> 
> 
>>
>> Best regards,
>> --
>> Tatsuo Ishii
>> SRA OSS, Inc. Japan
>> English: http://www.sraoss.co.jp/index_en.php
>> Japanese:http://www.sraoss.co.jp
>>
>> From: Muhammad Usama <m.usama at gmail.com>
>> Subject: [pgpool-committers: 2927] pgpool: Fixing watchdog handling of
>> local network failure events
>> Date: Mon, 11 Jan 2016 14:28:28 +0000
>> Message-ID: <E1aIdSO-0001pH-SL at gothos.postgresql.org>
>>
>> > Fixing watchdog handling of local network failure events
>> >
>> > WD_EVENT_NW_IP_IS_REMOVED and WD_EVENT_LOCAL_NODE_LOST must be handled in
>> > the main state machine function to make sure all watchdog states should
>> act
>> > properly on these events.
>> >
>> > Branch
>> > ------
>> > master
>> >
>> > Details
>> > -------
>> >
>> http://git.postgresql.org/gitweb?p=pgpool2.git;a=commitdiff;h=6ac8844288d21dc598d90abef3324c7cd9f9ac53
>> >
>> > Modified Files
>> > --------------
>> > src/watchdog/watchdog.c | 63
>> ++++++++++++++++++++++++++++++++++++++-----------
>> > 1 file changed, 49 insertions(+), 14 deletions(-)
>> >
>> > _______________________________________________
>> > pgpool-committers mailing list
>> > pgpool-committers at pgpool.net
>> > http://www.pgpool.net/mailman/listinfo/pgpool-committers
>>
>> creating master pgpool...Satrting set up in raw mode
>> creating startall and shutdownall
>> creating failover script
>> creating database cluster
>> /home/t-ishii/work/pgpool-II/current/pgpool2/src/test/regression/tests/004.watchdog/master/data0...done.
>> update postgreql.conf
>> creating pgpool_remote_start
>> creating basebackup.sh
>> start all
>> waiting for pgpool-II coming up...done.
>> create extensions
>> Pager usage is off.
>>  node_id | hostname | port  | status | lb_weight |  role  | select_cnt
>> ---------+----------+-------+--------+-----------+--------+------------
>>  0       | /tmp     | 11002 | 2      | 1.000000  | master | 0
>> (1 row)
>>
>> shutdown all
>>
>> pgpool-II setting for raw mode is done.
>> To start the whole system, use
>> /home/t-ishii/work/pgpool-II/current/pgpool2/src/test/regression/tests/004.watchdog/master/startall.
>> To shutdown the whole system, use
>> /home/t-ishii/work/pgpool-II/current/pgpool2/src/test/regression/tests/004.watchdog/master/shutdownall.
>> pcp command user name is "t-ishii", password is "t-ishii".
>> Each PostgreSQL, pgpool-II and pcp port is as follows:
>> #1 port is 11002
>> pgpool port is 11000
>> pcp port is 11001
>> The info above is in README.port.
>> master setup done.
>> waiting for server to start....10019 2016-01-12 15:28:35 JST LOG:
>> redirecting log output to logging collector process
>> 10019 2016-01-12 15:28:35 JST HINT:  Future log output will appear in
>> directory "pg_log".
>>  done
>> server started
>> creating standby pgpool...Waiting for the pgpool master...
>> Master brought up successfully.
>> Waiting for the standby to join cluster...
>> [check] 1 times
>> [check] 2 times
>> [check] 3 times
>> [check] 4 times
>> [check] 5 times
>> [check] 6 times
>> [check] 7 times
>> [check] 8 times
>> [check] 9 times
>> [check] 10 times
>> 2016-01-12 15:29:11: pid 10190: LOG:  stop request sent to pgpool. waiting
>> for termination...
>> .done.
>> Checking if the Standby pgpool-II detected the master shutdown...
>> Master shutdown detected.
>> Checking if the Standby pgpool-II takes over the master responsibility...
>> [check] 1 times
>> [check] 2 times
>> [check] 3 times
>> [check] 4 times
>> [check] 5 times
>> Standby successfully became the new master.
>> 2016-01-12 15:29:22: pid 10214: LOG:  stop request sent to pgpool. waiting
>> for termination...
>> .done.
>> 2016-01-12 15:29:23: pid 10218: FATAL:  could not read pid file
>> waiting for server to shut down.... done
>> server stopped
>> 3 out of 4 successfull
>>
>> 2016-01-12 15:28:36: pid 10028: LOG:  Backend status file
>> /home/t-ishii/work/pgpool-II/current/pgpool2/src/test/regression/tests/004.watchdog/master/log/pgpool_status
>> discarded
>> 2016-01-12 15:28:36: pid 10028: LOG:  waiting for watchdog to initialize
>> 2016-01-12 15:28:36: pid 10032: LOG:  setting the local watchdog node name
>> to "Linux_tishii-CF-SX3HE4BP_11000"
>> 2016-01-12 15:28:36: pid 10032: LOG:  watchdog cluster configured with 1
>> remote nodes
>> 2016-01-12 15:28:36: pid 10032: LOG:  watchdog remote node:0 on
>> localhost:21104
>> 2016-01-12 15:28:36: pid 10032: LOG:  IPC socket path:
>> "/tmp/.s.PGPOOLWD_CMD.21004"
>> 2016-01-12 15:28:40: pid 10032: LOG:  watchdog node state changed from
>> [LOADING] to [JOINING]
>> 2016-01-12 15:28:45: pid 10032: LOG:  watchdog node state changed from
>> [JOINING] to [INITIALIZING]
>> 2016-01-12 15:28:46: pid 10032: LOG:  I am the only alive node in the
>> watchdog cluster
>> 2016-01-12 15:28:46: pid 10032: HINT:  skiping stand for coordinator state
>> 2016-01-12 15:28:46: pid 10032: LOG:  watchdog node state changed from
>> [INITIALIZING] to [MASTER]
>> 2016-01-12 15:28:46: pid 10032: LOG:  I am announcing my self as
>> master/coordinator watchdog node
>> 2016-01-12 15:28:51: pid 10032: LOG:  I am the cluster leader node
>> 2016-01-12 15:28:51: pid 10032: DETAIL:  our declare coordinator message
>> is accepted by all nodes
>> 2016-01-12 15:28:51: pid 10032: LOG:  I am the cluster leader node.
>> Starting escalation process
>> 2016-01-12 15:28:51: pid 10028: LOG:  watchdog process is initialized
>> 2016-01-12 15:28:51: pid 10032: LOG:  escalation process started with
>> PID:10076
>> 2016-01-12 15:28:51: pid 10032: LOG:  new IPC connection is received
>> 2016-01-12 15:28:51: pid 10028: LOG:  Setting up socket for
>> 127.0.0.1:11000
>> 2016-01-12 15:28:51: pid 10077: LOG:  2 watchdog nodes are configured for
>> lifecheck
>> 2016-01-12 15:28:51: pid 10077: LOG:  watchdog nodes ID:0
>> Name:"Linux_tishii-CF-SX3HE4BP_11000"
>> 2016-01-12 15:28:51: pid 10077: DETAIL:  Host:"localhost" WD Port:21004
>> pgpool-II port:11000
>> 2016-01-12 15:28:51: pid 10077: LOG:  watchdog nodes ID:1 Name:""
>> 2016-01-12 15:28:51: pid 10077: DETAIL:  Host:"localhost" WD Port:21104
>> pgpool-II port:11100
>> 2016-01-12 15:28:51: pid 10076: LOG:  watchdog: escalation started
>> 2016-01-12 15:28:51: pid 10032: LOG:  watchdog escalation process with
>> pid: 10076 exit with SUCCESS.
>> 2016-01-12 15:28:51: pid 10032: LOG:  watchdog escalation process with
>> pid: 10076 exits with status 0
>> 2016-01-12 15:28:51: pid 10028: LOG:  pgpool-II successfully started.
>> version 3.5beta1 (ekieboshi)
>> 2016-01-12 15:28:51: pid 10109: LOG:  DB node id: 0 backend pid: 10116
>> statement:  DISCARD ALL
>> 2016-01-12 15:28:51: pid 10032: LOG:  new watchdog node connection is
>> received from "127.0.0.1:17819"
>> 2016-01-12 15:28:52: pid 10081: LOG:  set SO_REUSEPORT option to the socket
>> 2016-01-12 15:28:52: pid 10081: LOG:  creating socket for sending heartbeat
>> 2016-01-12 15:28:52: pid 10081: DETAIL:  set SO_REUSEPORT
>> 2016-01-12 15:28:52: pid 10079: LOG:  set SO_REUSEPORT option to the socket
>> 2016-01-12 15:28:52: pid 10079: LOG:  creating watchdog heartbeat receive
>> socket.
>> 2016-01-12 15:28:52: pid 10079: DETAIL:  set SO_REUSEPORT
>> 2016-01-12 15:28:58: pid 10032: LOG:  new outbond connection to
>> localhost:21104
>> 2016-01-12 15:29:01: pid 10077: LOG:  watchdog: lifecheck started
>> 2016-01-12 15:29:11: pid 10028: LOG:  received fast shutdown request
>> 2016-01-12 15:29:11: pid 10028: LOG:  shutdown request. closing listen
>> socket
>> 2016-01-12 15:29:11: pid 10078: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:11: pid 10085: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:11: pid 10109: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:11: pid 10080: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:11: pid 10082: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:11: pid 10083: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:11: pid 10084: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:11: pid 10086: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:11: pid 10087: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:11: pid 10088: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:11: pid 10089: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:11: pid 10090: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:11: pid 10091: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:11: pid 10092: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:11: pid 10093: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:11: pid 10094: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:11: pid 10095: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:11: pid 10111: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:11: pid 10102: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:11: pid 10103: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:11: pid 10101: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:11: pid 10104: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:11: pid 10100: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:11: pid 10032: LOG:  Watchdog is shutting down
>> 2016-01-12 15:29:11: pid 10105: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:11: pid 10099: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:11: pid 10106: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:11: pid 10191: LOG:  watchdog: de-escalation started
>> 2016-01-12 15:29:11: pid 10098: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:11: pid 10110: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:11: pid 10097: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:11: pid 10107: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:11: pid 10108: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:11: pid 10096: LOG:  child process received shutdown
>> request signal 2
>>
>> 2016-01-12 15:28:51: pid 10119: LOG:  Backend status file
>> /home/t-ishii/work/pgpool-II/current/pgpool2/src/test/regression/tests/004.watchdog/master/log/pgpool_status
>> discarded
>> 2016-01-12 15:28:51: pid 10119: LOG:  waiting for watchdog to initialize
>> 2016-01-12 15:28:51: pid 10123: LOG:  setting the local watchdog node name
>> to "Linux_tishii-CF-SX3HE4BP_11100"
>> 2016-01-12 15:28:51: pid 10123: LOG:  watchdog cluster configured with 1
>> remote nodes
>> 2016-01-12 15:28:51: pid 10123: LOG:  watchdog remote node:0 on
>> localhost:21004
>> 2016-01-12 15:28:51: pid 10123: LOG:  IPC socket path:
>> "/tmp/.s.PGPOOLWD_CMD.21104"
>> 2016-01-12 15:28:51: pid 10123: LOG:  new outbond connection to
>> localhost:21004
>> 2016-01-12 15:28:51: pid 10123: LOG:  watchdog node state changed from
>> [LOADING] to [INITIALIZING]
>> 2016-01-12 15:28:52: pid 10123: LOG:  watchdog node state changed from
>> [INITIALIZING] to [STANDBY]
>> 2016-01-12 15:28:52: pid 10119: LOG:  watchdog process is initialized
>> 2016-01-12 15:28:52: pid 10123: LOG:  new IPC connection is received
>> 2016-01-12 15:28:52: pid 10119: LOG:  Setting up socket for
>> 127.0.0.1:11100
>> 2016-01-12 15:28:52: pid 10129: LOG:  2 watchdog nodes are configured for
>> lifecheck
>> 2016-01-12 15:28:52: pid 10129: LOG:  watchdog nodes ID:0
>> Name:"Linux_tishii-CF-SX3HE4BP_11100"
>> 2016-01-12 15:28:52: pid 10129: DETAIL:  Host:"localhost" WD Port:21104
>> pgpool-II port:11100
>> 2016-01-12 15:28:52: pid 10129: LOG:  watchdog nodes ID:1
>> Name:"Linux_tishii-CF-SX3HE4BP_11000"
>> 2016-01-12 15:28:52: pid 10129: DETAIL:  Host:"localhost" WD Port:21004
>> pgpool-II port:11000
>> 2016-01-12 15:28:52: pid 10119: LOG:  pgpool-II successfully started.
>> version 3.5beta1 (ekieboshi)
>> 2016-01-12 15:28:53: pid 10134: LOG:  set SO_REUSEPORT option to the socket
>> 2016-01-12 15:28:53: pid 10134: LOG:  creating socket for sending heartbeat
>> 2016-01-12 15:28:53: pid 10134: DETAIL:  set SO_REUSEPORT
>> 2016-01-12 15:28:53: pid 10132: LOG:  set SO_REUSEPORT option to the socket
>> 2016-01-12 15:28:53: pid 10132: LOG:  creating watchdog heartbeat receive
>> socket.
>> 2016-01-12 15:28:53: pid 10132: DETAIL:  set SO_REUSEPORT
>> 2016-01-12 15:28:58: pid 10123: LOG:  new watchdog node connection is
>> received from "127.0.0.1:28554"
>> 2016-01-12 15:29:02: pid 10129: LOG:  watchdog: lifecheck started
>> 2016-01-12 15:29:11: pid 10123: LOG:  remote node
>> "Linux_tishii-CF-SX3HE4BP_11000" is shutting down
>> 2016-01-12 15:29:11: pid 10123: LOG:  watchdog cluster has lost the
>> coordinator node
>> 2016-01-12 15:29:11: pid 10123: LOG:  watchdog node state changed from
>> [STANDBY] to [JOINING]
>> 2016-01-12 15:29:11: pid 10123: LOG:  read from socket failed with error
>> :"Connection reset by peer"
>> 2016-01-12 15:29:11: pid 10123: LOG:  read from socket failed, remote end
>> closed the connection
>> 2016-01-12 15:29:16: pid 10123: LOG:  watchdog node state changed from
>> [JOINING] to [INITIALIZING]
>> 2016-01-12 15:29:17: pid 10123: LOG:  I am the only alive node in the
>> watchdog cluster
>> 2016-01-12 15:29:17: pid 10123: HINT:  skiping stand for coordinator state
>> 2016-01-12 15:29:17: pid 10123: LOG:  watchdog node state changed from
>> [INITIALIZING] to [MASTER]
>> 2016-01-12 15:29:17: pid 10123: LOG:  I am announcing my self as
>> master/coordinator watchdog node
>> 2016-01-12 15:29:22: pid 10123: LOG:  I am the cluster leader node
>> 2016-01-12 15:29:22: pid 10123: DETAIL:  our declare coordinator message
>> is accepted by all nodes
>> 2016-01-12 15:29:22: pid 10123: LOG:  I am the cluster leader node.
>> Starting escalation process
>> 2016-01-12 15:29:22: pid 10123: LOG:  escalation process started with
>> PID:10212
>> 2016-01-12 15:29:22: pid 10212: LOG:  watchdog: escalation started
>> 2016-01-12 15:29:22: pid 10123: LOG:  watchdog escalation process with
>> pid: 10212 exit with SUCCESS.
>> 2016-01-12 15:29:22: pid 10123: LOG:  watchdog escalation process with
>> pid: 10212 exits with status 0
>> 2016-01-12 15:29:22: pid 10119: LOG:  received fast shutdown request
>> 2016-01-12 15:29:22: pid 10119: LOG:  shutdown request. closing listen
>> socket
>> 2016-01-12 15:29:22: pid 10130: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:22: pid 10141: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:22: pid 10144: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:22: pid 10133: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:22: pid 10142: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:22: pid 10136: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:22: pid 10135: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:22: pid 10137: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:22: pid 10147: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:22: pid 10143: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:22: pid 10138: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:22: pid 10149: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:22: pid 10150: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:22: pid 10145: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:22: pid 10139: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:22: pid 10153: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:22: pid 10151: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:22: pid 10140: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:22: pid 10162: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:22: pid 10160: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:22: pid 10163: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:22: pid 10148: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:22: pid 10161: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:22: pid 10152: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:22: pid 10159: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:22: pid 10154: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:22: pid 10158: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:22: pid 10155: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:22: pid 10123: LOG:  Watchdog is shutting down
>> 2016-01-12 15:29:22: pid 10157: LOG:  child process received shutdown
>> request signal 2
>> 2016-01-12 15:29:22: pid 10215: LOG:  watchdog: de-escalation started
>> 2016-01-12 15:29:22: pid 10156: LOG:  child process received shutdown
>> request signal 2
>>
>>


More information about the pgpool-hackers mailing list