<div dir="ltr"><div>Hi Yugo,<br><br></div>I am using Centos 6.3x64 version.<br><div><div><br>[root@server1 ~]# uname -a<br>Linux server1 2.6.32-279.el6.x86_64 #1 SMP Fri Jun 22 12:19:21 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux<br>

<br></div><div>After I shut down pgpool, i output the process (server1) to the db.log, i don&#39;t see any process left over.<br></div><div>I also output the netstat on both server to the db.log, that you may find interesting about.<br>

</div><div><br></div><div>Thanks~<br>Ning<br></div></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Fri, Mar 8, 2013 at 5:44 AM, Yugo Nagata <span dir="ltr">&lt;<a href="mailto:nagata@sraoss.co.jp" target="_blank">nagata@sraoss.co.jp</a>&gt;</span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi ning,<br>

<br>

Thanks for detailed information. I&#39;ll try to reproduce the problem.<br>

<br>

In addition ,could you please provide me some more information?<br>

<br>

1. What&#39;s OS version? (just to be sure)<br>

<br>

2. After shutdown of the pgpool on server1, are there any pgpool process left?<br>

In the log of server1, I see port 9999 is still opend to listen while port<br>

9898 is closed. It might mean there are some problem in exiting the pgpool.<br>

<br>

&gt; Mar  7 23:57:42 server1 pgpool[2555]: received smart shutdown request<br>

&gt; Mar  7 23:57:42 server1 pgpool[2555]: watchdog_pid: 2558<br>

&gt; Mar  7 23:57:49 server1 pgpool[4407]: wd_chk_sticky: ifup[/sbin/ip] doesn&#39;t have &gt; sticky bit<br>

&gt; Mar  7 23:57:49 server1 pgpool[4408]: bind(:) failed. reason: Success<br>

&gt; Mar  7 23:57:49 server1 pgpool[4408]: unlink(/tmp/.s.PGSQL.9898) failed: No such file or directory<br>

<div class="im">&gt; tcp        0      0 <a href="http://0.0.0.0:9999" target="_blank">0.0.0.0:9999</a>                0.0.0.0:*                   LISTEN<br>

</div>&gt; tcp        8      0 <a href="http://172.16.6.154:9999" target="_blank">172.16.6.154:9999</a>           <a href="http://172.16.6.153:34048" target="_blank">172.16.6.153:34048</a>          ESTABLISHED<br>

&gt; tcp        0      0 <a href="http://172.16.6.154:9999" target="_blank">172.16.6.154:9999</a>           <a href="http://172.16.6.153:33924" target="_blank">172.16.6.153:33924</a>          TIME_WAIT<br>

&gt; tcp        0      0 <a href="http://172.16.6.154:9999" target="_blank">172.16.6.154:9999</a>           <a href="http://172.16.6.154:36458" target="_blank">172.16.6.154:36458</a>          TIME_WAIT<br>

&gt; tcp        0      0 <a href="http://172.16.6.154:9999" target="_blank">172.16.6.154:9999</a>           <a href="http://172.16.6.154:36514" target="_blank">172.16.6.154:36514</a>          TIME_WAIT<br>

&gt; tcp        0      0 <a href="http://172.16.6.154:44297" target="_blank">172.16.6.154:44297</a>          <a href="http://172.16.6.153:9999" target="_blank">172.16.6.153:9999</a>           TIME_WAIT<br>

&gt; tcp        9      0 <a href="http://172.16.6.154:9999" target="_blank">172.16.6.154:9999</a>           <a href="http://172.16.6.153:34008" target="_blank">172.16.6.153:34008</a>          CLOSE_WAIT<br>

&gt; tcp        0      0 <a href="http://172.16.6.154:9999" target="_blank">172.16.6.154:9999</a>           <a href="http://172.16.6.154:36486" target="_blank">172.16.6.154:36486</a>          TIME_WAIT<br>

&gt; unix  2      [ ACC ]     STREAM     LISTENING     15867  /tmp/.s.PGSQL.9999<br>

<br>

<br>

On Fri, 8 Mar 2013 00:28:20 -0600<br>

<div class="HOEnZb"><div class="h5">ning chan &lt;<a href="mailto:ninchan8328@gmail.com">ninchan8328@gmail.com</a>&gt; wrote:<br>

<br>

&gt; Hi Yugo,<br>

&gt; Thanks for looking at the issue, here is the exact steps i did to get in to<br>

&gt; the problem.<br>

&gt; 1) make sure replication is setup and pgpool on both server have the<br>

&gt; backend value set to 2<br>

&gt; 2) shutdown postgresql on the primary, this will promote the<br>

&gt; standby(server1)  to become new primary<br>

&gt; 3) execute pcp_recovery on server1 which will  recover the failed node<br>

&gt; (server0) and connect to the new primary (server1), check backend status<br>

&gt; value<br>

&gt; 4) shudown postfresql on the server1 (new Primary), this should promote<br>

&gt; server0 to become primary again<br>

&gt; 5) execute pcp_recovery on server0 which will recover the failed node<br>

&gt; (server1) and connect to the new primary (server0 again), check backend<br>

&gt; status value<br>

&gt; 6) go to server1, shutdown pgpool, and start it up again, pgpool at the<br>

&gt; point will not be able to start anymore, server reboot is required in order<br>

&gt; to bring pgpool online.<br>

&gt;<br>

&gt; I attached you the db-server0 and db-server1.log which i redirected all the<br>

&gt; command (search for &#39;Issue command&#39;) I executed in above steps to the log<br>

&gt; file as well, you should be able to follow it very easily.<br>

&gt; I also attached you my postgresql and pgpool conf files as well as my<br>

&gt; basebackup.sh and remote start script just in case you need them for<br>

&gt; reproduce.<br>

&gt;<br>

&gt; Thanks~<br>

&gt; Ning<br>

&gt;<br>

&gt;<br>

&gt; On Thu, Mar 7, 2013 at 6:01 AM, Yugo Nagata &lt;<a href="mailto:nagata@sraoss.co.jp">nagata@sraoss.co.jp</a>&gt; wrote:<br>

&gt;<br>

&gt; &gt; Hi ning,<br>

&gt; &gt;<br>

&gt; &gt; I tried to reproduce the bind error by repeatedly starting/stopping pgpools<br>

&gt; &gt; with both watchdog enabled. But I cannot see the error.<br>

&gt; &gt;<br>

&gt; &gt; Can you tell me a reliable way to to reproduce it?<br>

&gt; &gt;<br>

&gt; &gt;<br>

&gt; &gt; On Wed, 6 Mar 2013 11:21:01 -0600<br>

&gt; &gt; ning chan &lt;<a href="mailto:ninchan8328@gmail.com">ninchan8328@gmail.com</a>&gt; wrote:<br>

&gt; &gt;<br>

&gt; &gt; &gt; Hi Tatsuo,<br>

&gt; &gt; &gt;<br>

&gt; &gt; &gt; Do you need any more data for your investigation?<br>

&gt; &gt; &gt;<br>

&gt; &gt; &gt; Thanks~<br>

&gt; &gt; &gt; Ning<br>

&gt; &gt; &gt;<br>

&gt; &gt; &gt;<br>

&gt; &gt; &gt; On Mon, Mar 4, 2013 at 4:08 PM, ning chan &lt;<a href="mailto:ninchan8328@gmail.com">ninchan8328@gmail.com</a>&gt; wrote:<br>

&gt; &gt; &gt;<br>

&gt; &gt; &gt; &gt; Hi Tatsuo,<br>

&gt; &gt; &gt; &gt; I shutdown one watchdog instead of both, I can&#39;t reproduce the problem.<br>

&gt; &gt; &gt; &gt;<br>

&gt; &gt; &gt; &gt; Here is the details:<br>

&gt; &gt; &gt; &gt; server0 pgpool watchdog is disabled<br>

&gt; &gt; &gt; &gt; server1 pgpool watchdog is enabled and it is a primary database for<br>

&gt; &gt; &gt; &gt; streaming replication, failover &amp; failback works just fine; except<br>

&gt; &gt; that the<br>

&gt; &gt; &gt; &gt; virtual ip will not be migrated to the other pgpool server because<br>

&gt; &gt; &gt; &gt; watchdog on server0 is not running.<br>

&gt; &gt; &gt; &gt;<br>

&gt; &gt; &gt; &gt; FYI: as i reported on the other email thread, running watchdog on both<br>

&gt; &gt; &gt; &gt; server will not allow me to failover &amp; failback more than once which I<br>

&gt; &gt; am<br>

&gt; &gt; &gt; &gt; still looking for root cause.<br>

&gt; &gt; &gt; &gt;<br>

&gt; &gt; &gt; &gt; 1) both node shows pool_nodes as state 2<br>

&gt; &gt; &gt; &gt; 2) shutdown database on server1, then cause the DB to failover to<br>

&gt; &gt; server0,<br>

&gt; &gt; &gt; &gt; server0 is now primary<br>

&gt; &gt; &gt; &gt; 3) execute pcp_recovery on server0 to bring the server1 failed database<br>

&gt; &gt; &gt; &gt; back online and connects to server0 as a standby; however, pool_nodes<br>

&gt; &gt; on<br>

&gt; &gt; &gt; &gt; server1 shows the following:<br>

&gt; &gt; &gt; &gt; [root@server1 data]# psql -c &quot;show pool_nodes&quot; -p 9999<br>

&gt; &gt; &gt; &gt;  node_id | hostname | port | status | lb_weight |  role<br>

&gt; &gt; &gt; &gt; ---------+----------+------+--------+-----------+---------<br>

&gt; &gt; &gt; &gt;  0       | server0  | 5432 | 2      | 0.500000  | primary<br>

&gt; &gt; &gt; &gt;  1       | server1  | 5432 | 3      | 0.500000  | standby<br>

&gt; &gt; &gt; &gt; (2 rows)<br>

&gt; &gt; &gt; &gt;<br>

&gt; &gt; &gt; &gt; As shows, server1 pgpool think itself as in state 3.<br>

&gt; &gt; &gt; &gt; Replication however is working fine.<br>

&gt; &gt; &gt; &gt;<br>

&gt; &gt; &gt; &gt; 4) i have to execute pcp_attach_node on server1 to bring its pool_nodes<br>

&gt; &gt; &gt; &gt; state to 2, however, server0 pool_nodes info about server1 becomes 3.<br>

&gt; &gt; see<br>

&gt; &gt; &gt; &gt; below for both servers output:<br>

&gt; &gt; &gt; &gt; [root@server1 data]# psql -c &quot;show pool_nodes&quot; -p 9999<br>

&gt; &gt; &gt; &gt;  node_id | hostname | port | status | lb_weight |  role<br>

&gt; &gt; &gt; &gt; ---------+----------+------+--------+-----------+---------<br>

&gt; &gt; &gt; &gt;  0       | server0  | 5432 | 2      | 0.500000  | primary<br>

&gt; &gt; &gt; &gt;  1       | server1  | 5432 | 2      | 0.500000  | standby<br>

&gt; &gt; &gt; &gt;<br>

&gt; &gt; &gt; &gt; [root@server0 ~]# psql -c &quot;show pool_nodes&quot; -p 9999<br>

&gt; &gt; &gt; &gt;  node_id | hostname | port | status | lb_weight |  role<br>

&gt; &gt; &gt; &gt; ---------+----------+------+--------+-----------+---------<br>

&gt; &gt; &gt; &gt;  0       | server0  | 5432 | 2      | 0.500000  | primary<br>

&gt; &gt; &gt; &gt;  1       | server1  | 5432 | 3      | 0.500000  | standby<br>

&gt; &gt; &gt; &gt;<br>

&gt; &gt; &gt; &gt;<br>

&gt; &gt; &gt; &gt; 5) execute the following command on server1 will bring the server1<br>

&gt; &gt; status<br>

&gt; &gt; &gt; &gt; to 2 on both node:<br>

&gt; &gt; &gt; &gt; /usr/local/bin/pcp_attach_node 10 server0 9898 pgpool [passwd] 1<br>

&gt; &gt; &gt; &gt;<br>

&gt; &gt; &gt; &gt; [root@server1 data]# psql -c &quot;show pool_nodes&quot; -p 9999<br>

&gt; &gt; &gt; &gt;  node_id | hostname | port | status | lb_weight |  role<br>

&gt; &gt; &gt; &gt; ---------+----------+------+--------+-----------+---------<br>

&gt; &gt; &gt; &gt;  0       | server0  | 5432 | 2      | 0.500000  | primary<br>

&gt; &gt; &gt; &gt;  1       | server1  | 5432 | 2      | 0.500000  | standby<br>

&gt; &gt; &gt; &gt;<br>

&gt; &gt; &gt; &gt; [root@server0 ~]# psql -c &quot;show pool_nodes&quot; -p 9999<br>

&gt; &gt; &gt; &gt;  node_id | hostname | port | status | lb_weight |  role<br>

&gt; &gt; &gt; &gt; ---------+----------+------+--------+-----------+---------<br>

&gt; &gt; &gt; &gt;  0       | server0  | 5432 | 2      | 0.500000  | primary<br>

&gt; &gt; &gt; &gt;  1       | server1  | 5432 | 2      | 0.500000  | standby<br>

&gt; &gt; &gt; &gt;<br>

&gt; &gt; &gt; &gt; Please advise the next step.<br>

&gt; &gt; &gt; &gt;<br>

&gt; &gt; &gt; &gt; Thanks~<br>

&gt; &gt; &gt; &gt; Ning<br>

&gt; &gt; &gt; &gt;<br>

&gt; &gt; &gt; &gt;<br>

&gt; &gt; &gt; &gt; On Sun, Mar 3, 2013 at 6:03 PM, Tatsuo Ishii &lt;<a href="mailto:ishii@postgresql.org">ishii@postgresql.org</a>&gt;<br>

&gt; &gt; wrote:<br>

&gt; &gt; &gt; &gt;<br>

&gt; &gt; &gt; &gt;&gt; &gt; Mar  1 10:45:31 server1 pgpool[3339]: bind(:) failed. reason:<br>

&gt; &gt; Success<br>

&gt; &gt; &gt; &gt;&gt;<br>

&gt; &gt; &gt; &gt;&gt; This error messge seems pretty strange. &quot;:&quot; should be something like<br>

&gt; &gt; &gt; &gt;&gt; &quot;/tmp/.s.PGSQL.9898&quot;. Also it&#39;s weired because 2failed. reason:<br>

&gt; &gt; &gt; &gt;&gt; Success&quot;. To isolate the problem, can please disable watchdog and try<br>

&gt; &gt; &gt; &gt;&gt; again?<br>

&gt; &gt; &gt; &gt;&gt; --<br>

&gt; &gt; &gt; &gt;&gt; Tatsuo Ishii<br>

&gt; &gt; &gt; &gt;&gt; SRA OSS, Inc. Japan<br>

&gt; &gt; &gt; &gt;&gt; English: <a href="http://www.sraoss.co.jp/index_en.php" target="_blank">http://www.sraoss.co.jp/index_en.php</a><br>

&gt; &gt; &gt; &gt;&gt; Japanese: <a href="http://www.sraoss.co.jp" target="_blank">http://www.sraoss.co.jp</a><br>

&gt; &gt; &gt; &gt;&gt;<br>

&gt; &gt; &gt; &gt;&gt;<br>

&gt; &gt; &gt; &gt;&gt; &gt; Hi All,<br>

&gt; &gt; &gt; &gt;&gt; &gt; After upgrade to pgPool-II 3.2.3 and I tested my failover/ failback<br>

&gt; &gt; &gt; &gt;&gt; setup,<br>

&gt; &gt; &gt; &gt;&gt; &gt; and start / stop pgpool mutlip times, I see one of the pgpool goes<br>

&gt; &gt; in<br>

&gt; &gt; &gt; &gt;&gt; to an<br>

&gt; &gt; &gt; &gt;&gt; &gt; unrecoverable state.<br>

&gt; &gt; &gt; &gt;&gt; &gt;<br>

&gt; &gt; &gt; &gt;&gt; &gt; Mar  1 10:45:25 server1 pgpool[3007]: received smart shutdown<br>

&gt; &gt; request<br>

&gt; &gt; &gt; &gt;&gt; &gt; Mar  1 10:45:25 server1 pgpool[3007]: watchdog_pid: 3010<br>

&gt; &gt; &gt; &gt;&gt; &gt; Mar  1 10:45:31 server1 pgpool[3338]: wd_chk_sticky: ifup[/sbin/ip]<br>

&gt; &gt; &gt; &gt;&gt; doesn&#39;t<br>

&gt; &gt; &gt; &gt;&gt; &gt; have sticky bit<br>

&gt; &gt; &gt; &gt;&gt; &gt; Mar  1 10:45:31 server1 pgpool[3339]: bind(:) failed. reason:<br>

&gt; &gt; Success<br>

&gt; &gt; &gt; &gt;&gt; &gt; Mar  1 10:45:31 server1 pgpool[3339]: unlink(/tmp/.s.PGSQL.9898)<br>

&gt; &gt; &gt; &gt;&gt; failed: No<br>

&gt; &gt; &gt; &gt;&gt; &gt; such file or directory<br>

&gt; &gt; &gt; &gt;&gt; &gt;<br>

&gt; &gt; &gt; &gt;&gt; &gt;<br>

&gt; &gt; &gt; &gt;&gt; &gt; netstat shows the following:<br>

&gt; &gt; &gt; &gt;&gt; &gt; [root@server1 ~]# netstat -na |egrep &quot;9898|9999&quot;<br>

&gt; &gt; &gt; &gt;&gt; &gt; tcp        0      0 <a href="http://0.0.0.0:9898" target="_blank">0.0.0.0:9898</a>                0.0.0.0:*<br>

&gt; &gt; &gt; &gt;&gt; &gt; LISTEN<br>

&gt; &gt; &gt; &gt;&gt; &gt; tcp        0      0 <a href="http://0.0.0.0:9999" target="_blank">0.0.0.0:9999</a>                0.0.0.0:*<br>

&gt; &gt; &gt; &gt;&gt; &gt; LISTEN<br>

&gt; &gt; &gt; &gt;&gt; &gt; tcp        0      0 <a href="http://172.16.6.154:46650" target="_blank">172.16.6.154:46650</a>          <a href="http://172.16.6.153:9999" target="_blank">172.16.6.153:9999</a><br>

&gt; &gt; &gt; &gt;&gt; &gt; TIME_WAIT<br>

&gt; &gt; &gt; &gt;&gt; &gt; tcp        9      0 <a href="http://172.16.6.154:9999" target="_blank">172.16.6.154:9999</a>           <a href="http://172.16.6.153:51868" target="_blank">172.16.6.153:51868</a><br>

&gt; &gt; &gt; &gt;&gt; &gt; CLOSE_WAIT<br>

&gt; &gt; &gt; &gt;&gt; &gt; tcp        9      0 <a href="http://172.16.6.154:9999" target="_blank">172.16.6.154:9999</a>           <a href="http://172.16.6.153:51906" target="_blank">172.16.6.153:51906</a><br>

&gt; &gt; &gt; &gt;&gt; &gt; CLOSE_WAIT<br>

&gt; &gt; &gt; &gt;&gt; &gt; tcp        0      0 <a href="http://172.16.6.154:9999" target="_blank">172.16.6.154:9999</a>           <a href="http://172.16.6.154:50624" target="_blank">172.16.6.154:50624</a><br>

&gt; &gt; &gt; &gt;&gt; &gt; TIME_WAIT<br>

&gt; &gt; &gt; &gt;&gt; &gt; tcp        9      0 <a href="http://172.16.6.154:9999" target="_blank">172.16.6.154:9999</a>           <a href="http://172.16.6.153:51946" target="_blank">172.16.6.153:51946</a><br>

&gt; &gt; &gt; &gt;&gt; &gt; CLOSE_WAIT<br>

&gt; &gt; &gt; &gt;&gt; &gt; unix  2      [ ACC ]     STREAM     LISTENING     18698<br>

&gt; &gt; &gt; &gt;&gt;  /tmp/.s.PGSQL.9898<br>

&gt; &gt; &gt; &gt;&gt; &gt; unix  2      [ ACC ]     STREAM     LISTENING     18685<br>

&gt; &gt; &gt; &gt;&gt;  /tmp/.s.PGSQL.9999<br>

&gt; &gt; &gt; &gt;&gt; &gt;<br>

&gt; &gt; &gt; &gt;&gt; &gt; Is this a known issue?<br>

&gt; &gt; &gt; &gt;&gt; &gt;<br>

&gt; &gt; &gt; &gt;&gt; &gt; I will have to reboot the server in order to start pgpool back<br>

&gt; &gt; online.<br>

&gt; &gt; &gt; &gt;&gt; &gt;<br>

&gt; &gt; &gt; &gt;&gt; &gt; My cluster has two servers (server0 &amp; server1) which each of them<br>

&gt; &gt; are<br>

&gt; &gt; &gt; &gt;&gt; &gt; running pgpool, and postgreSQL with streaming Replication setup.<br>

&gt; &gt; &gt; &gt;&gt; &gt;<br>

&gt; &gt; &gt; &gt;&gt; &gt; Thanks~<br>

&gt; &gt; &gt; &gt;&gt; &gt; Ning<br>

&gt; &gt; &gt; &gt;&gt;<br>

&gt; &gt; &gt; &gt;<br>

&gt; &gt; &gt; &gt;<br>

&gt; &gt;<br>

&gt; &gt;<br>

&gt; &gt; --<br>

&gt; &gt; Yugo Nagata &lt;<a href="mailto:nagata@sraoss.co.jp">nagata@sraoss.co.jp</a>&gt;<br>

&gt; &gt;<br>

<br>

<br>

</div></div><span class="HOEnZb"><font color="#888888">--<br>

Yugo Nagata &lt;<a href="mailto:nagata@sraoss.co.jp">nagata@sraoss.co.jp</a>&gt;<br>

</font></span></blockquote></div><br></div>