No subject


Sun Nov 18 18:09:07 JST 2012


Feb 19 13:47:01 server0 pgpool[4391]: execute command:
/home/pgpool/failover.py -d 1 -h server1 -p 5432 -D /opt/postgres/9.2/data
-m 0 -H server0 -M 0 -P 1 -r 5432 -R /opt/postgres/9.2/data
Feb 19 13:47:01 server0 postgres[4786]: [2-1] LOG:  incomplete startup
packet
Feb 19 13:47:02 server0 pgpool[4391]: find_primary_node_repeatedly: waiting
for finding a primary node
Feb 19 13:47:06 server0 postgres[3083]: [6-1] LOG:  trigger file found:
/opt/postgres/9.2/data/trigger_file0
Feb 19 13:47:06 server0 postgres[3083]: [7-1] LOG:  redo done at 0/91000020
Feb 19 13:47:06 server0 postgres[3083]: [8-1] LOG:  selected new timeline
ID: 18
Feb 19 13:47:06 server0 postgres[3083]: [9-1] LOG:  archive recovery
complete
Feb 19 13:47:06 server0 postgres[3081]: [2-1] LOG:  database system is
ready to accept connections
Feb 19 13:47:06 server0 postgres[4804]: [2-1] LOG:  autovacuum launcher
started
Feb 19 13:47:07 server0 pgpool[4391]: find_primary_node: primary node id is
0
Feb 19 13:47:07 server0 pgpool[4391]: starting follow degeneration.
shutdown host server1(5432)
Feb 19 13:47:07 server0 pgpool[4391]: failover: 1 follow backends have been
degenerated
Feb 19 13:47:07 server0 pgpool[4391]: failover: set new primary node: 0
Feb 19 13:47:07 server0 pgpool[4391]: failover: set new master node: 0
Feb 19 13:47:07 server0 pgpool[4817]: connection received:
host=server1.local port=54619
Feb 19 13:47:07 server0 pgpool[4816]: do_child: failback event found.
restart myself.
Feb 19 13:47:07 server0 pgpool[4818]: do_child: failback event found.
restart myself.
Feb 19 13:47:07 server0 pgpool[4819]: do_child: failback event found.
restart myself.
Feb 19 13:47:07 server0 pgpool[4815]: start triggering follow command.
Feb 19 13:47:07 server0 pgpool[4815]: execute command: echo 1 server1 5432
/opt/postgres/9.2/data 0 server0 0 1 5432 /opt/postgres/9.2/data %
Feb 19 13:47:07 server0 pgpool[4821]: do_child: failback event found.
restart myself.
Feb 19 13:47:07 server0 pgpool[4822]: do_child: failback event found.
restart myself.
Feb 19 13:47:07 server0 pgpool[4823]: do_child: failback event found.
restart myself.
Feb 19 13:47:07 server0 pgpool[4820]: do_child: failback event found.
restart myself.
Feb 19 13:47:07 server0 pgpool[4827]: do_child: failback event found.
restart myself.
Feb 19 13:47:07 server0 pgpool[4828]: do_child: failback event found.
restart myself.
Feb 19 13:47:07 server0 pgpool[4826]: do_child: failback event found.
restart myself.
Feb 19 13:47:07 server0 pgpool[4829]: do_child: failback event found.
restart myself.
Feb 19 13:47:07 server0 pgpool[4830]: do_child: failback event found.
restart myself.
Feb 19 13:47:07 server0 pgpool[4831]: do_child: failback event found.
restart myself.
Feb 19 13:47:07 server0 pgpool[4832]: do_child: failback event found.
restart myself.
Feb 19 13:47:07 server0 pgpool[4833]: do_child: failback event found.
restart myself.
Feb 19 13:47:07 server0 pgpool[4834]: do_child: failback event found.
restart myself.
Feb 19 13:47:07 server0 pgpool[4833]: do_child: failback event found.
restart myself.
Feb 19 13:47:07 server0 pgpool[4834]: do_child: failback event found.
restart myself.
Feb 19 13:47:07 server0 pgpool[4835]: do_child: failback event found.
restart myself.
Feb 19 13:47:07 server0 pgpool[4836]: do_child: failback event found.
restart myself.
Feb 19 13:47:07 server0 pgpool[4837]: do_child: failback event found.
restart myself.
Feb 19 13:47:07 server0 pgpool[4391]: failover done. shutdown host
server1(5432)

Any idea why failover did not work after the recovery? Do i need to restart
remote pgpool after recovery?


On Mon, Feb 18, 2013 at 4:24 PM, ning chan <ninchan8328 at gmail.com> wrote:

>
> Hi Tatsuo,
> Sorry for the late reply as I was traveling.
> I am able to reproduce the problem and here is what I did:
>
> *1) Make sure primary (server0) and standby (server1) connections are
> good.*
> [root at server0 tmp]# psql -c "show pool_nodes" -p 9999
>  node_id | hostname | port | status | lb_weight |  role
> ---------+----------+------+--------+-----------+---------
>  0       | server0  | 5432 | 2      | 0.500000  | primary
>  1       | server1  | 5432 | 2      | 0.500000  | standby
> (2 rows)
>
> [root at server1 tmp]# psql -c "show pool_nodes" -p 9999
>  node_id | hostname | port | status | lb_weight |  role
> ---------+----------+------+--------+-----------+---------
>  0       | server0  | 5432 | 2      | 0.500000  | primary
>  1       | server1  | 5432 | 2      | 0.500000  | standby
> (2 rows)
>
> *2) force shutdown primary (server0), and standby (server1) server was
> promoted, below is the pgpool log from server1 during the failover:*
> Feb 18 15:31:45 server1 pgpool[2691]: connect_inet_domain_socket:
> connect() failed: No route to host
> Feb 18 15:31:45 server1 pgpool[2691]: make_persistent_db_connection:
> connection to server0(5432) failed
> Feb 18 15:31:45 server1 pgpool[2691]: check_replication_time_lag: could
> not connect to DB node 0, check sr_check_user and sr_check_password
> Feb 18 15:31:45 server1 pgpool[2689]: connect_inet_domain_socket:
> connect() failed: No route to host
> Feb 18 15:31:45 server1 pgpool[2689]: connection to server0(5432) failed
> Feb 18 15:31:45 server1 pgpool[2689]: new_connection: create_cp() failed
> Feb 18 15:31:45 server1 pgpool[2689]: degenerate_backend_set: 0 fail over
> request from pid 2689
> Feb 18 15:31:48 server1 pgpool[2689]: wd_create_send_socket: connect()
> reports failure (No route to host). You can safely ignore this while
> starting up.
> Feb 18 15:31:48 server1 pgpool[2653]: connect_inet_domain_socket:
> connect() failed: No route to host
> Feb 18 15:31:48 server1 pgpool[2653]: make_persistent_db_connection:
> connection to server0(5432) failed
> Feb 18 15:31:48 server1 pgpool[2653]: health check failed. 0 th host
> server0 at port 5432 is down
> Feb 18 15:31:48 server1 pgpool[2653]: health check retry sleep time: 5
> second(s)
> Feb 18 15:31:48 server1 pgpool[2653]: starting degeneration. shutdown host
> server0(5432)
> Feb 18 15:31:48 server1 pgpool[2653]: Restart all children
> Feb 18 15:31:48 server1 pgpool[2653]: execute command:
> /home/pgpool/failover.py -d 0 -h server0 -p 5432 -D /opt/postgres/9.2/data
> -m 1 -H server1 -M 0 -P 0 -r 5432 -R /opt/postgres/9.2/data
> Feb 18 15:31:49 server1 pgpool[2653]: find_primary_node_repeatedly:
> waiting for finding a primary node
> Feb 18 15:31:53 server1 postgres[1970]: [7-1] LOG:  trigger file found:
> /tmp/trigger_file0
> Feb 18 15:31:53 server1 postgres[2597]: [3-1] FATAL:  terminating
> walreceiver process due to administrator command
> Feb 18 15:31:53 server1 postgres[1970]: [8-1] LOG:  record with zero
> length at 0/64000418
> Feb 18 15:31:53 server1 postgres[1970]: [9-1] LOG:  redo done at 0/640003B8
> Feb 18 15:31:53 server1 postgres[1970]: [10-1] LOG:  selected new timeline
> ID: 15
> Feb 18 15:31:53 server1 postgres[1970]: [11-1] LOG:  archive recovery
> complete
> Feb 18 15:31:53 server1 postgres[1968]: [2-1] LOG:  database system is
> ready to accept connections
> Feb 18 15:31:53 server1 postgres[3168]: [2-1] LOG:  autovacuum launcher
> started
> Feb 18 15:31:54 server1 pgpool[2653]: find_primary_node: primary node id
> is 1
> Feb 18 15:31:54 server1 pgpool[2653]: starting follow degeneration.
> shutdown host server0(5432)
> Feb 18 15:31:54 server1 pgpool[2653]: failover: 1 follow backends have
> been degenerated
> Feb 18 15:31:54 server1 pgpool[2653]: failover: set new primary node: 1
> Feb 18 15:31:54 server1 pgpool[2653]: failover: set new master node: 1
> Feb 18 15:31:54 server1 pgpool[3171]: connection received:
> host=server1.local port=58346
> Feb 18 15:31:54 server1 pgpool[3170]: start triggering follow command.
> Feb 18 15:31:54 server1 pgpool[3170]: execute command: echo 0 server0 5432
> /opt/postgres/9.2/data 1 server1 0 0 5432 /opt/postgres/9.2/data %
> Feb 18 15:31:54 server1 pgpool[2691]: worker process received restart
> request
> Feb 18 15:31:54 server1 pgpool[2653]: failover done. shutdown host
> server0(5432)
>
> *3) boot up the failed server (server0) as usual, pgpool and database
> started up as usual.
> *
> *after server0 boot up, i check the pool_nodes info, server0 pgpool still
> think server0 is the primary:*
> [root at server0 ~]# psql -c "show pool_nodes" -p 9999
>  node_id | hostname | port | status | lb_weight |  role
> ---------+----------+------+--------+-----------+---------
>  0       | server0  | 5432 | 2      | 0.500000  | primary
>  1       | server1  | 5432 | 2      | 0.500000  | standby
> (2 rows)
>
> server1 pgpool still think server1 is the primary:
> [root at server1 tmp]# psql -c "show pool_nodes" -p 9999
>  node_id | hostname | port | status | lb_weight |  role
> ---------+----------+------+--------+-----------+---------
>  0       | server0  | 5432 | 3      | 0.500000  | standby
>  1       | server1  | 5432 | 2      | 0.500000  | primary
> (2 rows)
>
> *4) kill the database on server0 to prepare of the recovery*
> *as soon as i kill the server0 database, pgpool log shows following:*
> Feb 18 15:33:56 server0 pgpool[2153]: execute command:
> /home/pgpool/failover.py -d 0 -h server0 -p 5432 -D /opt/postgres/9.2/data
> -m 1 -H server1 -M 0 -P 0 -r 5432 -R /opt/postgres/9.2/data
> Feb 18 15:33:57 server0 pgpool[2153]: find_primary_node_repeatedly:
> waiting for finding a primary node
> Feb 18 15:33:57 server0 pgpool[2153]: find_primary_node: primary node id
> is 1
> Feb 18 15:33:57 server0 pgpool[2153]: starting follow degeneration.
> shutdown host server0(5432)
> Feb 18 15:33:57 server0 pgpool[2153]: failover: 1 follow backends have
> been degenerated
> Feb 18 15:33:57 server0 pgpool[2153]: failover: set new primary node: 1
> Feb 18 15:33:57 server0 pgpool[2153]: failover: set new master node: 1
> Feb 18 15:33:57 server0 pgpool[2534]: start triggering follow command.
> Feb 18 15:33:57 server0 pgpool[2534]: execute command: echo 0 server0 5432
> /opt/postgres/9.2/data 1 server1 0 0 5432 /opt/postgres/9.2/data %
> Feb 18 15:33:57 server0 pgpool[2190]: worker process received restart
> request
> Feb 18 15:33:57 server0 pgpool[2153]: failover done. shutdown host
> server0(5432)
>
> pool_nodes shows:
> [root at server0 ~]# psql -c "show pool_nodes" -p 9999
>  node_id | hostname | port | status | lb_weight |  role
> ---------+----------+------+--------+-----------+---------
>  0       | server0  | 5432 | 3      | 0.500000  | standby
>  1       | server1  | 5432 | 2      | 0.500000  | primary
> (2 rows)
>
>
> *5) on server1, execute /usr/local/bin/pcp_recovery_node 10 localhost
> 9898 pgpool password 0*
> server 1 pgpool log shows following:
> Feb 18 15:34:52 server1 pgpool[3209]: starting recovery command: "SELECT
> pgpool_recovery('basebackup.sh', 'server0', '/opt/postgres/9.2/data')"
> Feb 18 15:34:53 server1 pgpool[3209]: 1st stage is done
> Feb 18 15:34:53 server1 pgpool[3209]: check_postmaster_started: try to
> connect to postmaster on hostname:server0 database:postgres user:postgres
> (retry 0 times)
> Feb 18 15:34:53 server1 pgpool[3209]: check_postmaster_started: failed to
> connect to postmaster on hostname:server0 database:postgres user:postgres
>
> *The message "fail to connect to postmaster on hostname server0" continue
> until the command timeout.
> *
> *it looks like the 1st stage command is done but pgpool on server1 did
> not call the pgpool_remote_start to start the server0 database engine
> although the script is there under postgresql data folder
> *
> [root at server1 data]# ls
> backup_label.old     data       pg_hba.conf    pg_notify
> pg_stat_tmp  PG_VERSION       postmaster.pid
> base                 global     pg_ident.conf  pgpool_remote_start
> pg_subtrans  pg_xlog          recovery.done
> basebackup.sh        ning.test  pg_log         pg_serial
> pg_tblspc    postgresql.conf  recovery.standby1
> copy-base-backup.sh  pg_clog    pg_multixact   pg_snapshots
> pg_twophase  postmaster.opts
>
> *during this period, pgpool log on server0 shows the following:*
> Feb 18 15:34:42 server0 pgpool[2567]: connection received:
> host=server0.local port=45366
> Feb 18 15:34:46 server0 pgpool[2567]: connection received:
> host=server1.local port=39535
> Feb 18 15:34:52 server0 pgpool[2567]: connection received:
> host=server0.local port=45381
> Feb 18 15:34:56 server0 pgpool[2565]: connection received:
> host=server1.local port=39557
> Feb 18 15:35:02 server0 pgpool[2565]: connection received:
> host=server0.local port=45396
> Feb 18 15:35:06 server0 pgpool[2565]: connection received:
> host=server1.local port=39576
> Feb 18 15:35:12 server0 pgpool[2565]: connection received:
> host=server0.local port=45410
> Feb 18 15:35:16 server0 pgpool[2565]: connection received:
> host=server1.local port=39593
> Feb 18 15:35:22 server0 pgpool[2565]: connection received:
> host=server0.local port=45425
> Feb 18 15:35:26 server0 pgpool[2565]: connection received:
> host=server1.local port=39612
> Feb 18 15:35:32 server0 pgpool[2565]: connection received:
> host=server0.local port=45440
> Feb 18 15:35:36 server0 pgpool[2565]: connection received:
> host=server1.local port=39630
> Feb 18 15:35:42 server0 pgpool[2565]: connection received:
> host=server0.local port=45455
> Feb 18 15:35:46 server0 pgpool[2565]: connection received:
> host=server1.local port=39648
>
> *Below is the script of recovery_1st_stage_command = 'basebackup.sh'*
> [root at server1 ~]# more /opt/postgres/9.2/data/basebackup.sh
> #! /bin/sh
> # Recovery script for streaming replication.
> # This script assumes that DB node 0 is primary, and 1 is standby.
> #
> datadir=$1
> desthost=$2
> destdir=$3
>
> echo "datadir=$datadir"
> echo "desthost=$desthost"
> echo "destdir=$destdir"
>
> echo "Executing basebackup.sh" >> /var/log/replication.log
>
> echo "Executing start backup" >> /var/log/replication.log
> psql -c "SELECT pg_start_backup('Streaming Replication', true)" postgres
>
> echo "Executing rsync" >> /var/log/replication.log
> echo "rsync -C -a --progress --delete -e ssh --exclude postgresql.conf
> --exclude postmaster.pid --exclude postmaster.
> opts --exclude pg_log --exclude pg_xlog --exclude recovery.conf $datadir/
> $desthost:$destdir/"
>
> echo "Renaming recovery conf file" >>/var/log/replication.log
> echo "ssh -T $desthost mv $destdir/recovery.done $destdir/recovery.conf"
> ssh -T $desthost mv $destdir/recovery.done $destdir/recovery.conf
>
> echo "Removing trigger file"
> ssh -T $desthost rm -f /tmp/trigger_file0
>
> echo "Executing stop backup" >> /var/log/replication.log
> psql -c "SELECT pg_stop_backup()" postgres
>
> echo "existing basebackup.sh" >> /var/log/replication.log
>
> *Manually execute the command on server1 return the following result:*
>
> [root at server1 data]# ./basebackup.sh /opt/postgres/9.2/data/ server0
> /opt/postgres/9.2/data/
> datadir=/opt/postgres/9.2/data/
> desthost=server0
> destdir=/opt/postgres/9.2/data/
>  pg_start_backup
> -----------------
>  0/76000020
> (1 row)
>
> rsync -C -a --progress --delete -e ssh --exclude postgresql.conf --exclude
> postmaster.pid --exclude postmaster.opts --exclude pg_log --exclude pg_xlog
> --exclude recovery.conf /opt/postgres/9.2/data//
> server0:/opt/postgres/9.2/data//
> ssh -T server0 mv /opt/postgres/9.2/data//recovery.done
> /opt/postgres/9.2/data//recovery.conf
> Removing trigger file
> NOTICE:  WAL archiving is not enabled; you must ensure that all required
> WAL segments are copied through other means to complete the backup
>  pg_stop_backup
> ----------------
>  0/760000E0
> (1 row)
>
> Thanks and please advise.
>
> Ning
>
>
> On Sun, Feb 17, 2013 at 7:05 AM, Tatsuo Ishii <ishii at postgresql.org>wrote:
>
>> Hi,
>>
>> It seems the standby was unable to start up. Can you show standby
>> PostgreSQL's log? Maybe we could find the cause of the problem.
>> --
>> Tatsuo Ishii
>> SRA OSS, Inc. Japan
>> English: http://www.sraoss.co.jp/index_en.php
>> Japanese: http://www.sraoss.co.jp
>>
>> > Hi Tatsuo,
>> > Thank you so much for your reply.
>> > Actually in my case, i was using the pcp_recovery command and execute
>> it on
>> > the current primary server.
>> > However, if the remote node (recover node) database is off, i got the
>> > following message on the primary server pgpool log:
>> >
>> > Jan 31 16:58:10 server0 pgpool[2723]: starting recovery command: "SELECT
>> > pgpool_recovery('basebackup.sh', 'server1', '/opt/postgres/9.2/data')"
>> > Jan 31 16:58:11 server0 pgpool[2723]: 1st stage is done
>> > Jan 31 16:58:11 server0 pgpool[2723]: check_postmaster_started: try to
>> > connect to postmaster on hostname:server1 database:postgres
>> user:postgres
>> > (retry 0 times)
>> > Jan 31 16:58:11 server0 pgpool[2723]: check_postmaster_started: failed
>> to
>> > connect to postmaster on hostname:server1 database:postgres
>> user:postgres
>> > Jan 31 16:58:13 server0 pgpool[2719]: connection received:
>> > host=server0.local port=58446
>> > Jan 31 16:58:14 server0 pgpool[2723]: check_postmaster_started: try to
>> > connect to postmaster on hostname:server1 database:postgres
>> user:postgres
>> > (retry 1 times)
>> > Jan 31 16:58:14 server0 pgpool[2723]: check_postmaster_started: failed
>> to
>> > connect to postmaster on hostname:server1 database:postgres
>> user:postgres
>> > Jan 31 16:58:14 server0 pgpool[2719]: connection received:
>> > host=server1.local port=39928
>> > Jan 31 16:58:17 server0 pgpool[2723]: check_postmaster_started: try to
>> > connect to postmaster on hostname:server1 database:postgres
>> user:postgres
>> > (retry 2 times)
>> > Jan 31 16:58:17 server0 pgpool[2723]: check_postmaster_started: failed
>> to
>> > connect to postmaster on hostname:server1 database:postgres
>> user:postgres
>> > Jan 31 16:58:20 server0 pgpool[2723]: check_postmaster_started: try to
>> > connect to postmaster on hostname:server1 database:postgres
>> user:postgres
>> > (retry 3 times)
>> > Jan 31 16:58:20 server0 pgpool[2723]: check_postmaster_started: failed
>> to
>> > connect to postmaster on hostname:server1 database:postgres
>> user:postgres
>> > Jan 31 16:58:23 server0 pgpool[2719]: connection received:
>> > host=server0.local port=58464
>> > Jan 31 16:58:23 server0 pgpool[2723]: check_postmaster_started: try to
>> > connect to postmaster on hostname:server1 database:template1
>> user:postgres
>> > (retry 0 times)
>> > Jan 31 16:58:23 server0 pgpool[2723]: check_postmaster_started: failed
>> to
>> > connect to postmaster on hostname:server1 database:template1
>> user:postgres
>> > Jan 31 16:58:26 server0 pgpool[2723]: check_postmaster_started: try to
>> > connect to postmaster on hostname:server1 database:template1
>> user:postgres
>> > (retry 1 times)
>> > Jan 31 16:58:26 server0 pgpool[2723]: check_postmaster_started: failed
>> to
>> > connect to postmaster on hostname:server1 database:template1
>> user:postgres
>> > Jan 31 16:58:26 server0 pgpool[2719]: connection received:
>> > host=server1.local port=39946
>> > Jan 31 16:58:29 server0 pgpool[2723]: check_postmaster_started: try to
>> > connect to postmaster on hostname:server1 database:template1
>> user:postgres
>> > (retry 2 times)
>> > Jan 31 16:58:29 server0 pgpool[2723]: check_postmaster_started: failed
>> to
>> > connect to postmaster on hostname:server1 database:template1
>> user:postgres
>> > Jan 31 16:58:32 server0 pgpool[2723]: check_postmaster_started: try to
>> > connect to postmaster on hostname:server1 database:template1
>> user:postgres
>> > (retry 3 times)
>> > Jan 31 16:58:32 server0 pgpool[2723]: check_postmaster_started: failed
>> to
>> > connect to postmaster on hostname:server1 database:template1
>> user:postgres
>> > Jan 31 16:58:33 server0 pgpool[2719]: connection received:
>> > host=server0.local port=58483
>> > Jan 31 16:58:35 server0 pgpool[2723]: check_postmaster_started: try to
>> > connect to postmaster on hostname:server1 database:template1
>> user:postgres
>> > (retry 4 times)
>> > Jan 31 16:58:35 server0 pgpool[2723]: check_postmaster_started: failed
>> to
>> > connect to postmaster on hostname:server1 database:template1
>> user:postgres
>> > Jan 31 16:58:38 server0 pgpool[2723]: check_postmaster_started: try to
>> > connect to postmaster on hostname:server1 database:template1
>> user:postgres
>> > (retry 5 times)
>> > Jan 31 16:58:38 server0 pgpool[2723]: check_postmaster_started: failed
>> to
>> > connect to postmaster on hostname:server1 database:template1
>> user:postgres
>> >
>> > here is the exact command i execute on server0 to recover server1,
>> > /usr/local/bin/pcp_recovery_node 10 localhost 9898 pgpool password 1
>> >
>> > Do you have any idea why?
>> >
>> > Just FYI, we cannot pgpoolAdmin in our environment.
>> >
>> >
>> > On Sun, Feb 17, 2013 at 12:13 AM, Tatsuo Ishii <ishii at postgresql.org>
>> wrote:
>> >
>> >> > Hi all,
>> >> > I have the following question regarding the recovery of a filed
>> Primary
>> >> > Database Server.
>> >> >
>> >> > Question 1: in the documentation, under Streaming Replication Online
>> >> > Recovery section.
>> >> >
>> >> > http://www.pgpool.net/docs/latest/pgpool-en.html#stream
>> >> >
>> >> > in steps 6:
>> >> >
>> >> >    1. After completing online recovery, pgpool-II will start
>> PostgreSQL
>> >> on
>> >> >    the standby node. Install the script for this purpose on each DB
>> >> > nodes. Sample
>> >> >    script <http://www.pgpool.net/docs/latest/pgpool_remote_start> is
>> >> >    included in "sample" directory of the source code. This script
>> uses
>> >> ssh.
>> >> >    You need to allow recovery_user to login from the primary node to
>> the
>> >> >    standby node without being asked password.
>> >> >
>> >> > To my understanding, i think the postgreSQL doesn't not need to be
>> online
>> >> > for the recovery process right? Since later on it mentions that
>> >> > pgpool_remote_start will start up the DB on the failed node.
>> >>
>> >> Acually standby PostgreSQL node should not be started.
>> >>
>> >> > Question 2: in my configuration, i have 2 pgpool server with two
>> >> backends.
>> >> > Will it work for oneline recovery?
>> >>
>> >> Yes, but online recovery process should be initiated by one of pgpool,
>> >> not both. If you enable pgpool-II 3.2's watchdog, it will take care of
>> >> neccessary interlocking.
>> >>
>> >> > Question 3: when the failed node comes back online, should i use
>> >> > pcp_recovery from DB primary or should i use pcp_attach on the failed
>> >> node
>> >> > to recover the failed system? Actually in my case, both methods do
>> not
>> >> > recover my system every time.
>> >>
>> >> I'm confused. Didn't you start the online recovery process by using
>> >> pcp_recovery_node?(of course you could do it via pgpoolAdmin).
>> >>
>> >> Anyway pcp_recovery_node automatically attach recovered node, and you
>> >> don't need to execute pcp_attach_node.
>> >>
>> >> I suggest you read tutorial:
>> >>
>> http://www.pgpool.net/pgpool-web/contrib_docs/simple_sr_setting2/index.html
>> >> --
>> >> Tatsuo Ishii
>> >> SRA OSS, Inc. Japan
>> >> English: http://www.sraoss.co.jp/index_en.php
>> >> Japanese: http://www.sraoss.co.jp
>> >>
>>
>
>

--20cf306f796ca7bc5704d61a2a8e
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div><div><div><div><div><div><div><div><div>Hi Tatsuo,<br=
></div>I figured out the problem to the some scripting error in the pgpool_=
remote_start to have the incorrect path to pg_ctl.<br></div>As soon as I co=
rrected it i can recover the failed server and bring it online. <br>
</div>I now however facing another problem.<br></div>After I bring the fail=
ed master back in to the pgpool as a standby server, i then shutdown the cu=
rrect Primary server and expecting Standby server will be promoted to becom=
e the new Primary; however it did not happen.<br>
</div>I check the pgpool log, i see failover command is called but it did n=
ot execute. I check and confirm my failover script works just fine. <br><br=
></div>Here is the log:<br>Feb 19 14:51:40 server0 pgpool[3519]: set 1 th b=
ackend down status<br>
Feb 19 14:51:43 server0 pgpool[3554]: wd_create_send_socket: connect() repo=
rts failure (No route to host). You can safely ignore this while starting u=
p.<br>Feb 19 14:51:43 server0 pgpool[3522]: wd_lifecheck: lifecheck failed =
3 times. pgpool seems not to be working<br>
Feb 19 14:51:43 server0 pgpool[3522]: wd_IP_down: ifconfig down succeeded<b=
r>Feb 19 14:51:43 server0 pgpool[3519]: starting degeneration. shutdown hos=
t server1(5432)<br>Feb 19 14:51:43 server0 pgpool[3519]: Restart all childr=
en<br>
Feb 19 14:51:43 server0 pgpool[3519]: execute command: /home/pgpool/failove=
r.py -d 1 -h server1 -p 5432 -D /opt/postgres/9.2/data -m 0 -H server0 -M 0=
 -P 1 -r 5432 -R /opt/postgres/9.2/data<br>Feb 19 14:51:43 server0 postgres=
[3939]: [2-1] LOG:=A0 incomplete startup packet<br>
Feb 19 14:51:43 server0 postgres[3931]: [2-1] LOG:=A0 incomplete startup pa=
cket<br>Feb 19 14:51:43 server0 postgres[3935]: [2-1] LOG:=A0 incomplete st=
artup packet<br>Feb 19 14:51:43 server0 pgpool[3519]: find_primary_node_rep=
eatedly: waiting for finding a primary node<br>
Feb 19 14:51:46 server0 pgpool[3522]: wd_create_send_socket: connect() repo=
rts failure (No route to host). You can safely ignore this while starting u=
p.<br>Feb 19 14:51:46 server0 pgpool[3522]: wd_lifecheck: lifecheck failed =
3 times. pgpool seems not to be working<br>
Feb 19 14:51:54 server0 pgpool[3556]: connect_inet_domain_socket: connect()=
 failed: Connection timed out<br>Feb 19 14:51:54 server0 pgpool[3556]: make=
_persistent_db_connection: connection to server1(5432) failed<br>Feb 19 14:=
51:54 server0 pgpool[3556]: check_replication_time_lag: could not connect t=
o DB node 1, check sr_check_user and sr_check_password<br>
Feb 19 14:52:05 server0 pgpool[3522]: wd_lifecheck: lifecheck failed 3 time=
s. pgpool seems not to be working<br>Feb 19 14:52:05 server0 pgpool[3556]: =
connect_inet_domain_socket: connect() failed: No route to host<br>Feb 19 14=
:52:05 server0 pgpool[3556]: make_persistent_db_connection: connection to s=
erver1(5432) failed<br>
Feb 19 14:52:05 server0 pgpool[3556]: check_replication_time_lag: could not=
 connect to DB node 1, check sr_check_user and sr_check_password<br>Feb 19 =
14:52:05 server0 pgpool[3522]: wd_lifecheck: lifecheck failed 3 times. pgpo=
ol seems not to be working<br>
Feb 19 14:52:14 server0 pgpool[3519]: failover: no follow backends are dege=
nerated<br>Feb 19 14:52:14 server0 pgpool[3519]: failover: set new primary =
node: -1<br>Feb 19 14:52:14 server0 pgpool[3519]: failover: set new master =
node: 0<br>
Feb 19 14:52:14 server0 pgpool[3980]: connection received: host=3Dserver0.l=
ocal port=3D45361<br>Feb 19 14:52:14 server0 pgpool[3556]: worker process r=
eceived restart request<br>Feb 19 14:52:14 server0 pgpool[3519]: failover d=
one. shutdown host server1(5432)<br>
<br></div>As you can see, postgreSQL did not restarted.<br><br></div>From a=
 successful failover, the log should looks like this:<br>Feb 19 13:47:01 se=
rver0 pgpool[4391]: execute command: /home/pgpool/failover.py -d 1 -h serve=
r1 -p 5432 -D /opt/postgres/9.2/data -m 0 -H server0 -M 0 -P 1 -r 5432 -R /=
opt/postgres/9.2/data<br>
Feb 19 13:47:01 server0 postgres[4786]: [2-1] LOG:=A0 incomplete startup pa=
cket<br>Feb 19 13:47:02 server0 pgpool[4391]: find_primary_node_repeatedly:=
 waiting for finding a primary node<br>Feb 19 13:47:06 server0 postgres[308=
3]: [6-1] LOG:=A0 trigger file found: /opt/postgres/9.2/data/trigger_file0<=
br>
Feb 19 13:47:06 server0 postgres[3083]: [7-1] LOG:=A0 redo done at 0/910000=
20<br>Feb 19 13:47:06 server0 postgres[3083]: [8-1] LOG:=A0 selected new ti=
meline ID: 18<br>Feb 19 13:47:06 server0 postgres[3083]: [9-1] LOG:=A0 arch=
ive recovery complete<br>
Feb 19 13:47:06 server0 postgres[3081]: [2-1] LOG:=A0 database system is re=
ady to accept connections<br>Feb 19 13:47:06 server0 postgres[4804]: [2-1] =
LOG:=A0 autovacuum launcher started<br>Feb 19 13:47:07 server0 pgpool[4391]=
: find_primary_node: primary node id is 0<br>
Feb 19 13:47:07 server0 pgpool[4391]: starting follow degeneration. shutdow=
n host server1(5432)<br>Feb 19 13:47:07 server0 pgpool[4391]: failover: 1 f=
ollow backends have been degenerated<br>Feb 19 13:47:07 server0 pgpool[4391=
]: failover: set new primary node: 0<br>
Feb 19 13:47:07 server0 pgpool[4391]: failover: set new master node: 0<br>F=
eb 19 13:47:07 server0 pgpool[4817]: connection received: host=3Dserver1.lo=
cal port=3D54619<br>Feb 19 13:47:07 server0 pgpool[4816]: do_child: failbac=
k event found. restart myself.<br>
Feb 19 13:47:07 server0 pgpool[4818]: do_child: failback event found. resta=
rt myself.<br>Feb 19 13:47:07 server0 pgpool[4819]: do_child: failback even=
t found. restart myself.<br>Feb 19 13:47:07 server0 pgpool[4815]: start tri=
ggering follow command.<br>
Feb 19 13:47:07 server0 pgpool[4815]: execute command: echo 1 server1 5432 =
/opt/postgres/9.2/data 0 server0 0 1 5432 /opt/postgres/9.2/data %<br>Feb 1=
9 13:47:07 server0 pgpool[4821]: do_child: failback event found. restart my=
self.<br>
Feb 19 13:47:07 server0 pgpool[4822]: do_child: failback event found. resta=
rt myself.<br>Feb 19 13:47:07 server0 pgpool[4823]: do_child: failback even=
t found. restart myself.<br>Feb 19 13:47:07 server0 pgpool[4820]: do_child:=
 failback event found. restart myself.<br>
Feb 19 13:47:07 server0 pgpool[4827]: do_child: failback event found. resta=
rt myself.<br>Feb 19 13:47:07 server0 pgpool[4828]: do_child: failback even=
t found. restart myself.<br>Feb 19 13:47:07 server0 pgpool[4826]: do_child:=
 failback event found. restart myself.<br>
Feb 19 13:47:07 server0 pgpool[4829]: do_child: failback event found. resta=
rt myself.<br>Feb 19 13:47:07 server0 pgpool[4830]: do_child: failback even=
t found. restart myself.<br>Feb 19 13:47:07 server0 pgpool[4831]: do_child:=
 failback event found. restart myself.<br>
Feb 19 13:47:07 server0 pgpool[4832]: do_child: failback event found. resta=
rt myself.<br>Feb 19 13:47:07 server0 pgpool[4833]: do_child: failback even=
t found. restart myself.<br>Feb 19 13:47:07 server0 pgpool[4834]: do_child:=
 failback event found. restart myself.<br>
Feb 19 13:47:07 server0 pgpool[4833]: do_child: failback event found. resta=
rt myself.<br>Feb 19 13:47:07 server0 pgpool[4834]: do_child: failback even=
t found. restart myself.<br>Feb 19 13:47:07 server0 pgpool[4835]: do_child:=
 failback event found. restart myself.<br>
Feb 19 13:47:07 server0 pgpool[4836]: do_child: failback event found. resta=
rt myself.<br>Feb 19 13:47:07 server0 pgpool[4837]: do_child: failback even=
t found. restart myself.<br>Feb 19 13:47:07 server0 pgpool[4391]: failover =
done. shutdown host server1(5432)<br>
<br></div>Any idea why failover did not work after the recovery? Do i need =
to restart remote pgpool after recovery?<br></div><div class=3D"gmail_extra=
"><br><br><div class=3D"gmail_quote">On Mon, Feb 18, 2013 at 4:24 PM, ning =
chan <span dir=3D"ltr">&lt;<a href=3D"mailto:ninchan8328 at gmail.com" target=
=3D"_blank">ninchan8328 at gmail.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div><div><div><div><div><d=
iv><div><div><div><div><div><div><div><div><br>Hi Tatsuo,<br></div>Sorry fo=
r the late reply as I was traveling. <br>
</div>I am able to reproduce the problem and here is what I did:<br>
<br></div><b>1) Make sure primary (server0) and standby (server1) connectio=
ns are good.</b><br>[root at server0 tmp]# psql -c &quot;show pool_nodes&quot;=
 -p 9999<br>=A0node_id | hostname | port | status | lb_weight |=A0 role=A0=
=A0 <br>

---------+----------+------+--------+-----------+---------<br>=A00=A0=A0=A0=
=A0=A0=A0 | server0=A0 | 5432 | 2=A0=A0=A0=A0=A0 | 0.500000=A0 | primary<br=
>=A01=A0=A0=A0=A0=A0=A0 | server1=A0 | 5432 | 2=A0=A0=A0=A0=A0 | 0.500000=
=A0 | standby<br>(2 rows)<br><br>[root at server1 tmp]# psql -c &quot;show poo=
l_nodes&quot; -p 9999<br>

=A0node_id | hostname | port | status | lb_weight |=A0 role=A0=A0 <br>-----=
----+----------+------+--------+-----------+---------<br>=A00=A0=A0=A0=A0=
=A0=A0 | server0=A0 | 5432 | 2=A0=A0=A0=A0=A0 | 0.500000=A0 | primary<br>=
=A01=A0=A0=A0=A0=A0=A0 | server1=A0 | 5432 | 2=A0=A0=A0=A0=A0 | 0.500000=A0=
 | standby<br>

(2 rows)<br><br></div><b>2) force shutdown primary (server0), and standby (=
server1) server was promoted, below is the pgpool log from server1 during t=
he failover:</b><br>Feb 18 15:31:45 server1 pgpool[2691]: connect_inet_doma=
in_socket: connect() failed: No route to host<br>

Feb 18 15:31:45 server1 pgpool[2691]: make_persistent_db_connection: connec=
tion to server0(5432) failed<br>Feb 18 15:31:45 server1 pgpool[2691]: check=
_replication_time_lag: could not connect to DB node 0, check sr_check_user =
and sr_check_password<br>

Feb 18 15:31:45 server1 pgpool[2689]: connect_inet_domain_socket: connect()=
 failed: No route to host<br>Feb 18 15:31:45 server1 pgpool[2689]: connecti=
on to server0(5432) failed<br>Feb 18 15:31:45 server1 pgpool[2689]: new_con=
nection: create_cp() failed<br>

Feb 18 15:31:45 server1 pgpool[2689]: degenerate_backend_set: 0 fail over r=
equest from pid 2689<br>Feb 18 15:31:48 server1 pgpool[2689]: wd_create_sen=
d_socket: connect() reports failure (No route to host). You can safely igno=
re this while starting up.<br>

Feb 18 15:31:48 server1 pgpool[2653]: connect_inet_domain_socket: connect()=
 failed: No route to host<br>Feb 18 15:31:48 server1 pgpool[2653]: make_per=
sistent_db_connection: connection to server0(5432) failed<br>Feb 18 15:31:4=
8 server1 pgpool[2653]: health check failed. 0 th host server0 at port 5432=
 is down<br>

Feb 18 15:31:48 server1 pgpool[2653]: health check retry sleep time: 5 seco=
nd(s)<br>Feb 18 15:31:48 server1 pgpool[2653]: starting degeneration. shutd=
own host server0(5432)<br>Feb 18 15:31:48 server1 pgpool[2653]: Restart all=
 children<br>

Feb 18 15:31:48 server1 pgpool[2653]: execute command: /home/pgpool/failove=
r.py -d 0 -h server0 -p 5432 -D /opt/postgres/9.2/data -m 1 -H server1 -M 0=
 -P 0 -r 5432 -R /opt/postgres/9.2/data<br>Feb 18 15:31:49 server1 pgpool[2=
653]: find_primary_node_repeatedly: waiting for finding a primary node<br>

Feb 18 15:31:53 server1 postgres[1970]: [7-1] LOG:=A0 trigger file found: /=
tmp/trigger_file0<br>Feb 18 15:31:53 server1 postgres[2597]: [3-1] FATAL:=
=A0 terminating walreceiver process due to administrator command<br>Feb 18 =
15:31:53 server1 postgres[1970]: [8-1] LOG:=A0 record with zero length at 0=
/64000418<br>

Feb 18 15:31:53 server1 postgres[1970]: [9-1] LOG:=A0 redo done at 0/640003=
B8<br>Feb 18 15:31:53 server1 postgres[1970]: [10-1] LOG:=A0 selected new t=
imeline ID: 15<br>Feb 18 15:31:53 server1 postgres[1970]: [11-1] LOG:=A0 ar=
chive recovery complete<br>

Feb 18 15:31:53 server1 postgres[1968]: [2-1] LOG:=A0 database system is re=
ady to accept connections<br>Feb 18 15:31:53 server1 postgres[3168]: [2-1] =
LOG:=A0 autovacuum launcher started<br>Feb 18 15:31:54 server1 pgpool[2653]=
: find_primary_node: primary node id is 1<br>

Feb 18 15:31:54 server1 pgpool[2653]: starting follow degeneration. shutdow=
n host server0(5432)<br>Feb 18 15:31:54 server1 pgpool[2653]: failover: 1 f=
ollow backends have been degenerated<br>Feb 18 15:31:54 server1 pgpool[2653=
]: failover: set new primary node: 1<br>

Feb 18 15:31:54 server1 pgpool[2653]: failover: set new master node: 1<br>F=
eb 18 15:31:54 server1 pgpool[3171]: connection received: host=3Dserver1.lo=
cal port=3D58346<br>Feb 18 15:31:54 server1 pgpool[3170]: start triggering =
follow command.<br>

Feb 18 15:31:54 server1 pgpool[3170]: execute command: echo 0 server0 5432 =
/opt/postgres/9.2/data 1 server1 0 0 5432 /opt/postgres/9.2/data %<br>Feb 1=
8 15:31:54 server1 pgpool[2691]: worker process received restart request<br=
>

Feb 18 15:31:54 server1 pgpool[2653]: failover done. shutdown host server0(=
5432)<br><br></div><b>3) boot up the failed server (server0) as usual, pgpo=
ol and database started up as usual.<br></b></div><div><b>after server0 boo=
t up, i check the pool_nodes info, server0 pgpool still think server0 is th=
e primary:</b><br>

[root at server0 ~]# psql -c &quot;show pool_nodes&quot; -p 9999<br>=A0node_id=
 | hostname | port | status | lb_weight |=A0 role=A0=A0 <br>---------+-----=
-----+------+--------+-----------+---------<br>=A00=A0=A0=A0=A0=A0=A0 | ser=
ver0=A0 | 5432 | 2=A0=A0=A0=A0=A0 | 0.500000=A0 | primary<br>

=A01=A0=A0=A0=A0=A0=A0 | server1=A0 | 5432 | 2=A0=A0=A0=A0=A0 | 0.500000=A0=
 | standby<br>(2 rows)<br><br></div><div>server1 pgpool still think server1=
 is the primary:<br>[root at server1 tmp]# psql -c &quot;show pool_nodes&quot;=
 -p 9999<br>=A0node_id | hostname | port | status | lb_weight |=A0 role=A0=
=A0 <br>

---------+----------+------+--------+-----------+---------<br>=A00=A0=A0=A0=
=A0=A0=A0 | server0=A0 | 5432 | 3=A0=A0=A0=A0=A0 | 0.500000=A0 | standby<br=
>=A01=A0=A0=A0=A0=A0=A0 | server1=A0 | 5432 | 2=A0=A0=A0=A0=A0 | 0.500000=
=A0 | primary<br>(2 rows)<br></div><div><br></div><b>4) kill the database o=
n server0 to prepare of the recovery</b><br>

</div><div><b>as soon as i kill the server0 database, pgpool log shows foll=
owing:</b><br>Feb 18 15:33:56 server0 pgpool[2153]: execute command: /home/=
pgpool/failover.py -d 0 -h server0 -p 5432 -D /opt/postgres/9.2/data -m 1 -=
H server1 -M 0 -P 0 -r 5432 -R /opt/postgres/9.2/data<br>

Feb 18 15:33:57 server0 pgpool[2153]: find_primary_node_repeatedly: waiting=
 for finding a primary node<br>Feb 18 15:33:57 server0 pgpool[2153]: find_p=
rimary_node: primary node id is 1<br>Feb 18 15:33:57 server0 pgpool[2153]: =
starting follow degeneration. shutdown host server0(5432)<br>

Feb 18 15:33:57 server0 pgpool[2153]: failover: 1 follow backends have been=
 degenerated<br>Feb 18 15:33:57 server0 pgpool[2153]: failover: set new pri=
mary node: 1<br>Feb 18 15:33:57 server0 pgpool[2153]: failover: set new mas=
ter node: 1<br>

Feb 18 15:33:57 server0 pgpool[2534]: start triggering follow command.<br>F=
eb 18 15:33:57 server0 pgpool[2534]: execute command: echo 0 server0 5432 /=
opt/postgres/9.2/data 1 server1 0 0 5432 /opt/postgres/9.2/data %<br>Feb 18=
 15:33:57 server0 pgpool[2190]: worker process received restart request<br>

Feb 18 15:33:57 server0 pgpool[2153]: failover done. shutdown host server0(=
5432)<br><br></div><div>pool_nodes shows:<br>[root at server0 ~]# psql -c &quo=
t;show pool_nodes&quot; -p 9999<br>=A0node_id | hostname | port | status | =
lb_weight |=A0 role=A0=A0 <br>

---------+----------+------+--------+-----------+---------<br>=A00=A0=A0=A0=
=A0=A0=A0 | server0=A0 | 5432 | 3=A0=A0=A0=A0=A0 | 0.500000=A0 | standby<br=
>=A01=A0=A0=A0=A0=A0=A0 | server1=A0 | 5432 | 2=A0=A0=A0=A0=A0 | 0.500000=
=A0 | primary<br>(2 rows)<br></div><div><br><br></div>

<b>5) on server1, execute /usr/local/bin/pcp_recovery_node 10 localhost 989=
8 pgpool password 0</b><br></div>server 1 pgpool log shows following:<br>Fe=
b 18 15:34:52 server1 pgpool[3209]: starting recovery command: &quot;SELECT=
 pgpool_recovery(&#39;basebackup.sh&#39;, &#39;server0&#39;, &#39;/opt/post=
gres/9.2/data&#39;)&quot;<br>

Feb 18 15:34:53 server1 pgpool[3209]: 1st stage is done<br>Feb 18 15:34:53 =
server1 pgpool[3209]: check_postmaster_started: try to connect to postmaste=
r on hostname:server0 database:postgres user:postgres (retry 0 times)<br>

Feb 18 15:34:53 server1 pgpool[3209]: check_postmaster_started: failed to c=
onnect to postmaster on hostname:server0 database:postgres user:postgres<br=
><br></div><b>The message &quot;fail to connect to postmaster on hostname s=
erver0&quot; continue until the command timeout.<br>

</b></div><div><b>it looks like the 1st stage command is done but pgpool on=
 server1 did not call the pgpool_remote_start to start the server0 database=
 engine although the script is there under postgresql data folder<br></b><b=
r>

[root at server1 data]# ls<br>backup_label.old=A0=A0=A0=A0 data=A0=A0=A0=A0=A0=
=A0 pg_hba.conf=A0=A0=A0 pg_notify=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 pg_stat=
_tmp=A0 PG_VERSION=A0=A0=A0=A0=A0=A0 postmaster.pid<br>base=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 global=A0=A0=A0=A0 pg_ident.conf=A0 pgpoo=
l_remote_start=A0 pg_subtrans=A0 pg_xlog=A0=A0=A0=A0=A0=A0=A0=A0=A0 recover=
y.done<br>

basebackup.sh=A0=A0=A0=A0=A0=A0=A0 ning.test=A0 pg_log=A0=A0=A0=A0=A0=A0=A0=
=A0 pg_serial=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 pg_tblspc=A0=A0=A0 postgresq=
l.conf=A0 recovery.standby1<br>copy-base-backup.sh=A0 pg_clog=A0=A0=A0 pg_m=
ultixact=A0=A0 pg_snapshots=A0=A0=A0=A0=A0=A0=A0=A0 pg_twophase=A0 postmast=
er.opts<br></div>

<div><br></div><b>during this period, pgpool log on server0 shows the follo=
wing:</b><br>Feb 18 15:34:42 server0 pgpool[2567]: connection received: hos=
t=3Dserver0.local port=3D45366<br>Feb 18 15:34:46 server0 pgpool[2567]: con=
nection received: host=3Dserver1.local port=3D39535<br>

Feb 18 15:34:52 server0 pgpool[2567]: connection received: host=3Dserver0.l=
ocal port=3D45381<br>Feb 18 15:34:56 server0 pgpool[2565]: connection recei=
ved: host=3Dserver1.local port=3D39557<br>Feb 18 15:35:02 server0 pgpool[25=
65]: connection received: host=3Dserver0.local port=3D45396<br>

Feb 18 15:35:06 server0 pgpool[2565]: connection received: host=3Dserver1.l=
ocal port=3D39576<br>Feb 18 15:35:12 server0 pgpool[2565]: connection recei=
ved: host=3Dserver0.local port=3D45410<br>Feb 18 15:35:16 server0 pgpool[25=
65]: connection received: host=3Dserver1.local port=3D39593<br>

Feb 18 15:35:22 server0 pgpool[2565]: connection received: host=3Dserver0.l=
ocal port=3D45425<br>Feb 18 15:35:26 server0 pgpool[2565]: connection recei=
ved: host=3Dserver1.local port=3D39612<br>Feb 18 15:35:32 server0 pgpool[25=
65]: connection received: host=3Dserver0.local port=3D45440<br>

Feb 18 15:35:36 server0 pgpool[2565]: connection received: host=3Dserver1.l=
ocal port=3D39630<br>Feb 18 15:35:42 server0 pgpool[2565]: connection recei=
ved: host=3Dserver0.local port=3D45455<br>Feb 18 15:35:46 server0 pgpool[25=
65]: connection received: host=3Dserver1.local port=3D39648<br>

<br></div><b>Below is the script of recovery_1st_stage_command =3D &#39;bas=
ebackup.sh&#39;</b><br>[root at server1 ~]# more /opt/postgres/9.2/data/baseba=
ckup.sh <br>#! /bin/sh<br># Recovery script for streaming replication.<br>

# This script assumes that DB node 0 is primary, and 1 is standby.<br>#<br>=
datadir=3D$1<br>desthost=3D$2<br>destdir=3D$3<br><br>echo &quot;datadir=3D$=
datadir&quot;<br>echo &quot;desthost=3D$desthost&quot;<br>echo &quot;destdi=
r=3D$destdir&quot;<br>

<br>echo &quot;Executing basebackup.sh&quot; &gt;&gt; /var/log/replication.=
log<br><br>echo &quot;Executing start backup&quot; &gt;&gt; /var/log/replic=
ation.log<br>psql -c &quot;SELECT pg_start_backup(&#39;Streaming Replicatio=
n&#39;, true)&quot; postgres<br>

<br>echo &quot;Executing rsync&quot; &gt;&gt; /var/log/replication.log<br>e=
cho &quot;rsync -C -a --progress --delete -e ssh --exclude postgresql.conf =
--exclude postmaster.pid --exclude postmaster.<br>opts --exclude pg_log --e=
xclude pg_xlog --exclude recovery.conf $datadir/ $desthost:$destdir/&quot;<=
br>

<br>echo &quot;Renaming recovery conf file&quot; &gt;&gt;/var/log/replicati=
on.log<br>echo &quot;ssh -T $desthost mv $destdir/recovery.done $destdir/re=
covery.conf&quot;<br>ssh -T $desthost mv $destdir/recovery.done $destdir/re=
covery.conf<br>

<br>echo &quot;Removing trigger file&quot;<br>ssh -T $desthost rm -f /tmp/t=
rigger_file0 <br><br>echo &quot;Executing stop backup&quot; &gt;&gt; /var/l=
og/replication.log<br>psql -c &quot;SELECT pg_stop_backup()&quot; postgres<=
br>

<br>echo &quot;existing basebackup.sh&quot; &gt;&gt; /var/log/replication.l=
og<br><br></div><b>Manually execute the command on server1 return the follo=
wing result:</b><br><br>[root at server1 data]# ./basebackup.sh /opt/postgres/=
9.2/data/ server0 /opt/postgres/9.2/data/<br>

datadir=3D/opt/postgres/9.2/data/<br>desthost=3Dserver0<br>destdir=3D/opt/p=
ostgres/9.2/data/<br>=A0pg_start_backup <br>-----------------<br>=A00/76000=
020<br>(1 row)<br><br>rsync -C -a --progress --delete -e ssh --exclude post=
gresql.conf --exclude postmaster.pid --exclude postmaster.opts --exclude pg=
_log --exclude pg_xlog --exclude recovery.conf /opt/postgres/9.2/data// ser=
ver0:/opt/postgres/9.2/data//<br>

ssh -T server0 mv /opt/postgres/9.2/data//recovery.done /opt/postgres/9.2/d=
ata//recovery.conf<br>Removing trigger file<br>NOTICE:=A0 WAL archiving is =
not enabled; you must ensure that all required WAL segments are copied thro=
ugh other means to complete the backup<br>

=A0pg_stop_backup <br>----------------<br>=A00/760000E0<br>(1 row)<br><br><=
/div>Thanks and please advise.<span class=3D"HOEnZb"><font color=3D"#888888=
"><br><br></font></span></div><span class=3D"HOEnZb"><font color=3D"#888888=
"><div>
Ning</div></font></span><div><div class=3D"h5"><div><div><div class=3D"gmai=
l_extra"><br><br><div class=3D"gmail_quote">On Sun, Feb 17, 2013 at 7:05 AM=
, Tatsuo Ishii <span dir=3D"ltr">&lt;<a href=3D"mailto:ishii at postgresql.org=
" target=3D"_blank">ishii at postgresql.org</a>&gt;</span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">Hi,<br>
<br>
It seems the standby was unable to start up. Can you show standby<br>
PostgreSQL&#39;s log? Maybe we could find the cause of the problem.<br>
<div>--<br>
Tatsuo Ishii<br>
SRA OSS, Inc. Japan<br>
English: <a href=3D"http://www.sraoss.co.jp/index_en.php" target=3D"_blank"=
>http://www.sraoss.co.jp/index_en.php</a><br>
Japanese: <a href=3D"http://www.sraoss.co.jp" target=3D"_blank">http://www.=
sraoss.co.jp</a><br>
<br>
</div><div><div>&gt; Hi Tatsuo,<br>
&gt; Thank you so much for your reply.<br>
&gt; Actually in my case, i was using the pcp_recovery command and execute =
it on<br>
&gt; the current primary server.<br>
&gt; However, if the remote node (recover node) database is off, i got the<=
br>
&gt; following message on the primary server pgpool log:<br>
&gt;<br>
&gt; Jan 31 16:58:10 server0 pgpool[2723]: starting recovery command: &quot=
;SELECT<br>
&gt; pgpool_recovery(&#39;basebackup.sh&#39;, &#39;server1&#39;, &#39;/opt/=
postgres/9.2/data&#39;)&quot;<br>
&gt; Jan 31 16:58:11 server0 pgpool[2723]: 1st stage is done<br>
&gt; Jan 31 16:58:11 server0 pgpool[2723]: check_postmaster_started: try to=
<br>
&gt; connect to postmaster on hostname:server1 database:postgres user:postg=
res<br>
&gt; (retry 0 times)<br>
&gt; Jan 31 16:58:11 server0 pgpool[2723]: check_postmaster_started: failed=
 to<br>
&gt; connect to postmaster on hostname:server1 database:postgres user:postg=
res<br>
&gt; Jan 31 16:58:13 server0 pgpool[2719]: connection received:<br>
&gt; host=3Dserver0.local port=3D58446<br>
&gt; Jan 31 16:58:14 server0 pgpool[2723]: check_postmaster_started: try to=
<br>
&gt; connect to postmaster on hostname:server1 database:postgres user:postg=
res<br>
&gt; (retry 1 times)<br>
&gt; Jan 31 16:58:14 server0 pgpool[2723]: check_postmaster_started: failed=
 to<br>
&gt; connect to postmaster on hostname:server1 database:postgres user:postg=
res<br>
&gt; Jan 31 16:58:14 server0 pgpool[2719]: connection received:<br>
&gt; host=3Dserver1.local port=3D39928<br>
&gt; Jan 31 16:58:17 server0 pgpool[2723]: check_postmaster_started: try to=
<br>
&gt; connect to postmaster on hostname:server1 database:postgres user:postg=
res<br>
&gt; (retry 2 times)<br>
&gt; Jan 31 16:58:17 server0 pgpool[2723]: check_postmaster_started: failed=
 to<br>
&gt; connect to postmaster on hostname:server1 database:postgres user:postg=
res<br>
&gt; Jan 31 16:58:20 server0 pgpool[2723]: check_postmaster_started: try to=
<br>
&gt; connect to postmaster on hostname:server1 database:postgres user:postg=
res<br>
&gt; (retry 3 times)<br>
&gt; Jan 31 16:58:20 server0 pgpool[2723]: check_postmaster_started: failed=
 to<br>
&gt; connect to postmaster on hostname:server1 database:postgres user:postg=
res<br>
&gt; Jan 31 16:58:23 server0 pgpool[2719]: connection received:<br>
&gt; host=3Dserver0.local port=3D58464<br>
&gt; Jan 31 16:58:23 server0 pgpool[2723]: check_postmaster_started: try to=
<br>
&gt; connect to postmaster on hostname:server1 database:template1 user:post=
gres<br>
&gt; (retry 0 times)<br>
&gt; Jan 31 16:58:23 server0 pgpool[2723]: check_postmaster_started: failed=
 to<br>
&gt; connect to postmaster on hostname:server1 database:template1 user:post=
gres<br>
&gt; Jan 31 16:58:26 server0 pgpool[2723]: check_postmaster_started: try to=
<br>
&gt; connect to postmaster on hostname:server1 database:template1 user:post=
gres<br>
&gt; (retry 1 times)<br>
&gt; Jan 31 16:58:26 server0 pgpool[2723]: check_postmaster_started: failed=
 to<br>
&gt; connect to postmaster on hostname:server1 database:template1 user:post=
gres<br>
&gt; Jan 31 16:58:26 server0 pgpool[2719]: connection received:<br>
&gt; host=3Dserver1.local port=3D39946<br>
&gt; Jan 31 16:58:29 server0 pgpool[2723]: check_postmaster_started: try to=
<br>
&gt; connect to postmaster on hostname:server1 database:template1 user:post=
gres<br>
&gt; (retry 2 times)<br>
&gt; Jan 31 16:58:29 server0 pgpool[2723]: check_postmaster_started: failed=
 to<br>
&gt; connect to postmaster on hostname:server1 database:template1 user:post=
gres<br>
&gt; Jan 31 16:58:32 server0 pgpool[2723]: check_postmaster_started: try to=
<br>
&gt; connect to postmaster on hostname:server1 database:template1 user:post=
gres<br>
&gt; (retry 3 times)<br>
&gt; Jan 31 16:58:32 server0 pgpool[2723]: check_postmaster_started: failed=
 to<br>
&gt; connect to postmaster on hostname:server1 database:template1 user:post=
gres<br>
&gt; Jan 31 16:58:33 server0 pgpool[2719]: connection received:<br>
&gt; host=3Dserver0.local port=3D58483<br>
&gt; Jan 31 16:58:35 server0 pgpool[2723]: check_postmaster_started: try to=
<br>
&gt; connect to postmaster on hostname:server1 database:template1 user:post=
gres<br>
&gt; (retry 4 times)<br>
&gt; Jan 31 16:58:35 server0 pgpool[2723]: check_postmaster_started: failed=
 to<br>
&gt; connect to postmaster on hostname:server1 database:template1 user:post=
gres<br>
&gt; Jan 31 16:58:38 server0 pgpool[2723]: check_postmaster_started: try to=
<br>
&gt; connect to postmaster on hostname:server1 database:template1 user:post=
gres<br>
&gt; (retry 5 times)<br>
&gt; Jan 31 16:58:38 server0 pgpool[2723]: check_postmaster_started: failed=
 to<br>
&gt; connect to postmaster on hostname:server1 database:template1 user:post=
gres<br>
&gt;<br>
&gt; here is the exact command i execute on server0 to recover server1,<br>
&gt; /usr/local/bin/pcp_recovery_node 10 localhost 9898 pgpool password 1<b=
r>
&gt;<br>
&gt; Do you have any idea why?<br>
&gt;<br>
&gt; Just FYI, we cannot pgpoolAdmin in our environment.<br>
&gt;<br>
&gt;<br>
</div></div><div><div>&gt; On Sun, Feb 17, 2013 at 12:13 AM, Tatsuo Ishii &=
lt;<a href=3D"mailto:ishii at postgresql.org" target=3D"_blank">ishii at postgres=
ql.org</a>&gt; wrote:<br>
&gt;<br>
&gt;&gt; &gt; Hi all,<br>
&gt;&gt; &gt; I have the following question regarding the recovery of a fil=
ed Primary<br>
&gt;&gt; &gt; Database Server.<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; Question 1: in the documentation, under Streaming Replication=
 Online<br>
&gt;&gt; &gt; Recovery section.<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; <a href=3D"http://www.pgpool.net/docs/latest/pgpool-en.html#s=
tream" target=3D"_blank">http://www.pgpool.net/docs/latest/pgpool-en.html#s=
tream</a><br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; in steps 6:<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; =A0 =A01. After completing online recovery, pgpool-II will st=
art PostgreSQL<br>
&gt;&gt; on<br>
&gt;&gt; &gt; =A0 =A0the standby node. Install the script for this purpose =
on each DB<br>
&gt;&gt; &gt; nodes. Sample<br>
&gt;&gt; &gt; =A0 =A0script &lt;<a href=3D"http://www.pgpool.net/docs/lates=
t/pgpool_remote_start" target=3D"_blank">http://www.pgpool.net/docs/latest/=
pgpool_remote_start</a>&gt; is<br>
&gt;&gt; &gt; =A0 =A0included in &quot;sample&quot; directory of the source=
 code. This script uses<br>
&gt;&gt; ssh.<br>
&gt;&gt; &gt; =A0 =A0You need to allow recovery_user to login from the prim=
ary node to the<br>
&gt;&gt; &gt; =A0 =A0standby node without being asked password.<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; To my understanding, i think the postgreSQL doesn&#39;t not n=
eed to be online<br>
&gt;&gt; &gt; for the recovery process right? Since later on it mentions th=
at<br>
&gt;&gt; &gt; pgpool_remote_start will start up the DB on the failed node.<=
br>
&gt;&gt;<br>
&gt;&gt; Acually standby PostgreSQL node should not be started.<br>
&gt;&gt;<br>
&gt;&gt; &gt; Question 2: in my configuration, i have 2 pgpool server with =
two<br>
&gt;&gt; backends.<br>
&gt;&gt; &gt; Will it work for oneline recovery?<br>
&gt;&gt;<br>
&gt;&gt; Yes, but online recovery process should be initiated by one of pgp=
ool,<br>
&gt;&gt; not both. If you enable pgpool-II 3.2&#39;s watchdog, it will take=
 care of<br>
&gt;&gt; neccessary interlocking.<br>
&gt;&gt;<br>
&gt;&gt; &gt; Question 3: when the failed node comes back online, should i =
use<br>
&gt;&gt; &gt; pcp_recovery from DB primary or should i use pcp_attach on th=
e failed<br>
&gt;&gt; node<br>
&gt;&gt; &gt; to recover the failed system? Actually in my case, both metho=
ds do not<br>
&gt;&gt; &gt; recover my system every time.<br>
&gt;&gt;<br>
&gt;&gt; I&#39;m confused. Didn&#39;t you start the online recovery process=
 by using<br>
&gt;&gt; pcp_recovery_node?(of course you could do it via pgpoolAdmin).<br>
&gt;&gt;<br>
&gt;&gt; Anyway pcp_recovery_node automatically attach recovered node, and =
you<br>
&gt;&gt; don&#39;t need to execute pcp_attach_node.<br>
&gt;&gt;<br>
&gt;&gt; I suggest you read tutorial:<br>
&gt;&gt; <a href=3D"http://www.pgpool.net/pgpool-web/contrib_docs/simple_sr=
_setting2/index.html" target=3D"_blank">http://www.pgpool.net/pgpool-web/co=
ntrib_docs/simple_sr_setting2/index.html</a><br>
&gt;&gt; --<br>
&gt;&gt; Tatsuo Ishii<br>
&gt;&gt; SRA OSS, Inc. Japan<br>
&gt;&gt; English: <a href=3D"http://www.sraoss.co.jp/index_en.php" target=
=3D"_blank">http://www.sraoss.co.jp/index_en.php</a><br>
&gt;&gt; Japanese: <a href=3D"http://www.sraoss.co.jp" target=3D"_blank">ht=
tp://www.sraoss.co.jp</a><br>
&gt;&gt;<br>
</div></div></blockquote></div><br></div></div></div></div></div></div>
</blockquote></div><br></div>

--20cf306f796ca7bc5704d61a2a8e--


More information about the pgpool-general mailing list