[pgpool-general: 2396] Re: native replication PITR problems

Videanu Adrian videanuadrian at yahoo.com
Sat Jan 11 17:18:53 JST 2014

Hi all,

After a further reading of pgpool tutorial i think that i found the problem, but i want to know if i understood this correctly : 

"Data synchronization is finalized during what is called "second stage".
Before entering the second stage, pgpool-II waits until all clients have disconnected.
It blocks any new incoming connection until the second stage is over. 
After all connections have terminated, pgpool-II merges updated data between
the first stage and the second stage. This is the final data
synchronization step. 
Note that there is a restriction about online recovery. If pgpool-II itself
is installed on multiple hosts, online recovery does not work correctly,
because pgpool-II has to stop all clients during the 2nd stage of
online recovery. If there are several pgpool hosts, only one will have received
the online recovery command and will block connections. "  

So, what i understand from here is that when I perform online recovery I should stop the stand-by pgpool server. Also, on the primary server there should be no connections open left, and until the recovery is performed, no other connections will be opened. The entire cluster will be basically down when 2nd stage recovery process is running.
Are these assumptions correct ?
Also sometimes it happens that my standby pgpool node detects one postgresql backend as down and it degenerate it.
My question here is : What is the business of pgpool standby node to detach postgres backends as long as it is the STAND-BY and not the ACTIVE node ?

Adrian Videanu

 From: Videanu Adrian <videanuadrian at yahoo.com>
To: "pgpool-general at pgpool.net" <pgpool-general at pgpool.net> 
Sent: Friday, January 10, 2014 12:21 PM
Subject: [pgpool-general: 2395] native replication PITR problems


I have a pgpool 3.3.2 cluster using native replication with 2 Postgresql 9.2 nodes with online recovery using PITR.

The problem is that from time to time one of the nodes get disconnected (I do not know why, because the load is very low and the machines are in the same subnet), and when I try to recovery it with pgpool-admin recovery button, after the first state the recovery process apparently freezes and nothing happens. During this pgpool cannot be accessed, in fact i guess that the connection are made but it somehow waits for something... .

Active Pgpool machine logs :

// the first postgresql node is declared dead (have no ideea why...  how may i debug this kind of issues ?)

Jan 10 09:36:51 pgpool133 pgpool[26722]: wd_send_response: WD_STAND_FOR_LOCK_HOLDER received it
Jan 10 09:36:51 pgpool133 pgpool[26722]: degenerate_backend_set: 0 fail over request from pid 26722
Jan 10 09:36:51 pgpool133 pgpool[26703]: wd_start_interlock: start interlocking
Jan 10 09:36:53 pgpool133 pgpool[26703]: starting degeneration. shutdown host
Jan 10
 09:36:53 pgpool133 pgpool[26703]: Restart all children
Jan 10 09:37:00 pgpool133 pgpool[26703]: wd_end_interlock: end interlocking
Jan 10 09:37:01 pgpool133 pgpool[26703]: failover: set new primary node: -1
Jan 10 09:37:01 pgpool133 pgpool[26703]: failover: set new master node: 1
Jan 10 09:37:01 pgpool133 pgpool[26703]: failover done. shutdown host
Jan 10 09:37:01 pgpool133 pgpool[27029]: worker process received restart request
Jan 10 09:37:02 pgpool133 pgpool[27028]: pcp child process received restart request
Jan 10 09:37:02 pgpool133 pgpool[26703]: PCP child 27028 exits with status 256 in failover()
Jan 10 09:37:02 pgpool133 pgpool[26703]: fork a new PCP child pid 32576 in failover()
Jan 10 09:37:02 pgpool133 pgpool[26703]: worker child 27029 exits with status 256
Jan 10 09:37:02 pgpool133 pgpool[26703]: fork a new worker child pid 32577


Before start the recovery
 process I deleted everything in the archive directory and in data directory to the node that was about to be recovered

// start the recovery process
Jan 10 09:43:07 pgpool133 pgpool[32576]: starting recovering node 0
Jan 10 09:43:08 pgpool133 pgpool[32576]: CHECKPOINT in the 1st stage done
Jan 10 09:43:08 pgpool133 pgpool[32576]: starting recovery command: "SELECT pgpool_recovery('basebackup.sh', '', '/var/lib/postgresql/9.2/data')"
Jan 10 09:43:22 pgpool133 pgpool[32576]: 1st stage is done
Jan 10 09:43:22 pgpool133 pgpool[32576]: starting 2nd stage
... after that nothing happens

Online postgresql node logs : 
+ DATA=/var/lib/postgresql/9.2/data
+ RECOVERY_DATA=/var/lib/postgresql/9.2/data
+ ARCHIVE_DIR=/var/lib/postgresql/9.2/archive
+ psql -c 'SELECT pg_start_backup('\''pgpoo-recovery'\'')'
(1 row)

+ rsync -C -a -c -e 'ssh -p 2022' --delete --exclude postmaster.log --exclude postmaster.pid --exclude postmaster.opts --exclude pg_log --exclude recovery.conf --
+ cat
+ scp -P 2022 recovery.conf
+ rm -f recovery.conf
+ psql -c 'SELECT pg_stop_backup()' postgres
NOTICE:  pg_stop_backup complete, all required WAL segments have been archived
(1 row)

P.S - I had experienced this kind of problems in the past but if i tried multiple times worked. But now, it seems that it doesn`t want to work anymore :)

Junst as i was writting  this email the second Postgres node (and the last up) was declared down and pgpool was not acccepting conections due to the fact that no backend was online. After a complet
 turnoff of both postgresql servers and pgpoll servers I could recover the node 1 also....

I have attached my relevant conf files.

Adrian Videanu
pgpool-general mailing list
pgpool-general at pgpool.net
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20140111/dcd305df/attachment.html>

More information about the pgpool-general mailing list