[pgpool-general: 1981] Re: trouble with recovery of a downed node

Fri Aug 2 04:15:25 JST 2013

I should say I am using pgpool-II 3.3.0 with PostgreSQL 9.2 on CentOS 6, 
native replication.

Sean

On 13-08-01 04:41 PM, Sean Hogan wrote:
> Hi again,
>
> I am having difficulty achieving a consistent database recovery in my 
> three-PostgreSQL, two-pgpool setup (native replication).  In a 
> previous post I mentioned that I often get inconsistent counts of 
> updated rows for a table that is probably being updated during the 
> recovery.
>
> The technique I'm using is a daily backup with continuous archiving as 
> described at 
> http://zetetic.net/blog/2012/3/9/point-in-time-recovery-from-backup-using-postgresql-continuo.html. 
> As a result my setup has no recovery_1st_stage_command, 
> pgpool_recovery_pitr as the second phase command, and my 
> pgpool_remote_start takes care of the complete recovery and restart as 
> in that blog post.
>
> I would include my exact scripts but I don't want to be one of those 
> people who say "here's my setup, now fix it".  :-)  Instead I'd like 
> to better understand the recovery phases and how they impact database 
> availability.  Here is the order of things as I understand it:
>
> 1)    Start recovery by running the stage 1 script.  Existing and new 
> connections, and all queries and updates, are allowed. Updates are 
> written to the WAL.
>
> 2)    Stop accepting connections and queries.  (Question: Are existing 
> connections terminated at this point?)
>
> 3)    Run the stage 2 script.  pgpool_recovery_pitr does 
> pgpool_switch_xlog() as recommended, so all WAL are archived by the 
> time it completes.
>
> 4)    Run pgpool_remote_start.
>   a)    This copies the base backup and archived WALs, and writes 
> recovery.conf.
>   b)    Since in postgresql.conf I have hot_standby=off, connections 
> are blocked until recovery completes.
>   c)    My pgpool_remote_start starts PostgreSQL on the recovered node 
> in a synchronous fashion, so it does not terminate until the 
> PostgreSQL startup is finished.
>
> 5)    Connections are allowed once more.
>
>
> Is this correct?  With this flow I can't see how the newly recovered 
> node could be out of sync with the master.  But the cluster behaves as 
> if an update has not been recorded in the archived WAL, or an update 
> took place on the master and the other slave somewhere between step 2 
> and step 5.
>
> pgpool-II's logs of the recovery look correct:
>
> 2013-08-01 16:16:09 LOG:   pid 25265: send_failback_request: fail back 
> 1 th node request from pid 25265
> 2013-08-01 16:16:09 LOG:   pid 25257: wd_start_interlock: start 
> interlocking
> 2013-08-01 16:16:09 LOG:   pid 25265: wd_send_response: 
> WD_STAND_FOR_LOCK_HOLDER received it
> 2013-08-01 16:16:09 LOG:   pid 25257: starting fail back. reconnect 
> host psql-vm2.compusult.net(5432)
> 2013-08-01 16:16:11 LOG:   pid 25257: Restart all children
> 2013-08-01 16:16:11 LOG:   pid 25257: wd_end_interlock: end interlocking
> 2013-08-01 16:16:12 LOG:   pid 25257: failover: set new primary node: -1
> 2013-08-01 16:16:12 LOG:   pid 25257: failover: set new master node: 0
> 2013-08-01 16:16:12 LOG:   pid 6894: worker process received restart 
> request
> 2013-08-01 16:16:12 LOG:   pid 25257: failback done. reconnect host 
> psql-vm2.compusult.net(5432)
> 2013-08-01 16:16:13 LOG:   pid 6893: pcp child process received 
> restart request
> 2013-08-01 16:16:13 LOG:   pid 25257: PCP child 6893 exits with status 
> 256 in failover()
> 2013-08-01 16:16:13 LOG:   pid 25257: fork a new PCP child pid 7374 in 
> failover()
> 2013-08-01 16:16:13 LOG:   pid 25257: worker child 6894 exits with 
> status 256
> 2013-08-01 16:16:13 LOG:   pid 25257: fork a new worker child pid 7375
>
> But no matter how I do it, I get something like this after running my 
> app for a while:
>
> 2013-08-01 17:08:28 ERROR: pid 6597: pgpool detected difference of the 
> number of inserted, updated or deleted tuples. Possible last query 
> was: " UPDATE ws_cached_searches SET ... "
> 2013-08-01 17:08:28 LOG:   pid 6597: CommandComplete: Number of 
> affected tuples are: 1 0 1
> 2013-08-01 17:08:28 LOG:   pid 6597: ReadyForQuery: Degenerate 
> backends: 1
> 2013-08-01 17:08:28 LOG:   pid 6597: ReadyForQuery: Number of affected 
> tuples are: 1 0 1
> 2013-08-01 17:08:28 LOG:   pid 25257: starting degeneration. shutdown 
> host psql-vm2.compusult.net(5432)
>
> It varies whether the complaint is for the node just recovered, or the 
> other slave.
>
> I feel like I'm missing something critical in the backup or recovery 
> process, but it is eluding me.  Can anyone help?
>
> Thanks,
> Sean
>
>
> _______________________________________________
> pgpool-general mailing list
> pgpool-general at pgpool.net
> http://www.pgpool.net/mailman/listinfo/pgpool-general

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20130801/76b3b8e1/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sean.vcf
Type: text/x-vcard
Size: 275 bytes
Desc: not available
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20130801/76b3b8e1/attachment.vcf>