[pgpool-general: 1981] Re: trouble with recovery of a downed node
Sean Hogan
sean at compusult.net
Fri Aug 2 04:15:25 JST 2013
I should say I am using pgpool-II 3.3.0 with PostgreSQL 9.2 on CentOS 6,
native replication.
Sean
On 13-08-01 04:41 PM, Sean Hogan wrote:
> Hi again,
>
> I am having difficulty achieving a consistent database recovery in my
> three-PostgreSQL, two-pgpool setup (native replication). In a
> previous post I mentioned that I often get inconsistent counts of
> updated rows for a table that is probably being updated during the
> recovery.
>
> The technique I'm using is a daily backup with continuous archiving as
> described at
> http://zetetic.net/blog/2012/3/9/point-in-time-recovery-from-backup-using-postgresql-continuo.html.
> As a result my setup has no recovery_1st_stage_command,
> pgpool_recovery_pitr as the second phase command, and my
> pgpool_remote_start takes care of the complete recovery and restart as
> in that blog post.
>
> I would include my exact scripts but I don't want to be one of those
> people who say "here's my setup, now fix it". :-) Instead I'd like
> to better understand the recovery phases and how they impact database
> availability. Here is the order of things as I understand it:
>
> 1) Start recovery by running the stage 1 script. Existing and new
> connections, and all queries and updates, are allowed. Updates are
> written to the WAL.
>
> 2) Stop accepting connections and queries. (Question: Are existing
> connections terminated at this point?)
>
> 3) Run the stage 2 script. pgpool_recovery_pitr does
> pgpool_switch_xlog() as recommended, so all WAL are archived by the
> time it completes.
>
> 4) Run pgpool_remote_start.
> a) This copies the base backup and archived WALs, and writes
> recovery.conf.
> b) Since in postgresql.conf I have hot_standby=off, connections
> are blocked until recovery completes.
> c) My pgpool_remote_start starts PostgreSQL on the recovered node
> in a synchronous fashion, so it does not terminate until the
> PostgreSQL startup is finished.
>
> 5) Connections are allowed once more.
>
>
> Is this correct? With this flow I can't see how the newly recovered
> node could be out of sync with the master. But the cluster behaves as
> if an update has not been recorded in the archived WAL, or an update
> took place on the master and the other slave somewhere between step 2
> and step 5.
>
> pgpool-II's logs of the recovery look correct:
>
> 2013-08-01 16:16:09 LOG: pid 25265: send_failback_request: fail back
> 1 th node request from pid 25265
> 2013-08-01 16:16:09 LOG: pid 25257: wd_start_interlock: start
> interlocking
> 2013-08-01 16:16:09 LOG: pid 25265: wd_send_response:
> WD_STAND_FOR_LOCK_HOLDER received it
> 2013-08-01 16:16:09 LOG: pid 25257: starting fail back. reconnect
> host psql-vm2.compusult.net(5432)
> 2013-08-01 16:16:11 LOG: pid 25257: Restart all children
> 2013-08-01 16:16:11 LOG: pid 25257: wd_end_interlock: end interlocking
> 2013-08-01 16:16:12 LOG: pid 25257: failover: set new primary node: -1
> 2013-08-01 16:16:12 LOG: pid 25257: failover: set new master node: 0
> 2013-08-01 16:16:12 LOG: pid 6894: worker process received restart
> request
> 2013-08-01 16:16:12 LOG: pid 25257: failback done. reconnect host
> psql-vm2.compusult.net(5432)
> 2013-08-01 16:16:13 LOG: pid 6893: pcp child process received
> restart request
> 2013-08-01 16:16:13 LOG: pid 25257: PCP child 6893 exits with status
> 256 in failover()
> 2013-08-01 16:16:13 LOG: pid 25257: fork a new PCP child pid 7374 in
> failover()
> 2013-08-01 16:16:13 LOG: pid 25257: worker child 6894 exits with
> status 256
> 2013-08-01 16:16:13 LOG: pid 25257: fork a new worker child pid 7375
>
> But no matter how I do it, I get something like this after running my
> app for a while:
>
> 2013-08-01 17:08:28 ERROR: pid 6597: pgpool detected difference of the
> number of inserted, updated or deleted tuples. Possible last query
> was: " UPDATE ws_cached_searches SET ... "
> 2013-08-01 17:08:28 LOG: pid 6597: CommandComplete: Number of
> affected tuples are: 1 0 1
> 2013-08-01 17:08:28 LOG: pid 6597: ReadyForQuery: Degenerate
> backends: 1
> 2013-08-01 17:08:28 LOG: pid 6597: ReadyForQuery: Number of affected
> tuples are: 1 0 1
> 2013-08-01 17:08:28 LOG: pid 25257: starting degeneration. shutdown
> host psql-vm2.compusult.net(5432)
>
> It varies whether the complaint is for the node just recovered, or the
> other slave.
>
> I feel like I'm missing something critical in the backup or recovery
> process, but it is eluding me. Can anyone help?
>
> Thanks,
> Sean
>
>
> _______________________________________________
> pgpool-general mailing list
> pgpool-general at pgpool.net
> http://www.pgpool.net/mailman/listinfo/pgpool-general
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20130801/76b3b8e1/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sean.vcf
Type: text/x-vcard
Size: 275 bytes
Desc: not available
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20130801/76b3b8e1/attachment.vcf>
More information about the pgpool-general
mailing list