<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<div class="moz-cite-prefix">I should say I am using pgpool-II 3.3.0
with PostgreSQL 9.2 on CentOS 6, native replication.<br>
<br>
Sean<br>
<br>
On 13-08-01 04:41 PM, Sean Hogan wrote:<br>
</div>
<blockquote cite="mid:51FAB2EC.3090507@compusult.net" type="cite">Hi
again,
<br>
<br>
I am having difficulty achieving a consistent database recovery in
my three-PostgreSQL, two-pgpool setup (native replication). In a
previous post I mentioned that I often get inconsistent counts of
updated rows for a table that is probably being updated during the
recovery.
<br>
<br>
The technique I'm using is a daily backup with continuous
archiving as described at
<a class="moz-txt-link-freetext" href="http://zetetic.net/blog/2012/3/9/point-in-time-recovery-from-backup-using-postgresql-continuo.html">http://zetetic.net/blog/2012/3/9/point-in-time-recovery-from-backup-using-postgresql-continuo.html</a>.
As a result my setup has no recovery_1st_stage_command,
pgpool_recovery_pitr as the second phase command, and my
pgpool_remote_start takes care of the complete recovery and
restart as in that blog post.
<br>
<br>
I would include my exact scripts but I don't want to be one of
those people who say "here's my setup, now fix it". :-) Instead
I'd like to better understand the recovery phases and how they
impact database availability. Here is the order of things as I
understand it:
<br>
<br>
1) Start recovery by running the stage 1 script. Existing and
new connections, and all queries and updates, are allowed.
Updates are written to the WAL.
<br>
<br>
2) Stop accepting connections and queries. (Question: Are
existing connections terminated at this point?)
<br>
<br>
3) Run the stage 2 script. pgpool_recovery_pitr does
pgpool_switch_xlog() as recommended, so all WAL are archived by
the time it completes.
<br>
<br>
4) Run pgpool_remote_start.
<br>
a) This copies the base backup and archived WALs, and writes
recovery.conf.
<br>
b) Since in postgresql.conf I have hot_standby=off,
connections are blocked until recovery completes.
<br>
c) My pgpool_remote_start starts PostgreSQL on the recovered
node in a synchronous fashion, so it does not terminate until the
PostgreSQL startup is finished.
<br>
<br>
5) Connections are allowed once more.
<br>
<br>
<br>
Is this correct? With this flow I can't see how the newly
recovered node could be out of sync with the master. But the
cluster behaves as if an update has not been recorded in the
archived WAL, or an update took place on the master and the other
slave somewhere between step 2 and step 5.
<br>
<br>
pgpool-II's logs of the recovery look correct:
<br>
<br>
2013-08-01 16:16:09 LOG: pid 25265: send_failback_request: fail
back 1 th node request from pid 25265
<br>
2013-08-01 16:16:09 LOG: pid 25257: wd_start_interlock: start
interlocking
<br>
2013-08-01 16:16:09 LOG: pid 25265: wd_send_response:
WD_STAND_FOR_LOCK_HOLDER received it
<br>
2013-08-01 16:16:09 LOG: pid 25257: starting fail back.
reconnect host psql-vm2.compusult.net(5432)
<br>
2013-08-01 16:16:11 LOG: pid 25257: Restart all children
<br>
2013-08-01 16:16:11 LOG: pid 25257: wd_end_interlock: end
interlocking
<br>
2013-08-01 16:16:12 LOG: pid 25257: failover: set new primary
node: -1
<br>
2013-08-01 16:16:12 LOG: pid 25257: failover: set new master
node: 0
<br>
2013-08-01 16:16:12 LOG: pid 6894: worker process received
restart request
<br>
2013-08-01 16:16:12 LOG: pid 25257: failback done. reconnect
host psql-vm2.compusult.net(5432)
<br>
2013-08-01 16:16:13 LOG: pid 6893: pcp child process received
restart request
<br>
2013-08-01 16:16:13 LOG: pid 25257: PCP child 6893 exits with
status 256 in failover()
<br>
2013-08-01 16:16:13 LOG: pid 25257: fork a new PCP child pid
7374 in failover()
<br>
2013-08-01 16:16:13 LOG: pid 25257: worker child 6894 exits with
status 256
<br>
2013-08-01 16:16:13 LOG: pid 25257: fork a new worker child pid
7375
<br>
<br>
But no matter how I do it, I get something like this after running
my app for a while:
<br>
<br>
2013-08-01 17:08:28 ERROR: pid 6597: pgpool detected difference of
the number of inserted, updated or deleted tuples. Possible last
query was: " UPDATE ws_cached_searches SET ... "
<br>
2013-08-01 17:08:28 LOG: pid 6597: CommandComplete: Number of
affected tuples are: 1 0 1
<br>
2013-08-01 17:08:28 LOG: pid 6597: ReadyForQuery: Degenerate
backends: 1
<br>
2013-08-01 17:08:28 LOG: pid 6597: ReadyForQuery: Number of
affected tuples are: 1 0 1
<br>
2013-08-01 17:08:28 LOG: pid 25257: starting degeneration.
shutdown host psql-vm2.compusult.net(5432)
<br>
<br>
It varies whether the complaint is for the node just recovered, or
the other slave.
<br>
<br>
I feel like I'm missing something critical in the backup or
recovery process, but it is eluding me. Can anyone help?
<br>
<br>
Thanks,
<br>
Sean
<br>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
pgpool-general mailing list
<a class="moz-txt-link-abbreviated" href="mailto:pgpool-general@pgpool.net">pgpool-general@pgpool.net</a>
<a class="moz-txt-link-freetext" href="http://www.pgpool.net/mailman/listinfo/pgpool-general">http://www.pgpool.net/mailman/listinfo/pgpool-general</a>
</pre>
</blockquote>
<br>
</body>
</html>