<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <div class="moz-cite-prefix">I should say I am using pgpool-II 3.3.0

      with PostgreSQL 9.2 on CentOS 6, native replication.<br>

      <br>

      Sean<br>

      <br>

      On 13-08-01 04:41 PM, Sean Hogan wrote:<br>

    </div>

    <blockquote cite="mid:51FAB2EC.3090507@compusult.net" type="cite">Hi

      again,

      <br>

      <br>

      I am having difficulty achieving a consistent database recovery in

      my three-PostgreSQL, two-pgpool setup (native replication).&nbsp; In a

      previous post I mentioned that I often get inconsistent counts of

      updated rows for a table that is probably being updated during the

      recovery.

      <br>

      <br>

      The technique I'm using is a daily backup with continuous

      archiving as described at

      <a class="moz-txt-link-freetext" href="http://zetetic.net/blog/2012/3/9/point-in-time-recovery-from-backup-using-postgresql-continuo.html">http://zetetic.net/blog/2012/3/9/point-in-time-recovery-from-backup-using-postgresql-continuo.html</a>.

      As a result my setup has no recovery_1st_stage_command,

      pgpool_recovery_pitr as the second phase command, and my

      pgpool_remote_start takes care of the complete recovery and

      restart as in that blog post.

      <br>

      <br>

      I would include my exact scripts but I don't want to be one of

      those people who say "here's my setup, now fix it".&nbsp; :-)&nbsp; Instead

      I'd like to better understand the recovery phases and how they

      impact database availability.&nbsp; Here is the order of things as I

      understand it:

      <br>

      <br>

      1)&nbsp;&nbsp;&nbsp; Start recovery by running the stage 1 script.&nbsp; Existing and

      new connections, and all queries and updates, are allowed.&nbsp;

      Updates are written to the WAL.

      <br>

      <br>

      2)&nbsp;&nbsp;&nbsp; Stop accepting connections and queries.&nbsp; (Question: Are

      existing connections terminated at this point?)

      <br>

      <br>

      3)&nbsp;&nbsp;&nbsp; Run the stage 2 script.&nbsp; pgpool_recovery_pitr does

      pgpool_switch_xlog() as recommended, so all WAL are archived by

      the time it completes.

      <br>

      <br>

      4)&nbsp;&nbsp;&nbsp; Run pgpool_remote_start.

      <br>

      &nbsp; a)&nbsp;&nbsp;&nbsp; This copies the base backup and archived WALs, and writes

      recovery.conf.

      <br>

      &nbsp; b)&nbsp;&nbsp;&nbsp; Since in postgresql.conf I have hot_standby=off,

      connections are blocked until recovery completes.

      <br>

      &nbsp; c)&nbsp;&nbsp;&nbsp; My pgpool_remote_start starts PostgreSQL on the recovered

      node in a synchronous fashion, so it does not terminate until the

      PostgreSQL startup is finished.

      <br>

      <br>

      5)&nbsp;&nbsp;&nbsp; Connections are allowed once more.

      <br>

      <br>

      <br>

      Is this correct?&nbsp; With this flow I can't see how the newly

      recovered node could be out of sync with the master.&nbsp; But the

      cluster behaves as if an update has not been recorded in the

      archived WAL, or an update took place on the master and the other

      slave somewhere between step 2 and step 5.

      <br>

      <br>

      pgpool-II's logs of the recovery look correct:

      <br>

      <br>

      2013-08-01 16:16:09 LOG:&nbsp;&nbsp; pid 25265: send_failback_request: fail

      back 1 th node request from pid 25265

      <br>

      2013-08-01 16:16:09 LOG:&nbsp;&nbsp; pid 25257: wd_start_interlock: start

      interlocking

      <br>

      2013-08-01 16:16:09 LOG:&nbsp;&nbsp; pid 25265: wd_send_response:

      WD_STAND_FOR_LOCK_HOLDER received it

      <br>

      2013-08-01 16:16:09 LOG:&nbsp;&nbsp; pid 25257: starting fail back.

      reconnect host psql-vm2.compusult.net(5432)

      <br>

      2013-08-01 16:16:11 LOG:&nbsp;&nbsp; pid 25257: Restart all children

      <br>

      2013-08-01 16:16:11 LOG:&nbsp;&nbsp; pid 25257: wd_end_interlock: end

      interlocking

      <br>

      2013-08-01 16:16:12 LOG:&nbsp;&nbsp; pid 25257: failover: set new primary

      node: -1

      <br>

      2013-08-01 16:16:12 LOG:&nbsp;&nbsp; pid 25257: failover: set new master

      node: 0

      <br>

      2013-08-01 16:16:12 LOG:&nbsp;&nbsp; pid 6894: worker process received

      restart request

      <br>

      2013-08-01 16:16:12 LOG:&nbsp;&nbsp; pid 25257: failback done. reconnect

      host psql-vm2.compusult.net(5432)

      <br>

      2013-08-01 16:16:13 LOG:&nbsp;&nbsp; pid 6893: pcp child process received

      restart request

      <br>

      2013-08-01 16:16:13 LOG:&nbsp;&nbsp; pid 25257: PCP child 6893 exits with

      status 256 in failover()

      <br>

      2013-08-01 16:16:13 LOG:&nbsp;&nbsp; pid 25257: fork a new PCP child pid

      7374 in failover()

      <br>

      2013-08-01 16:16:13 LOG:&nbsp;&nbsp; pid 25257: worker child 6894 exits with

      status 256

      <br>

      2013-08-01 16:16:13 LOG:&nbsp;&nbsp; pid 25257: fork a new worker child pid

      7375

      <br>

      <br>

      But no matter how I do it, I get something like this after running

      my app for a while:

      <br>

      <br>

      2013-08-01 17:08:28 ERROR: pid 6597: pgpool detected difference of

      the number of inserted, updated or deleted tuples. Possible last

      query was: " UPDATE ws_cached_searches SET ... "

      <br>

      2013-08-01 17:08:28 LOG:&nbsp;&nbsp; pid 6597: CommandComplete: Number of

      affected tuples are: 1 0 1

      <br>

      2013-08-01 17:08:28 LOG:&nbsp;&nbsp; pid 6597: ReadyForQuery: Degenerate

      backends: 1

      <br>

      2013-08-01 17:08:28 LOG:&nbsp;&nbsp; pid 6597: ReadyForQuery: Number of

      affected tuples are: 1 0 1

      <br>

      2013-08-01 17:08:28 LOG:&nbsp;&nbsp; pid 25257: starting degeneration.

      shutdown host psql-vm2.compusult.net(5432)

      <br>

      <br>

      It varies whether the complaint is for the node just recovered, or

      the other slave.

      <br>

      <br>

      I feel like I'm missing something critical in the backup or

      recovery process, but it is eluding me.&nbsp; Can anyone help?

      <br>

      <br>

      Thanks,

      <br>

      Sean

      <br>

      <br>

      <fieldset class="mimeAttachmentHeader"></fieldset>

      <br>

      <pre wrap="">_______________________________________________

pgpool-general mailing list

<a class="moz-txt-link-abbreviated" href="mailto:pgpool-general@pgpool.net">pgpool-general@pgpool.net</a>

<a class="moz-txt-link-freetext" href="http://www.pgpool.net/mailman/listinfo/pgpool-general">http://www.pgpool.net/mailman/listinfo/pgpool-general</a>

</pre>

    </blockquote>

    <br>

  </body>

</html>