[pgpool-general: 826] pgpool 3.1.3 Online Recovery gets stuck in 2nd stage

Taneli Riitaoja Taneli.Riitaoja at csc.fi
Mon Jul 30 17:47:13 JST 2012


Hello,

we've run in to the following problem when trying to set up pgpool Online Recovery with pgpool-3.1.3 and PostgreSQL 9.1.4 in RHEL6 (6.2).

We've got replication running fine and things seem to work without problems, but Online Recovery gets stuck in the 2nd stage after the checkpoint.

The simple test setup is as follows:

node0: pgpool + postgresql
node1: postgresql.

1) Start both postgresql's.
2) Start pgpool.
3) Wait until health check lists both node0 and node1 status as 1.
4) Kill postgresql on node1.
5) Wait until health check lists node1 as status 3.
6) Issue 'pcp_recovery_node -d 200 node0 9898 postgres password 1' from node0.
7) Check for errors.

Stuck after 2nd stage checkpoint:

pgpool_debug.log (node0):

...
LOG:  statement: CHECKPOINT
2012-07-27 15:58:26 LOG:   pid 24435: CHECKPOINT in the 2nd stage done
2012-07-27 15:58:26 LOG:   pid 24435: starting recovery command: "SELECT pgpool_recovery('pgpool_recovery_pitr', 'node1', '/data')"
LOG:  statement: SELECT pgpool_recovery('pgpool_recovery_pitr', 'node1', '/data')
DEBUG:  recovery_script: /data/pgpool_recovery_pitr /data node1 /data
2012-07-27 15:59:19 DEBUG: pid 24226: starting health checking
2012-07-27 16:00:19 DEBUG: pid 24226: starting health checking
2012-07-27 16:01:19 DEBUG: pid 24226: starting health checking
2012-07-27 16:02:19 DEBUG: pid 24226: starting health checking
2012-07-27 16:03:19 DEBUG: pid 24226: starting health checking
2012-07-27 16:04:19 DEBUG: pid 24226: starting health checking ...

postgresql.log (node0):

...
2012-07-27 15:58:23 EEST template1 node0.csc.fi(38663) [unknown] postgres LOG:  00000: statement: CHECKPOINT
2012-07-27 15:58:23 EEST template1 node0.csc.fi(38663) [unknown] postgres LOCATION:  exec_simple_query, postgres.c:862
2012-07-27 15:58:23 EEST template1 node0.csc.fi(38663) [unknown] postgres LOG:  00000: statement: SELECT pgpool_recovery('copy_base_backup', 'node1', '/data')
2012-07-27 15:58:23 EEST template1 node0.csc.fi(38663) [unknown] postgres LOCATION:  exec_simple_query, postgres.c:862
2012-07-27 15:58:23 EEST template1 node0.csc.fi(38663) [unknown] postgres DEBUG:  00000: recovery_script: /data/copy_base_backup /data node1 /data
2012-07-27 15:58:23 EEST template1 node0.csc.fi(38663) [unknown] postgres LOCATION:  pgpool_recovery, pgpool-recovery.c:77
2012-07-27 15:58:23 EEST template1 node0.csc.fi(38663) [unknown] postgres STATEMENT:  SELECT pgpool_recovery('copy_base_backup', 'node1', '/data')
2012-07-27 15:58:24 EEST postgres node0.csc.fi(38665) psql postgres LOG:  00000: statement: select pg_start_backup(' 2012-07-27_15:58:24')
2012-07-27 15:58:24 EEST postgres node0.csc.fi(38665) psql postgres LOCATION:  exec_simple_query, postgres.c:862
2012-07-27 15:58:24 EEST     DEBUG:  00000: archived transaction log file "0000000100000000000000E4"
2012-07-27 15:58:24 EEST     LOCATION:  pgarch_archiveXlog, pgarch.c:612
2012-07-27 15:58:25 EEST postgres node0.csc.fi(38667) psql postgres LOG:  00000: statement: select pg_stop_backup()
2012-07-27 15:58:25 EEST postgres node0.csc.fi(38667) psql postgres LOCATION:  exec_simple_query, postgres.c:862
2012-07-27 15:58:25 EEST     DEBUG:  00000: archived transaction log file "0000000100000000000000E5"
2012-07-27 15:58:25 EEST     LOCATION:  pgarch_archiveXlog, pgarch.c:612
2012-07-27 15:58:25 EEST     DEBUG:  00000: archived transaction log file "0000000100000000000000E5.00000020.backup"
2012-07-27 15:58:25 EEST     LOCATION:  pgarch_archiveXlog, pgarch.c:612
2012-07-27 15:58:26 EEST postgres node0.csc.fi(38667) psql postgres NOTICE:  00000: pg_stop_backup complete, all required WAL segments have been archived
2012-07-27 15:58:26 EEST postgres node0.csc.fi(38667) psql postgres LOCATION:  do_pg_stop_backup, xlog.c:9405
2012-07-27 15:58:26 EEST postgres node0.csc.fi(38667) psql postgres STATEMENT:  select pg_stop_backup()
2012-07-27 15:58:26 EEST template1 node0.csc.fi(38663) [unknown] postgres LOG:  00000: statement: CHECKPOINT
2012-07-27 15:58:26 EEST template1 node0.csc.fi(38663) [unknown] postgres LOCATION:  exec_simple_query, postgres.c:862
2012-07-27 15:58:26 EEST template1 node0.csc.fi(38663) [unknown] postgres LOG:  00000: statement: SELECT pgpool_recovery('pgpool_recovery_pitr', 'node1', '/data')
2012-07-27 15:58:26 EEST template1 node0.csc.fi(38663) [unknown] postgres LOCATION:  exec_simple_query, postgres.c:862
2012-07-27 15:58:26 EEST template1 node0.csc.fi(38663) [unknown] postgres DEBUG:  00000: recovery_script: /data/pgpool_recovery_pitr /data node1 /data
2012-07-27 15:58:26 EEST template1 node0.csc.fi(38663) [unknown] postgres LOCATION:  pgpool_recovery, pgpool-recovery.c:77
2012-07-27 15:58:26 EEST template1 node0.csc.fi(38663) [unknown] postgres STATEMENT:  SELECT pgpool_recovery('pgpool_recovery_pitr', 'node1', '/data')
2012-07-27 15:58:28 EEST     DEBUG:  00000: autovacuum: processing database "postgres"
2012-07-27 15:58:28 EEST     LOCATION:  AutoVacWorkerMain, autovacuum.c:1634
2012-07-27 15:58:58 EEST     DEBUG:  00000: autovacuum: processing database "template1"
2012-07-27 15:58:58 EEST     LOCATION:  AutoVacWorkerMain, autovacuum.c:1634
2012-07-27 15:59:28 EEST     DEBUG:  00000: autovacuum: processing database "postgres"
2012-07-27 15:59:28 EEST     LOCATION:  AutoVacWorkerMain, autovacuum.c:1634
2012-07-27 15:59:58 EEST     DEBUG:  00000: autovacuum: processing database "template1"
2012-07-27 15:59:58 EEST     LOCATION:  AutoVacWorkerMain, autovacuum.c:1634
...

----------------------
Troubleshooting so far
----------------------

Execution seems to freeze in the 'pgpool_recovery_pitr script' in the execution of psql-commands (revealed by ps aux | grep psql, and by inserting echo commands in the script).

The two psql-statements in question are:

#1 psql -p $port -t -c 'SELECT datname FROM pg_database WHERE NOT datistemplate AND datallowconn' template1
#2 psql -p $port -c "SELECT pgpool_switch_xlog('$archdir')" template1

The execution gets stuck on #1, but we also tried commenting out the #1 psql-command to see if the execution also gets stuck in the #2 psql-command. It does.

After killing the stuck connections, running the command manually produces results:

[root at node0 pgpool]# ps aux | grep psql
postgres  5928  0.0  0.2 168432  2368 ?        S    10:27   0:00 psql -p 5432 -t -c SELECT datname FROM pg_database WHERE NOT datistemplate AND datallowconn template1
root      5931  0.0  0.0 103240   848 pts/5    S+   10:27   0:00 grep psql
[root at node0 pgpool]# kill 5928
[root at node0 pgpool]# ps aux | grep psql
postgres  5932  0.0  0.2 168432  2364 ?        S    10:27   0:00 psql -p 5432 -c SELECT pgpool_switch_xlog('/backup') template1
root      6064  0.0  0.0 103240   844 pts/5    S+   10:27   0:00 grep psql
[root at node0 pgpool]# kill 5932
[root at node0 pgpool]# su - postgres
-bash-4.1$ psql -p 5432 -t -c 'SELECT datname FROM pg_database WHERE NOT datistemplate AND datallowconn' template1
LOG:  statement: SELECT datname FROM pg_database WHERE NOT datistemplate AND datallowconn  postgres -bash-4.1$ psql -p 5432 -c "SELECT pgpool_switch_xlog('/backup')" template1
LOG:  statement: SELECT pgpool_switch_xlog('/backup')
DEBUG:  get_function_oid: oid of "pg_catalog": 11
DEBUG:  get_function_oid: oid of "pg_switch_xlog": 2848
DEBUG:  get_function_oid: text typid: 25
DEBUG:  get_function_oid: oid of "pg_catalog": 11
DEBUG:  get_function_oid: oid of "pg_xlogfile_name": 2851
LOG:  pgpool_switch_xlog: waiting for "/backup/0000000100000002000000F0"
                 pgpool_switch_xlog                 
----------------------------------------------------
 /backup/0000000100000002000000F0
(1 row)
-bash-4.1$

Execution time was less than a second in both cases.

Any suggestions on how to proceed in debugging this? Been fighting with this for a week now, and things are starting to look a bit gloomy. I expect we have some connection related parameters wrong in pgpool.conf that causes blocking for psql-connections?

pgpool-log from the test setup is attached, as well as pgpool.conf and postgresql.conf (postgresql.conf are identical on both nodes, except for port number) and copy_base_backup and pgpool_recovery_pitr. Attachments use Linux line breaks.


-- Taneli Riitaoja


More information about the pgpool-general mailing list