[pgpool-general: 827] Re: pgpool 3.1.3 Online Recovery gets stuck in 2nd stage

Tatsuo Ishii ishii at postgresql.org
Tue Jul 31 10:05:42 JST 2012


I have tried with PostgreSQL 9.2 beta2 and pgpool-II 3.2.0 RC2 and it
seems to work fine.

It seems your PostgreSQL backend process 38663 stucks somewhere.

> 2012-07-27 15:58:26 EEST template1 node0.csc.fi(38663) [unknown] postgres STATEMENT:  SELECT pgpool_recovery('pgpool_recovery_pitr', 'node1', '/data')

To get further info, you might want to do:

1) attach gdb to PostgreSQL pid 38663 and to see where it stucks.

2) Inspect pg_locks.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

> Hello,
> 
> we've run in to the following problem when trying to set up pgpool Online Recovery with pgpool-3.1.3 and PostgreSQL 9.1.4 in RHEL6 (6.2).
> 
> We've got replication running fine and things seem to work without problems, but Online Recovery gets stuck in the 2nd stage after the checkpoint.
> 
> The simple test setup is as follows:
> 
> node0: pgpool + postgresql
> node1: postgresql.
> 
> 1) Start both postgresql's.
> 2) Start pgpool.
> 3) Wait until health check lists both node0 and node1 status as 1.
> 4) Kill postgresql on node1.
> 5) Wait until health check lists node1 as status 3.
> 6) Issue 'pcp_recovery_node -d 200 node0 9898 postgres password 1' from node0.
> 7) Check for errors.
> 
> Stuck after 2nd stage checkpoint:
> 
> pgpool_debug.log (node0):
> 
> ...
> LOG:  statement: CHECKPOINT
> 2012-07-27 15:58:26 LOG:   pid 24435: CHECKPOINT in the 2nd stage done
> 2012-07-27 15:58:26 LOG:   pid 24435: starting recovery command: "SELECT pgpool_recovery('pgpool_recovery_pitr', 'node1', '/data')"
> LOG:  statement: SELECT pgpool_recovery('pgpool_recovery_pitr', 'node1', '/data')
> DEBUG:  recovery_script: /data/pgpool_recovery_pitr /data node1 /data
> 2012-07-27 15:59:19 DEBUG: pid 24226: starting health checking
> 2012-07-27 16:00:19 DEBUG: pid 24226: starting health checking
> 2012-07-27 16:01:19 DEBUG: pid 24226: starting health checking
> 2012-07-27 16:02:19 DEBUG: pid 24226: starting health checking
> 2012-07-27 16:03:19 DEBUG: pid 24226: starting health checking
> 2012-07-27 16:04:19 DEBUG: pid 24226: starting health checking ...
> 
> postgresql.log (node0):
> 
> ...
> 2012-07-27 15:58:23 EEST template1 node0.csc.fi(38663) [unknown] postgres LOG:  00000: statement: CHECKPOINT
> 2012-07-27 15:58:23 EEST template1 node0.csc.fi(38663) [unknown] postgres LOCATION:  exec_simple_query, postgres.c:862
> 2012-07-27 15:58:23 EEST template1 node0.csc.fi(38663) [unknown] postgres LOG:  00000: statement: SELECT pgpool_recovery('copy_base_backup', 'node1', '/data')
> 2012-07-27 15:58:23 EEST template1 node0.csc.fi(38663) [unknown] postgres LOCATION:  exec_simple_query, postgres.c:862
> 2012-07-27 15:58:23 EEST template1 node0.csc.fi(38663) [unknown] postgres DEBUG:  00000: recovery_script: /data/copy_base_backup /data node1 /data
> 2012-07-27 15:58:23 EEST template1 node0.csc.fi(38663) [unknown] postgres LOCATION:  pgpool_recovery, pgpool-recovery.c:77
> 2012-07-27 15:58:23 EEST template1 node0.csc.fi(38663) [unknown] postgres STATEMENT:  SELECT pgpool_recovery('copy_base_backup', 'node1', '/data')
> 2012-07-27 15:58:24 EEST postgres node0.csc.fi(38665) psql postgres LOG:  00000: statement: select pg_start_backup(' 2012-07-27_15:58:24')
> 2012-07-27 15:58:24 EEST postgres node0.csc.fi(38665) psql postgres LOCATION:  exec_simple_query, postgres.c:862
> 2012-07-27 15:58:24 EEST     DEBUG:  00000: archived transaction log file "0000000100000000000000E4"
> 2012-07-27 15:58:24 EEST     LOCATION:  pgarch_archiveXlog, pgarch.c:612
> 2012-07-27 15:58:25 EEST postgres node0.csc.fi(38667) psql postgres LOG:  00000: statement: select pg_stop_backup()
> 2012-07-27 15:58:25 EEST postgres node0.csc.fi(38667) psql postgres LOCATION:  exec_simple_query, postgres.c:862
> 2012-07-27 15:58:25 EEST     DEBUG:  00000: archived transaction log file "0000000100000000000000E5"
> 2012-07-27 15:58:25 EEST     LOCATION:  pgarch_archiveXlog, pgarch.c:612
> 2012-07-27 15:58:25 EEST     DEBUG:  00000: archived transaction log file "0000000100000000000000E5.00000020.backup"
> 2012-07-27 15:58:25 EEST     LOCATION:  pgarch_archiveXlog, pgarch.c:612
> 2012-07-27 15:58:26 EEST postgres node0.csc.fi(38667) psql postgres NOTICE:  00000: pg_stop_backup complete, all required WAL segments have been archived
> 2012-07-27 15:58:26 EEST postgres node0.csc.fi(38667) psql postgres LOCATION:  do_pg_stop_backup, xlog.c:9405
> 2012-07-27 15:58:26 EEST postgres node0.csc.fi(38667) psql postgres STATEMENT:  select pg_stop_backup()
> 2012-07-27 15:58:26 EEST template1 node0.csc.fi(38663) [unknown] postgres LOG:  00000: statement: CHECKPOINT
> 2012-07-27 15:58:26 EEST template1 node0.csc.fi(38663) [unknown] postgres LOCATION:  exec_simple_query, postgres.c:862
> 2012-07-27 15:58:26 EEST template1 node0.csc.fi(38663) [unknown] postgres LOG:  00000: statement: SELECT pgpool_recovery('pgpool_recovery_pitr', 'node1', '/data')
> 2012-07-27 15:58:26 EEST template1 node0.csc.fi(38663) [unknown] postgres LOCATION:  exec_simple_query, postgres.c:862
> 2012-07-27 15:58:26 EEST template1 node0.csc.fi(38663) [unknown] postgres DEBUG:  00000: recovery_script: /data/pgpool_recovery_pitr /data node1 /data
> 2012-07-27 15:58:26 EEST template1 node0.csc.fi(38663) [unknown] postgres LOCATION:  pgpool_recovery, pgpool-recovery.c:77
> 2012-07-27 15:58:26 EEST template1 node0.csc.fi(38663) [unknown] postgres STATEMENT:  SELECT pgpool_recovery('pgpool_recovery_pitr', 'node1', '/data')
> 2012-07-27 15:58:28 EEST     DEBUG:  00000: autovacuum: processing database "postgres"
> 2012-07-27 15:58:28 EEST     LOCATION:  AutoVacWorkerMain, autovacuum.c:1634
> 2012-07-27 15:58:58 EEST     DEBUG:  00000: autovacuum: processing database "template1"
> 2012-07-27 15:58:58 EEST     LOCATION:  AutoVacWorkerMain, autovacuum.c:1634
> 2012-07-27 15:59:28 EEST     DEBUG:  00000: autovacuum: processing database "postgres"
> 2012-07-27 15:59:28 EEST     LOCATION:  AutoVacWorkerMain, autovacuum.c:1634
> 2012-07-27 15:59:58 EEST     DEBUG:  00000: autovacuum: processing database "template1"
> 2012-07-27 15:59:58 EEST     LOCATION:  AutoVacWorkerMain, autovacuum.c:1634
> ...
> 
> ----------------------
> Troubleshooting so far
> ----------------------
> 
> Execution seems to freeze in the 'pgpool_recovery_pitr script' in the execution of psql-commands (revealed by ps aux | grep psql, and by inserting echo commands in the script).
> 
> The two psql-statements in question are:
> 
> #1 psql -p $port -t -c 'SELECT datname FROM pg_database WHERE NOT datistemplate AND datallowconn' template1
> #2 psql -p $port -c "SELECT pgpool_switch_xlog('$archdir')" template1
> 
> The execution gets stuck on #1, but we also tried commenting out the #1 psql-command to see if the execution also gets stuck in the #2 psql-command. It does.
> 
> After killing the stuck connections, running the command manually produces results:
> 
> [root at node0 pgpool]# ps aux | grep psql
> postgres  5928  0.0  0.2 168432  2368 ?        S    10:27   0:00 psql -p 5432 -t -c SELECT datname FROM pg_database WHERE NOT datistemplate AND datallowconn template1
> root      5931  0.0  0.0 103240   848 pts/5    S+   10:27   0:00 grep psql
> [root at node0 pgpool]# kill 5928
> [root at node0 pgpool]# ps aux | grep psql
> postgres  5932  0.0  0.2 168432  2364 ?        S    10:27   0:00 psql -p 5432 -c SELECT pgpool_switch_xlog('/backup') template1
> root      6064  0.0  0.0 103240   844 pts/5    S+   10:27   0:00 grep psql
> [root at node0 pgpool]# kill 5932
> [root at node0 pgpool]# su - postgres
> -bash-4.1$ psql -p 5432 -t -c 'SELECT datname FROM pg_database WHERE NOT datistemplate AND datallowconn' template1
> LOG:  statement: SELECT datname FROM pg_database WHERE NOT datistemplate AND datallowconn  postgres -bash-4.1$ psql -p 5432 -c "SELECT pgpool_switch_xlog('/backup')" template1
> LOG:  statement: SELECT pgpool_switch_xlog('/backup')
> DEBUG:  get_function_oid: oid of "pg_catalog": 11
> DEBUG:  get_function_oid: oid of "pg_switch_xlog": 2848
> DEBUG:  get_function_oid: text typid: 25
> DEBUG:  get_function_oid: oid of "pg_catalog": 11
> DEBUG:  get_function_oid: oid of "pg_xlogfile_name": 2851
> LOG:  pgpool_switch_xlog: waiting for "/backup/0000000100000002000000F0"
>                  pgpool_switch_xlog                 
> ----------------------------------------------------
>  /backup/0000000100000002000000F0
> (1 row)
> -bash-4.1$
> 
> Execution time was less than a second in both cases.
> 
> Any suggestions on how to proceed in debugging this? Been fighting with this for a week now, and things are starting to look a bit gloomy. I expect we have some connection related parameters wrong in pgpool.conf that causes blocking for psql-connections?
> 
> pgpool-log from the test setup is attached, as well as pgpool.conf and postgresql.conf (postgresql.conf are identical on both nodes, except for port number) and copy_base_backup and pgpool_recovery_pitr. Attachments use Linux line breaks.
> 
> 
> -- Taneli Riitaoja
> _______________________________________________
> pgpool-general mailing list
> pgpool-general at pgpool.net
> http://www.pgpool.net/mailman/listinfo/pgpool-general


More information about the pgpool-general mailing list