[pgpool-general: 827] Re: pgpool 3.1.3 Online Recovery gets stuck in 2nd stage
Tatsuo Ishii
ishii at postgresql.org
Tue Jul 31 10:05:42 JST 2012
I have tried with PostgreSQL 9.2 beta2 and pgpool-II 3.2.0 RC2 and it
seems to work fine.
It seems your PostgreSQL backend process 38663 stucks somewhere.
> 2012-07-27 15:58:26 EEST template1 node0.csc.fi(38663) [unknown] postgres STATEMENT: SELECT pgpool_recovery('pgpool_recovery_pitr', 'node1', '/data')
To get further info, you might want to do:
1) attach gdb to PostgreSQL pid 38663 and to see where it stucks.
2) Inspect pg_locks.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp
> Hello,
>
> we've run in to the following problem when trying to set up pgpool Online Recovery with pgpool-3.1.3 and PostgreSQL 9.1.4 in RHEL6 (6.2).
>
> We've got replication running fine and things seem to work without problems, but Online Recovery gets stuck in the 2nd stage after the checkpoint.
>
> The simple test setup is as follows:
>
> node0: pgpool + postgresql
> node1: postgresql.
>
> 1) Start both postgresql's.
> 2) Start pgpool.
> 3) Wait until health check lists both node0 and node1 status as 1.
> 4) Kill postgresql on node1.
> 5) Wait until health check lists node1 as status 3.
> 6) Issue 'pcp_recovery_node -d 200 node0 9898 postgres password 1' from node0.
> 7) Check for errors.
>
> Stuck after 2nd stage checkpoint:
>
> pgpool_debug.log (node0):
>
> ...
> LOG: statement: CHECKPOINT
> 2012-07-27 15:58:26 LOG: pid 24435: CHECKPOINT in the 2nd stage done
> 2012-07-27 15:58:26 LOG: pid 24435: starting recovery command: "SELECT pgpool_recovery('pgpool_recovery_pitr', 'node1', '/data')"
> LOG: statement: SELECT pgpool_recovery('pgpool_recovery_pitr', 'node1', '/data')
> DEBUG: recovery_script: /data/pgpool_recovery_pitr /data node1 /data
> 2012-07-27 15:59:19 DEBUG: pid 24226: starting health checking
> 2012-07-27 16:00:19 DEBUG: pid 24226: starting health checking
> 2012-07-27 16:01:19 DEBUG: pid 24226: starting health checking
> 2012-07-27 16:02:19 DEBUG: pid 24226: starting health checking
> 2012-07-27 16:03:19 DEBUG: pid 24226: starting health checking
> 2012-07-27 16:04:19 DEBUG: pid 24226: starting health checking ...
>
> postgresql.log (node0):
>
> ...
> 2012-07-27 15:58:23 EEST template1 node0.csc.fi(38663) [unknown] postgres LOG: 00000: statement: CHECKPOINT
> 2012-07-27 15:58:23 EEST template1 node0.csc.fi(38663) [unknown] postgres LOCATION: exec_simple_query, postgres.c:862
> 2012-07-27 15:58:23 EEST template1 node0.csc.fi(38663) [unknown] postgres LOG: 00000: statement: SELECT pgpool_recovery('copy_base_backup', 'node1', '/data')
> 2012-07-27 15:58:23 EEST template1 node0.csc.fi(38663) [unknown] postgres LOCATION: exec_simple_query, postgres.c:862
> 2012-07-27 15:58:23 EEST template1 node0.csc.fi(38663) [unknown] postgres DEBUG: 00000: recovery_script: /data/copy_base_backup /data node1 /data
> 2012-07-27 15:58:23 EEST template1 node0.csc.fi(38663) [unknown] postgres LOCATION: pgpool_recovery, pgpool-recovery.c:77
> 2012-07-27 15:58:23 EEST template1 node0.csc.fi(38663) [unknown] postgres STATEMENT: SELECT pgpool_recovery('copy_base_backup', 'node1', '/data')
> 2012-07-27 15:58:24 EEST postgres node0.csc.fi(38665) psql postgres LOG: 00000: statement: select pg_start_backup(' 2012-07-27_15:58:24')
> 2012-07-27 15:58:24 EEST postgres node0.csc.fi(38665) psql postgres LOCATION: exec_simple_query, postgres.c:862
> 2012-07-27 15:58:24 EEST DEBUG: 00000: archived transaction log file "0000000100000000000000E4"
> 2012-07-27 15:58:24 EEST LOCATION: pgarch_archiveXlog, pgarch.c:612
> 2012-07-27 15:58:25 EEST postgres node0.csc.fi(38667) psql postgres LOG: 00000: statement: select pg_stop_backup()
> 2012-07-27 15:58:25 EEST postgres node0.csc.fi(38667) psql postgres LOCATION: exec_simple_query, postgres.c:862
> 2012-07-27 15:58:25 EEST DEBUG: 00000: archived transaction log file "0000000100000000000000E5"
> 2012-07-27 15:58:25 EEST LOCATION: pgarch_archiveXlog, pgarch.c:612
> 2012-07-27 15:58:25 EEST DEBUG: 00000: archived transaction log file "0000000100000000000000E5.00000020.backup"
> 2012-07-27 15:58:25 EEST LOCATION: pgarch_archiveXlog, pgarch.c:612
> 2012-07-27 15:58:26 EEST postgres node0.csc.fi(38667) psql postgres NOTICE: 00000: pg_stop_backup complete, all required WAL segments have been archived
> 2012-07-27 15:58:26 EEST postgres node0.csc.fi(38667) psql postgres LOCATION: do_pg_stop_backup, xlog.c:9405
> 2012-07-27 15:58:26 EEST postgres node0.csc.fi(38667) psql postgres STATEMENT: select pg_stop_backup()
> 2012-07-27 15:58:26 EEST template1 node0.csc.fi(38663) [unknown] postgres LOG: 00000: statement: CHECKPOINT
> 2012-07-27 15:58:26 EEST template1 node0.csc.fi(38663) [unknown] postgres LOCATION: exec_simple_query, postgres.c:862
> 2012-07-27 15:58:26 EEST template1 node0.csc.fi(38663) [unknown] postgres LOG: 00000: statement: SELECT pgpool_recovery('pgpool_recovery_pitr', 'node1', '/data')
> 2012-07-27 15:58:26 EEST template1 node0.csc.fi(38663) [unknown] postgres LOCATION: exec_simple_query, postgres.c:862
> 2012-07-27 15:58:26 EEST template1 node0.csc.fi(38663) [unknown] postgres DEBUG: 00000: recovery_script: /data/pgpool_recovery_pitr /data node1 /data
> 2012-07-27 15:58:26 EEST template1 node0.csc.fi(38663) [unknown] postgres LOCATION: pgpool_recovery, pgpool-recovery.c:77
> 2012-07-27 15:58:26 EEST template1 node0.csc.fi(38663) [unknown] postgres STATEMENT: SELECT pgpool_recovery('pgpool_recovery_pitr', 'node1', '/data')
> 2012-07-27 15:58:28 EEST DEBUG: 00000: autovacuum: processing database "postgres"
> 2012-07-27 15:58:28 EEST LOCATION: AutoVacWorkerMain, autovacuum.c:1634
> 2012-07-27 15:58:58 EEST DEBUG: 00000: autovacuum: processing database "template1"
> 2012-07-27 15:58:58 EEST LOCATION: AutoVacWorkerMain, autovacuum.c:1634
> 2012-07-27 15:59:28 EEST DEBUG: 00000: autovacuum: processing database "postgres"
> 2012-07-27 15:59:28 EEST LOCATION: AutoVacWorkerMain, autovacuum.c:1634
> 2012-07-27 15:59:58 EEST DEBUG: 00000: autovacuum: processing database "template1"
> 2012-07-27 15:59:58 EEST LOCATION: AutoVacWorkerMain, autovacuum.c:1634
> ...
>
> ----------------------
> Troubleshooting so far
> ----------------------
>
> Execution seems to freeze in the 'pgpool_recovery_pitr script' in the execution of psql-commands (revealed by ps aux | grep psql, and by inserting echo commands in the script).
>
> The two psql-statements in question are:
>
> #1 psql -p $port -t -c 'SELECT datname FROM pg_database WHERE NOT datistemplate AND datallowconn' template1
> #2 psql -p $port -c "SELECT pgpool_switch_xlog('$archdir')" template1
>
> The execution gets stuck on #1, but we also tried commenting out the #1 psql-command to see if the execution also gets stuck in the #2 psql-command. It does.
>
> After killing the stuck connections, running the command manually produces results:
>
> [root at node0 pgpool]# ps aux | grep psql
> postgres 5928 0.0 0.2 168432 2368 ? S 10:27 0:00 psql -p 5432 -t -c SELECT datname FROM pg_database WHERE NOT datistemplate AND datallowconn template1
> root 5931 0.0 0.0 103240 848 pts/5 S+ 10:27 0:00 grep psql
> [root at node0 pgpool]# kill 5928
> [root at node0 pgpool]# ps aux | grep psql
> postgres 5932 0.0 0.2 168432 2364 ? S 10:27 0:00 psql -p 5432 -c SELECT pgpool_switch_xlog('/backup') template1
> root 6064 0.0 0.0 103240 844 pts/5 S+ 10:27 0:00 grep psql
> [root at node0 pgpool]# kill 5932
> [root at node0 pgpool]# su - postgres
> -bash-4.1$ psql -p 5432 -t -c 'SELECT datname FROM pg_database WHERE NOT datistemplate AND datallowconn' template1
> LOG: statement: SELECT datname FROM pg_database WHERE NOT datistemplate AND datallowconn postgres -bash-4.1$ psql -p 5432 -c "SELECT pgpool_switch_xlog('/backup')" template1
> LOG: statement: SELECT pgpool_switch_xlog('/backup')
> DEBUG: get_function_oid: oid of "pg_catalog": 11
> DEBUG: get_function_oid: oid of "pg_switch_xlog": 2848
> DEBUG: get_function_oid: text typid: 25
> DEBUG: get_function_oid: oid of "pg_catalog": 11
> DEBUG: get_function_oid: oid of "pg_xlogfile_name": 2851
> LOG: pgpool_switch_xlog: waiting for "/backup/0000000100000002000000F0"
> pgpool_switch_xlog
> ----------------------------------------------------
> /backup/0000000100000002000000F0
> (1 row)
> -bash-4.1$
>
> Execution time was less than a second in both cases.
>
> Any suggestions on how to proceed in debugging this? Been fighting with this for a week now, and things are starting to look a bit gloomy. I expect we have some connection related parameters wrong in pgpool.conf that causes blocking for psql-connections?
>
> pgpool-log from the test setup is attached, as well as pgpool.conf and postgresql.conf (postgresql.conf are identical on both nodes, except for port number) and copy_base_backup and pgpool_recovery_pitr. Attachments use Linux line breaks.
>
>
> -- Taneli Riitaoja
> _______________________________________________
> pgpool-general mailing list
> pgpool-general at pgpool.net
> http://www.pgpool.net/mailman/listinfo/pgpool-general
More information about the pgpool-general
mailing list