<div dir="ltr">Hi, we've run into a very strange situation where a child process hangs in active state.<div><br></div><div><div><br></div><div>*** Versions</div><div><br></div><div>- os: RHEL 5.7</div><div>- pgpool: 3.3.2</div>
<div>- backend: EnterpriseDB 9.3.1.3 on x86_64</div><div>- edb jdbc driver</div><div><br></div><div><br></div><div>*** show pool_pools;</div><div><br></div><div> pool_pid | start_time | pool_id | backend_id | database | username | create_time | majorversion | minorversion | pool_counter | pool_backendpid | pool_connected</div>
<div>----------+---------------------+---------+------------+----------+--------------+---------------------+--------------+--------------+--------------+-----------------+----------------</div><div> 26462 | 2014-04-18 17:10:58 | 0 | 0 | cloud | enterprisedb | 2014-04-18 17:10:58 | 3 | 0 | 11518 | 20546 | 1</div>
<div> 26462 | 2014-04-18 17:10:58 | 0 | 1 | cloud | enterprisedb | 2014-04-18 17:10:58 | 3 | 0 | 11518 | 29160 | 1</div><div> 26462 | 2014-04-18 17:10:58 | 1 | 0 | | | | 0 | 0 | 0 | 0 | 0</div>
<div> 26462 | 2014-04-18 17:10:58 | 1 | 1 | | | | 0 | 0 | 0 | 0 | 0</div><div> 26462 | 2014-04-18 17:10:58 | 2 | 0 | | | | 0 | 0 | 0 | 0 | 0</div>
<div> 26462 | 2014-04-18 17:10:58 | 2 | 1 | | | | 0 | 0 | 0 | 0 | 0</div><div> 26462 | 2014-04-18 17:10:58 | 3 | 0 | | | | 0 | 0 | 0 | 0 | 0</div>
<div> 26462 | 2014-04-18 17:10:58 | 3 | 1 | | | | 0 | 0 | 0 | 0 | 0</div><div><br></div><div><br></div><div><div>*** pg_stat_activity</div>
<div><br></div><div>edb=# select * from pg_stat_activity where pid = 20546;</div><div> datid | datname | pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | xact_start | query_</div>
<div>start | state_change | waiting | state | query</div><div>-------+---------+-------+----------+--------------+------------------+--------------+-----------------+-------------+----------------------------------+---------------------------------+-----------------</div>
<div>----------------+----------------------------------+---------+--------+----------------------------</div><div> 16391 | cloud | 20546 | 10 | enterprisedb | | 10.12.240.94 | | 30104 | 18-APR-14 17:10:58.329551 +09:00 | 18-APR-14 17:14:36.83873 +09:00 | 18-APR-14 17:14:</div>
<div>36.83873 +09:00 | 18-APR-14 17:14:36.838732 +09:00 | f | active | SET extra_float_digits = 3</div></div><div><br></div><div><br></div><div><div>*** ps output (pgpool server)</div><div><br></div><div>pgpool 32395 0.0 0.0 24160 4848 ? S Apr15 0:11 \_ /db/pgpool/bin/pgpool -n -D -f /db/pgpool/conf/pgm/pgpool.conf -F /db/pgpool/conf/pgm/pcp.conf -a /db/pgpool/conf/pgm/pool_hba.conf</div>
<div>pgpool 18534 0.0 0.0 24160 704 ? S Apr18 0:00 | \_ pgpool: PCP: wait for connection request</div><div>pgpool 18535 0.0 0.0 24160 1076 ? S Apr18 0:03 | \_ pgpool: worker process</div>
<div>pgpool 26462 0.0 0.0 830632 9032 ? S Apr18 0:10 | \_ pgpool: enterprisedb cloud ---------(redacted)---------(15878) idle</div><div>pgpool 11259 0.0 0.0 24160 876 ? S 13:31 0:00 | \_ pgpool: wait for connection request</div>
<div>pgpool 11261 0.0 0.0 24160 864 ? S 13:31 0:00 | \_ pgpool: wait for connection request</div><div>pgpool 11262 0.0 0.0 24160 864 ? S 13:31 0:00 | \_ pgpool: wait for connection request</div>
<div>pgpool 11263 0.0 0.0 24160 864 ? S 13:31 0:00 | \_ pgpool: wait for connection request</div><div>...</div></div><div><br></div><div><br></div><div><div>*** ps output (backend)</div><div><br></div>
<div>$ ps auxf | grep 20546</div><div>hanadmin 12046 0.0 0.0 61188 744 pts/8 S+ 15:07 0:00 \_ grep 20546</div><div>541 20546 0.0 0.0 8761172 7448 ? Ss Apr18 0:07 \_ postgres: enterprisedb cloud 10.12.240.94[30104] PARSE</div>
</div><div><br></div><div><br></div><div><br></div><div><div>*** gdb trace</div><div><br></div><div>(gdb) bt</div><div>#0 0x00000033308cd6d3 in __select_nocancel () from /lib64/libc.so.6</div><div>#1 0x0000000000418c71 in pool_check_fd (cp=<value optimized out>) at pool_process_query.c:951</div>
<div>#2 0x000000000041d0e4 in pool_read2 (cp=0xbadfc70, len=822083579) at pool_stream.c:264</div><div>#3 0x0000000000418803 in SimpleForwardToFrontend (kind=49 '1', frontend=0xbaf1180, backend=0xbad8330) at pool_process_query.c:1136</div>
<div>#4 0x00000000004499bb in ProcessBackendResponse (frontend=0xbaf1180, backend=0xbad8330, state=0x7fff94626060, num_fields=0x7fff94626066) at pool_proto_modules.c:2670</div><div>#5 0x000000000041b69d in pool_process_query (frontend=0xbaf1180, backend=0xbad8330, reset_request=0) at pool_process_query.c:289</div>
<div>#6 0x000000000040ad8a in do_child (unix_fd=4, inet_fd=5) at child.c:355</div><div>#7 0x000000000040456f in fork_a_child (unix_fd=4, inet_fd=5, id=0) at main.c:1258</div><div>#8 0x0000000000404897 in reaper () at main.c:2482</div>
<div>#9 0x0000000000404c25 in pool_sleep (second=<value optimized out>) at main.c:2679</div><div>#10 0x0000000000407a0a in main (argc=<value optimized out>, argv=<value optimized out>) at main.c:856</div>
<div>(gdb) up</div><div>#1 0x0000000000418c71 in pool_check_fd (cp=<value optimized out>) at pool_process_query.c:951</div><div>951 fds = select(fd+1, &readmask, NULL, &exceptmask, timeoutp);</div>
<div>(gdb)</div><div>#2 0x000000000041d0e4 in pool_read2 (cp=0xbadfc70, len=822083579) at pool_stream.c:264</div><div>264 if (pool_check_fd(cp))</div><div>(gdb)</div><div>#3 0x0000000000418803 in SimpleForwardToFrontend (kind=49 '1', frontend=0xbaf1180, backend=0xbad8330) at pool_process_query.c:1136</div>
<div>1136 p = pool_read2(CONNECTION(backend, i), len);</div><div>(gdb) p len</div><div>$1 = 822083580</div><div>(gdb) p *backend->slots[i].con</div><div>$3 = {fd = 10, wbuf = 0xbae01d0 "P", wbufsz = 8192, wbufpo = 0, ssl_active = -1, hp = 0xbae21e0 "1", po = 0, bufsz = 1024, len = 0, sbuf = 0x0, sbufsz = 0, buf2 = 0x2aaf32c92010 "\004OMMIT",</div>
<div> bufsz2 = 822083584, buf3 = 0x0, bufsz3 = 0, isbackend = 1, db_node_id = 1, tstate = 73 'I', is_internal_transaction_started = 0 '\000', auth_kind = 0, pwd_size = 0,</div><div> password = '\000' <repeats 1023 times>, salt = "\000\000\000", params = {num = 0, names = 0xbae25f0, values = 0xbae2a00}, no_forward = 0, kind = 0 '\000', protoVersion = 0, raddr = {addr = {</div>
<div> ss_family = 0, __ss_align = 0, __ss_padding = '\000' <repeats 111 times>}, salen = 0}, auth_method = uaReject, auth_arg = 0x0, database = 0x0, username = 0x0}</div></div><div><br></div><div><br></div>
<div>***</div><div><br></div><div><div>`len` and `bufsz2` being over 800MB looks extremely odd. Could it be a bug of pgpool?</div><div><div>The ps output shows that the process alone actually allocated 800MB of address space.</div>
<div>This has happened several times for us, but we have no idea what caused it</div><div>and we just had to kill the process.</div></div></div><div><br></div><div>Any idea?</div><div><br></div>-- <br>cheers,<br>junegunn choi.
</div></div>