<div dir="ltr"><div>Hi,</div><div>we had the issue where every child process hangs while waiting</div><div>for the password packet from frontend, finally making the whole</div><div>pgpool cluster unresponsive. The callstack of each process was</div>
<div>given as follows:</div><div><br></div><div><br></div><div>#0 0x0000003d222cdaf3 in __select_nocancel () from /lib64/libc.so.6</div><div>#1 0x0000000000418c61 in pool_check_fd (cp=<value optimized out>)</div><div>
at pool_process_query.c:951</div><div>#2 0x000000000041d534 in pool_read (cp=0x1da7a210, buf=0x7ffff5092d3f, len=1)</div><div> at pool_stream.c:139</div><div>#3 0x000000000040b9f0 in read_password_packet (frontend=0x1da7a210,</div>
<div> protoMajor=<value optimized out>,</div><div> password=0x70a460 "md55c81acdb03ea852f30d0630528697236", pwdSize=0x70a860)</div><div> at pool_auth.c:1047</div><div>#4 0x000000000040c8d2 in do_md5 (backend=0x1da63c70, frontend=0x1da7a210,</div>
<div> reauth=1, protoMajor=3) at pool_auth.c:867</div><div>#5 0x000000000040cceb in pool_do_reauth (frontend=0x1da7a210, cp=0x1da5ec70)</div><div> at pool_auth.c:421</div><div>#6 0x000000000040a9c5 in connect_using_existing_connection (unix_fd=4,</div>
<div> inet_fd=5) at child.c:1043</div><div>#7 do_child (unix_fd=4, inet_fd=5) at child.c:330</div><div>#8 0x000000000040455f in fork_a_child (unix_fd=4, inet_fd=5, id=0)</div><div> at main.c:1258</div><div>#9 0x0000000000404887 in reaper () at main.c:2482</div>
<div>#10 0x0000000000407a47 in main (argc=<value optimized out>, argv=0x0)</div><div> at main.c:714</div><div><br></div><div>We looked into the code and realized that send_md5auth_request</div><div>function, which is called just before read_password_packet,</div>
<div>always returns 0. So we suspect that those processes were waiting</div><div>for the response even when they failed to correctly send the</div><div>request. Strangely, we couldn't reproduce the exact problem on</div>
<div>our test cluster, however it was still happening on our</div><div>production servers several times a day. We had to quickly find</div><div>a workaround for this recurring problem, so we commented out</div><div>do_md5 part from pool_do_reauth effectively disabling the</div>
<div>authentication, and it has not occurred ever since. Although</div><div>our cluster is running smooth now, we crippled the md5</div><div>authentication. I believe this problem deserves attention and a</div><div>proper fix. Thanks.</div>
<div><br></div><div><br></div>-- <br>cheers,<br>junegunn choi.
</div>