[pgpool-hackers: 3252] Re: Segfault in a race condition

Yugo Nagata nagata at sraoss.co.jp
Mon Feb 25 19:01:41 JST 2019


Hi,

I found another race condition in 3.6.15 causing a segfault, which is
reported by our customer.

On Tue, 08 Jan 2019 17:04:00 +0900 (JST)
Tatsuo Ishii <ishii at sraoss.co.jp> wrote:

> I found a segfault could happen in a race condition:
> 
> 1) frontend tries to connect to Pgpool-II
> 
> 2) there's no existing connection cache
> 
> 3) try to create new backend connections by calling connect_backend()
> 
> 4) inside connect_backend(), pool_create_cp() gets called
> 
> 5) pool_create_cp() calls new_connection()
> 
> 6) failover occurs and the global backend status is set to down, but
>    the pgpool main does not send kill signal to the child process yet
> 
> 7) inside new_connection() after checking VALID_BACKEND, it checks the
>    global backend status and finds it is set to down status, so that
>    it returns without creating new connection slot
> 
> 8) connect_backend() continues and accesses the downed connection slot
>    because local status says it's alive, which results in a segfault.
 
The situation is almost the same to above except that the segfault
occurs in pool_do_auth().  (See backtrace and log below)

I guess pool_do_auth was called before Req_info->master_node_id was updated
in failover(), so MASTER_CONNECTION(cp) was referring the downed connection
and MASTER_CONNECTION(cp)->sp caused the segfault.

Here is the backtrace from core:
=================================
Core was generated by `pgpool: accept connection                       '.
Program terminated with signal 11, Segmentation fault.
#0  0x000000000041b993 in pool_do_auth (frontend=0x1678f28, cp=0x1668f18)
    at auth/pool_auth.c:77
77		protoMajor = MASTER_CONNECTION(cp)->sp->major;
Missing separate debuginfos, use: debuginfo-install libmemcached-0.31-1.1.el6.x86_64
(gdb) bt
#0  0x000000000041b993 in pool_do_auth (frontend=0x1678f28, cp=0x1668f18)
    at auth/pool_auth.c:77
#1  0x000000000042377f in connect_backend (sp=0x167ae78, frontend=0x1678f28)
    at protocol/child.c:954
#2  0x0000000000423fdd in get_backend_connection (frontend=0x1678f28)
    at protocol/child.c:2396
#3  0x0000000000424b94 in do_child (fds=0x16584f0) at protocol/child.c:337
#4  0x000000000040682d in fork_a_child (fds=0x16584f0, id=372)
    at main/pgpool_main.c:758
#5  0x0000000000409941 in failover () at main/pgpool_main.c:2102
#6  0x000000000040cb40 in PgpoolMain (discard_status=<value optimized out>, 
    clear_memcache_oidmaps=<value optimized out>) at main/pgpool_main.c:476
#7  0x0000000000405c44 in main (argc=<value optimized out>, 
    argv=<value optimized out>) at main/main.c:317
(gdb) l
72		int authkind;
73		int i;
74		StartupPacket *sp;
75		
76	
77		protoMajor = MASTER_CONNECTION(cp)->sp->major;
78	
79		kind = pool_read_kind(cp);
80		if (kind < 0)
81			ereport(ERROR,
=======================================-

Here is a snippet of the pgpool log. PID 5067 has a segfault.
==================
(snip)
2019-02-23 18:41:35:MAIN(2743):[No Connection]:[No Connection]: LOG:  starting degeneration. shutdown host xxxxxxxx(xxxx)
2019-02-23 18:41:35:MAIN(2743):[No Connection]:[No Connection]: LOG:  Restart all children
2019-02-23 18:41:35:CHILD(5067):[No Connection]:[No Connection]: LOG:  new connection received
2019-02-23 18:41:35:CHILD(5067):[No Connection]:[No Connection]: DETAIL:  connecting host=xxxxxx port=xxxx
(snip)
2019-02-23 18:41:37:MAIN(2743):[No Connection]:[No Connection]: LOG:  child process with pid: 5066 exits with status 0
2019-02-23 18:41:37:MAIN(2743):[No Connection]:[No Connection]: LOG:  child process with pid: 5066 exited with success and will not be restarted
2019-02-23 18:41:37:MAIN(2743):[No Connection]:[No Connection]: WARNING:  child process with pid: 5067 was terminated by segmentation fault
2019-02-23 18:41:37:MAIN(2743):[No Connection]:[No Connection]: LOG:  child process with pid: 5067 exited with success and will not be restarted
2019-02-23 18:41:37:MAIN(2743):[No Connection]:[No Connection]: LOG:  child process with pid: 5068 exits with status 0
2019-02-23 18:41:37:MAIN(2743):[No Connection]:[No Connection]: LOG:  child process with pid: 5068 exited with success and will not be restarted
(snip)
===================



Regards,
-- 
Yugo Nagata <nagata at sraoss.co.jp>


More information about the pgpool-hackers mailing list