[pgpool-hackers: 3256] Re: Segfault in a race condition

Tatsuo Ishii ishii at sraoss.co.jp
Wed Feb 27 08:25:51 JST 2019


> Hi,
> 
> I found another race condition in 3.6.15 causing a segfault, which is
> reported by our customer.
> 
> On Tue, 08 Jan 2019 17:04:00 +0900 (JST)
> Tatsuo Ishii <ishii at sraoss.co.jp> wrote:
> 
>> I found a segfault could happen in a race condition:
>> 
>> 1) frontend tries to connect to Pgpool-II
>> 
>> 2) there's no existing connection cache
>> 
>> 3) try to create new backend connections by calling connect_backend()
>> 
>> 4) inside connect_backend(), pool_create_cp() gets called
>> 
>> 5) pool_create_cp() calls new_connection()
>> 
>> 6) failover occurs and the global backend status is set to down, but
>>    the pgpool main does not send kill signal to the child process yet
>> 
>> 7) inside new_connection() after checking VALID_BACKEND, it checks the
>>    global backend status and finds it is set to down status, so that
>>    it returns without creating new connection slot
>> 
>> 8) connect_backend() continues and accesses the downed connection slot
>>    because local status says it's alive, which results in a segfault.
>  
> The situation is almost the same to above except that the segfault
> occurs in pool_do_auth().  (See backtrace and log below)
> 
> I guess pool_do_auth was called before Req_info->master_node_id was updated
> in failover(), so MASTER_CONNECTION(cp) was referring the downed connection
> and MASTER_CONNECTION(cp)->sp caused the segfault.

The situation is different in that the segfault explained in
[pgpool-hackers: 3214] was caused by local node status was too old
(the global status was up-to-date), while in this case the global
status is not yet updated. So we cannot employ the same fix as before.

I think the possible fix would be, checking Req_info->switching = true
before referring MASTER_CONNECTION macro. If it's true, refuse to
accept new connection.

What do you think?

> Here is the backtrace from core:
> =================================
> Core was generated by `pgpool: accept connection                       '.
> Program terminated with signal 11, Segmentation fault.
> #0  0x000000000041b993 in pool_do_auth (frontend=0x1678f28, cp=0x1668f18)
>     at auth/pool_auth.c:77
> 77		protoMajor = MASTER_CONNECTION(cp)->sp->major;
> Missing separate debuginfos, use: debuginfo-install libmemcached-0.31-1.1.el6.x86_64
> (gdb) bt
> #0  0x000000000041b993 in pool_do_auth (frontend=0x1678f28, cp=0x1668f18)
>     at auth/pool_auth.c:77
> #1  0x000000000042377f in connect_backend (sp=0x167ae78, frontend=0x1678f28)
>     at protocol/child.c:954
> #2  0x0000000000423fdd in get_backend_connection (frontend=0x1678f28)
>     at protocol/child.c:2396
> #3  0x0000000000424b94 in do_child (fds=0x16584f0) at protocol/child.c:337
> #4  0x000000000040682d in fork_a_child (fds=0x16584f0, id=372)
>     at main/pgpool_main.c:758
> #5  0x0000000000409941 in failover () at main/pgpool_main.c:2102
> #6  0x000000000040cb40 in PgpoolMain (discard_status=<value optimized out>, 
>     clear_memcache_oidmaps=<value optimized out>) at main/pgpool_main.c:476
> #7  0x0000000000405c44 in main (argc=<value optimized out>, 
>     argv=<value optimized out>) at main/main.c:317
> (gdb) l
> 72		int authkind;
> 73		int i;
> 74		StartupPacket *sp;
> 75		
> 76	
> 77		protoMajor = MASTER_CONNECTION(cp)->sp->major;
> 78	
> 79		kind = pool_read_kind(cp);
> 80		if (kind < 0)
> 81			ereport(ERROR,
> =======================================-
> 
> Here is a snippet of the pgpool log. PID 5067 has a segfault.
> ==================
> (snip)
> 2019-02-23 18:41:35:MAIN(2743):[No Connection]:[No Connection]: LOG:  starting degeneration. shutdown host xxxxxxxx(xxxx)
> 2019-02-23 18:41:35:MAIN(2743):[No Connection]:[No Connection]: LOG:  Restart all children
> 2019-02-23 18:41:35:CHILD(5067):[No Connection]:[No Connection]: LOG:  new connection received
> 2019-02-23 18:41:35:CHILD(5067):[No Connection]:[No Connection]: DETAIL:  connecting host=xxxxxx port=xxxx
> (snip)
> 2019-02-23 18:41:37:MAIN(2743):[No Connection]:[No Connection]: LOG:  child process with pid: 5066 exits with status 0
> 2019-02-23 18:41:37:MAIN(2743):[No Connection]:[No Connection]: LOG:  child process with pid: 5066 exited with success and will not be restarted
> 2019-02-23 18:41:37:MAIN(2743):[No Connection]:[No Connection]: WARNING:  child process with pid: 5067 was terminated by segmentation fault
> 2019-02-23 18:41:37:MAIN(2743):[No Connection]:[No Connection]: LOG:  child process with pid: 5067 exited with success and will not be restarted
> 2019-02-23 18:41:37:MAIN(2743):[No Connection]:[No Connection]: LOG:  child process with pid: 5068 exits with status 0
> 2019-02-23 18:41:37:MAIN(2743):[No Connection]:[No Connection]: LOG:  child process with pid: 5068 exited with success and will not be restarted
> (snip)
> ===================
> 
> 
> 
> Regards,
> -- 
> Yugo Nagata <nagata at sraoss.co.jp>


More information about the pgpool-hackers mailing list