[pgpool-hackers: 28] Re: [pgpool-II 0000005]: occasional pgpool child segmentation faults

Wed Feb 22 15:13:02 JST 2012

Forwarded from matis.

I think problem is [pcp_attach_node 0] tries to attached down
node. pcp_attach_node attaches specified node regardless the node is
actually up or down. i.e. it's absolutely user's responsibility to
make sure he is doing the right thing. However segfault is not good of
course.

[technical aspect of the problem] Pgpool child has an internal state
cache (my_backend_status[(backend_id)]) not to be distrubed by
occasional backend status changes(e.g. going down). This is going to
be updated whenever convenient, usually at the time next session
starting. The cache is used by VALID_BACKEND macro. Also we have data
in shared memory area which represents which one is master(where
"master" is the first live backend). In this case master was set to 0
by the reason described above. So pgpool child looks into down backend
info, which causes segfalut. It seems we should employ silimar
strategy as my_backend_status for VALID_BACKEND macro case as
well. Problem is the macro has been used so widely, I'm not sure it's
safe to change the macro or not. I feel like I need to study this
issue more...
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

From: pgpool Bug Tracker <bugtracker at pgpool.net>
Subject: [pgpool-II 0000005]: occasional pgpool child segmentation faults
Date: Wed, 15 Feb 2012 20:28:06 +0900
Message-ID: <e0cea16d4b29b29ed5ddac67bd671243 at www.pgpool.net>

> 
> The following issue has been SUBMITTED. 
> ====================================================================== 
> http://www.pgpool.net/mantisbt/view.php?id=5 
> ====================================================================== 
> Reported By:                tuomas
> Assigned To:                
> ====================================================================== 
> Project:                    pgpool-II
> Issue ID:                   5
> Category:                   Bug
> Reproducibility:            sometimes
> Severity:                   minor
> Priority:                   normal
> Status:                     new
> ====================================================================== 
> Date Submitted:             2012-02-15 20:28 JST
> Last Modified:              2012-02-15 20:28 JST
> ====================================================================== 
> Summary:                    occasional pgpool child segmentation faults
> Description: 
> Pgpool: 3.1.2
> Postgresql: 9.1.2
> mode: master-slave streaming replication
> 
> We are seeing some segfaults from pgpool childs, possibly because for some
> reason primary node id is different from master node id.
> 
> 
> Feb 15 08:26:33 pgpool[7920]: ProcessFrontendResponse: failed to read kind from
> frontend. frontend abnormally exited
> Feb 15 08:26:33 kernel: pgpool[7920]: segfault at 0 ip 000000000040a505 sp
> 00007fff92887400 error 4 in pgpool[400000+de000]
> Feb 15 08:26:33 kernel: Process pgpool (pid: 7920, threadinfo ffff880101006000,
> task ffff8800a263ada0)
> Feb 15 08:26:33 pgpool[10468]: Child process 7920 was terminated by segmentation
> fault
> 
> Here's the gdb output.
> 
> http://www.pgpool.net/mantisbt/view.php?id=0  0x000000000040a505 in do_child
> (unix_fd=5, inet_fd=<value optimized out>) at child.c:356
> 356                     sp = MASTER_CONNECTION(backend)->sp;
> (gdb) bt
> http://www.pgpool.net/mantisbt/view.php?id=0  0x000000000040a505 in do_child
> (unix_fd=5, inet_fd=<value optimized out>) at child.c:356
> http://www.pgpool.net/mantisbt/view.php?id=1  0x00000000004047e5 in fork_a_child
> (unix_fd=5, inet_fd=6, id=5) at main.c:1072
> http://www.pgpool.net/mantisbt/view.php?id=2  0x000000000040509a in reaper () at
> main.c:2150
> http://www.pgpool.net/mantisbt/view.php?id=3  0x0000000000405cfd in pool_sleep
> (second=<value optimized out>) at main.c:2347
> http://www.pgpool.net/mantisbt/view.php?id=4  0x00000000004074f8 in main
> (argc=<value optimized out>, argv=<value optimized out>) at main.c:708
> 
> (gdb) print backend
> $1 = (POOL_CONNECTION_POOL *) 0xe86070
> 
> (gdb) print backend->slots
> $3 = {0x0, 0xe89d00, 0x0 <repeats 126 times>}
> 
> (gdb) print status
> $4 = POOL_END
> 
> (gdb) print Req_info->master_node_id
> $5 = 0
> 
> (gdb) print Req_info->primary_node_id
> $6 = 1
> 
> 
> Another one in different place:
> 
> Program terminated with signal 11, Segmentation fault.
> http://www.pgpool.net/mantisbt/view.php?id=0  0x00000000004479a2 in
> ProcessFrontendResponse (frontend=0xe870a0, backend=0xe86070) at
> pool_proto_modules.c:2012
> 2012    if (MAJOR(backend) == PROTO_MAJOR_V3)
> (gdb) bt
> http://www.pgpool.net/mantisbt/view.php?id=0  0x00000000004479a2 in
> ProcessFrontendResponse (frontend=0xe870a0, backend=0xe86070) at
> pool_proto_modules.c:2012
> http://www.pgpool.net/mantisbt/view.php?id=1  0x0000000000414e36 in
> pool_process_query (frontend=0xe870a0, backend=0xe86070, reset_request=<value
> optimized out>) at pool_process_query.c:344
> http://www.pgpool.net/mantisbt/view.php?id=2  0x000000000040a4f2 in do_child
> (unix_fd=5, inet_fd=<value optimized out>) at child.c:354
> http://www.pgpool.net/mantisbt/view.php?id=3  0x00000000004047e5 in fork_a_child
> (unix_fd=5, inet_fd=6, id=1) at main.c:1072
> http://www.pgpool.net/mantisbt/view.php?id=4  0x000000000040509a in reaper () at
> main.c:2150
> http://www.pgpool.net/mantisbt/view.php?id=5  0x0000000000405cfd in pool_sleep
> (second=<value optimized out>) at main.c:2347
> http://www.pgpool.net/mantisbt/view.php?id=6  0x00000000004074f8 in main
> (argc=<value optimized out>, argv=<value optimized out>) at main.c:708
> 
> (gdb) print backend
> $1 = (POOL_CONNECTION_POOL *) 0xe86070
> 
> (gdb) print backend->slots
> $2 = {0x0, 0xe850a0, 0x0 <repeats 126 times>}
> 
> (gdb) print Req_info->master_node_id
> $3 = 0
> 
> (gdb) print Req_info->primary_node_id
> $4 = 1
> 
> In both cases backend->slots[0]->sp is accessed but backend->slots[0] is null
> 
> Events that lead to this:
> 
> - Node 1 was primary and node 0 slave (previously there was a failover that
> turned node 1 to primary/master and 0 as slave)
> - pgpool lost connectivity to both backends
> - node 1 was attached back first as it was primary and continued to be primary
> and master
> - node 0 was attached back and although node 1 stayed primary as it should, it
> seems node 0 became master:
> 
> [network issue broke all connections]
> Cannot accept() new connection. all backends are down
> Replication of node:0 is behind 0 bytes from the primary server (node:1)
> [pcp_attach_node 1 here]
> send_failback_request: fail back 1 th node request from pid 20679
> starting fail back. reconnect host fdqa03.fd-qa.flowdock-int.com(5432)
> Do not restart children because we are failbacking node id 1 hosta.b.c.d
> port:5432 and we are in streaming replication mode
> find_primary_node_repeatedly: waiting for finding a primary node
> find_primary_node: primary node id is 1
> failover: set new primary node: 1
> failover: set new master node: 1  <-- here master is still 0
> worker process received restart request
> failback done. reconnect host a.b.c.d(5432)
> ...
> do_child: failback event found. restart myself.
> ...
> 
> [pcp_attach_node 0]
> send_failback_request: fail back 0 th node request from pid 21988
> starting fail back. reconnect host d.c.b.a(5432)
> Do not restart children because we are failbacking node id 0 hostd.c.b.a
> port:5432 and we are in streaming replication mode
> find_primary_node_repeatedly: waiting for finding a primary node
> find_primary_node: primary node id is 1
> failover: set new primary node: 1
> failover: set new master node: 0  <--- here master node is set to 0 for some
> reason
> failback done. reconnect host d.c.b.a(5432)
> ...
> 
> Let me know if some additional info would be useful
> 
> I don't quite see the difference between master node and primary node so i guess
> they should always be the same at least in streaming replication mode.
> 
> ====================================================================== 
> 
> Issue History 
> Date Modified    Username       Field                    Change               
> ====================================================================== 
> 2012-02-15 20:28 tuomas         New Issue                                    
> ======================================================================
>