[pgpool-hackers: 1972] Re: Problem with log_line_prefix()

Tatsuo Ishii ishii at sraoss.co.jp
Tue Jan 10 12:08:01 JST 2017


Hi Usama,

Problem is, there are some places where standard library functions
that calls malloc() internally.
(For example, asprintf(), getaddrinfo()).

Also blocking signals/unblocking signals every time malloc() called
might hurt performance (I have not done an actual measurement. So this
just a guess).

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

> Hi Ishii-San
> 
> I have been thinking about this and I think there is one solution we can
> adopt.
> 
> We can block signals before doing malloc() in, AllocSetAlloc() and
> AllocSetContextCreate() function, And since pgpool-II only uses palloc and
> friends to allocate memory and there are only few places where the malloc()
> are issued by MemoryManager API, So if we block signals before calling the
> malloc(), we can safely use elog/ereport from signal handler routines.
> What are your thoughts on this. I am attaching the sample patch for the
> reference.
> 
> Kind regards
> Muhammad Usama
> 
> 
> On Fri, Jan 6, 2017 at 12:03 PM, Muhammad Usama <m.usama at gmail.com> wrote:
> 
>> Hi Ishii-San
>>
>>
>> On Fri, Jan 6, 2017 at 11:22 AM, Tatsuo Ishii <ishii at sraoss.co.jp> wrote:
>>
>>> After some investigations, I found PostgreSQL does not block all
>>> signals: SIGTERM etc. still can interrupt.
>>>
>>> Probably we should take out elog/ereport call from child.c:die().
>>>
>>
>> Yes, this seems to be a good fix for now, But I am looking for some more
>> future proof fix. Give me a couple of more days to get to the solution,
>> otherwise we will go with removing the ereport from child.c:die.
>>
>> Thanks
>> Best regards
>> Muhammad Usama
>>
>>
>>> Best regards,
>>> --
>>> Tatsuo Ishii
>>> SRA OSS, Inc. Japan
>>> English: http://www.sraoss.co.jp/index_en.php
>>> Japanese:http://www.sraoss.co.jp
>>>
>>> > I have asked this how PostgreSQL solves similar problems like this.
>>> > It turned out that PostgreSQL mostly blocks all signals:
>>> >
>>> > https://www.postgresql.org/message-id/18405.1482794285%40sss.pgh.pa.us
>>> >
>>> > Probably we should do the same way...
>>> >
>>> > Best regards,
>>> > --
>>> > Tatsuo Ishii
>>> > SRA OSS, Inc. Japan
>>> > English: http://www.sraoss.co.jp/index_en.php
>>> > Japanese:http://www.sraoss.co.jp
>>> >
>>> >> Usama,
>>> >>
>>> >> If I hit CTRL-C key while executing the regression test, sometimes a
>>> >> orphan pgpool child remains. Here's a stack trace of the process:
>>> >>
>>> >> t-ishii at localhost: gdb temp/installed/bin/pgpool 3794
>>> >> GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.2) 7.7.1
>>> >> [snip]
>>> >> Loaded symbols for /lib/x86_64-linux-gnu/libnss_files.so.2
>>> >> __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linu
>>> x/x86_64/lowlevellock.S:95
>>> >> 95   ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:
>>> そのようなファイルやディレクトリはありません.
>>> >> (gdb) bt
>>> >> #0  __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linu
>>> x/x86_64/lowlevellock.S:95
>>> >> #1  0x00007f67fe20ccba in _L_lock_12808 () from
>>> /lib/x86_64-linux-gnu/libc.so.6
>>> >> #2  0x00007f67fe20a6b5 in __GI___libc_malloc (bytes=15) at
>>> malloc.c:2887
>>> >> #3  0x00007f67fe21072a in __GI___strdup (s=0x7f67fe305dd8
>>> "/etc/localtime") at strdup.c:42
>>> >> #4  0x00007f67fe239f51 in tzset_internal (always=<optimized out>,
>>> explicit=explicit at entry=1)
>>> >>     at tzset.c:444
>>> >> #5  0x00007f67fe23a913 in __tz_convert (timer=timer at entry=0x7ffce1c1b
>>> 7f8,
>>> >>     use_localtime=use_localtime at entry=1, tp=tp at entry=0x7f67fe54bde0
>>> <_tmbuf>) at tzset.c:632
>>> >> #6  0x00007f67fe2387d1 in __GI_localtime (t=t at entry=0x7ffce1c1b7f8)
>>> at localtime.c:42
>>> >> #7  0x000000000045627b in log_line_prefix (buf=buf at entry=0x7ffce1c1b8d0,
>>> line_prefix=<optimized out>,
>>> >>     edata=<optimized out>) at ../../src/utils/error/elog.c:2059
>>> >> #8  0x000000000045894d in send_message_to_server_log (edata=0x753320
>>> <errordata>)
>>> >>     at ../../src/utils/error/elog.c:2084
>>> >> #9  EmitErrorReport () at ../../src/utils/error/elog.c:1129
>>> >> #10 0x0000000000456d8e in errfinish (dummy=<optimized out>) at
>>> ../../src/utils/error/elog.c:434
>>> >> #11 0x0000000000421f57 in die (sig=2) at protocol/child.c:925
>>> >> #12 <signal handler called>
>>> >> #13 _int_malloc (av=0x7f67fe546760 <main_arena>, bytes=4176) at
>>> malloc.c:3302
>>> >> #14 0x00007f67fe20a6c0 in __GI___libc_malloc (bytes=4176) at
>>> malloc.c:2891
>>> >> #15 0x000000000045561d in AllocSetAlloc (context=0xf90000, size=4128)
>>> at ../../src/utils/mmgr/aset.c:674
>>> >> #16 0x000000000042cd1e in pool_init_cp () at
>>> protocol/pool_connection_pool.c:70
>>> >> #17 0x0000000000423024 in do_child (fds=fds at entry=0xfa35c0) at
>>> protocol/child.c:194
>>> >> #18 0x00000000004086f5 in fork_a_child (fds=0xfa35c0, id=11) at
>>> main/pgpool_main.c:723
>>> >> #19 0x00000000004090cc in reaper () at main/pgpool_main.c:2427
>>> >> #20 0x000000000040d869 in PgpoolMain (discard_status=discard_status at entry=1
>>> '\001',
>>> >>     clear_memcache_oidmaps=clear_memcache_oidmaps at entry=0 '\000') at
>>> main/pgpool_main.c:457
>>> >> #21 0x0000000000406eb3 in main (argc=<optimized out>, argv=<optimized
>>> out>) at main/main.c:305
>>> >>
>>> >> As you can see, the process was stuck while calling malloc().
>>> >>
>>> >> pool_init_cp() was interrupted by SIGINT (because I hit the key), and
>>> >> the signal handler die() was called. In die(), elog() was
>>> >> called. elog() calls log_line_prefix(), which calls
>>> >> localtime(). localtime() calls malloc(), which seems to try to obtain
>>> >> a lock, then it causes a dead lock since pool_init_cp() was
>>> >> interrupted while calling malloc().
>>> >>
>>> >> log_line_prefix() was copied from PostgreSQL but I see subtle
>>> >> differences from the PostgreSQL's original log_line_prefix().:
>>> >>
>>> >> pg_strftime(strfbuf, sizeof(strfbuf),
>>> >>      "%Y-%m-%d %H:%M:%S %Z",
>>> >>      pg_localtime(&stamp_time, log_timezone));
>>> >>
>>> >> Ours:
>>> >>      strftime(strbuf, 128, "%Y-%m-%d %H:%M:%S", localtime(&now));
>>> >>
>>> >> Apparently the call stuck shows that the hang happened inside
>>> >> localtime() because it calls malloc(). On the other hand,
>>> >> pg_localtime() seems to avoid calling localtime() by using their home
>>> >> made version of localtime(), which does *not* call malloc(). I'm not
>>> >> sure if it's a intention but it seems it'd better to follow
>>> >> PostgreSQL's way to avoid the hang problem.
>>> >>
>>> >> What do you think, Usama?
>>> >>
>>> >> Best regards,
>>> >> --
>>> >> Tatsuo Ishii
>>> >> SRA OSS, Inc. Japan
>>> >> English: http://www.sraoss.co.jp/index_en.php
>>> >> Japanese:http://www.sraoss.co.jp
>>> >> _______________________________________________
>>> >> pgpool-hackers mailing list
>>> >> pgpool-hackers at pgpool.net
>>> >> http://www.pgpool.net/mailman/listinfo/pgpool-hackers
>>> > _______________________________________________
>>> > pgpool-hackers mailing list
>>> > pgpool-hackers at pgpool.net
>>> > http://www.pgpool.net/mailman/listinfo/pgpool-hackers
>>> _______________________________________________
>>> pgpool-hackers mailing list
>>> pgpool-hackers at pgpool.net
>>> http://www.pgpool.net/mailman/listinfo/pgpool-hackers
>>>
>>
>>


More information about the pgpool-hackers mailing list