[pgpool-hackers: 549] Re: Backend process hangs with pgpool

Tue Jun 17 21:40:29 JST 2014

Hi Tatsuo

Did you get a chance to look into this issue ? or the patch I have shared

Thanks
Usama

On Tue, May 20, 2014 at 6:11 PM, Muhammad Usama <m.usama at gmail.com> wrote:

> Hi Tatsuo
>
> I am trying to fix the issue of backend hangs happens because of missing
> sync message pgpool while doing extended protocol. Although I believe this
> should ideally be fixed on PostgreSQL backend side but till the time I
> tried to get around this problem by sending the sync message instead of
> flush message from the do_query function when backend transaction state is
> IDLE.
> Can you please take a look at the attached patch if it can serve the
> purpose.
> I am getting the clean regression with the patch and this patch is also
> apparently fixing the backend state active issue. But I am not sure if this
> approach might break some other scenarios which are not covered by the
> regression.
>
> Thanks
> Muhammad Usama
>
>
> On Tue, May 13, 2014 at 2:03 PM, Muhammad Usama <m.usama at gmail.com> wrote:
>
>> Thanks for looking into it. So this is the known issue, which needs to be
>> fixed on PostgreSQL backend side. Or is there any way possible to get
>> around this in production environment?
>>
>> Thanks
>> Usama
>>
>>
>>
>>
>> On Mon, May 12, 2014 at 7:32 PM, Tatsuo Ishii <ishii at sraoss.co.jp> wrote:
>>
>>> Hi Usama,
>>>
>>> It looks like the exact case for one of our FAQ items:
>>>
>>>
>>> http://www.pgpool.net/mediawiki/index.php/FAQ#When_I_check_pg_stat_activity_view.2C_I_see_a_query_like_.22SELECT_count.28.2A.29_FROM_pg_catalog.pg_class_AS_c_WHERE_c.oid_.3D_pgpool_regclass.28.27pgbench_accounts.27.29_AND_c.relpersistence_.3D_.27u.27.22_in_active_state_for_very_long_time._Why.3F
>>>
>>> Best regards,
>>> --
>>> Tatsuo Ishii
>>> SRA OSS, Inc. Japan
>>> English: http://www.sraoss.co.jp/index_en.php
>>> Japanese: http://www.sraoss.co.jp
>>>
>>>
>>> From: Muhammad Usama <m.usama at gmail.com>
>>> Subject: Backend process hangs with pgpool
>>> Date: Mon, 12 May 2014 18:35:08 +0500
>>> Message-ID: <CAEJvTzUkCtQFsMH4yvudB=
>>> M3jVp_Qf5xrP8GoqCjQ2n-CKKOYw at mail.gmail.com>
>>>
>>> > Hi Tatsuo
>>> >
>>> > I am looking at the problem faced by one of the customer.  He has
>>> reported
>>> > that PostgreSQL backed is frequently getting stuck on the production
>>> > environment. And the analysis till this point suggests that pgpool-II's
>>> > internal query ("SELECT count(*) FROM pg_catalog.pg_class AS c WHERE
>>> > c.relname = 'foldershare' AND c.relpersistence = 'u' ") is responsible
>>> for
>>> > the backend to get stuck, by not issuing the necessary SYNC message
>>> when it
>>> > is expected by backend.
>>> > I am pasting all the information and discussion regarding the problem
>>> we
>>> > have so far.
>>> > Can you please suggest on what could be the problem and if the problem
>>> is
>>> > correctly identified what should be the proper way to fix it.
>>> > I am also attaching the not working patch (temp.patch) which tries to
>>> send
>>> > the sync message to backend from do_query() but doesn't worked out
>>> >
>>> > Please find below for the discussion, and the information regarding
>>> this
>>> > problem
>>> > Log , configuration files and gdb trace is also attached with the mail
>>> >
>>> > #query info
>>> > ============
>>> > enterprisedb at edb# select * from pg_stat_activity where pid = 19497;
>>> >
>>> > -[ RECORD 1
>>> >
>>> ]----+---------------------------------------------------------------------------------------------------------
>>> > datid            | 1110943
>>> > datname          | cloud
>>> > pid              | 19497
>>> > usesysid         | 16584
>>> > usename          | cloud
>>> > application_name |
>>> > client_addr      | 10.27.28.241
>>> > client_hostname  |
>>> > client_port      | 2653
>>> > backend_start    | 10-APR-14 08:36:26.454733 +09:00
>>> > xact_start       | 10-APR-14 08:36:26.523032 +09:00
>>> > query_start      | 10-APR-14 08:36:26.551021 +09:00
>>> > state_change     | 10-APR-14 08:36:26.551022 +09:00
>>> > waiting          | f
>>> > state            | active
>>> > query            | SELECT count(*) FROM pg_catalog.pg_class AS c WHERE
>>> > c.relname = 'foldershare' AND c.relpersistence = 'u'
>>> >
>>> > # explain
>>> >
>>> > enterprisedb at edb# explain analyze SELECT count(*) FROM
>>> pg_catalog.pg_class
>>> > AS c WHERE c.relname = 'foldershare' AND c.relpersistence = 'u';
>>> >                                                                QUERY
>>> PLAN
>>> >
>>> ---------------------------------------------------------------------------------------------------------------------------------------------
>>> > Aggregate  (cost=4.27..4.28 rows=1 width=0) (actual time=0.011..0.011
>>> > rows=1 loops=1)
>>> >
>>> >   ->  Index Scan using pg_class_relname_nsp_index on pg_class c
>>> >  (cost=0.00..4.27 rows=1 width=0) (actual time=0.009..0.009 rows=0
>>> loops=1)
>>> >
>>> >         Index Cond: (relname = 'foldershare'::name)
>>> >
>>> >         Filter: (relpersistence = 'u'::"char")
>>> >
>>> > Total runtime: 0.050 ms
>>> >
>>> > (5 rows)
>>> >
>>> > #OS process
>>> >
>>> > [09:56:21 enterprisedb at cloud-post-mdb /db/pgdba/logs]$ ps -ef | grep
>>> 19497
>>> >
>>> > 534      19497  7260  0 08:36 ?        00:00:00 postgres: cloud cloud
>>> > 10.27.28.241[2653] SELECT
>>> >
>>> >
>>> > Internal discussion on the problem
>>> > =============================
>>> >
>>> > Noah Misch Wrote:
>>> > ============
>>> >
>>> > Looking at the customer's PPAS stack trace, it could be explained by
>>> PPAS
>>> > not
>>> > receiving a Sync message after a series of extended query protocol
>>> messages.
>>> > When this happens, pg_stat_activity and the "ps" line will give the
>>> > appearance
>>> > of an active query. An affected backend will appear to ignore
>>> > pg_cancel_backend(), but it will promptly respond to
>>> pg_terminate_backend().
>>> > Assuming that's a correct diagnosis, I see three potential
>>> explanations:
>>> >
>>> > (1) A network problem between the pgpool server and the PPAS server
>>> cuts off
>>> > the transfer in the middle of an extended query protocol message
>>> sequence.
>>> > (Though I don't see that the customer ever stated it directly, the
>>> files
>>> > provided suggest pgpool is on a separate system.)
>>> >
>>> > (2) pgpool, through some bug, omits a necessary Sync message.
>>> >
>>> > (3) The client software omits the Sync message, and pgpool just passes
>>> on
>>> > that
>>> > omission. I think we can discount this possibility, because one of the
>>> > customer-reported hung queries is generated internally by pgpool:
>>> > SELECT count(*) FROM pg_catalog.pg_class AS c WHERE c.relname = '%s'
>>> AND
>>> > c.relpersistence = 'u'
>>> >
>>> > Either (2) or (1) could be it. I feel (2) is more likely, because it
>>> is a
>>> > stretch to believe that this would be the sole symptom of network
>>> trouble.
>>> > Tracking down the bug behind (2) could be difficult, alas.
>>> >
>>> > I propose that support file a Redmine case against pgpool to
>>> investigate
>>> > explanation (2). Usama, if you would like more detail from the
>>> customer to
>>> > aid the pgpool team in such analysis, now is a good time to request
>>> it. Note
>>> > that the customer is offering remote access to their system.
>>> >
>>> > One can exercise this hang situation on the PPAS backend side by
>>> having a
>>> > client program call PQexecParams() with a libpq patched like so:
>>> >
>>> > -- a/src/interfaces/libpq/fe-exec.c
>>> > +++ b/src/interfaces/libpq/fe-exec.c
>>> > @ -1558,10 +1558,12 @ PQsendQueryGuts(PGconn *conn,
>>> > pqPutMsgEnd(conn) < 0)
>>> > goto sendFailed;
>>> >
>>> > #if 0
>>> > /* construct the Sync message */
>>> > if (pqPutMsgStart('S', false, conn) < 0 ||
>>> > pqPutMsgEnd(conn) < 0)
>>> > goto sendFailed;
>>> > #endif
>>> >
>>> > /* remember we are using extended query protocol */
>>> >         conn->queryclass = PGQUERY_EXTENDED
>>> >
>>> >
>>> >
>>> > Noah Misch Wrote:
>>> > ================
>>> > The PostgreSQL wire protocol has two messages, "Flush" and "Sync", for
>>> the
>>> > client to signal that it is waiting for the result of its previous
>>> requests.
>>> > Sync ends any implicit transaction; Flush does not.  Sync causes the
>>> session
>>> > to show as "idle" in pg_stat_activity; Flush leaves it "active".
>>>  pgpool
>>> > uses
>>> > Flush for certain internal queries, because it does not wish to
>>> disturb an
>>> > outstanding implicit transaction.  The consequence for pg_stat_activity
>>> > output
>>> > is undesirable, but it is rightly considered more tolerable than
>>> closing a
>>> > user-initiated implicit transaction.
>>> >
>>> > In the context of this support ticket, it's enough to say that the
>>> behavior
>>> > is
>>> > a known limitation of pg_stat_activity and does not indicate a
>>> malfunction.
>>> >
>>> > As far as improving the situation, two options come to mind.  The first
>>> > option
>>> > is to have pgpool track whether an implicit transaction could be open;
>>> when
>>> > pgpool sees a ReadyForQuery message from the server, any implicit
>>> > transaction
>>> > is definitely closed.  For these internal metadata queries, it can then
>>> > issue
>>> > a Sync instead of a Flush if the session was idle (no implicit
>>> transaction)
>>> > when the internal query began.  A second option is to change the
>>> server to
>>> > report a pg_stat_activity.state value other than "active" (perhaps
>>> "idle in
>>> > transaction" or a novel "idle in implicit transaction") when a query
>>> has
>>> > ended
>>> > with a Flush.
>>> >
>>> >
>>> > Thanks
>>> > Usama
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-hackers/attachments/20140617/4d77cb1b/attachment.html>