[pgpool-hackers: 550] Re: Backend process hangs with pgpool

Wed Jun 18 00:22:45 JST 2014

Hi Usama,

I think your patch breaks the case when a client uses an unnamed
portal in an non explicit transaction because sync message effectively
closes current transaction (remember that extended protocol
automatically starts an internal transaction) and the unnamed portal
is destroyed.

For now I can't think of any way to work around the problem at
pgpool-II side.

Onece I have proposed to fix PostgreSQL side, but it was refused. Too
bad.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

> Hi Tatsuo
> 
> Did you get a chance to look into this issue ? or the patch I have shared
> 
> Thanks
> Usama
> 
> 
> On Tue, May 20, 2014 at 6:11 PM, Muhammad Usama <m.usama at gmail.com> wrote:
> 
>> Hi Tatsuo
>>
>> I am trying to fix the issue of backend hangs happens because of missing
>> sync message pgpool while doing extended protocol. Although I believe this
>> should ideally be fixed on PostgreSQL backend side but till the time I
>> tried to get around this problem by sending the sync message instead of
>> flush message from the do_query function when backend transaction state is
>> IDLE.
>> Can you please take a look at the attached patch if it can serve the
>> purpose.
>> I am getting the clean regression with the patch and this patch is also
>> apparently fixing the backend state active issue. But I am not sure if this
>> approach might break some other scenarios which are not covered by the
>> regression.
>>
>> Thanks
>> Muhammad Usama
>>
>>
>> On Tue, May 13, 2014 at 2:03 PM, Muhammad Usama <m.usama at gmail.com> wrote:
>>
>>> Thanks for looking into it. So this is the known issue, which needs to be
>>> fixed on PostgreSQL backend side. Or is there any way possible to get
>>> around this in production environment?
>>>
>>> Thanks
>>> Usama
>>>
>>>
>>>
>>>
>>> On Mon, May 12, 2014 at 7:32 PM, Tatsuo Ishii <ishii at sraoss.co.jp> wrote:
>>>
>>>> Hi Usama,
>>>>
>>>> It looks like the exact case for one of our FAQ items:
>>>>
>>>>
>>>> http://www.pgpool.net/mediawiki/index.php/FAQ#When_I_check_pg_stat_activity_view.2C_I_see_a_query_like_.22SELECT_count.28.2A.29_FROM_pg_catalog.pg_class_AS_c_WHERE_c.oid_.3D_pgpool_regclass.28.27pgbench_accounts.27.29_AND_c.relpersistence_.3D_.27u.27.22_in_active_state_for_very_long_time._Why.3F
>>>>
>>>> Best regards,
>>>> --
>>>> Tatsuo Ishii
>>>> SRA OSS, Inc. Japan
>>>> English: http://www.sraoss.co.jp/index_en.php
>>>> Japanese: http://www.sraoss.co.jp
>>>>
>>>>
>>>> From: Muhammad Usama <m.usama at gmail.com>
>>>> Subject: Backend process hangs with pgpool
>>>> Date: Mon, 12 May 2014 18:35:08 +0500
>>>> Message-ID: <CAEJvTzUkCtQFsMH4yvudB=
>>>> M3jVp_Qf5xrP8GoqCjQ2n-CKKOYw at mail.gmail.com>
>>>>
>>>> > Hi Tatsuo
>>>> >
>>>> > I am looking at the problem faced by one of the customer.  He has
>>>> reported
>>>> > that PostgreSQL backed is frequently getting stuck on the production
>>>> > environment. And the analysis till this point suggests that pgpool-II's
>>>> > internal query ("SELECT count(*) FROM pg_catalog.pg_class AS c WHERE
>>>> > c.relname = 'foldershare' AND c.relpersistence = 'u' ") is responsible
>>>> for
>>>> > the backend to get stuck, by not issuing the necessary SYNC message
>>>> when it
>>>> > is expected by backend.
>>>> > I am pasting all the information and discussion regarding the problem
>>>> we
>>>> > have so far.
>>>> > Can you please suggest on what could be the problem and if the problem
>>>> is
>>>> > correctly identified what should be the proper way to fix it.
>>>> > I am also attaching the not working patch (temp.patch) which tries to
>>>> send
>>>> > the sync message to backend from do_query() but doesn't worked out
>>>> >
>>>> > Please find below for the discussion, and the information regarding
>>>> this
>>>> > problem
>>>> > Log , configuration files and gdb trace is also attached with the mail
>>>> >
>>>> > #query info
>>>> > ============
>>>> > enterprisedb at edb# select * from pg_stat_activity where pid = 19497;
>>>> >
>>>> > -[ RECORD 1
>>>> >
>>>> ]----+---------------------------------------------------------------------------------------------------------
>>>> > datid            | 1110943
>>>> > datname          | cloud
>>>> > pid              | 19497
>>>> > usesysid         | 16584
>>>> > usename          | cloud
>>>> > application_name |
>>>> > client_addr      | 10.27.28.241
>>>> > client_hostname  |
>>>> > client_port      | 2653
>>>> > backend_start    | 10-APR-14 08:36:26.454733 +09:00
>>>> > xact_start       | 10-APR-14 08:36:26.523032 +09:00
>>>> > query_start      | 10-APR-14 08:36:26.551021 +09:00
>>>> > state_change     | 10-APR-14 08:36:26.551022 +09:00
>>>> > waiting          | f
>>>> > state            | active
>>>> > query            | SELECT count(*) FROM pg_catalog.pg_class AS c WHERE
>>>> > c.relname = 'foldershare' AND c.relpersistence = 'u'
>>>> >
>>>> > # explain
>>>> >
>>>> > enterprisedb at edb# explain analyze SELECT count(*) FROM
>>>> pg_catalog.pg_class
>>>> > AS c WHERE c.relname = 'foldershare' AND c.relpersistence = 'u';
>>>> >                                                                QUERY
>>>> PLAN
>>>> >
>>>> ---------------------------------------------------------------------------------------------------------------------------------------------
>>>> > Aggregate  (cost=4.27..4.28 rows=1 width=0) (actual time=0.011..0.011
>>>> > rows=1 loops=1)
>>>> >
>>>> >   ->  Index Scan using pg_class_relname_nsp_index on pg_class c
>>>> >  (cost=0.00..4.27 rows=1 width=0) (actual time=0.009..0.009 rows=0
>>>> loops=1)
>>>> >
>>>> >         Index Cond: (relname = 'foldershare'::name)
>>>> >
>>>> >         Filter: (relpersistence = 'u'::"char")
>>>> >
>>>> > Total runtime: 0.050 ms
>>>> >
>>>> > (5 rows)
>>>> >
>>>> > #OS process
>>>> >
>>>> > [09:56:21 enterprisedb at cloud-post-mdb /db/pgdba/logs]$ ps -ef | grep
>>>> 19497
>>>> >
>>>> > 534      19497  7260  0 08:36 ?        00:00:00 postgres: cloud cloud
>>>> > 10.27.28.241[2653] SELECT
>>>> >
>>>> >
>>>> > Internal discussion on the problem
>>>> > =============================
>>>> >
>>>> > Noah Misch Wrote:
>>>> > ============
>>>> >
>>>> > Looking at the customer's PPAS stack trace, it could be explained by
>>>> PPAS
>>>> > not
>>>> > receiving a Sync message after a series of extended query protocol
>>>> messages.
>>>> > When this happens, pg_stat_activity and the "ps" line will give the
>>>> > appearance
>>>> > of an active query. An affected backend will appear to ignore
>>>> > pg_cancel_backend(), but it will promptly respond to
>>>> pg_terminate_backend().
>>>> > Assuming that's a correct diagnosis, I see three potential
>>>> explanations:
>>>> >
>>>> > (1) A network problem between the pgpool server and the PPAS server
>>>> cuts off
>>>> > the transfer in the middle of an extended query protocol message
>>>> sequence.
>>>> > (Though I don't see that the customer ever stated it directly, the
>>>> files
>>>> > provided suggest pgpool is on a separate system.)
>>>> >
>>>> > (2) pgpool, through some bug, omits a necessary Sync message.
>>>> >
>>>> > (3) The client software omits the Sync message, and pgpool just passes
>>>> on
>>>> > that
>>>> > omission. I think we can discount this possibility, because one of the
>>>> > customer-reported hung queries is generated internally by pgpool:
>>>> > SELECT count(*) FROM pg_catalog.pg_class AS c WHERE c.relname = '%s'
>>>> AND
>>>> > c.relpersistence = 'u'
>>>> >
>>>> > Either (2) or (1) could be it. I feel (2) is more likely, because it
>>>> is a
>>>> > stretch to believe that this would be the sole symptom of network
>>>> trouble.
>>>> > Tracking down the bug behind (2) could be difficult, alas.
>>>> >
>>>> > I propose that support file a Redmine case against pgpool to
>>>> investigate
>>>> > explanation (2). Usama, if you would like more detail from the
>>>> customer to
>>>> > aid the pgpool team in such analysis, now is a good time to request
>>>> it. Note
>>>> > that the customer is offering remote access to their system.
>>>> >
>>>> > One can exercise this hang situation on the PPAS backend side by
>>>> having a
>>>> > client program call PQexecParams() with a libpq patched like so:
>>>> >
>>>> > -- a/src/interfaces/libpq/fe-exec.c
>>>> > +++ b/src/interfaces/libpq/fe-exec.c
>>>> > @ -1558,10 +1558,12 @ PQsendQueryGuts(PGconn *conn,
>>>> > pqPutMsgEnd(conn) < 0)
>>>> > goto sendFailed;
>>>> >
>>>> > #if 0
>>>> > /* construct the Sync message */
>>>> > if (pqPutMsgStart('S', false, conn) < 0 ||
>>>> > pqPutMsgEnd(conn) < 0)
>>>> > goto sendFailed;
>>>> > #endif
>>>> >
>>>> > /* remember we are using extended query protocol */
>>>> >         conn->queryclass = PGQUERY_EXTENDED
>>>> >
>>>> >
>>>> >
>>>> > Noah Misch Wrote:
>>>> > ================
>>>> > The PostgreSQL wire protocol has two messages, "Flush" and "Sync", for
>>>> the
>>>> > client to signal that it is waiting for the result of its previous
>>>> requests.
>>>> > Sync ends any implicit transaction; Flush does not.  Sync causes the
>>>> session
>>>> > to show as "idle" in pg_stat_activity; Flush leaves it "active".
>>>>  pgpool
>>>> > uses
>>>> > Flush for certain internal queries, because it does not wish to
>>>> disturb an
>>>> > outstanding implicit transaction.  The consequence for pg_stat_activity
>>>> > output
>>>> > is undesirable, but it is rightly considered more tolerable than
>>>> closing a
>>>> > user-initiated implicit transaction.
>>>> >
>>>> > In the context of this support ticket, it's enough to say that the
>>>> behavior
>>>> > is
>>>> > a known limitation of pg_stat_activity and does not indicate a
>>>> malfunction.
>>>> >
>>>> > As far as improving the situation, two options come to mind.  The first
>>>> > option
>>>> > is to have pgpool track whether an implicit transaction could be open;
>>>> when
>>>> > pgpool sees a ReadyForQuery message from the server, any implicit
>>>> > transaction
>>>> > is definitely closed.  For these internal metadata queries, it can then
>>>> > issue
>>>> > a Sync instead of a Flush if the session was idle (no implicit
>>>> transaction)
>>>> > when the internal query began.  A second option is to change the
>>>> server to
>>>> > report a pg_stat_activity.state value other than "active" (perhaps
>>>> "idle in
>>>> > transaction" or a novel "idle in implicit transaction") when a query
>>>> has
>>>> > ended
>>>> > with a Flush.
>>>> >
>>>> >
>>>> > Thanks
>>>> > Usama
>>>>
>>>
>>>
>>