[pgpool-general: 2158] Re: pgpool latency, hesitation under load

Wed Sep 25 10:08:51 JST 2013

I have some questions.

> Hello,
> Thanks for your message.
> Yes, we were cascading pgpool-II.
> At first, we tried only one, but when connections exceeded around 10% the
> available pool, it increased latency for new connections to up 30 seconds.
> By cascading pooling the maximum latency reduced to around 3 seconds.
> But even 3 seconds were too much trouble at peak times. 
> The web application impose around 2 thousand transactions per second, and even
> more at peak times.
> By my limited knowledge about pgpool2, it seems that each process forked has its
> own queue. Therefore, one given connection could be assigned by kernel to a long
> queue composed of long running queries, while at the same moment a side process
> containing short queries has its queue flushed.

I think there's only one queue which is managed by the kernel.

Pgpool-II's process architecture is similar to Apache's model (if its
worker is configured as process). The parent process issues listen()
system call. The child process is forked off from the parent, thus
inherits the same file descriptor of the parent's listen fd and issues
accept() system call if it is not processing queries. When a client
sends a connection request to the pgpool-II listen port, the kernel
wakes up all the child process which have issued accept() system
call. The kernel selects one of the child process then let the child
process it. Because Apache's model is very successful as far as I
know, I really don't have an idea which part illustrated above takes
up to 30 seconds.

> As we observed many queries running at 0.5 seconds, it takes only 6 queries
> queued at same process to sum 3 seconds.
> Our team gave pgbouncer a try, as it uses threads and linux kernel assigns tcp
> connections to threads in the same process, afaik.
> The results were astounding for pooling, with clear visibility of queues by
> pgbouncer system queries. The latencies were reduced to around 0.5 seconds most
> of time. At some peak hours one or other frontend reach 2 seconds, rarely.
> (we have graphs of this difference, but this list may not allow attachments)
> It seems that the kernel logic to assign tcp connections to process queues is
> causing this behaviour.
> The custom queries for monitoring in pgbouncer show the waiting connections
> clearly at peak hours and they are flushed very fast. I guess it is because of
> the single queue.

What mode of pgbouncer did you use in the test? As far as I know,
"session mode" is similar to pgpool but other mode including
"transaction mode" is very different from pgpool-II.

> A very used feature by application is the reserved connections for slow queries.
> Is there a way to configure pgpool to behave similarly and show waiting
> connections by process?

One of the idea is, creating another pgpool instance which is solely
responsible for the slow queries.  If you want to process slow
queries, you connect to the pgpool port, rather than existing one.

> We use pg_activity, zabbix, log threshold, and queries agains pg_stats to
> monitor backend. This way we see that application issues thousands of short
> queries and many queries between 0.1 to 0.5 seconds all time. Sometimes it
> issues very long queries of 30 seconds.
> The server has enough cpu for the present load (after many tunings and
> indexing), as now is running around 12%.
> 
> There has been some weeks and app versions since we blacklisted some tables, but
> as far I can remember, some query cache data was invalidated during the
> execution of some queries before other portion of query was executed. The
> queries were complex and  execution plan was not simple, so was reasonable to
> pgpool, that do not have the execution plan, to not process the query the same
> way backend do.
> Since then, as our cache hit was reduced to 0.55, we disabled query cache (as
> stated on docs if below 0.7)  and penalty on backend, giving our present
> environment  (running on RAM) was negligible.
> I have one of such queries recorded and  its present execution plan and may
> attach the txt file if list allows it.
> 
> Regards.
> Andre Felipe
> http://www.techforce.com.br
> 
> 
>  
> 
> 
> On 18/Sep/2013 20:09 Tatsuo Ishii <ishii at postgresql.org> wrote ..
>> I am a little bit confused.
>> 
>> You said:
>> > Using pgpool 3.3.0 as connection pooling (1 backend) with shared memory query
>> and:
>> > The pgpool at each web frontend does not have query cache enabled.
>> 
>> So you are "cascading" pgpool-II?
>> 
>> Also I am confused by this:
>> > Most of queries are below 0,5 seconds.
>> and this:
>> > answer psql requests under 100 ms, about 12% cpu load. Zabbix monitoring.
>> 
>> In the mean time I'm interested in this:
>> > complex queries, because postgresql used a non literal execution and query cache
>> > caused errors.
>> 
>> What kind of errors exactly do you have?
>> --
>> Tatsuo Ishii
>> SRA OSS, Inc. Japan
>> English: http://www.sraoss.co.jp/index_en.php
>> Japanese: http://www.sraoss.co.jp
>> 
>> > Hello,
>> > Using pgpool 3.3.0 as connection pooling (1 backend) with shared memory query
>> > cache, and connection pooling only, we are trying to spot a misconfiguration
>> > that is causing a hesitation, latency, staggering under load.
>> > During these hesitation periods, sgbd transactions per second drops.
>> > If we connect web server directly to sgbd backend (too many processes open) 
>> > obviously %sys cpu is too high, but there are not staggering latencies.
>> > Sometimes a delay of up o 2.5 seconds to get a new connection for the web app.
>> > Also, we had to use 2 level pooling (one level at each web server, other in 
>> > front of sgbd) because when connections exceeded about 10% of num_init_children,
>> > it hesitated for about 30 seconds. 
>> > num_init_children = 160
>> > max_pool = 1
>> > When leaving child_max_connections at 0, the problem is reduced, but still 
>> > happens.
>> > connection_life_time = 360
>> > client_idle_limit = 360
>> > The backend is executing about 2000 transactions per second, around 300 
>> > connections per second.
>> > Most of queries are below 0,5 seconds.
>> > Query cache hit at same machine is 0.55. We had to blacklist some tables with
>> > complex queries, because postgresql used a non literal execution and query cache
>> > caused errors.
>> > The pgpool at each web frontend does not have query cache enabled.
>> > The postgresql 9.2.4 backend on debian linux 7.1, kernel 3.2, 40 cores 80 
>> > threads, 512 GB ram, raid controller 1GB NVRAM,  is running db on RAM and always
>> > answer psql requests under 100 ms, about 12% cpu load. Zabbix monitoring.
>> > max_connections = 256
>> > 
>> > How could we track our misconfiguration that cause pgpool latency staggering?
>> > 
>> > Regards.
>> > Andre Felipe Machado
>> > http://www.techforce.com.br