[pgpool-general: 2442] Re: PGPool tuning

Fri Jan 17 11:32:50 JST 2014

Thank you, Tatsuo.

We are still experiencing the problem once or twice per day.  I am making
incremental changes on our live cluster after testing them on the test
cluster.  So far we have done the following:

-Comment out unused 2nd backend in pgpool.conf
-Add a connect_timeout of 10 seconds to the pg_connect() connection string
in the PHP application
-set sysctl net.core.somaxconn = 1024

We just did the last step today so we will see if there is any impact.

When the fault happens, there is work being done in the database, yet
"select * from pg_stat_activity;" shows only a few running queries at the
time.  To me, this says that Apache+PHP still has the connection open to
pgpool.

I'll be sure to post back if we figure it out!

Justin

On Mon, Jan 13, 2014 at 7:55 PM, Tatsuo Ishii <ishii at postgresql.org> wrote:

> Thanks for posting detailed analythis. It looks really interesting.
> I need more time to understanding full details.
>
> In the mean time I wonder if you care about listen queue
> setting. Currently pgpool listens up to num_init_children*2 (which 64,
> in your case). However Apache connects to pgpool up to 256, which is
> way too low compared with 64. Also Linux allows max the listen queue
> to up 128 by default on most systems. You can check it by looking at:
>
> $ sysctl net.core.somaxconn
> net.core.somaxconn = 128
>
> 128 is too low compared with 256, of course.
>
> If the allowed listen queue length (backlog) is too low, lots of retry
> happens in kernel's TCP layer.
>
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese: http://www.sraoss.co.jp
>
> > Greetings!
> >
> > We are having an issue with PGPool and I wanted to post my analysis to
> this
> > list to see if: A). My analysis seems correct to you all and B). To see
> if
> > you folks might have any advice on tuning.
> >
> >
> > For the last month plus, we have been experiencing an intermittent fault
> > state on our production cluster.  When the fault occurs, any request to
> the
> > Apache+PHP web server will either time out connecting, or will connect
> but
> > return with a "Could not connect to DB" message from PHP.  I've done some
> > analysis on the problem and this is what I've found.
> >
> > First let me describe the cluster as it is configured today.  We have one
> > web front end running Apache+PHP, which has a MaxClients setting of 256,
> > meaning that it's possible to have 256 concurrently running processes.
>  The
> > PHP application is configured to connect to PGPool 3.2.1 for its database
> > connection.  PGPool is configured with max_init_children of 32 and
> max_pool
> > of 8.  The application runs on 10-12 different databases, all with the
> same
> > Postgres username+password.
> >
> > When the fault occurs, it looks like this: Apache has 256 running
> processes
> > and load on the web front end drops to near 0.  PGPool has all 32 sockets
> > that face Apache filled, and all 256 sockets that face Postgres filled.
> >  Postgres has 256 connections and its load goes to near 0.  If you try to
> > connect to PGPool from the command line, it will time out in connecting,
> or
> > sometimes partially connect and then receive a connection closed message.
> >
> > Using our test cluster, I ran some tests that give me high confidence
> that
> > PGPool is actually working correctly, as are Apache and Postgres, and
> that
> > the fundamental problem is just a badly tuned configuration.  This is the
> > test that shows that best:
> >
> >
> >    1. Stop Apache, restart PGPool
> >    2. Start up 100 psql command line clients to connect to PGPool with a
> >    single database
> >    3. The first 32 psql clients connect and work fine
> >    4. The 33rd psql client blocks waiting to connect (it will time out
> >    after 30 seconds, but in this test we don't wait that long)
> >    5. fg the psql client #1, then exit the client, freeing up one of
> >    PGPool's connections
> >    6. One of the 68 blocking psql clients now gets through and can run
> >    queries
> >    7. Any of the 32 connected psql clients can get through as well
> >
> > This shows that PGPool is working as expected.
> >
> > Now we try a test that is more like the real world:
> >
> >    1. Restart PGPool
> >    2. Start up 10-20 psql command line clients.  These are simulating
> long
> >    running php processes.
> >    3. Start siege web testing tool with 100-200 concurrent requests to
> >    Apache.
> >    4. At 100 clients, the response time from Apache slows down and the
> time
> >    taken to service each request goes up to around 15s (from < 1s).  Psql
> >    command line client can get through most of the time, but it takes
> some
> >    time to connect as it is contending for one of the 32 slots to PGPool
> with
> >    all of the Apache processes.
> >    5. At 200 clients, response time goes up more and we start to see
> >    failures in Apache, as well as "Could not connect to DB" responses.
>  Psql
> >    command line client often will timeout before it gets a connection to
> >    PGPool.
> >    6. Once lots of failures are happening at the 200 clients level, load
> on
> >    Postgres goes to near 0 as well as load on Apache.
> >    7. Failure will also happen with 250 siege clients and no psql command
> >    line clients running.
> >
> >
> > In step 4, I believe the response time from Apache goes up due to PGPool
> > having to spend so much time managing incoming connections from Apache as
> > well as managing connections to Postgres.  Database load is not high in
> > this case, so the slowness is not due to Postgres being overloaded.
> >
> > I believe that on the live cluster the load is even more severe as there
> > are more databases being used, and occasionally high load, long running
> > queries.
> >
> > It's also notable that restarting Apache has been our fix to get
> everything
> > running again.  I believe that this is because PGPool gets a chance to
> > catch up, which it does fairly quickly, and resumes with 32 available
> > sockets for Apache.  If we do nothing, PGPool reaches a 10 minute timeout
> > specified in its config, and closes all 32 sockets, which causes
> everything
> > to resume working again.
> >
> >
> > In the end, I believe the problem is that Apache is just sending too many
> > requests to PGPool, and PGPool spends all of its time managing
> connections,
> > causing it to be slow at everything.  That slowness and contention for 32
> > slots among up to 256 Apache processes leads to connection timeouts (it
> > should be noted that Apache seems to have no connect timeout defined and
> > will wait for a connection until the PHP max execution time is reached).
> >  Once a threshold is reached, we enter a state where no Apache process is
> > able to connect to PGPool in enough time and we see the browser requests
> > either timing out entirely or returning the "Could not connect to DB"
> > message.
> >
> >
> > The proposed solution to all of this is to adjust the configuration of
> > PGPool and Apache to ensure that we can never reach this overwhelmed
> state.
> >  Specifically, we need to increase the number of PGPool processes and
> > decrease the maximum number of Apache processes.  We need to be careful
> as
> > we do this, as there is surely an upper limit to how many PGPool
> processes
> > can be sustained and increasing that increases overhead on Postgres since
> > it increases the number of persistent open connections between it and
> > PGPool.  The same for Apache, we need to lower MaxClients but not so low
> > that it turns away requests that could have been handled.
> >
> >
> > There are a few other adjustments that I believe will help that I'll
> > describe below.
> >
> > Apache MaxClients:
> > This is how many concurrent Apache processes can run at once.  The
> current
> > setting of 256 is clearly more than the system can handle.  I suggest we
> > drop it down to 128 to begin with and monitor the results.  I'd like to
> > make this change before the others.
> >
> > Apache PHP DB connection timeout:
> > I can see that it's waiting as long as 150s before returning with 'Could
> > not connect to DB' at times, which indicates that no timeout is being
> > specified.  This must be sent as part of the connection string, like:
> > "pgsql:host=127.0.0.1;port=5432;dbname=vw_bepensa;timeout=10".  I'm not
> > sure at this point what a reasonable value would be, but I'm thinking 10
> > seconds is a good start.
> >
> > PGPool backends:
> > We currently have 2 backends specified in the config.  One has
> > backend_weight of 1 and the other, that is not used, has backedn_weight
> of
> > 0.  I have confirmed that whenever a client connects to PGPool and
> requests
> > a connection to a database, for example, PGPool opens a persistent
> > connection to both backends.  We will comment out the backend that
> > specifies the backup server, which should help PGPool a lot.
> >
> >
> > PGPool max_init_children:
> > This is the config parameter that specifies how many PGPool processes can
> > run, and therefore how many sockets are available to Apache.  Increasing
> > this number by one increases the number of persistent connections to the
> DB
> > by max_pool, currently 8.  Postgres is currently configured to only allow
> > 300 connections maximum, so that would need to be changed as well.  More
> > research and testing is needed to find the sweet spot.
> >
> > PGPool max_pool:
> > This parameter specifies how many different DBs each PGPool process keeps
> > in its cache of persistent connections to Postgres.  It is currently set
> to
> > 8, yet we have more than 8 different databases in production (I see 12
> > connected right now).  If a connection to a database is requested of
> PGPool
> > by Apache, and the PGPool process servicing Apache's request does not
> have
> > a connection to that database, it will drop one and use the slot to make
> a
> > new connection to the requested DB on Postgres.  If max_pool was set to
> 12,
> > this would stop happening and there would always be a persistent
> connection
> > to the db requested ready to go when requested by apache.  Postgres would
> > ideally get no new db connections.  Increasing from 8 to 12 would mean
> that
> > total connections to Postgres would be 32*12 = 384, which is above
> > Postgres's connection limit.  So this parameter, max_init_children, and
> > Postgres's connection limit must all be tuned to eachother, and kept low
> > enough to not overwhelm Postgres.
> >
> >
> > I suggest that we begin by commenting out the second backend in
> > pgpool.conf, and lowering MaxClients on Apache to 128.  This should
> prevent
> > PGPool being hammered past the point that it can handle.  If PGPool does
> > fall behind, only 128 Apache connections will be hitting PGPool and it
> > seems to be able to handle that many in an orderly fashion.
> >
> > I also think adding a PHP connection timeout will help keep the system
> from
> > grinding to a stop.
> >
> >
> > Thank you for reading and any help or insight you can provide!
> >
> > Justin Cooper
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20140116/056113a2/attachment-0001.html>