[pgpool-general: 2471] Re: PGPool tuning

Justin Cooper jcooper at vendwatchtelematics.com
Thu Jan 23 01:50:45 JST 2014


Just to follow up on this, it turns out it was a problem in our
application...

We had a long running job that was tying up the user's session, paired with
a second browser window that was making a "ping" type call every 10
seconds.  These pings were stacking up and after 300s, they would tie up
all of PGPool's available sockets to apache.


Thanks for the help.  I did a lot of testing with PGPool and thought it was
the source of the problem, but the more I tested the more I became
convinced that PGPool is actually working great!


Cheers,
Justin









On Thu, Jan 16, 2014 at 10:24 PM, Tatsuo Ishii <ishii at postgresql.org> wrote:

> When the lock up happens, what "select * from pg_stat_activity"
> and "select * from pg_locks" show?
>
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese: http://www.sraoss.co.jp
>
> > Thank you, Tatsuo.
> >
> > We are still experiencing the problem once or twice per day.  I am making
> > incremental changes on our live cluster after testing them on the test
> > cluster.  So far we have done the following:
> >
> > -Comment out unused 2nd backend in pgpool.conf
> > -Add a connect_timeout of 10 seconds to the pg_connect() connection
> string
> > in the PHP application
> > -set sysctl net.core.somaxconn = 1024
> >
> > We just did the last step today so we will see if there is any impact.
> >
> > When the fault happens, there is work being done in the database, yet
> > "select * from pg_stat_activity;" shows only a few running queries at the
> > time.  To me, this says that Apache+PHP still has the connection open to
> > pgpool.
> >
> > I'll be sure to post back if we figure it out!
> >
> > Justin
> >
> >
> >
> >
> >
> >
> >
> > On Mon, Jan 13, 2014 at 7:55 PM, Tatsuo Ishii <ishii at postgresql.org>
> wrote:
> >
> >> Thanks for posting detailed analythis. It looks really interesting.
> >> I need more time to understanding full details.
> >>
> >> In the mean time I wonder if you care about listen queue
> >> setting. Currently pgpool listens up to num_init_children*2 (which 64,
> >> in your case). However Apache connects to pgpool up to 256, which is
> >> way too low compared with 64. Also Linux allows max the listen queue
> >> to up 128 by default on most systems. You can check it by looking at:
> >>
> >> $ sysctl net.core.somaxconn
> >> net.core.somaxconn = 128
> >>
> >> 128 is too low compared with 256, of course.
> >>
> >> If the allowed listen queue length (backlog) is too low, lots of retry
> >> happens in kernel's TCP layer.
> >>
> >> Best regards,
> >> --
> >> Tatsuo Ishii
> >> SRA OSS, Inc. Japan
> >> English: http://www.sraoss.co.jp/index_en.php
> >> Japanese: http://www.sraoss.co.jp
> >>
> >> > Greetings!
> >> >
> >> > We are having an issue with PGPool and I wanted to post my analysis to
> >> this
> >> > list to see if: A). My analysis seems correct to you all and B). To
> see
> >> if
> >> > you folks might have any advice on tuning.
> >> >
> >> >
> >> > For the last month plus, we have been experiencing an intermittent
> fault
> >> > state on our production cluster.  When the fault occurs, any request
> to
> >> the
> >> > Apache+PHP web server will either time out connecting, or will connect
> >> but
> >> > return with a "Could not connect to DB" message from PHP.  I've done
> some
> >> > analysis on the problem and this is what I've found.
> >> >
> >> > First let me describe the cluster as it is configured today.  We have
> one
> >> > web front end running Apache+PHP, which has a MaxClients setting of
> 256,
> >> > meaning that it's possible to have 256 concurrently running processes.
> >>  The
> >> > PHP application is configured to connect to PGPool 3.2.1 for its
> database
> >> > connection.  PGPool is configured with max_init_children of 32 and
> >> max_pool
> >> > of 8.  The application runs on 10-12 different databases, all with the
> >> same
> >> > Postgres username+password.
> >> >
> >> > When the fault occurs, it looks like this: Apache has 256 running
> >> processes
> >> > and load on the web front end drops to near 0.  PGPool has all 32
> sockets
> >> > that face Apache filled, and all 256 sockets that face Postgres
> filled.
> >> >  Postgres has 256 connections and its load goes to near 0.  If you
> try to
> >> > connect to PGPool from the command line, it will time out in
> connecting,
> >> or
> >> > sometimes partially connect and then receive a connection closed
> message.
> >> >
> >> > Using our test cluster, I ran some tests that give me high confidence
> >> that
> >> > PGPool is actually working correctly, as are Apache and Postgres, and
> >> that
> >> > the fundamental problem is just a badly tuned configuration.  This is
> the
> >> > test that shows that best:
> >> >
> >> >
> >> >    1. Stop Apache, restart PGPool
> >> >    2. Start up 100 psql command line clients to connect to PGPool
> with a
> >> >    single database
> >> >    3. The first 32 psql clients connect and work fine
> >> >    4. The 33rd psql client blocks waiting to connect (it will time out
> >> >    after 30 seconds, but in this test we don't wait that long)
> >> >    5. fg the psql client #1, then exit the client, freeing up one of
> >> >    PGPool's connections
> >> >    6. One of the 68 blocking psql clients now gets through and can run
> >> >    queries
> >> >    7. Any of the 32 connected psql clients can get through as well
> >> >
> >> > This shows that PGPool is working as expected.
> >> >
> >> > Now we try a test that is more like the real world:
> >> >
> >> >    1. Restart PGPool
> >> >    2. Start up 10-20 psql command line clients.  These are simulating
> >> long
> >> >    running php processes.
> >> >    3. Start siege web testing tool with 100-200 concurrent requests to
> >> >    Apache.
> >> >    4. At 100 clients, the response time from Apache slows down and the
> >> time
> >> >    taken to service each request goes up to around 15s (from < 1s).
>  Psql
> >> >    command line client can get through most of the time, but it takes
> >> some
> >> >    time to connect as it is contending for one of the 32 slots to
> PGPool
> >> with
> >> >    all of the Apache processes.
> >> >    5. At 200 clients, response time goes up more and we start to see
> >> >    failures in Apache, as well as "Could not connect to DB" responses.
> >>  Psql
> >> >    command line client often will timeout before it gets a connection
> to
> >> >    PGPool.
> >> >    6. Once lots of failures are happening at the 200 clients level,
> load
> >> on
> >> >    Postgres goes to near 0 as well as load on Apache.
> >> >    7. Failure will also happen with 250 siege clients and no psql
> command
> >> >    line clients running.
> >> >
> >> >
> >> > In step 4, I believe the response time from Apache goes up due to
> PGPool
> >> > having to spend so much time managing incoming connections from
> Apache as
> >> > well as managing connections to Postgres.  Database load is not high
> in
> >> > this case, so the slowness is not due to Postgres being overloaded.
> >> >
> >> > I believe that on the live cluster the load is even more severe as
> there
> >> > are more databases being used, and occasionally high load, long
> running
> >> > queries.
> >> >
> >> > It's also notable that restarting Apache has been our fix to get
> >> everything
> >> > running again.  I believe that this is because PGPool gets a chance to
> >> > catch up, which it does fairly quickly, and resumes with 32 available
> >> > sockets for Apache.  If we do nothing, PGPool reaches a 10 minute
> timeout
> >> > specified in its config, and closes all 32 sockets, which causes
> >> everything
> >> > to resume working again.
> >> >
> >> >
> >> > In the end, I believe the problem is that Apache is just sending too
> many
> >> > requests to PGPool, and PGPool spends all of its time managing
> >> connections,
> >> > causing it to be slow at everything.  That slowness and contention
> for 32
> >> > slots among up to 256 Apache processes leads to connection timeouts
> (it
> >> > should be noted that Apache seems to have no connect timeout defined
> and
> >> > will wait for a connection until the PHP max execution time is
> reached).
> >> >  Once a threshold is reached, we enter a state where no Apache
> process is
> >> > able to connect to PGPool in enough time and we see the browser
> requests
> >> > either timing out entirely or returning the "Could not connect to DB"
> >> > message.
> >> >
> >> >
> >> > The proposed solution to all of this is to adjust the configuration of
> >> > PGPool and Apache to ensure that we can never reach this overwhelmed
> >> state.
> >> >  Specifically, we need to increase the number of PGPool processes and
> >> > decrease the maximum number of Apache processes.  We need to be
> careful
> >> as
> >> > we do this, as there is surely an upper limit to how many PGPool
> >> processes
> >> > can be sustained and increasing that increases overhead on Postgres
> since
> >> > it increases the number of persistent open connections between it and
> >> > PGPool.  The same for Apache, we need to lower MaxClients but not so
> low
> >> > that it turns away requests that could have been handled.
> >> >
> >> >
> >> > There are a few other adjustments that I believe will help that I'll
> >> > describe below.
> >> >
> >> > Apache MaxClients:
> >> > This is how many concurrent Apache processes can run at once.  The
> >> current
> >> > setting of 256 is clearly more than the system can handle.  I suggest
> we
> >> > drop it down to 128 to begin with and monitor the results.  I'd like
> to
> >> > make this change before the others.
> >> >
> >> > Apache PHP DB connection timeout:
> >> > I can see that it's waiting as long as 150s before returning with
> 'Could
> >> > not connect to DB' at times, which indicates that no timeout is being
> >> > specified.  This must be sent as part of the connection string, like:
> >> > "pgsql:host=127.0.0.1;port=5432;dbname=vw_bepensa;timeout=10".  I'm
> not
> >> > sure at this point what a reasonable value would be, but I'm thinking
> 10
> >> > seconds is a good start.
> >> >
> >> > PGPool backends:
> >> > We currently have 2 backends specified in the config.  One has
> >> > backend_weight of 1 and the other, that is not used, has
> backedn_weight
> >> of
> >> > 0.  I have confirmed that whenever a client connects to PGPool and
> >> requests
> >> > a connection to a database, for example, PGPool opens a persistent
> >> > connection to both backends.  We will comment out the backend that
> >> > specifies the backup server, which should help PGPool a lot.
> >> >
> >> >
> >> > PGPool max_init_children:
> >> > This is the config parameter that specifies how many PGPool processes
> can
> >> > run, and therefore how many sockets are available to Apache.
>  Increasing
> >> > this number by one increases the number of persistent connections to
> the
> >> DB
> >> > by max_pool, currently 8.  Postgres is currently configured to only
> allow
> >> > 300 connections maximum, so that would need to be changed as well.
>  More
> >> > research and testing is needed to find the sweet spot.
> >> >
> >> > PGPool max_pool:
> >> > This parameter specifies how many different DBs each PGPool process
> keeps
> >> > in its cache of persistent connections to Postgres.  It is currently
> set
> >> to
> >> > 8, yet we have more than 8 different databases in production (I see 12
> >> > connected right now).  If a connection to a database is requested of
> >> PGPool
> >> > by Apache, and the PGPool process servicing Apache's request does not
> >> have
> >> > a connection to that database, it will drop one and use the slot to
> make
> >> a
> >> > new connection to the requested DB on Postgres.  If max_pool was set
> to
> >> 12,
> >> > this would stop happening and there would always be a persistent
> >> connection
> >> > to the db requested ready to go when requested by apache.  Postgres
> would
> >> > ideally get no new db connections.  Increasing from 8 to 12 would mean
> >> that
> >> > total connections to Postgres would be 32*12 = 384, which is above
> >> > Postgres's connection limit.  So this parameter, max_init_children,
> and
> >> > Postgres's connection limit must all be tuned to eachother, and kept
> low
> >> > enough to not overwhelm Postgres.
> >> >
> >> >
> >> > I suggest that we begin by commenting out the second backend in
> >> > pgpool.conf, and lowering MaxClients on Apache to 128.  This should
> >> prevent
> >> > PGPool being hammered past the point that it can handle.  If PGPool
> does
> >> > fall behind, only 128 Apache connections will be hitting PGPool and it
> >> > seems to be able to handle that many in an orderly fashion.
> >> >
> >> > I also think adding a PHP connection timeout will help keep the system
> >> from
> >> > grinding to a stop.
> >> >
> >> >
> >> > Thank you for reading and any help or insight you can provide!
> >> >
> >> > Justin Cooper
> >>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20140122/e8908a4a/attachment-0001.html>


More information about the pgpool-general mailing list