[pgpool-hackers: 2112] Re: Proposal to make backend node failover mechanism quorum aware

Fri Mar 10 17:07:15 JST 2017

On Fri, Mar 10, 2017 at 11:05 AM, Tatsuo Ishii <ishii at sraoss.co.jp> wrote:

> Usama,
>
> I have a question regarding Zone partitioning case described in
> section 2 in your proposal.  In my understanding after the network
> partitioning happens, Pgpool-II/watchdog in zone 2 will suicide
> because they cannot acquire quorum. So split-brain or data
> inconsistency due to two master node will not happen in even in
> Pgpool-II 3.6. Am I missing something?
>

With the current design of watchdog the Pgpool-II/Watchdog commits suicide
in only two cases.

1- When all network interfaces on the machine becomes unavailable(machine
lost all IP addresses).
2- When connection to the up-stream trusted server becomes unreachable (if
trusted_servers are configured)

So in zone partitioning scenario described in section 2 the Pgpool-II nodes
in zone 2 will not commit suicide because none
of the above two conditions for node suicide exists.

Also, doing the suicide as soon as the cluster looses the quorum doesn't
feel like a good option because if we implement that we will end up with
all the Pgpool-II nodes committing suicide as soon as the quorum is lost in
the cluster and eventually the Pgpool-II service will become unavailable,
and the administrator would require to manually restart Pgpool-II nodes.
Current implementation makes sure that split-brain does not happen when a
quorum is not available and at the same time keep looking for new/old-lost
nodes to join back the cluster to make sure minimum possible service
disruption happen and cluster recovers automatically without any manual
intervention.

Thanks
Best regards
Muhammad Usama

> Best regards,
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese:http://www.sraoss.co.jp
>
> From: Muhammad Usama <m.usama at gmail.com>
> Subject: Re: Proposal to make backend node failover mechanism quorum aware
> Date: Thu, 9 Mar 2017 00:57:58 +0500
> Message-ID: <CAEJvTzXap+qMGLt7SQ-1hPgf=aNuAYEsu_JQYd695hac0WagkA at mail.
> gmail.com>
>
> > Hi
> >
> > Please use this document. The image quality of the previously shared
> > version was not up to the mark.
> >
> > Thanks
> > Best regards
> > Muhammad Usama
> >
> > On Thu, Mar 9, 2017 at 12:53 AM, Muhammad Usama <m.usama at gmail.com>
> wrote:
> >
> >> Hi Ishii-San
> >>
> >> I have tried to create a detailed proposal to explain why and where the
> >> quorum aware backend failover mechanism would be useful.
> >> Can you please take a look at the attached pdf document and share your
> >> thoughts.
> >>
> >> Thanks
> >> Kind Regards
> >> Muhammad Usama
> >>
> >>
> >> On Wed, Jan 25, 2017 at 2:04 PM, Muhammad Usama <m.usama at gmail.com>
> wrote:
> >>
> >>>
> >>>
> >>> On Wed, Jan 25, 2017 at 9:05 AM, Tatsuo Ishii <ishii at sraoss.co.jp>
> wrote:
> >>>
> >>>> Usama,
> >>>>
> >>>> > This is correct. If the Pgpool-II is used in maste-standby mode
> (With
> >>>> > elastic or virtual-IP and clients only connect to one Pgpool-II
> server
> >>>> > only) then there is not much issues that could be caused by the
> >>>> > interruption of link between AZ1 and AZ2 as you defined above.
> >>>> >
> >>>> > But the issue arrives when the Pgpool-II is used in the
> master-master
> >>>> > mode(clients connect to all available Pgpool-II) then consider the
> >>>> > following scenario.
> >>>> >
> >>>> > a) Link between AZ1 and AZ2 broke, at that time B1 was master while
> B2
> >>>> was
> >>>> > standby.
> >>>> >
> >>>> > b) Pgpool-C in AZ2 promote B2 to the master since Pgpool-C is not
> able
> >>>> to
> >>>> > connect old master (B1)
> >>>>
> >>>> I thought Pgpool-C sucides because it cannot get quorum in this
> case, no?
> >>>>
> >>>
> >>> No, Pgpool-II only commits suicide only when it loses all network
> >>> connections. Otherwise the master watchdog node is de-escalated when
> the
> >>> quorum is lost.
> >>> Committing a suicide everytime quorum is lost is very risky and not
> >>> a feasible since it will shutdown the whole cluster as soon as a
> quorum
> >>> loses even because of a small glitch.
> >>>
> >>>
> >>>> > c) A client connects to Pgpool-C and issues a write statement. It
> will
> >>>> land
> >>>> > on the B2 PostgreSQL server, which was promoted as master in step
> b.
> >>>> >
> >>>> > c-1) Another client connects to Pgpool-A and also issues a write
> >>>> statement
> >>>> > that will land on the B1 PostgreSQL server as it the master node
> in AZ.
> >>>> >
> >>>> > d) The link between AZ1 and AZ2 is restored, but now the PostgreSQL
> B1
> >>>> and
> >>>> > B2 both have different sets of data and with no easy way to get both
> >>>> > changes in one place and restore the cluster to original state.
> >>>> >
> >>>> > The above scenario will become more complicated if both availability
> >>>> zones
> >>>> > AZ1 and AZ2 have multiple Pgpool-II nodes, since retiring the
> multiple
> >>>> > Pgpool-II nodes logic will become more complex when link disruption
> >>>> between
> >>>> > AZ1 and AZ2.
> >>>> >
> >>>> > So the proposal tries to solve this by making sure that we should
> >>>> always
> >>>> > have only one master PostgreSQL node in the cluster and never end
> up
> >>>> in the
> >>>> > situation where we have different sets of data in different
> PostgreSQL
> >>>> > nodes.
> >>>> >
> >>>> >
> >>>> >
> >>>> >> > There is also a question ("[pgpool-general: 5179] Architecture
> >>>> Questions
> >>>> >> > <http://www.sraoss.jp/pipermail/pgpool-general/2016-December
> >>>> /005237.html
> >>>> >> >")
> >>>> >> > posted by a user in pgpool-general mailing list who wants a
> similar
> >>>> type
> >>>> >> of
> >>>> >> > network that spans over two AWS availability zones and Pgpool-II
> >>>> has no
> >>>> >> > good answer to avoid split-brain of backend nodes if the
> corporate
> >>>> link
> >>>> >> > between two zones suffers a glitch.
> >>>> >>
> >>>> >> That seems totally different story to me because there two
> independent
> >>>> >> streaming replication primary servers in the east and west
> regions.
> >>>> >>
> >>>> >>
> >>>> > I think the original question statement was a little bit confusing.
> >>>> How I
> >>>> > understand the user requirements later in the thread was that.
> >>>> > The user has a couple of PostgreSQL nodes in two availability zones
> >>>> (total
> >>>> > 4 PG nodes) and all four nodes are part of the single streaming
> >>>> replication
> >>>> > setup.
> >>>> > Both zones have two Pgpool-II nodes each. (Total 4 Pgpool-II nodes
> in
> >>>> the
> >>>> > cluster).
> >>>> > Each availability zone has one application server that connects to
> one
> >>>> of
> >>>> > two Pgpool-II in the that availability zone. (That makes it
> >>>> master-master
> >>>> > Pgpool setup). And the user is concerned about split-brain of
> >>>> PostgreSQL
> >>>> > servers when the corporate link between zones becomes unavailable.
> >>>> >
> >>>> > Thanks
> >>>> > Best regards
> >>>> > Muhammad Usama
> >>>> >
> >>>> >
> >>>> >
> >>>> >> Best regards,
> >>>> >> --
> >>>> >> Tatsuo Ishii
> >>>> >> SRA OSS, Inc. Japan
> >>>> >> English: http://www.sraoss.co.jp/index_en.php
> >>>> >> Japanese:http://www.sraoss.co.jp
> >>>> >>
> >>>> >> > Thanks
> >>>> >> > Best regards
> >>>> >> > Muhammad Usama
> >>>> >> >
> >>>> >> >
> >>>> >> >
> >>>> >> >>
> >>>> >> >> Best regards,
> >>>> >> >> --
> >>>> >> >> Tatsuo Ishii
> >>>> >> >> SRA OSS, Inc. Japan
> >>>> >> >> English: http://www.sraoss.co.jp/index_en.php
> >>>> >> >> Japanese:http://www.sraoss.co.jp
> >>>> >> >>
> >>>> >> >> >> > Hi Hackers,
> >>>> >> >> >> >
> >>>> >> >> >> > This is the proposal to make the failover of backend
> >>>> PostgreSQL
> >>>> >> nodes
> >>>> >> >> >> > quorum aware to make it more robust and fault tolerant.
> >>>> >> >> >> >
> >>>> >> >> >> > Currently Pgpool-II proceeds to failover the backend node
> as
> >>>> soon
> >>>> >> as
> >>>> >> >> the
> >>>> >> >> >> > health check detects the failure or in case of an error
> >>>> occurred on
> >>>> >> >> the
> >>>> >> >> >> > backend connection (when fail_over_on_backend_error is
> set).
> >>>> This
> >>>> >> is
> >>>> >> >> good
> >>>> >> >> >> > enough for the standalone Pgpool-II server.
> >>>> >> >> >> >
> >>>> >> >> >> > But consider the scenario where we have more than one
> >>>> Pgpool-II
> >>>> >> (Say
> >>>> >> >> >> > Pgpool-A, Pgpool-B and Pgpool-C) in the cluster connected
> >>>> through
> >>>> >> >> >> watchdog
> >>>> >> >> >> > and each Pgpool-II node is configured with two PostgreSQL
> >>>> backends
> >>>> >> >> (B1
> >>>> >> >> >> and
> >>>> >> >> >> > B2).
> >>>> >> >> >> >
> >>>> >> >> >> > Now if due to some network glitch or an issue, Pgpool-A
> fails
> >>>> or
> >>>> >> loses
> >>>> >> >> >> its
> >>>> >> >> >> > network connection with backend B1, The Pgpool-A will
> detect
> >>>> the
> >>>> >> >> failure
> >>>> >> >> >> > and detach (failover) the B1 backend and also pass this
> >>>> information
> >>>> >> >> to
> >>>> >> >> >> the
> >>>> >> >> >> > other Pgpool-II nodes (Pgpool-II B and Pgpool-II C),
> Although
> >>>> the
> >>>> >> >> Backend
> >>>> >> >> >> > B1 was perfectly healthy and it was also reachable from
> >>>> Pgpool-B
> >>>> >> and
> >>>> >> >> >> > Pgpool-C nodes, But still because of a network glitch
> between
> >>>> >> >> Pgpool-A
> >>>> >> >> >> and
> >>>> >> >> >> > Backend B1, it will get detached from the cluster and the
> >>>> worst
> >>>> >> part
> >>>> >> >> is,
> >>>> >> >> >> if
> >>>> >> >> >> > the B1 was a master PostgreSQL (in master-standby
> >>>> configuration),
> >>>> >> the
> >>>> >> >> >> > Pgpool-II failover would also promote the B2 PostgreSQL
> node
> >>>> as a
> >>>> >> new
> >>>> >> >> >> > master, hense making the way for split-brain and/or data
> >>>> >> corruptions.
> >>>> >> >> >> >
> >>>> >> >> >> > So my proposal is that when the Watchdog is configured in
> >>>> Pgpool-II
> >>>> >> >> the
> >>>> >> >> >> > backend health check of Pgpool-II should consult with
> other
> >>>> >> attached
> >>>> >> >> >> > Pgpool-II nodes over the watchdog to decide if the Backend
> >>>> node is
> >>>> >> >> >> actually
> >>>> >> >> >> > failed or if it is just a localized glitch/false alarm.
> And
> >>>> the
> >>>> >> >> failover
> >>>> >> >> >> on
> >>>> >> >> >> > the node should only be performed, when the majority of
> >>>> cluster
> >>>> >> >> members
> >>>> >> >> >> > agrees on the failure of nodes.
> >>>> >> >> >> >
> >>>> >> >> >> > This quorum aware architecture of failover will prevents
> the
> >>>> false
> >>>> >> >> >> > failovers and split-brain scenarios in the Backend nodes.
> >>>> >> >> >> >
> >>>> >> >> >> > What are your thoughts and suggestions on this?
> >>>> >> >> >> >
> >>>> >> >> >> > Thanks
> >>>> >> >> >> > Best regards
> >>>> >> >> >> > Muhammad Usama
> >>>> >> >> >>
> >>>> >> >>
> >>>> >>
> >>>>
> >>>
> >>>
> >>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-hackers/attachments/20170310/d7786e3b/attachment-0001.html>