[pgpool-hackers: 2151] Re: Proposal to make backend node failover mechanism quorum aware

Thu Mar 16 14:36:11 JST 2017

On Thu, Mar 16, 2017 at 4:14 AM, Tatsuo Ishii <ishii at sraoss.co.jp> wrote:

> > On Fri, Mar 10, 2017 at 11:05 AM, Tatsuo Ishii <ishii at sraoss.co.jp>
> wrote:
> >
> >> Usama,
> >>
> >> I have a question regarding Zone partitioning case described in
> >> section 2 in your proposal.  In my understanding after the network
> >> partitioning happens, Pgpool-II/watchdog in zone 2 will suicide
> >> because they cannot acquire quorum. So split-brain or data
> >> inconsistency due to two master node will not happen in even in
> >> Pgpool-II 3.6. Am I missing something?
> >>
> >
> > With the current design of watchdog the Pgpool-II/Watchdog commits
> suicide
> > in only two cases.
> >
> > 1- When all network interfaces on the machine becomes unavailable(machine
> > lost all IP addresses).
> > 2- When connection to the up-stream trusted server becomes unreachable
> (if
> > trusted_servers are configured)
> >
> > So in zone partitioning scenario described in section 2 the Pgpool-II
> nodes
> > in zone 2 will not commit suicide because none
> > of the above two conditions for node suicide exists.
> >
> > Also, doing the suicide as soon as the cluster looses the quorum doesn't
> > feel like a good option because if we implement that we will end up with
> > all the Pgpool-II nodes committing suicide as soon as the quorum is lost
> in
> > the cluster and eventually the Pgpool-II service will become unavailable,
> > and the administrator would require to manually restart Pgpool-II nodes.
> > Current implementation makes sure that split-brain does not happen when a
> > quorum is not available
>
> How do you prevent split-brain without a quorum?
>

Since In watchdog cluster a split-brain scenario would mean that more than
one node becomes a delegate-IP holder.
So to prevent that whenever the quorum is not present in the cluster the
watchdog node does not acquire a VIP and also if the cluster looses the
quorum at any point in time the master node performs the de-escalation and
releases the VIP.
This technique makes sure that only one delegate-IP holder watchdog exist
in the cluster hence the split-brain never happens.

Thanks
Best Regards
Muhammad Usama

> >  and at the same time keep looking for new/old-lost
> > nodes to join back the cluster to make sure minimum possible service
> > disruption happen and cluster recovers automatically without any manual
> > intervention.
> >
> >
> > Thanks
> > Best regards
> > Muhammad Usama
> >
> >
> >
> >> Best regards,
> >> --
> >> Tatsuo Ishii
> >> SRA OSS, Inc. Japan
> >> English: http://www.sraoss.co.jp/index_en.php
> >> Japanese:http://www.sraoss.co.jp
> >>
> >> From: Muhammad Usama <m.usama at gmail.com>
> >> Subject: Re: Proposal to make backend node failover mechanism quorum
> aware
> >> Date: Thu, 9 Mar 2017 00:57:58 +0500
> >> Message-ID: <CAEJvTzXap+qMGLt7SQ-1hPgf=aNuAYEsu_JQYd695hac0WagkA at mail.
> >> gmail.com>
> >>
> >> > Hi
> >> >
> >> > Please use this document. The image quality of the previously shared
> >> > version was not up to the mark.
> >> >
> >> > Thanks
> >> > Best regards
> >> > Muhammad Usama
> >> >
> >> > On Thu, Mar 9, 2017 at 12:53 AM, Muhammad Usama <m.usama at gmail.com>
> >> wrote:
> >> >
> >> >> Hi Ishii-San
> >> >>
> >> >> I have tried to create a detailed proposal to explain why and where
> the
> >> >> quorum aware backend failover mechanism would be useful.
> >> >> Can you please take a look at the attached pdf document and share
> your
> >> >> thoughts.
> >> >>
> >> >> Thanks
> >> >> Kind Regards
> >> >> Muhammad Usama
> >> >>
> >> >>
> >> >> On Wed, Jan 25, 2017 at 2:04 PM, Muhammad Usama <m.usama at gmail.com>
> >> wrote:
> >> >>
> >> >>>
> >> >>>
> >> >>> On Wed, Jan 25, 2017 at 9:05 AM, Tatsuo Ishii <ishii at sraoss.co.jp>
> >> wrote:
> >> >>>
> >> >>>> Usama,
> >> >>>>
> >> >>>> > This is correct. If the Pgpool-II is used in maste-standby mode
> >> (With
> >> >>>> > elastic or virtual-IP and clients only connect to one Pgpool-II
> >> server
> >> >>>> > only) then there is not much issues that could be caused by the
> >> >>>> > interruption of link between AZ1 and AZ2 as you defined above.
> >> >>>> >
> >> >>>> > But the issue arrives when the Pgpool-II is used in the
> >> master-master
> >> >>>> > mode(clients connect to all available Pgpool-II) then consider
> the
> >> >>>> > following scenario.
> >> >>>> >
> >> >>>> > a) Link between AZ1 and AZ2 broke, at that time B1 was master
> while
> >> B2
> >> >>>> was
> >> >>>> > standby.
> >> >>>> >
> >> >>>> > b) Pgpool-C in AZ2 promote B2 to the master since Pgpool-C is not
> >> able
> >> >>>> to
> >> >>>> > connect old master (B1)
> >> >>>>
> >> >>>> I thought Pgpool-C sucides because it cannot get quorum in this
> >> case, no?
> >> >>>>
> >> >>>
> >> >>> No, Pgpool-II only commits suicide only when it loses all network
> >> >>> connections. Otherwise the master watchdog node is de-escalated when
> >> the
> >> >>> quorum is lost.
> >> >>> Committing a suicide everytime quorum is lost is very risky and not
> >> >>> a feasible since it will shutdown the whole cluster as soon as a
> >> quorum
> >> >>> loses even because of a small glitch.
> >> >>>
> >> >>>
> >> >>>> > c) A client connects to Pgpool-C and issues a write statement. It
> >> will
> >> >>>> land
> >> >>>> > on the B2 PostgreSQL server, which was promoted as master in step
> >> b.
> >> >>>> >
> >> >>>> > c-1) Another client connects to Pgpool-A and also issues a write
> >> >>>> statement
> >> >>>> > that will land on the B1 PostgreSQL server as it the master node
> >> in AZ.
> >> >>>> >
> >> >>>> > d) The link between AZ1 and AZ2 is restored, but now the
> PostgreSQL
> >> B1
> >> >>>> and
> >> >>>> > B2 both have different sets of data and with no easy way to get
> both
> >> >>>> > changes in one place and restore the cluster to original state.
> >> >>>> >
> >> >>>> > The above scenario will become more complicated if both
> availability
> >> >>>> zones
> >> >>>> > AZ1 and AZ2 have multiple Pgpool-II nodes, since retiring the
> >> multiple
> >> >>>> > Pgpool-II nodes logic will become more complex when link
> disruption
> >> >>>> between
> >> >>>> > AZ1 and AZ2.
> >> >>>> >
> >> >>>> > So the proposal tries to solve this by making sure that we should
> >> >>>> always
> >> >>>> > have only one master PostgreSQL node in the cluster and never end
> >> up
> >> >>>> in the
> >> >>>> > situation where we have different sets of data in different
> >> PostgreSQL
> >> >>>> > nodes.
> >> >>>> >
> >> >>>> >
> >> >>>> >
> >> >>>> >> > There is also a question ("[pgpool-general: 5179] Architecture
> >> >>>> Questions
> >> >>>> >> > <http://www.sraoss.jp/pipermail/pgpool-general/2016-December
> >> >>>> /005237.html
> >> >>>> >> >")
> >> >>>> >> > posted by a user in pgpool-general mailing list who wants a
> >> similar
> >> >>>> type
> >> >>>> >> of
> >> >>>> >> > network that spans over two AWS availability zones and
> Pgpool-II
> >> >>>> has no
> >> >>>> >> > good answer to avoid split-brain of backend nodes if the
> >> corporate
> >> >>>> link
> >> >>>> >> > between two zones suffers a glitch.
> >> >>>> >>
> >> >>>> >> That seems totally different story to me because there two
> >> independent
> >> >>>> >> streaming replication primary servers in the east and west
> >> regions.
> >> >>>> >>
> >> >>>> >>
> >> >>>> > I think the original question statement was a little bit
> confusing.
> >> >>>> How I
> >> >>>> > understand the user requirements later in the thread was that.
> >> >>>> > The user has a couple of PostgreSQL nodes in two availability
> zones
> >> >>>> (total
> >> >>>> > 4 PG nodes) and all four nodes are part of the single streaming
> >> >>>> replication
> >> >>>> > setup.
> >> >>>> > Both zones have two Pgpool-II nodes each. (Total 4 Pgpool-II
> nodes
> >> in
> >> >>>> the
> >> >>>> > cluster).
> >> >>>> > Each availability zone has one application server that connects
> to
> >> one
> >> >>>> of
> >> >>>> > two Pgpool-II in the that availability zone. (That makes it
> >> >>>> master-master
> >> >>>> > Pgpool setup). And the user is concerned about split-brain of
> >> >>>> PostgreSQL
> >> >>>> > servers when the corporate link between zones becomes
> unavailable.
> >> >>>> >
> >> >>>> > Thanks
> >> >>>> > Best regards
> >> >>>> > Muhammad Usama
> >> >>>> >
> >> >>>> >
> >> >>>> >
> >> >>>> >> Best regards,
> >> >>>> >> --
> >> >>>> >> Tatsuo Ishii
> >> >>>> >> SRA OSS, Inc. Japan
> >> >>>> >> English: http://www.sraoss.co.jp/index_en.php
> >> >>>> >> Japanese:http://www.sraoss.co.jp
> >> >>>> >>
> >> >>>> >> > Thanks
> >> >>>> >> > Best regards
> >> >>>> >> > Muhammad Usama
> >> >>>> >> >
> >> >>>> >> >
> >> >>>> >> >
> >> >>>> >> >>
> >> >>>> >> >> Best regards,
> >> >>>> >> >> --
> >> >>>> >> >> Tatsuo Ishii
> >> >>>> >> >> SRA OSS, Inc. Japan
> >> >>>> >> >> English: http://www.sraoss.co.jp/index_en.php
> >> >>>> >> >> Japanese:http://www.sraoss.co.jp
> >> >>>> >> >>
> >> >>>> >> >> >> > Hi Hackers,
> >> >>>> >> >> >> >
> >> >>>> >> >> >> > This is the proposal to make the failover of backend
> >> >>>> PostgreSQL
> >> >>>> >> nodes
> >> >>>> >> >> >> > quorum aware to make it more robust and fault tolerant.
> >> >>>> >> >> >> >
> >> >>>> >> >> >> > Currently Pgpool-II proceeds to failover the backend
> node
> >> as
> >> >>>> soon
> >> >>>> >> as
> >> >>>> >> >> the
> >> >>>> >> >> >> > health check detects the failure or in case of an error
> >> >>>> occurred on
> >> >>>> >> >> the
> >> >>>> >> >> >> > backend connection (when fail_over_on_backend_error is
> >> set).
> >> >>>> This
> >> >>>> >> is
> >> >>>> >> >> good
> >> >>>> >> >> >> > enough for the standalone Pgpool-II server.
> >> >>>> >> >> >> >
> >> >>>> >> >> >> > But consider the scenario where we have more than one
> >> >>>> Pgpool-II
> >> >>>> >> (Say
> >> >>>> >> >> >> > Pgpool-A, Pgpool-B and Pgpool-C) in the cluster
> connected
> >> >>>> through
> >> >>>> >> >> >> watchdog
> >> >>>> >> >> >> > and each Pgpool-II node is configured with two
> PostgreSQL
> >> >>>> backends
> >> >>>> >> >> (B1
> >> >>>> >> >> >> and
> >> >>>> >> >> >> > B2).
> >> >>>> >> >> >> >
> >> >>>> >> >> >> > Now if due to some network glitch or an issue, Pgpool-A
> >> fails
> >> >>>> or
> >> >>>> >> loses
> >> >>>> >> >> >> its
> >> >>>> >> >> >> > network connection with backend B1, The Pgpool-A will
> >> detect
> >> >>>> the
> >> >>>> >> >> failure
> >> >>>> >> >> >> > and detach (failover) the B1 backend and also pass this
> >> >>>> information
> >> >>>> >> >> to
> >> >>>> >> >> >> the
> >> >>>> >> >> >> > other Pgpool-II nodes (Pgpool-II B and Pgpool-II C),
> >> Although
> >> >>>> the
> >> >>>> >> >> Backend
> >> >>>> >> >> >> > B1 was perfectly healthy and it was also reachable from
> >> >>>> Pgpool-B
> >> >>>> >> and
> >> >>>> >> >> >> > Pgpool-C nodes, But still because of a network glitch
> >> between
> >> >>>> >> >> Pgpool-A
> >> >>>> >> >> >> and
> >> >>>> >> >> >> > Backend B1, it will get detached from the cluster and
> the
> >> >>>> worst
> >> >>>> >> part
> >> >>>> >> >> is,
> >> >>>> >> >> >> if
> >> >>>> >> >> >> > the B1 was a master PostgreSQL (in master-standby
> >> >>>> configuration),
> >> >>>> >> the
> >> >>>> >> >> >> > Pgpool-II failover would also promote the B2 PostgreSQL
> >> node
> >> >>>> as a
> >> >>>> >> new
> >> >>>> >> >> >> > master, hense making the way for split-brain and/or data
> >> >>>> >> corruptions.
> >> >>>> >> >> >> >
> >> >>>> >> >> >> > So my proposal is that when the Watchdog is configured
> in
> >> >>>> Pgpool-II
> >> >>>> >> >> the
> >> >>>> >> >> >> > backend health check of Pgpool-II should consult with
> >> other
> >> >>>> >> attached
> >> >>>> >> >> >> > Pgpool-II nodes over the watchdog to decide if the
> Backend
> >> >>>> node is
> >> >>>> >> >> >> actually
> >> >>>> >> >> >> > failed or if it is just a localized glitch/false alarm.
> >> And
> >> >>>> the
> >> >>>> >> >> failover
> >> >>>> >> >> >> on
> >> >>>> >> >> >> > the node should only be performed, when the majority of
> >> >>>> cluster
> >> >>>> >> >> members
> >> >>>> >> >> >> > agrees on the failure of nodes.
> >> >>>> >> >> >> >
> >> >>>> >> >> >> > This quorum aware architecture of failover will prevents
> >> the
> >> >>>> false
> >> >>>> >> >> >> > failovers and split-brain scenarios in the Backend
> nodes.
> >> >>>> >> >> >> >
> >> >>>> >> >> >> > What are your thoughts and suggestions on this?
> >> >>>> >> >> >> >
> >> >>>> >> >> >> > Thanks
> >> >>>> >> >> >> > Best regards
> >> >>>> >> >> >> > Muhammad Usama
> >> >>>> >> >> >>
> >> >>>> >> >>
> >> >>>> >>
> >> >>>>
> >> >>>
> >> >>>
> >> >>
> >>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-hackers/attachments/20170316/d0d14f31/attachment-0001.html>