[pgpool-hackers: 2149] Re: Proposal to make backend node failover mechanism quorum aware

Thu Mar 16 08:14:44 JST 2017

> On Fri, Mar 10, 2017 at 11:05 AM, Tatsuo Ishii <ishii at sraoss.co.jp> wrote:
> 
>> Usama,
>>
>> I have a question regarding Zone partitioning case described in
>> section 2 in your proposal.  In my understanding after the network
>> partitioning happens, Pgpool-II/watchdog in zone 2 will suicide
>> because they cannot acquire quorum. So split-brain or data
>> inconsistency due to two master node will not happen in even in
>> Pgpool-II 3.6. Am I missing something?
>>
> 
> With the current design of watchdog the Pgpool-II/Watchdog commits suicide
> in only two cases.
> 
> 1- When all network interfaces on the machine becomes unavailable(machine
> lost all IP addresses).
> 2- When connection to the up-stream trusted server becomes unreachable (if
> trusted_servers are configured)
> 
> So in zone partitioning scenario described in section 2 the Pgpool-II nodes
> in zone 2 will not commit suicide because none
> of the above two conditions for node suicide exists.
> 
> Also, doing the suicide as soon as the cluster looses the quorum doesn't
> feel like a good option because if we implement that we will end up with
> all the Pgpool-II nodes committing suicide as soon as the quorum is lost in
> the cluster and eventually the Pgpool-II service will become unavailable,
> and the administrator would require to manually restart Pgpool-II nodes.
> Current implementation makes sure that split-brain does not happen when a
> quorum is not available

How do you prevent split-brain without a quorum?

>  and at the same time keep looking for new/old-lost
> nodes to join back the cluster to make sure minimum possible service
> disruption happen and cluster recovers automatically without any manual
> intervention.
> 
> 
> Thanks
> Best regards
> Muhammad Usama
> 
> 
> 
>> Best regards,
>> --
>> Tatsuo Ishii
>> SRA OSS, Inc. Japan
>> English: http://www.sraoss.co.jp/index_en.php
>> Japanese:http://www.sraoss.co.jp
>>
>> From: Muhammad Usama <m.usama at gmail.com>
>> Subject: Re: Proposal to make backend node failover mechanism quorum aware
>> Date: Thu, 9 Mar 2017 00:57:58 +0500
>> Message-ID: <CAEJvTzXap+qMGLt7SQ-1hPgf=aNuAYEsu_JQYd695hac0WagkA at mail.
>> gmail.com>
>>
>> > Hi
>> >
>> > Please use this document. The image quality of the previously shared
>> > version was not up to the mark.
>> >
>> > Thanks
>> > Best regards
>> > Muhammad Usama
>> >
>> > On Thu, Mar 9, 2017 at 12:53 AM, Muhammad Usama <m.usama at gmail.com>
>> wrote:
>> >
>> >> Hi Ishii-San
>> >>
>> >> I have tried to create a detailed proposal to explain why and where the
>> >> quorum aware backend failover mechanism would be useful.
>> >> Can you please take a look at the attached pdf document and share your
>> >> thoughts.
>> >>
>> >> Thanks
>> >> Kind Regards
>> >> Muhammad Usama
>> >>
>> >>
>> >> On Wed, Jan 25, 2017 at 2:04 PM, Muhammad Usama <m.usama at gmail.com>
>> wrote:
>> >>
>> >>>
>> >>>
>> >>> On Wed, Jan 25, 2017 at 9:05 AM, Tatsuo Ishii <ishii at sraoss.co.jp>
>> wrote:
>> >>>
>> >>>> Usama,
>> >>>>
>> >>>> > This is correct. If the Pgpool-II is used in maste-standby mode
>> (With
>> >>>> > elastic or virtual-IP and clients only connect to one Pgpool-II
>> server
>> >>>> > only) then there is not much issues that could be caused by the
>> >>>> > interruption of link between AZ1 and AZ2 as you defined above.
>> >>>> >
>> >>>> > But the issue arrives when the Pgpool-II is used in the
>> master-master
>> >>>> > mode(clients connect to all available Pgpool-II) then consider the
>> >>>> > following scenario.
>> >>>> >
>> >>>> > a) Link between AZ1 and AZ2 broke, at that time B1 was master while
>> B2
>> >>>> was
>> >>>> > standby.
>> >>>> >
>> >>>> > b) Pgpool-C in AZ2 promote B2 to the master since Pgpool-C is not
>> able
>> >>>> to
>> >>>> > connect old master (B1)
>> >>>>
>> >>>> I thought Pgpool-C sucides because it cannot get quorum in this
>> case, no?
>> >>>>
>> >>>
>> >>> No, Pgpool-II only commits suicide only when it loses all network
>> >>> connections. Otherwise the master watchdog node is de-escalated when
>> the
>> >>> quorum is lost.
>> >>> Committing a suicide everytime quorum is lost is very risky and not
>> >>> a feasible since it will shutdown the whole cluster as soon as a
>> quorum
>> >>> loses even because of a small glitch.
>> >>>
>> >>>
>> >>>> > c) A client connects to Pgpool-C and issues a write statement. It
>> will
>> >>>> land
>> >>>> > on the B2 PostgreSQL server, which was promoted as master in step
>> b.
>> >>>> >
>> >>>> > c-1) Another client connects to Pgpool-A and also issues a write
>> >>>> statement
>> >>>> > that will land on the B1 PostgreSQL server as it the master node
>> in AZ.
>> >>>> >
>> >>>> > d) The link between AZ1 and AZ2 is restored, but now the PostgreSQL
>> B1
>> >>>> and
>> >>>> > B2 both have different sets of data and with no easy way to get both
>> >>>> > changes in one place and restore the cluster to original state.
>> >>>> >
>> >>>> > The above scenario will become more complicated if both availability
>> >>>> zones
>> >>>> > AZ1 and AZ2 have multiple Pgpool-II nodes, since retiring the
>> multiple
>> >>>> > Pgpool-II nodes logic will become more complex when link disruption
>> >>>> between
>> >>>> > AZ1 and AZ2.
>> >>>> >
>> >>>> > So the proposal tries to solve this by making sure that we should
>> >>>> always
>> >>>> > have only one master PostgreSQL node in the cluster and never end
>> up
>> >>>> in the
>> >>>> > situation where we have different sets of data in different
>> PostgreSQL
>> >>>> > nodes.
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> >> > There is also a question ("[pgpool-general: 5179] Architecture
>> >>>> Questions
>> >>>> >> > <http://www.sraoss.jp/pipermail/pgpool-general/2016-December
>> >>>> /005237.html
>> >>>> >> >")
>> >>>> >> > posted by a user in pgpool-general mailing list who wants a
>> similar
>> >>>> type
>> >>>> >> of
>> >>>> >> > network that spans over two AWS availability zones and Pgpool-II
>> >>>> has no
>> >>>> >> > good answer to avoid split-brain of backend nodes if the
>> corporate
>> >>>> link
>> >>>> >> > between two zones suffers a glitch.
>> >>>> >>
>> >>>> >> That seems totally different story to me because there two
>> independent
>> >>>> >> streaming replication primary servers in the east and west
>> regions.
>> >>>> >>
>> >>>> >>
>> >>>> > I think the original question statement was a little bit confusing.
>> >>>> How I
>> >>>> > understand the user requirements later in the thread was that.
>> >>>> > The user has a couple of PostgreSQL nodes in two availability zones
>> >>>> (total
>> >>>> > 4 PG nodes) and all four nodes are part of the single streaming
>> >>>> replication
>> >>>> > setup.
>> >>>> > Both zones have two Pgpool-II nodes each. (Total 4 Pgpool-II nodes
>> in
>> >>>> the
>> >>>> > cluster).
>> >>>> > Each availability zone has one application server that connects to
>> one
>> >>>> of
>> >>>> > two Pgpool-II in the that availability zone. (That makes it
>> >>>> master-master
>> >>>> > Pgpool setup). And the user is concerned about split-brain of
>> >>>> PostgreSQL
>> >>>> > servers when the corporate link between zones becomes unavailable.
>> >>>> >
>> >>>> > Thanks
>> >>>> > Best regards
>> >>>> > Muhammad Usama
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> >> Best regards,
>> >>>> >> --
>> >>>> >> Tatsuo Ishii
>> >>>> >> SRA OSS, Inc. Japan
>> >>>> >> English: http://www.sraoss.co.jp/index_en.php
>> >>>> >> Japanese:http://www.sraoss.co.jp
>> >>>> >>
>> >>>> >> > Thanks
>> >>>> >> > Best regards
>> >>>> >> > Muhammad Usama
>> >>>> >> >
>> >>>> >> >
>> >>>> >> >
>> >>>> >> >>
>> >>>> >> >> Best regards,
>> >>>> >> >> --
>> >>>> >> >> Tatsuo Ishii
>> >>>> >> >> SRA OSS, Inc. Japan
>> >>>> >> >> English: http://www.sraoss.co.jp/index_en.php
>> >>>> >> >> Japanese:http://www.sraoss.co.jp
>> >>>> >> >>
>> >>>> >> >> >> > Hi Hackers,
>> >>>> >> >> >> >
>> >>>> >> >> >> > This is the proposal to make the failover of backend
>> >>>> PostgreSQL
>> >>>> >> nodes
>> >>>> >> >> >> > quorum aware to make it more robust and fault tolerant.
>> >>>> >> >> >> >
>> >>>> >> >> >> > Currently Pgpool-II proceeds to failover the backend node
>> as
>> >>>> soon
>> >>>> >> as
>> >>>> >> >> the
>> >>>> >> >> >> > health check detects the failure or in case of an error
>> >>>> occurred on
>> >>>> >> >> the
>> >>>> >> >> >> > backend connection (when fail_over_on_backend_error is
>> set).
>> >>>> This
>> >>>> >> is
>> >>>> >> >> good
>> >>>> >> >> >> > enough for the standalone Pgpool-II server.
>> >>>> >> >> >> >
>> >>>> >> >> >> > But consider the scenario where we have more than one
>> >>>> Pgpool-II
>> >>>> >> (Say
>> >>>> >> >> >> > Pgpool-A, Pgpool-B and Pgpool-C) in the cluster connected
>> >>>> through
>> >>>> >> >> >> watchdog
>> >>>> >> >> >> > and each Pgpool-II node is configured with two PostgreSQL
>> >>>> backends
>> >>>> >> >> (B1
>> >>>> >> >> >> and
>> >>>> >> >> >> > B2).
>> >>>> >> >> >> >
>> >>>> >> >> >> > Now if due to some network glitch or an issue, Pgpool-A
>> fails
>> >>>> or
>> >>>> >> loses
>> >>>> >> >> >> its
>> >>>> >> >> >> > network connection with backend B1, The Pgpool-A will
>> detect
>> >>>> the
>> >>>> >> >> failure
>> >>>> >> >> >> > and detach (failover) the B1 backend and also pass this
>> >>>> information
>> >>>> >> >> to
>> >>>> >> >> >> the
>> >>>> >> >> >> > other Pgpool-II nodes (Pgpool-II B and Pgpool-II C),
>> Although
>> >>>> the
>> >>>> >> >> Backend
>> >>>> >> >> >> > B1 was perfectly healthy and it was also reachable from
>> >>>> Pgpool-B
>> >>>> >> and
>> >>>> >> >> >> > Pgpool-C nodes, But still because of a network glitch
>> between
>> >>>> >> >> Pgpool-A
>> >>>> >> >> >> and
>> >>>> >> >> >> > Backend B1, it will get detached from the cluster and the
>> >>>> worst
>> >>>> >> part
>> >>>> >> >> is,
>> >>>> >> >> >> if
>> >>>> >> >> >> > the B1 was a master PostgreSQL (in master-standby
>> >>>> configuration),
>> >>>> >> the
>> >>>> >> >> >> > Pgpool-II failover would also promote the B2 PostgreSQL
>> node
>> >>>> as a
>> >>>> >> new
>> >>>> >> >> >> > master, hense making the way for split-brain and/or data
>> >>>> >> corruptions.
>> >>>> >> >> >> >
>> >>>> >> >> >> > So my proposal is that when the Watchdog is configured in
>> >>>> Pgpool-II
>> >>>> >> >> the
>> >>>> >> >> >> > backend health check of Pgpool-II should consult with
>> other
>> >>>> >> attached
>> >>>> >> >> >> > Pgpool-II nodes over the watchdog to decide if the Backend
>> >>>> node is
>> >>>> >> >> >> actually
>> >>>> >> >> >> > failed or if it is just a localized glitch/false alarm.
>> And
>> >>>> the
>> >>>> >> >> failover
>> >>>> >> >> >> on
>> >>>> >> >> >> > the node should only be performed, when the majority of
>> >>>> cluster
>> >>>> >> >> members
>> >>>> >> >> >> > agrees on the failure of nodes.
>> >>>> >> >> >> >
>> >>>> >> >> >> > This quorum aware architecture of failover will prevents
>> the
>> >>>> false
>> >>>> >> >> >> > failovers and split-brain scenarios in the Backend nodes.
>> >>>> >> >> >> >
>> >>>> >> >> >> > What are your thoughts and suggestions on this?
>> >>>> >> >> >> >
>> >>>> >> >> >> > Thanks
>> >>>> >> >> >> > Best regards
>> >>>> >> >> >> > Muhammad Usama
>> >>>> >> >> >>
>> >>>> >> >>
>> >>>> >>
>> >>>>
>> >>>
>> >>>
>> >>
>>