[pgpool-hackers: 2109] Re: Proposal to make backend node failover mechanism quorum aware

Fri Mar 10 15:05:32 JST 2017

Usama,

I have a question regarding Zone partitioning case described in
section 2 in your proposal.  In my understanding after the network
partitioning happens, Pgpool-II/watchdog in zone 2 will suicide
because they cannot acquire quorum. So split-brain or data
inconsistency due to two master node will not happen in even in
Pgpool-II 3.6. Am I missing something?

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

From: Muhammad Usama <m.usama at gmail.com>
Subject: Re: Proposal to make backend node failover mechanism quorum aware
Date: Thu, 9 Mar 2017 00:57:58 +0500
Message-ID: <CAEJvTzXap+qMGLt7SQ-1hPgf=aNuAYEsu_JQYd695hac0WagkA at mail.gmail.com>

> Hi
> 
> Please use this document. The image quality of the previously shared
> version was not up to the mark.
> 
> Thanks
> Best regards
> Muhammad Usama
> 
> On Thu, Mar 9, 2017 at 12:53 AM, Muhammad Usama <m.usama at gmail.com> wrote:
> 
>> Hi Ishii-San
>>
>> I have tried to create a detailed proposal to explain why and where the
>> quorum aware backend failover mechanism would be useful.
>> Can you please take a look at the attached pdf document and share your
>> thoughts.
>>
>> Thanks
>> Kind Regards
>> Muhammad Usama
>>
>>
>> On Wed, Jan 25, 2017 at 2:04 PM, Muhammad Usama <m.usama at gmail.com> wrote:
>>
>>>
>>>
>>> On Wed, Jan 25, 2017 at 9:05 AM, Tatsuo Ishii <ishii at sraoss.co.jp> wrote:
>>>
>>>> Usama,
>>>>
>>>> > This is correct. If the Pgpool-II is used in maste-standby mode (With
>>>> > elastic or virtual-IP and clients only connect to one Pgpool-II server
>>>> > only) then there is not much issues that could be caused by the
>>>> > interruption of link between AZ1 and AZ2 as you defined above.
>>>> >
>>>> > But the issue arrives when the Pgpool-II is used in the master-master
>>>> > mode(clients connect to all available Pgpool-II) then consider the
>>>> > following scenario.
>>>> >
>>>> > a) Link between AZ1 and AZ2 broke, at that time B1 was master while B2
>>>> was
>>>> > standby.
>>>> >
>>>> > b) Pgpool-C in AZ2 promote B2 to the master since Pgpool-C is not able
>>>> to
>>>> > connect old master (B1)
>>>>
>>>> I thought Pgpool-C sucides because it cannot get quorum in this case, no?
>>>>
>>>
>>> No, Pgpool-II only commits suicide only when it loses all network
>>> connections. Otherwise the master watchdog node is de-escalated when the
>>> quorum is lost.
>>> Committing a suicide everytime quorum is lost is very risky and not
>>> a feasible since it will shutdown the whole cluster as soon as a quorum
>>> loses even because of a small glitch.
>>>
>>>
>>>> > c) A client connects to Pgpool-C and issues a write statement. It will
>>>> land
>>>> > on the B2 PostgreSQL server, which was promoted as master in step b.
>>>> >
>>>> > c-1) Another client connects to Pgpool-A and also issues a write
>>>> statement
>>>> > that will land on the B1 PostgreSQL server as it the master node in AZ.
>>>> >
>>>> > d) The link between AZ1 and AZ2 is restored, but now the PostgreSQL B1
>>>> and
>>>> > B2 both have different sets of data and with no easy way to get both
>>>> > changes in one place and restore the cluster to original state.
>>>> >
>>>> > The above scenario will become more complicated if both availability
>>>> zones
>>>> > AZ1 and AZ2 have multiple Pgpool-II nodes, since retiring the multiple
>>>> > Pgpool-II nodes logic will become more complex when link disruption
>>>> between
>>>> > AZ1 and AZ2.
>>>> >
>>>> > So the proposal tries to solve this by making sure that we should
>>>> always
>>>> > have only one master PostgreSQL node in the cluster and never end up
>>>> in the
>>>> > situation where we have different sets of data in different PostgreSQL
>>>> > nodes.
>>>> >
>>>> >
>>>> >
>>>> >> > There is also a question ("[pgpool-general: 5179] Architecture
>>>> Questions
>>>> >> > <http://www.sraoss.jp/pipermail/pgpool-general/2016-December
>>>> /005237.html
>>>> >> >")
>>>> >> > posted by a user in pgpool-general mailing list who wants a similar
>>>> type
>>>> >> of
>>>> >> > network that spans over two AWS availability zones and Pgpool-II
>>>> has no
>>>> >> > good answer to avoid split-brain of backend nodes if the corporate
>>>> link
>>>> >> > between two zones suffers a glitch.
>>>> >>
>>>> >> That seems totally different story to me because there two independent
>>>> >> streaming replication primary servers in the east and west regions.
>>>> >>
>>>> >>
>>>> > I think the original question statement was a little bit confusing.
>>>> How I
>>>> > understand the user requirements later in the thread was that.
>>>> > The user has a couple of PostgreSQL nodes in two availability zones
>>>> (total
>>>> > 4 PG nodes) and all four nodes are part of the single streaming
>>>> replication
>>>> > setup.
>>>> > Both zones have two Pgpool-II nodes each. (Total 4 Pgpool-II nodes in
>>>> the
>>>> > cluster).
>>>> > Each availability zone has one application server that connects to one
>>>> of
>>>> > two Pgpool-II in the that availability zone. (That makes it
>>>> master-master
>>>> > Pgpool setup). And the user is concerned about split-brain of
>>>> PostgreSQL
>>>> > servers when the corporate link between zones becomes unavailable.
>>>> >
>>>> > Thanks
>>>> > Best regards
>>>> > Muhammad Usama
>>>> >
>>>> >
>>>> >
>>>> >> Best regards,
>>>> >> --
>>>> >> Tatsuo Ishii
>>>> >> SRA OSS, Inc. Japan
>>>> >> English: http://www.sraoss.co.jp/index_en.php
>>>> >> Japanese:http://www.sraoss.co.jp
>>>> >>
>>>> >> > Thanks
>>>> >> > Best regards
>>>> >> > Muhammad Usama
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> >>
>>>> >> >> Best regards,
>>>> >> >> --
>>>> >> >> Tatsuo Ishii
>>>> >> >> SRA OSS, Inc. Japan
>>>> >> >> English: http://www.sraoss.co.jp/index_en.php
>>>> >> >> Japanese:http://www.sraoss.co.jp
>>>> >> >>
>>>> >> >> >> > Hi Hackers,
>>>> >> >> >> >
>>>> >> >> >> > This is the proposal to make the failover of backend
>>>> PostgreSQL
>>>> >> nodes
>>>> >> >> >> > quorum aware to make it more robust and fault tolerant.
>>>> >> >> >> >
>>>> >> >> >> > Currently Pgpool-II proceeds to failover the backend node as
>>>> soon
>>>> >> as
>>>> >> >> the
>>>> >> >> >> > health check detects the failure or in case of an error
>>>> occurred on
>>>> >> >> the
>>>> >> >> >> > backend connection (when fail_over_on_backend_error is set).
>>>> This
>>>> >> is
>>>> >> >> good
>>>> >> >> >> > enough for the standalone Pgpool-II server.
>>>> >> >> >> >
>>>> >> >> >> > But consider the scenario where we have more than one
>>>> Pgpool-II
>>>> >> (Say
>>>> >> >> >> > Pgpool-A, Pgpool-B and Pgpool-C) in the cluster connected
>>>> through
>>>> >> >> >> watchdog
>>>> >> >> >> > and each Pgpool-II node is configured with two PostgreSQL
>>>> backends
>>>> >> >> (B1
>>>> >> >> >> and
>>>> >> >> >> > B2).
>>>> >> >> >> >
>>>> >> >> >> > Now if due to some network glitch or an issue, Pgpool-A fails
>>>> or
>>>> >> loses
>>>> >> >> >> its
>>>> >> >> >> > network connection with backend B1, The Pgpool-A will detect
>>>> the
>>>> >> >> failure
>>>> >> >> >> > and detach (failover) the B1 backend and also pass this
>>>> information
>>>> >> >> to
>>>> >> >> >> the
>>>> >> >> >> > other Pgpool-II nodes (Pgpool-II B and Pgpool-II C), Although
>>>> the
>>>> >> >> Backend
>>>> >> >> >> > B1 was perfectly healthy and it was also reachable from
>>>> Pgpool-B
>>>> >> and
>>>> >> >> >> > Pgpool-C nodes, But still because of a network glitch between
>>>> >> >> Pgpool-A
>>>> >> >> >> and
>>>> >> >> >> > Backend B1, it will get detached from the cluster and the
>>>> worst
>>>> >> part
>>>> >> >> is,
>>>> >> >> >> if
>>>> >> >> >> > the B1 was a master PostgreSQL (in master-standby
>>>> configuration),
>>>> >> the
>>>> >> >> >> > Pgpool-II failover would also promote the B2 PostgreSQL node
>>>> as a
>>>> >> new
>>>> >> >> >> > master, hense making the way for split-brain and/or data
>>>> >> corruptions.
>>>> >> >> >> >
>>>> >> >> >> > So my proposal is that when the Watchdog is configured in
>>>> Pgpool-II
>>>> >> >> the
>>>> >> >> >> > backend health check of Pgpool-II should consult with other
>>>> >> attached
>>>> >> >> >> > Pgpool-II nodes over the watchdog to decide if the Backend
>>>> node is
>>>> >> >> >> actually
>>>> >> >> >> > failed or if it is just a localized glitch/false alarm. And
>>>> the
>>>> >> >> failover
>>>> >> >> >> on
>>>> >> >> >> > the node should only be performed, when the majority of
>>>> cluster
>>>> >> >> members
>>>> >> >> >> > agrees on the failure of nodes.
>>>> >> >> >> >
>>>> >> >> >> > This quorum aware architecture of failover will prevents the
>>>> false
>>>> >> >> >> > failovers and split-brain scenarios in the Backend nodes.
>>>> >> >> >> >
>>>> >> >> >> > What are your thoughts and suggestions on this?
>>>> >> >> >> >
>>>> >> >> >> > Thanks
>>>> >> >> >> > Best regards
>>>> >> >> >> > Muhammad Usama
>>>> >> >> >>
>>>> >> >>
>>>> >>
>>>>
>>>
>>>
>>