[pgpool-hackers: 2001] Re: Proposal to make backend node failover mechanism quorum aware

Wed Jan 25 00:01:09 JST 2017

On Mon, Jan 23, 2017 at 8:15 AM, Tatsuo Ishii <ishii at sraoss.co.jp> wrote:

> > On Fri, Jan 20, 2017 at 7:37 AM, Tatsuo Ishii <ishii at sraoss.co.jp>
> wrote:
> >
> >> > On Mon, Jan 16, 2017 at 12:10 PM, Tatsuo Ishii <ishii at sraoss.co.jp>
> >> wrote:
> >> >
> >> >> Hi Usama,
> >> >>
> >> >> If my understanding is correct, by using the quorum, Pgpool-B and
> >> >> Pgpool-C decides that B1 is healthy. What happens when Pgpool-A tries
> >> >> to connect to B1 if the network failure between Pgpool-A and B1
> >> >> continues? I guess clients connect to Pgpool-A get error and failed
> to
> >> >> connect to database?
> >> >>
> >> >
> >> > Yes, that is correct. I think what we can do in this scenario is, If
> the
> >> > Pgpool-A is not allowed to failover B1 because other nodes in the
> >> cluster
> >> > (Pgpool-B and Pgpool-C) does not agree with the failure of B1 then the
> >> > Pgpool-A will throw an error to its clients if B1 was the
> master/primary
> >> > Backend Server. Otherwise, if B1 was the Standby server then Pgpool-A
> >> would
> >> > continue serving the clients without using the unreachable PostgreSQL
> >> > server B1.
> >>
> >> Well, that sounds overly complex to me. In this case it is likely that
> >> network devices or switch ports used by Pgpool-A are broken. In this
> >> situation, as a member of watchdog clusters, Pgpool-A cannot be
> >> trusted any more thus we can let Pgpool-II retire from the watchdog
> >> cluster.
> >>
> >
> > Basically the scenario mentioned in the initial proposal is the very
> > simplistic, which has all Pgpool-II and Database servers located inside a
> > single network and as you pointed out the failure scenario would be more
> > likely because of a network device failure.
> >
> > But if we consider a situation where the Pgpool-II servers and PostgreSQL
> > servers are distributed in multiple or even just two availability zones
> > then the network partitioning can happen because of disruption of the
> > link connecting the networks in the different availability zone.
>
> Ok. Suppose we have:
>
> AZ1: Pgpool-A, Pgpool-B, B1
> AZ2: Pgpool-C, B2
>
> They are configured as shown in
> http://www.pgpool.net/docs/latest/en/html/example-aws.html.
>
> If AZ1 and AZ2 are disconnected, then I expect followings happen in
> Pgpool-II 3.6:
>
> 1) Pgpool-A and Pgpool-B detects failure of B2 because they cannot
>    reach to B2 and detache B2. They may promote B1 if B1 is standby.
>
> 2) Pgpool-A and Pgpool-B decide that new watchdog master should be
>    elected from one of Pgpool-A and Pgpool-B.
>
> 3) Pgpool-C decices that it should retire from the watchdog cluster,
>    which makes users in AZ2 impossible to access B2 through the
>    elastic IP.
>
>    Pgpool-C may or may not promote B2 (if it's a standby).
>
> According to the proposal, the only difference would be in #3:
>
> 3a) Pgpool-C decices that it should retire from the watchdog cluster,
>    which makes users in AZ2 impossible to access B2 through the
>    elastic IP. Users in AZ2 need to access Pgpool-C using Pgpool-C's
>    real IP address.
>
>    Pgpool-C does not promote B2 (if it's a standby).
>
>    Pgpool-C refuses access to B2 (if it's a primary).
>
> If my understanding is correct, the proposal seems to add little
> benefit because:
>
> - Users in AZ2 need to switch the IP from the elastic IP to real IP
>   when the link down between two regions to access DB.
>
> - Even without the proposal, users in AZ2 could access B2 in this
>   case. Users need to switch IP anyway, so switching from the elastic
>   IP to the real standby IP is no big deal.
>
> Am I missing something?
>

This is correct. If the Pgpool-II is used in maste-standby mode (With
elastic or virtual-IP and clients only connect to one Pgpool-II server
only) then there is not much issues that could be caused by the
interruption of link between AZ1 and AZ2 as you defined above.

But the issue arrives when the Pgpool-II is used in the master-master
mode(clients connect to all available Pgpool-II) then consider the
following scenario.

a) Link between AZ1 and AZ2 broke, at that time B1 was master while B2 was
standby.

b) Pgpool-C in AZ2 promote B2 to the master since Pgpool-C is not able to
connect old master (B1)

c) A client connects to Pgpool-C and issues a write statement. It will land
on the B2 PostgreSQL server, which was promoted as master in step b.

c-1) Another client connects to Pgpool-A and also issues a write statement
that will land on the B1 PostgreSQL server as it the master node in AZ.

d) The link between AZ1 and AZ2 is restored, but now the PostgreSQL B1 and
B2 both have different sets of data and with no easy way to get both
changes in one place and restore the cluster to original state.

The above scenario will become more complicated if both availability zones
AZ1 and AZ2 have multiple Pgpool-II nodes, since retiring the multiple
Pgpool-II nodes logic will become more complex when link disruption between
AZ1 and AZ2.

So the proposal tries to solve this by making sure that we should always
have only one master PostgreSQL node in the cluster and never end up in the
situation where we have different sets of data in different PostgreSQL
nodes.

> > There is also a question ("[pgpool-general: 5179] Architecture Questions
> > <http://www.sraoss.jp/pipermail/pgpool-general/2016-December/005237.html
> >")
> > posted by a user in pgpool-general mailing list who wants a similar type
> of
> > network that spans over two AWS availability zones and Pgpool-II has no
> > good answer to avoid split-brain of backend nodes if the corporate link
> > between two zones suffers a glitch.
>
> That seems totally different story to me because there two independent
> streaming replication primary servers in the east and west regions.
>
>
I think the original question statement was a little bit confusing. How I
understand the user requirements later in the thread was that.
The user has a couple of PostgreSQL nodes in two availability zones (total
4 PG nodes) and all four nodes are part of the single streaming replication
setup.
Both zones have two Pgpool-II nodes each. (Total 4 Pgpool-II nodes in the
cluster).
Each availability zone has one application server that connects to one of
two Pgpool-II in the that availability zone. (That makes it master-master
Pgpool setup). And the user is concerned about split-brain of PostgreSQL
servers when the corporate link between zones becomes unavailable.

Thanks
Best regards
Muhammad Usama

> Best regards,
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese:http://www.sraoss.co.jp
>
> > Thanks
> > Best regards
> > Muhammad Usama
> >
> >
> >
> >>
> >> Best regards,
> >> --
> >> Tatsuo Ishii
> >> SRA OSS, Inc. Japan
> >> English: http://www.sraoss.co.jp/index_en.php
> >> Japanese:http://www.sraoss.co.jp
> >>
> >> >> > Hi Hackers,
> >> >> >
> >> >> > This is the proposal to make the failover of backend PostgreSQL
> nodes
> >> >> > quorum aware to make it more robust and fault tolerant.
> >> >> >
> >> >> > Currently Pgpool-II proceeds to failover the backend node as soon
> as
> >> the
> >> >> > health check detects the failure or in case of an error occurred on
> >> the
> >> >> > backend connection (when fail_over_on_backend_error is set). This
> is
> >> good
> >> >> > enough for the standalone Pgpool-II server.
> >> >> >
> >> >> > But consider the scenario where we have more than one Pgpool-II
> (Say
> >> >> > Pgpool-A, Pgpool-B and Pgpool-C) in the cluster connected through
> >> >> watchdog
> >> >> > and each Pgpool-II node is configured with two PostgreSQL backends
> >> (B1
> >> >> and
> >> >> > B2).
> >> >> >
> >> >> > Now if due to some network glitch or an issue, Pgpool-A fails or
> loses
> >> >> its
> >> >> > network connection with backend B1, The Pgpool-A will detect the
> >> failure
> >> >> > and detach (failover) the B1 backend and also pass this information
> >> to
> >> >> the
> >> >> > other Pgpool-II nodes (Pgpool-II B and Pgpool-II C), Although the
> >> Backend
> >> >> > B1 was perfectly healthy and it was also reachable from Pgpool-B
> and
> >> >> > Pgpool-C nodes, But still because of a network glitch between
> >> Pgpool-A
> >> >> and
> >> >> > Backend B1, it will get detached from the cluster and the worst
> part
> >> is,
> >> >> if
> >> >> > the B1 was a master PostgreSQL (in master-standby configuration),
> the
> >> >> > Pgpool-II failover would also promote the B2 PostgreSQL node as a
> new
> >> >> > master, hense making the way for split-brain and/or data
> corruptions.
> >> >> >
> >> >> > So my proposal is that when the Watchdog is configured in Pgpool-II
> >> the
> >> >> > backend health check of Pgpool-II should consult with other
> attached
> >> >> > Pgpool-II nodes over the watchdog to decide if the Backend node is
> >> >> actually
> >> >> > failed or if it is just a localized glitch/false alarm. And the
> >> failover
> >> >> on
> >> >> > the node should only be performed, when the majority of cluster
> >> members
> >> >> > agrees on the failure of nodes.
> >> >> >
> >> >> > This quorum aware architecture of failover will prevents the false
> >> >> > failovers and split-brain scenarios in the Backend nodes.
> >> >> >
> >> >> > What are your thoughts and suggestions on this?
> >> >> >
> >> >> > Thanks
> >> >> > Best regards
> >> >> > Muhammad Usama
> >> >>
> >>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-hackers/attachments/20170124/1caae0bd/attachment-0001.html>