[pgpool-hackers: 2095] Re: Proposal to make backend node failover mechanism quorum aware

Thu Mar 9 04:57:58 JST 2017

Hi

Please use this document. The image quality of the previously shared
version was not up to the mark.

Thanks
Best regards
Muhammad Usama

On Thu, Mar 9, 2017 at 12:53 AM, Muhammad Usama <m.usama at gmail.com> wrote:

> Hi Ishii-San
>
> I have tried to create a detailed proposal to explain why and where the
> quorum aware backend failover mechanism would be useful.
> Can you please take a look at the attached pdf document and share your
> thoughts.
>
> Thanks
> Kind Regards
> Muhammad Usama
>
>
> On Wed, Jan 25, 2017 at 2:04 PM, Muhammad Usama <m.usama at gmail.com> wrote:
>
>>
>>
>> On Wed, Jan 25, 2017 at 9:05 AM, Tatsuo Ishii <ishii at sraoss.co.jp> wrote:
>>
>>> Usama,
>>>
>>> > This is correct. If the Pgpool-II is used in maste-standby mode (With
>>> > elastic or virtual-IP and clients only connect to one Pgpool-II server
>>> > only) then there is not much issues that could be caused by the
>>> > interruption of link between AZ1 and AZ2 as you defined above.
>>> >
>>> > But the issue arrives when the Pgpool-II is used in the master-master
>>> > mode(clients connect to all available Pgpool-II) then consider the
>>> > following scenario.
>>> >
>>> > a) Link between AZ1 and AZ2 broke, at that time B1 was master while B2
>>> was
>>> > standby.
>>> >
>>> > b) Pgpool-C in AZ2 promote B2 to the master since Pgpool-C is not able
>>> to
>>> > connect old master (B1)
>>>
>>> I thought Pgpool-C sucides because it cannot get quorum in this case, no?
>>>
>>
>> No, Pgpool-II only commits suicide only when it loses all network
>> connections. Otherwise the master watchdog node is de-escalated when the
>> quorum is lost.
>> Committing a suicide everytime quorum is lost is very risky and not
>> a feasible since it will shutdown the whole cluster as soon as a quorum
>> loses even because of a small glitch.
>>
>>
>>> > c) A client connects to Pgpool-C and issues a write statement. It will
>>> land
>>> > on the B2 PostgreSQL server, which was promoted as master in step b.
>>> >
>>> > c-1) Another client connects to Pgpool-A and also issues a write
>>> statement
>>> > that will land on the B1 PostgreSQL server as it the master node in AZ.
>>> >
>>> > d) The link between AZ1 and AZ2 is restored, but now the PostgreSQL B1
>>> and
>>> > B2 both have different sets of data and with no easy way to get both
>>> > changes in one place and restore the cluster to original state.
>>> >
>>> > The above scenario will become more complicated if both availability
>>> zones
>>> > AZ1 and AZ2 have multiple Pgpool-II nodes, since retiring the multiple
>>> > Pgpool-II nodes logic will become more complex when link disruption
>>> between
>>> > AZ1 and AZ2.
>>> >
>>> > So the proposal tries to solve this by making sure that we should
>>> always
>>> > have only one master PostgreSQL node in the cluster and never end up
>>> in the
>>> > situation where we have different sets of data in different PostgreSQL
>>> > nodes.
>>> >
>>> >
>>> >
>>> >> > There is also a question ("[pgpool-general: 5179] Architecture
>>> Questions
>>> >> > <http://www.sraoss.jp/pipermail/pgpool-general/2016-December
>>> /005237.html
>>> >> >")
>>> >> > posted by a user in pgpool-general mailing list who wants a similar
>>> type
>>> >> of
>>> >> > network that spans over two AWS availability zones and Pgpool-II
>>> has no
>>> >> > good answer to avoid split-brain of backend nodes if the corporate
>>> link
>>> >> > between two zones suffers a glitch.
>>> >>
>>> >> That seems totally different story to me because there two independent
>>> >> streaming replication primary servers in the east and west regions.
>>> >>
>>> >>
>>> > I think the original question statement was a little bit confusing.
>>> How I
>>> > understand the user requirements later in the thread was that.
>>> > The user has a couple of PostgreSQL nodes in two availability zones
>>> (total
>>> > 4 PG nodes) and all four nodes are part of the single streaming
>>> replication
>>> > setup.
>>> > Both zones have two Pgpool-II nodes each. (Total 4 Pgpool-II nodes in
>>> the
>>> > cluster).
>>> > Each availability zone has one application server that connects to one
>>> of
>>> > two Pgpool-II in the that availability zone. (That makes it
>>> master-master
>>> > Pgpool setup). And the user is concerned about split-brain of
>>> PostgreSQL
>>> > servers when the corporate link between zones becomes unavailable.
>>> >
>>> > Thanks
>>> > Best regards
>>> > Muhammad Usama
>>> >
>>> >
>>> >
>>> >> Best regards,
>>> >> --
>>> >> Tatsuo Ishii
>>> >> SRA OSS, Inc. Japan
>>> >> English: http://www.sraoss.co.jp/index_en.php
>>> >> Japanese:http://www.sraoss.co.jp
>>> >>
>>> >> > Thanks
>>> >> > Best regards
>>> >> > Muhammad Usama
>>> >> >
>>> >> >
>>> >> >
>>> >> >>
>>> >> >> Best regards,
>>> >> >> --
>>> >> >> Tatsuo Ishii
>>> >> >> SRA OSS, Inc. Japan
>>> >> >> English: http://www.sraoss.co.jp/index_en.php
>>> >> >> Japanese:http://www.sraoss.co.jp
>>> >> >>
>>> >> >> >> > Hi Hackers,
>>> >> >> >> >
>>> >> >> >> > This is the proposal to make the failover of backend
>>> PostgreSQL
>>> >> nodes
>>> >> >> >> > quorum aware to make it more robust and fault tolerant.
>>> >> >> >> >
>>> >> >> >> > Currently Pgpool-II proceeds to failover the backend node as
>>> soon
>>> >> as
>>> >> >> the
>>> >> >> >> > health check detects the failure or in case of an error
>>> occurred on
>>> >> >> the
>>> >> >> >> > backend connection (when fail_over_on_backend_error is set).
>>> This
>>> >> is
>>> >> >> good
>>> >> >> >> > enough for the standalone Pgpool-II server.
>>> >> >> >> >
>>> >> >> >> > But consider the scenario where we have more than one
>>> Pgpool-II
>>> >> (Say
>>> >> >> >> > Pgpool-A, Pgpool-B and Pgpool-C) in the cluster connected
>>> through
>>> >> >> >> watchdog
>>> >> >> >> > and each Pgpool-II node is configured with two PostgreSQL
>>> backends
>>> >> >> (B1
>>> >> >> >> and
>>> >> >> >> > B2).
>>> >> >> >> >
>>> >> >> >> > Now if due to some network glitch or an issue, Pgpool-A fails
>>> or
>>> >> loses
>>> >> >> >> its
>>> >> >> >> > network connection with backend B1, The Pgpool-A will detect
>>> the
>>> >> >> failure
>>> >> >> >> > and detach (failover) the B1 backend and also pass this
>>> information
>>> >> >> to
>>> >> >> >> the
>>> >> >> >> > other Pgpool-II nodes (Pgpool-II B and Pgpool-II C), Although
>>> the
>>> >> >> Backend
>>> >> >> >> > B1 was perfectly healthy and it was also reachable from
>>> Pgpool-B
>>> >> and
>>> >> >> >> > Pgpool-C nodes, But still because of a network glitch between
>>> >> >> Pgpool-A
>>> >> >> >> and
>>> >> >> >> > Backend B1, it will get detached from the cluster and the
>>> worst
>>> >> part
>>> >> >> is,
>>> >> >> >> if
>>> >> >> >> > the B1 was a master PostgreSQL (in master-standby
>>> configuration),
>>> >> the
>>> >> >> >> > Pgpool-II failover would also promote the B2 PostgreSQL node
>>> as a
>>> >> new
>>> >> >> >> > master, hense making the way for split-brain and/or data
>>> >> corruptions.
>>> >> >> >> >
>>> >> >> >> > So my proposal is that when the Watchdog is configured in
>>> Pgpool-II
>>> >> >> the
>>> >> >> >> > backend health check of Pgpool-II should consult with other
>>> >> attached
>>> >> >> >> > Pgpool-II nodes over the watchdog to decide if the Backend
>>> node is
>>> >> >> >> actually
>>> >> >> >> > failed or if it is just a localized glitch/false alarm. And
>>> the
>>> >> >> failover
>>> >> >> >> on
>>> >> >> >> > the node should only be performed, when the majority of
>>> cluster
>>> >> >> members
>>> >> >> >> > agrees on the failure of nodes.
>>> >> >> >> >
>>> >> >> >> > This quorum aware architecture of failover will prevents the
>>> false
>>> >> >> >> > failovers and split-brain scenarios in the Backend nodes.
>>> >> >> >> >
>>> >> >> >> > What are your thoughts and suggestions on this?
>>> >> >> >> >
>>> >> >> >> > Thanks
>>> >> >> >> > Best regards
>>> >> >> >> > Muhammad Usama
>>> >> >> >>
>>> >> >>
>>> >>
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-hackers/attachments/20170309/b764960c/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: quorum aware failover.pdf
Type: application/pdf
Size: 336676 bytes
Desc: not available
URL: <http://www.sraoss.jp/pipermail/pgpool-hackers/attachments/20170309/b764960c/attachment-0001.pdf>