<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Mar 16, 2017 at 4:14 AM, Tatsuo Ishii <span dir="ltr"><<a href="mailto:ishii@sraoss.co.jp" target="_blank">ishii@sraoss.co.jp</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">> On Fri, Mar 10, 2017 at 11:05 AM, Tatsuo Ishii <<a href="mailto:ishii@sraoss.co.jp">ishii@sraoss.co.jp</a>> wrote:<br>
><br>
>> Usama,<br>
>><br>
>> I have a question regarding Zone partitioning case described in<br>
>> section 2 in your proposal. In my understanding after the network<br>
>> partitioning happens, Pgpool-II/watchdog in zone 2 will suicide<br>
>> because they cannot acquire quorum. So split-brain or data<br>
>> inconsistency due to two master node will not happen in even in<br>
>> Pgpool-II 3.6. Am I missing something?<br>
>><br>
><br>
> With the current design of watchdog the Pgpool-II/Watchdog commits suicide<br>
> in only two cases.<br>
><br>
> 1- When all network interfaces on the machine becomes unavailable(machine<br>
> lost all IP addresses).<br>
> 2- When connection to the up-stream trusted server becomes unreachable (if<br>
> trusted_servers are configured)<br>
><br>
> So in zone partitioning scenario described in section 2 the Pgpool-II nodes<br>
> in zone 2 will not commit suicide because none<br>
> of the above two conditions for node suicide exists.<br>
><br>
> Also, doing the suicide as soon as the cluster looses the quorum doesn't<br>
> feel like a good option because if we implement that we will end up with<br>
> all the Pgpool-II nodes committing suicide as soon as the quorum is lost in<br>
> the cluster and eventually the Pgpool-II service will become unavailable,<br>
> and the administrator would require to manually restart Pgpool-II nodes.<br>
> Current implementation makes sure that split-brain does not happen when a<br>
> quorum is not available<br>
<br>
</span>How do you prevent split-brain without a quorum?<br></blockquote><div><br></div><div>Since In watchdog cluster a split-brain scenario would mean that more than one node becomes a delegate-IP holder.</div><div>So to prevent that whenever the quorum is not present in the cluster the watchdog node does not acquire a VIP and also if the cluster looses the quorum at any point in time the master node performs the de-escalation and releases the VIP.</div><div>This technique makes sure that only one delegate-IP holder watchdog exist in the cluster hence the split-brain never happens.</div><div><br></div><div>Thanks</div><div>Best Regards</div><div>Muhammad Usama</div><div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div class="HOEnZb"><div class="h5"><br>
> and at the same time keep looking for new/old-lost<br>
> nodes to join back the cluster to make sure minimum possible service<br>
> disruption happen and cluster recovers automatically without any manual<br>
> intervention.<br>
><br>
><br>
> Thanks<br>
> Best regards<br>
> Muhammad Usama<br>
><br>
><br>
><br>
>> Best regards,<br>
>> --<br>
>> Tatsuo Ishii<br>
>> SRA OSS, Inc. Japan<br>
>> English: <a href="http://www.sraoss.co.jp/index_en.php" rel="noreferrer" target="_blank">http://www.sraoss.co.jp/index_<wbr>en.php</a><br>
>> Japanese:<a href="http://www.sraoss.co.jp" rel="noreferrer" target="_blank">http://www.sraoss.co.<wbr>jp</a><br>
>><br>
>> From: Muhammad Usama <<a href="mailto:m.usama@gmail.com">m.usama@gmail.com</a>><br>
>> Subject: Re: Proposal to make backend node failover mechanism quorum aware<br>
>> Date: Thu, 9 Mar 2017 00:57:58 +0500<br>
>> Message-ID: <CAEJvTzXap+qMGLt7SQ-1hPgf=<wbr>aNuAYEsu_JQYd695hac0WagkA@<wbr>mail.<br>
>> <a href="http://gmail.com" rel="noreferrer" target="_blank">gmail.com</a>><br>
>><br>
>> > Hi<br>
>> ><br>
>> > Please use this document. The image quality of the previously shared<br>
>> > version was not up to the mark.<br>
>> ><br>
>> > Thanks<br>
>> > Best regards<br>
>> > Muhammad Usama<br>
>> ><br>
>> > On Thu, Mar 9, 2017 at 12:53 AM, Muhammad Usama <<a href="mailto:m.usama@gmail.com">m.usama@gmail.com</a>><br>
>> wrote:<br>
>> ><br>
>> >> Hi Ishii-San<br>
>> >><br>
>> >> I have tried to create a detailed proposal to explain why and where the<br>
>> >> quorum aware backend failover mechanism would be useful.<br>
>> >> Can you please take a look at the attached pdf document and share your<br>
>> >> thoughts.<br>
>> >><br>
>> >> Thanks<br>
>> >> Kind Regards<br>
>> >> Muhammad Usama<br>
>> >><br>
>> >><br>
>> >> On Wed, Jan 25, 2017 at 2:04 PM, Muhammad Usama <<a href="mailto:m.usama@gmail.com">m.usama@gmail.com</a>><br>
>> wrote:<br>
>> >><br>
>> >>><br>
>> >>><br>
>> >>> On Wed, Jan 25, 2017 at 9:05 AM, Tatsuo Ishii <<a href="mailto:ishii@sraoss.co.jp">ishii@sraoss.co.jp</a>><br>
>> wrote:<br>
>> >>><br>
>> >>>> Usama,<br>
>> >>>><br>
>> >>>> > This is correct. If the Pgpool-II is used in maste-standby mode<br>
>> (With<br>
>> >>>> > elastic or virtual-IP and clients only connect to one Pgpool-II<br>
>> server<br>
>> >>>> > only) then there is not much issues that could be caused by the<br>
>> >>>> > interruption of link between AZ1 and AZ2 as you defined above.<br>
>> >>>> ><br>
>> >>>> > But the issue arrives when the Pgpool-II is used in the<br>
>> master-master<br>
>> >>>> > mode(clients connect to all available Pgpool-II) then consider the<br>
>> >>>> > following scenario.<br>
>> >>>> ><br>
>> >>>> > a) Link between AZ1 and AZ2 broke, at that time B1 was master while<br>
>> B2<br>
>> >>>> was<br>
>> >>>> > standby.<br>
>> >>>> ><br>
>> >>>> > b) Pgpool-C in AZ2 promote B2 to the master since Pgpool-C is not<br>
>> able<br>
>> >>>> to<br>
>> >>>> > connect old master (B1)<br>
>> >>>><br>
>> >>>> I thought Pgpool-C sucides because it cannot get quorum in this<br>
>> case, no?<br>
>> >>>><br>
>> >>><br>
>> >>> No, Pgpool-II only commits suicide only when it loses all network<br>
>> >>> connections. Otherwise the master watchdog node is de-escalated when<br>
>> the<br>
>> >>> quorum is lost.<br>
>> >>> Committing a suicide everytime quorum is lost is very risky and not<br>
>> >>> a feasible since it will shutdown the whole cluster as soon as a<br>
>> quorum<br>
>> >>> loses even because of a small glitch.<br>
>> >>><br>
>> >>><br>
>> >>>> > c) A client connects to Pgpool-C and issues a write statement. It<br>
>> will<br>
>> >>>> land<br>
>> >>>> > on the B2 PostgreSQL server, which was promoted as master in step<br>
>> b.<br>
>> >>>> ><br>
>> >>>> > c-1) Another client connects to Pgpool-A and also issues a write<br>
>> >>>> statement<br>
>> >>>> > that will land on the B1 PostgreSQL server as it the master node<br>
>> in AZ.<br>
>> >>>> ><br>
>> >>>> > d) The link between AZ1 and AZ2 is restored, but now the PostgreSQL<br>
>> B1<br>
>> >>>> and<br>
>> >>>> > B2 both have different sets of data and with no easy way to get both<br>
>> >>>> > changes in one place and restore the cluster to original state.<br>
>> >>>> ><br>
>> >>>> > The above scenario will become more complicated if both availability<br>
>> >>>> zones<br>
>> >>>> > AZ1 and AZ2 have multiple Pgpool-II nodes, since retiring the<br>
>> multiple<br>
>> >>>> > Pgpool-II nodes logic will become more complex when link disruption<br>
>> >>>> between<br>
>> >>>> > AZ1 and AZ2.<br>
>> >>>> ><br>
>> >>>> > So the proposal tries to solve this by making sure that we should<br>
>> >>>> always<br>
>> >>>> > have only one master PostgreSQL node in the cluster and never end<br>
>> up<br>
>> >>>> in the<br>
>> >>>> > situation where we have different sets of data in different<br>
>> PostgreSQL<br>
>> >>>> > nodes.<br>
>> >>>> ><br>
>> >>>> ><br>
>> >>>> ><br>
>> >>>> >> > There is also a question ("[pgpool-general: 5179] Architecture<br>
>> >>>> Questions<br>
>> >>>> >> > <<a href="http://www.sraoss.jp/pipermail/pgpool-general/2016-December" rel="noreferrer" target="_blank">http://www.sraoss.jp/<wbr>pipermail/pgpool-general/2016-<wbr>December</a><br>
>> >>>> /005237.html<br>
>> >>>> >> >")<br>
>> >>>> >> > posted by a user in pgpool-general mailing list who wants a<br>
>> similar<br>
>> >>>> type<br>
>> >>>> >> of<br>
>> >>>> >> > network that spans over two AWS availability zones and Pgpool-II<br>
>> >>>> has no<br>
>> >>>> >> > good answer to avoid split-brain of backend nodes if the<br>
>> corporate<br>
>> >>>> link<br>
>> >>>> >> > between two zones suffers a glitch.<br>
>> >>>> >><br>
>> >>>> >> That seems totally different story to me because there two<br>
>> independent<br>
>> >>>> >> streaming replication primary servers in the east and west<br>
>> regions.<br>
>> >>>> >><br>
>> >>>> >><br>
>> >>>> > I think the original question statement was a little bit confusing.<br>
>> >>>> How I<br>
>> >>>> > understand the user requirements later in the thread was that.<br>
>> >>>> > The user has a couple of PostgreSQL nodes in two availability zones<br>
>> >>>> (total<br>
>> >>>> > 4 PG nodes) and all four nodes are part of the single streaming<br>
>> >>>> replication<br>
>> >>>> > setup.<br>
>> >>>> > Both zones have two Pgpool-II nodes each. (Total 4 Pgpool-II nodes<br>
>> in<br>
>> >>>> the<br>
>> >>>> > cluster).<br>
>> >>>> > Each availability zone has one application server that connects to<br>
>> one<br>
>> >>>> of<br>
>> >>>> > two Pgpool-II in the that availability zone. (That makes it<br>
>> >>>> master-master<br>
>> >>>> > Pgpool setup). And the user is concerned about split-brain of<br>
>> >>>> PostgreSQL<br>
>> >>>> > servers when the corporate link between zones becomes unavailable.<br>
>> >>>> ><br>
>> >>>> > Thanks<br>
>> >>>> > Best regards<br>
>> >>>> > Muhammad Usama<br>
>> >>>> ><br>
>> >>>> ><br>
>> >>>> ><br>
>> >>>> >> Best regards,<br>
>> >>>> >> --<br>
>> >>>> >> Tatsuo Ishii<br>
>> >>>> >> SRA OSS, Inc. Japan<br>
>> >>>> >> English: <a href="http://www.sraoss.co.jp/index_en.php" rel="noreferrer" target="_blank">http://www.sraoss.co.jp/index_<wbr>en.php</a><br>
>> >>>> >> Japanese:<a href="http://www.sraoss.co.jp" rel="noreferrer" target="_blank">http://www.sraoss.co.<wbr>jp</a><br>
>> >>>> >><br>
>> >>>> >> > Thanks<br>
>> >>>> >> > Best regards<br>
>> >>>> >> > Muhammad Usama<br>
>> >>>> >> ><br>
>> >>>> >> ><br>
>> >>>> >> ><br>
>> >>>> >> >><br>
>> >>>> >> >> Best regards,<br>
>> >>>> >> >> --<br>
>> >>>> >> >> Tatsuo Ishii<br>
>> >>>> >> >> SRA OSS, Inc. Japan<br>
>> >>>> >> >> English: <a href="http://www.sraoss.co.jp/index_en.php" rel="noreferrer" target="_blank">http://www.sraoss.co.jp/index_<wbr>en.php</a><br>
>> >>>> >> >> Japanese:<a href="http://www.sraoss.co.jp" rel="noreferrer" target="_blank">http://www.sraoss.co.<wbr>jp</a><br>
>> >>>> >> >><br>
>> >>>> >> >> >> > Hi Hackers,<br>
>> >>>> >> >> >> ><br>
>> >>>> >> >> >> > This is the proposal to make the failover of backend<br>
>> >>>> PostgreSQL<br>
>> >>>> >> nodes<br>
>> >>>> >> >> >> > quorum aware to make it more robust and fault tolerant.<br>
>> >>>> >> >> >> ><br>
>> >>>> >> >> >> > Currently Pgpool-II proceeds to failover the backend node<br>
>> as<br>
>> >>>> soon<br>
>> >>>> >> as<br>
>> >>>> >> >> the<br>
>> >>>> >> >> >> > health check detects the failure or in case of an error<br>
>> >>>> occurred on<br>
>> >>>> >> >> the<br>
>> >>>> >> >> >> > backend connection (when fail_over_on_backend_error is<br>
>> set).<br>
>> >>>> This<br>
>> >>>> >> is<br>
>> >>>> >> >> good<br>
>> >>>> >> >> >> > enough for the standalone Pgpool-II server.<br>
>> >>>> >> >> >> ><br>
>> >>>> >> >> >> > But consider the scenario where we have more than one<br>
>> >>>> Pgpool-II<br>
>> >>>> >> (Say<br>
>> >>>> >> >> >> > Pgpool-A, Pgpool-B and Pgpool-C) in the cluster connected<br>
>> >>>> through<br>
>> >>>> >> >> >> watchdog<br>
>> >>>> >> >> >> > and each Pgpool-II node is configured with two PostgreSQL<br>
>> >>>> backends<br>
>> >>>> >> >> (B1<br>
>> >>>> >> >> >> and<br>
>> >>>> >> >> >> > B2).<br>
>> >>>> >> >> >> ><br>
>> >>>> >> >> >> > Now if due to some network glitch or an issue, Pgpool-A<br>
>> fails<br>
>> >>>> or<br>
>> >>>> >> loses<br>
>> >>>> >> >> >> its<br>
>> >>>> >> >> >> > network connection with backend B1, The Pgpool-A will<br>
>> detect<br>
>> >>>> the<br>
>> >>>> >> >> failure<br>
>> >>>> >> >> >> > and detach (failover) the B1 backend and also pass this<br>
>> >>>> information<br>
>> >>>> >> >> to<br>
>> >>>> >> >> >> the<br>
>> >>>> >> >> >> > other Pgpool-II nodes (Pgpool-II B and Pgpool-II C),<br>
>> Although<br>
>> >>>> the<br>
>> >>>> >> >> Backend<br>
>> >>>> >> >> >> > B1 was perfectly healthy and it was also reachable from<br>
>> >>>> Pgpool-B<br>
>> >>>> >> and<br>
>> >>>> >> >> >> > Pgpool-C nodes, But still because of a network glitch<br>
>> between<br>
>> >>>> >> >> Pgpool-A<br>
>> >>>> >> >> >> and<br>
>> >>>> >> >> >> > Backend B1, it will get detached from the cluster and the<br>
>> >>>> worst<br>
>> >>>> >> part<br>
>> >>>> >> >> is,<br>
>> >>>> >> >> >> if<br>
>> >>>> >> >> >> > the B1 was a master PostgreSQL (in master-standby<br>
>> >>>> configuration),<br>
>> >>>> >> the<br>
>> >>>> >> >> >> > Pgpool-II failover would also promote the B2 PostgreSQL<br>
>> node<br>
>> >>>> as a<br>
>> >>>> >> new<br>
>> >>>> >> >> >> > master, hense making the way for split-brain and/or data<br>
>> >>>> >> corruptions.<br>
>> >>>> >> >> >> ><br>
>> >>>> >> >> >> > So my proposal is that when the Watchdog is configured in<br>
>> >>>> Pgpool-II<br>
>> >>>> >> >> the<br>
>> >>>> >> >> >> > backend health check of Pgpool-II should consult with<br>
>> other<br>
>> >>>> >> attached<br>
>> >>>> >> >> >> > Pgpool-II nodes over the watchdog to decide if the Backend<br>
>> >>>> node is<br>
>> >>>> >> >> >> actually<br>
>> >>>> >> >> >> > failed or if it is just a localized glitch/false alarm.<br>
>> And<br>
>> >>>> the<br>
>> >>>> >> >> failover<br>
>> >>>> >> >> >> on<br>
>> >>>> >> >> >> > the node should only be performed, when the majority of<br>
>> >>>> cluster<br>
>> >>>> >> >> members<br>
>> >>>> >> >> >> > agrees on the failure of nodes.<br>
>> >>>> >> >> >> ><br>
>> >>>> >> >> >> > This quorum aware architecture of failover will prevents<br>
>> the<br>
>> >>>> false<br>
>> >>>> >> >> >> > failovers and split-brain scenarios in the Backend nodes.<br>
>> >>>> >> >> >> ><br>
>> >>>> >> >> >> > What are your thoughts and suggestions on this?<br>
>> >>>> >> >> >> ><br>
>> >>>> >> >> >> > Thanks<br>
>> >>>> >> >> >> > Best regards<br>
>> >>>> >> >> >> > Muhammad Usama<br>
>> >>>> >> >> >><br>
>> >>>> >> >><br>
>> >>>> >><br>
>> >>>><br>
>> >>><br>
>> >>><br>
>> >><br>
>><br>
</div></div></blockquote></div><br></div></div>