<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Apr 16, 2019 at 12:14 PM Tatsuo Ishii <<a href="mailto:ishii@sraoss.co.jp">ishii@sraoss.co.jp</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">> On Tue, Apr 16, 2019 at 7:55 AM Tatsuo Ishii <<a href="mailto:ishii@sraoss.co.jp" target="_blank">ishii@sraoss.co.jp</a>> wrote:<br>
> <br>
>> Hi Usama,<br>
>><br>
>> > Hi TAKATSUKA Haruka,<br>
>> ><br>
>> > Thanks for the patch, But your patch effectively disables the node<br>
>> > quarantine, which does't seems a right way.<br>
>> > Since the backend node that was quarantined because of absence of quorum<br>
>> > and/or consensus is already un-reachable<br>
>> > form the Pgpool-II node, and we don't want to select it as load-balance<br>
>> > node ( in case the node was secondary) or consider it<br>
>> > as available when it is not by not marking it as quarantine.<br>
>> ><br>
>> > In my opinion the right way to tackle the issue is by keep setting the<br>
>> > quarantine state as it is done currently but<br>
>> > also keep the health check working on quarantine nodes. So that as soon<br>
>> as<br>
>> > the connectivity to the<br>
>> > quarantined node resumes, it becomes the part of cluster automatically.<br>
>><br>
>> What if the connection failure between the primary PostgreSQL and one<br>
>> of Pgpool-II servers is permanent? Doesn't health checking continues<br>
>> forever?<br>
>><br>
> <br>
> Yes, only for the quarantined PostgreSQL nodes. But I don't think there is<br>
> a problem<br>
> in that. As conceptually the quarantine nodes are not failed node (they are<br>
> just unusable at that moment)<br>
> and taking the node out of quarantine zone shouldn't require the manual<br>
> intervention. So I think its the correct<br>
> way to continue the health checking on quarantined nodes.<br>
> <br>
> Do you see an issue with the approach ?<br>
<br>
Yes. Think about the case when the PostgreSQL node is primary. Users<br>
cannot issue write queries while the retrying. The network failure<br>
could persist days and the whole database cluster is unusable in the<br>
period.<br></blockquote><div><br></div><div>Yes thats true, But not allowing the node to go into quarantine state will still not solve it,</div><div>Because the primary would still be unavailable anyway even if we set the quarantine state</div><div>or not. So whole idea of this patch is to recover from quarantine state automatically as soon as</div><div>the connectivity resumes.</div><div>Similarly failover of that node is again not an option if the user wants to do failover only when the</div><div>network consensus exists, otherwise he should just disable failover_require_consensus.</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
BTW,<br>
<br>
> > When the communication between master/coordinator pgpool and<br>
> > primary PostgreSQL node is down during a short period<br>
><br>
> I wonder why you don't set appropriate health check retry parameters<br>
> to avoid such a temporary communication failure in the firs place. A<br>
> brain surgery to ignore the error reports from Pgpool-II does not seem<br>
> to be a sane choice.<br>
<br>
The original reporter didn't answer my question. I think it is likely<br>
a problem of misconfiguraton (should use longer heath check retry).<br>
<br>
In summary I think for shorter period communication failure just<br>
increasing health check parameters is enough. However for longer<br>
period communication failure, the watchdog node should decline the<br>
role.<br></blockquote><div><br></div><div>I am sorry I didn't totally get it what you mean here.</div><div>Do you mean that the pgpool-II node that has the primary node in quarantine state should resign from the master/coordinator</div><div>pgpool-II node (if it was a master/coordinator) in that case?</div><div> </div><div>Thanks</div><div>Best Regards</div><div>Muhammad Usama</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
>> > Can you please try out the attached patch, to see if the solution works<br>
>> for<br>
>> > the situation?<br>
>> > The patch is generated against current master branch.<br>
>> ><br>
>> > Thanks<br>
>> > Best Regards<br>
>> > Muhammad Usama<br>
>> ><br>
>> > On Wed, Apr 10, 2019 at 2:04 PM TAKATSUKA Haruka <<a href="mailto:harukat@sraoss.co.jp" target="_blank">harukat@sraoss.co.jp</a>><br>
>> > wrote:<br>
>> ><br>
>> >> Hello, Pgpool developers<br>
>> >><br>
>> >><br>
>> >> I found Pgpool-II watchdog is too strict for duplicate failover request<br>
>> >> with allow_multiple_failover_requests_from_node=off setting.<br>
>> >><br>
>> >> For example, A watchdog cluster with 3 pgpool instances is here.<br>
>> >> Their backends are PostgreSQL servers using streaming replication.<br>
>> >><br>
>> >> When the communication between master/coordinator pgpool and<br>
>> >> primary PostgreSQL node is down during a short period<br>
>> >> (or pgpool do any false-positive judgement by various reasons),<br>
>> >> and then the pgpool tries to failover but cannot get the consensus,<br>
>> >> so it makes the primary node into quarantine status. It cannot<br>
>> >> be reset automatically. As a result, the service becomes unavailable.<br>
>> >><br>
>> >> This case generates logs like the following:<br>
>> >><br>
>> >> pid 1234: LOG: new IPC connection received<br>
>> >> pid 1234: LOG: watchdog received the failover command from local<br>
>> >> pgpool-II on IPC interface<br>
>> >> pid 1234: LOG: watchdog is processing the failover command<br>
>> >> [DEGENERATE_BACKEND_REQUEST] received from local pgpool-II on IPC<br>
>> interface<br>
>> >> pid 1234: LOG: Duplicate failover request from "pg1:5432 Linux pg1"<br>
>> node<br>
>> >> pid 1234: DETAIL: request ignored<br>
>> >> pid 1234: LOG: failover requires the majority vote, waiting for<br>
>> consensus<br>
>> >> pid 1234: DETAIL: failover request noted<br>
>> >> pid 4321: LOG: degenerate backend request for 1 node(s) from pid<br>
>> [4321],<br>
>> >> is changed to quarantine node request by watchdog<br>
>> >> pid 4321: DETAIL: watchdog is taking time to build consensus<br>
>> >><br>
>> >> Note that this case dosen't have any communication truouble among<br>
>> >> the Pgpool watchdog nodes.<br>
>> >> You can reproduce it by changing one PostgreSQL's pg_hba.conf to<br>
>> >> reject the helth check access from one pgpool node in short period.<br>
>> >><br>
>> >> The document don't say that duplicate failover requests make the node<br>
>> >> quarantine immediately. I think it should be just igunoring the request.<br>
>> >><br>
>> >> A patch file for head of V3_7_STABLE is attached.<br>
>> >> Pgpool with this patch also disturbs failover by single pgpool's<br>
>> repeated<br>
>> >> failover requests. But it can recover when the connection trouble is<br>
>> gone.<br>
>> >><br>
>> >> Does this change have any problem?<br>
>> >><br>
>> >><br>
>> >> with best regards,<br>
>> >> TAKATSUKA Haruka <<a href="mailto:harukat@sraoss.co.jp" target="_blank">harukat@sraoss.co.jp</a>><br>
>> >> _______________________________________________<br>
>> >> pgpool-hackers mailing list<br>
>> >> <a href="mailto:pgpool-hackers@pgpool.net" target="_blank">pgpool-hackers@pgpool.net</a><br>
>> >> <a href="http://www.pgpool.net/mailman/listinfo/pgpool-hackers" rel="noreferrer" target="_blank">http://www.pgpool.net/mailman/listinfo/pgpool-hackers</a><br>
>> >><br>
>><br>
</blockquote></div></div>