<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Apr 16, 2019 at 2:03 PM Tatsuo Ishii <<a href="mailto:ishii@sraoss.co.jp">ishii@sraoss.co.jp</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">>> Thanks. However this will change existing behavior. Probably we should<br>
>> make the change against master branch only?<br>
>><br>
> <br>
> Probably yes, because the current fix I have for this in my mind involves<br>
> the configurable timeout parameter<br>
> to make the master pgpool resign. Let me come up with the patch and then we<br>
> work on the part of that<br>
> needs to be back ported.<br>
> And regarding the patch I shared upthread to continue the health check on<br>
> quarantined nodes, Do you think we should<br>
> also back-patch it to older versions as-well ?<br>
<br>
Not sure we should back port both of two patches since they will<br>
change existing behaviors (and even one of them is documented).<br>
<br>
What do you think?<br></blockquote><div><br></div><div>Totally agreed. So I will go on to make it for master branch only. </div><div>Many thanks for the valuable inputs.</div><div><br></div><div>Best regards</div><div>Muhammad Usama</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
> Thanks<br>
> Best Regards<br>
> Muhammad Usama<br>
> <br>
> <br>
>><br>
>> > Thanks<br>
>> > Best Regards<br>
>> > Muhammad Usama<br>
>> ><br>
>> ><br>
>> >> > Thanks<br>
>> >> > Best Regards<br>
>> >> > Muhammad Usama<br>
>> >> ><br>
>> >> ><br>
>> >> >> >> > Can you please try out the attached patch, to see if the<br>
>> solution<br>
>> >> >> works<br>
>> >> >> >> for<br>
>> >> >> >> > the situation?<br>
>> >> >> >> > The patch is generated against current master branch.<br>
>> >> >> >> ><br>
>> >> >> >> > Thanks<br>
>> >> >> >> > Best Regards<br>
>> >> >> >> > Muhammad Usama<br>
>> >> >> >> ><br>
>> >> >> >> > On Wed, Apr 10, 2019 at 2:04 PM TAKATSUKA Haruka <<br>
>> >> >> <a href="mailto:harukat@sraoss.co.jp" target="_blank">harukat@sraoss.co.jp</a>><br>
>> >> >> >> > wrote:<br>
>> >> >> >> ><br>
>> >> >> >> >> Hello, Pgpool developers<br>
>> >> >> >> >><br>
>> >> >> >> >><br>
>> >> >> >> >> I found Pgpool-II watchdog is too strict for duplicate failover<br>
>> >> >> request<br>
>> >> >> >> >> with allow_multiple_failover_requests_from_node=off setting.<br>
>> >> >> >> >><br>
>> >> >> >> >> For example, A watchdog cluster with 3 pgpool instances is<br>
>> here.<br>
>> >> >> >> >> Their backends are PostgreSQL servers using streaming<br>
>> replication.<br>
>> >> >> >> >><br>
>> >> >> >> >> When the communication between master/coordinator pgpool and<br>
>> >> >> >> >> primary PostgreSQL node is down during a short period<br>
>> >> >> >> >> (or pgpool do any false-positive judgement by various reasons),<br>
>> >> >> >> >> and then the pgpool tries to failover but cannot get the<br>
>> >> consensus,<br>
>> >> >> >> >> so it makes the primary node into quarantine status. It cannot<br>
>> >> >> >> >> be reset automatically. As a result, the service becomes<br>
>> >> unavailable.<br>
>> >> >> >> >><br>
>> >> >> >> >> This case generates logs like the following:<br>
>> >> >> >> >><br>
>> >> >> >> >> pid 1234: LOG: new IPC connection received<br>
>> >> >> >> >> pid 1234: LOG: watchdog received the failover command from<br>
>> local<br>
>> >> >> >> >> pgpool-II on IPC interface<br>
>> >> >> >> >> pid 1234: LOG: watchdog is processing the failover command<br>
>> >> >> >> >> [DEGENERATE_BACKEND_REQUEST] received from local pgpool-II on<br>
>> IPC<br>
>> >> >> >> interface<br>
>> >> >> >> >> pid 1234: LOG: Duplicate failover request from "pg1:5432 Linux<br>
>> >> pg1"<br>
>> >> >> >> node<br>
>> >> >> >> >> pid 1234: DETAIL: request ignored<br>
>> >> >> >> >> pid 1234: LOG: failover requires the majority vote, waiting<br>
>> for<br>
>> >> >> >> consensus<br>
>> >> >> >> >> pid 1234: DETAIL: failover request noted<br>
>> >> >> >> >> pid 4321: LOG: degenerate backend request for 1 node(s) from<br>
>> pid<br>
>> >> >> >> [4321],<br>
>> >> >> >> >> is changed to quarantine node request by watchdog<br>
>> >> >> >> >> pid 4321: DETAIL: watchdog is taking time to build consensus<br>
>> >> >> >> >><br>
>> >> >> >> >> Note that this case dosen't have any communication truouble<br>
>> among<br>
>> >> >> >> >> the Pgpool watchdog nodes.<br>
>> >> >> >> >> You can reproduce it by changing one PostgreSQL's pg_hba.conf<br>
>> to<br>
>> >> >> >> >> reject the helth check access from one pgpool node in short<br>
>> >> period.<br>
>> >> >> >> >><br>
>> >> >> >> >> The document don't say that duplicate failover requests make<br>
>> the<br>
>> >> node<br>
>> >> >> >> >> quarantine immediately. I think it should be just igunoring the<br>
>> >> >> request.<br>
>> >> >> >> >><br>
>> >> >> >> >> A patch file for head of V3_7_STABLE is attached.<br>
>> >> >> >> >> Pgpool with this patch also disturbs failover by single<br>
>> pgpool's<br>
>> >> >> >> repeated<br>
>> >> >> >> >> failover requests. But it can recover when the connection<br>
>> trouble<br>
>> >> is<br>
>> >> >> >> gone.<br>
>> >> >> >> >><br>
>> >> >> >> >> Does this change have any problem?<br>
>> >> >> >> >><br>
>> >> >> >> >><br>
>> >> >> >> >> with best regards,<br>
>> >> >> >> >> TAKATSUKA Haruka <<a href="mailto:harukat@sraoss.co.jp" target="_blank">harukat@sraoss.co.jp</a>><br>
>> >> >> >> >> _______________________________________________<br>
>> >> >> >> >> pgpool-hackers mailing list<br>
>> >> >> >> >> <a href="mailto:pgpool-hackers@pgpool.net" target="_blank">pgpool-hackers@pgpool.net</a><br>
>> >> >> >> >> <a href="http://www.pgpool.net/mailman/listinfo/pgpool-hackers" rel="noreferrer" target="_blank">http://www.pgpool.net/mailman/listinfo/pgpool-hackers</a><br>
>> >> >> >> >><br>
>> >> >> >><br>
>> >> >><br>
>> >><br>
>><br>
</blockquote></div></div>