[pgpool-hackers: 2062] Re: New feature candidate: verify standby node while finding primary node

Fri Feb 24 05:31:49 JST 2017

On Thu, Jan 12, 2017 at 7:34 AM, Tatsuo Ishii <ishii at sraoss.co.jp> wrote:

> This is a proposal for a new feature toward Pgpool-II 3.7.
>
> Currently Pgpool-II finds a primary node and standby node like this
> (it happens while Pgpool-II starting up or failover):
>
> 1) Issue "SELECT pg_is_in_recovery()" to a node in question.
>
> 2) If it returns "t", then decide the node is standby. Go to next node
>    (go back to step 1).
>
> 3) If it returns other than that, then decide the node is
>    the primary. Other nodes are regarded as standby.
>
> This logic works mostly well except in an unusual scenario like this:
>
> i) We have two nodes: node 0 is primary, node 1 is standby.
>
> ii) A stupid admin issues "pg_ctl promote" to the standby node and node 1
> becomes
>   a stand alone PostgreSQL.
>
> In this case, eventually node 1 will be behind to node 0, because no
> replication happens. If replication delay check is enabled, Pgpool-II
> avoids to send queries to node 1 because of the replication
> delay. However, if the replication delay check is not enabled or the
> replication delay threshold is large, user will not notice the
> situation.
>
> Also the scenario is known as "split brain" which users want to
> avoid. I think we need to do something here.
>
> Here is the modified procedure to avoid it.
>
> 1) Issue "SELECT pg_is_in_recovery()" to a node in question.
>
> 2) If it returns "t", then decide the node is standby. Go to next node
>    (go back to step 1).
>
> 3) If it returns other than that, then decide the node is the
>    primary. Check remaining nodes whether they are actually standby or
>    not by issuing "SELECT pg_is_in_recovery()".  Additionally we could
>    use pg_stat_wal_receiver view to check if it actually connects to
>    the primary node if the PostgreSQL version is 9.6 or higher.
>
>
I think it's a very good feature to have especially with the PostgreSQL
version is 9.6 and higher if we can verify that all the configured backends
belongs to the same cluster and are in the correct state. Because although
unlikely, but the misconfigurations like that are very hard to catch
otherwise.

> Question is, what if the checking in #3 reveals that the node in
> question is not "proper" standby.
>
> - Do we want to add new status code other than "up", "down", "not
>   connected" and "unused"?
>

In my opinion, adding a new status  "not connected" or "invalid node"
should be more helpful.

>
> - Do we want to automatically detach the node so that Pgpool-II does
>   not use the node?
>

I think as soon as Pgpool-II detects that the backend node is not connected
to the valid primary PostgreSQL, we should perform the failover on that
node to make sure that all sessions using the node should immediately stop
using it and also the database administrator can take desired and
appropriate actions on that node in the failover script.

> - Do we want to the check more ferequetly, say a similar timing as
>   health checking?
>

May be we can change the health check query so that it can serve both
purposes. I.e. Check the health and also the status of the backend node at
same time.

Thanks
Best regards
Muhammad Usama

> Comments, suggestions are welcome.

--
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese:http://www.sraoss.co.jp
> _______________________________________________
> pgpool-hackers mailing list
> pgpool-hackers at pgpool.net
> http://www.pgpool.net/mailman/listinfo/pgpool-hackers
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-hackers/attachments/20170224/54faad70/attachment.html>