[pgpool-hackers: 2767] Re: New feature candidate: verify standby node while finding primary node

Wed Apr 4 11:12:13 JST 2018

Note.

>> I think as soon as Pgpool-II detects that the backend node is not connected
>> to the valid primary PostgreSQL, we should perform the failover on that
>> node to make sure that all sessions using the node should immediately stop
>> using it and also the database administrator can take desired and
>> appropriate actions on that node in the failover script.
> 
> I agree. However we may want to have a new switch to choose existig
> behavior.

If we perform the failover as soon as Pgpool-II detects the the
backend node is not connected to the valid primary, the node status
will be changed to "down" or "quarantine" status depending on quorum
settings. So it seems we don't need to invent a new status for invalid
primary/standby nodes.

Still we need a new switch though.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

> - Verify that pg_stat_wal_receiver.status is not 'stopping'.
> 
> - Parse pg_stat_wal_receiver.conninfo to get primary connection info.
> 
> - Verify the connection info matches primary. If not, this does not
>   connect to our primary.
>   
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese:http://www.sraoss.co.jp
> 
>> I visited this again and want to have it in 3.8:-)
>> 
>>> On Thu, Jan 12, 2017 at 7:34 AM, Tatsuo Ishii <ishii at sraoss.co.jp> wrote:
>>> 
>>>> This is a proposal for a new feature toward Pgpool-II 3.7.
>>>>
>>>> Currently Pgpool-II finds a primary node and standby node like this
>>>> (it happens while Pgpool-II starting up or failover):
>>>>
>>>> 1) Issue "SELECT pg_is_in_recovery()" to a node in question.
>>>>
>>>> 2) If it returns "t", then decide the node is standby. Go to next node
>>>>    (go back to step 1).
>>>>
>>>> 3) If it returns other than that, then decide the node is
>>>>    the primary. Other nodes are regarded as standby.
>>>>
>>>> This logic works mostly well except in an unusual scenario like this:
>>>>
>>>> i) We have two nodes: node 0 is primary, node 1 is standby.
>>>>
>>>> ii) A stupid admin issues "pg_ctl promote" to the standby node and node 1
>>>> becomes
>>>>   a stand alone PostgreSQL.
>>>>
>>>> In this case, eventually node 1 will be behind to node 0, because no
>>>> replication happens. If replication delay check is enabled, Pgpool-II
>>>> avoids to send queries to node 1 because of the replication
>>>> delay. However, if the replication delay check is not enabled or the
>>>> replication delay threshold is large, user will not notice the
>>>> situation.
>>>>
>>>> Also the scenario is known as "split brain" which users want to
>>>> avoid. I think we need to do something here.
>>>>
>>>> Here is the modified procedure to avoid it.
>>>>
>>>> 1) Issue "SELECT pg_is_in_recovery()" to a node in question.
>>>>
>>>> 2) If it returns "t", then decide the node is standby. Go to next node
>>>>    (go back to step 1).
>>>>
>>>> 3) If it returns other than that, then decide the node is the
>>>>    primary. Check remaining nodes whether they are actually standby or
>>>>    not by issuing "SELECT pg_is_in_recovery()".  Additionally we could
>>>>    use pg_stat_wal_receiver view to check if it actually connects to
>>>>    the primary node if the PostgreSQL version is 9.6 or higher.
>>>>
>>>>
>>> I think it's a very good feature to have especially with the PostgreSQL
>>> version is 9.6 and higher if we can verify that all the configured backends
>>> belongs to the same cluster and are in the correct state. Because although
>>> unlikely, but the misconfigurations like that are very hard to catch
>>> otherwise.
>> 
>> Right.
>> 
>>>> Question is, what if the checking in #3 reveals that the node in
>>>> question is not "proper" standby.
>>>>
>>>> - Do we want to add new status code other than "up", "down", "not
>>>>   connected" and "unused"?
>>>
>>> In my opinion, adding a new status  "not connected" or "invalid node"
>>> should be more helpful.
>> 
>> After thinking while, I tend to agree with you we should invent a new
>> status name for the node because it makes easier to recognize such
>> that node by using, for example, "show pool_nodes".
>> 
>> I prefer "invalid node" over "Not connected".
>> 
>>>> - Do we want to automatically detach the node so that Pgpool-II does
>>>>   not use the node?
>>>>
>>> 
>>> I think as soon as Pgpool-II detects that the backend node is not connected
>>> to the valid primary PostgreSQL, we should perform the failover on that
>>> node to make sure that all sessions using the node should immediately stop
>>> using it and also the database administrator can take desired and
>>> appropriate actions on that node in the failover script.
>> 
>> I agree. However we may want to have a new switch to choose existig
>> behavior.