[pgpool-hackers: 2759] Re: New feature candidate: verify standby node while finding primary node

Tatsuo Ishii ishii at sraoss.co.jp
Fri Mar 30 16:16:27 JST 2018


I visited this again and want to have it in 3.8:-)

> On Thu, Jan 12, 2017 at 7:34 AM, Tatsuo Ishii <ishii at sraoss.co.jp> wrote:
> 
>> This is a proposal for a new feature toward Pgpool-II 3.7.
>>
>> Currently Pgpool-II finds a primary node and standby node like this
>> (it happens while Pgpool-II starting up or failover):
>>
>> 1) Issue "SELECT pg_is_in_recovery()" to a node in question.
>>
>> 2) If it returns "t", then decide the node is standby. Go to next node
>>    (go back to step 1).
>>
>> 3) If it returns other than that, then decide the node is
>>    the primary. Other nodes are regarded as standby.
>>
>> This logic works mostly well except in an unusual scenario like this:
>>
>> i) We have two nodes: node 0 is primary, node 1 is standby.
>>
>> ii) A stupid admin issues "pg_ctl promote" to the standby node and node 1
>> becomes
>>   a stand alone PostgreSQL.
>>
>> In this case, eventually node 1 will be behind to node 0, because no
>> replication happens. If replication delay check is enabled, Pgpool-II
>> avoids to send queries to node 1 because of the replication
>> delay. However, if the replication delay check is not enabled or the
>> replication delay threshold is large, user will not notice the
>> situation.
>>
>> Also the scenario is known as "split brain" which users want to
>> avoid. I think we need to do something here.
>>
>> Here is the modified procedure to avoid it.
>>
>> 1) Issue "SELECT pg_is_in_recovery()" to a node in question.
>>
>> 2) If it returns "t", then decide the node is standby. Go to next node
>>    (go back to step 1).
>>
>> 3) If it returns other than that, then decide the node is the
>>    primary. Check remaining nodes whether they are actually standby or
>>    not by issuing "SELECT pg_is_in_recovery()".  Additionally we could
>>    use pg_stat_wal_receiver view to check if it actually connects to
>>    the primary node if the PostgreSQL version is 9.6 or higher.
>>
>>
> I think it's a very good feature to have especially with the PostgreSQL
> version is 9.6 and higher if we can verify that all the configured backends
> belongs to the same cluster and are in the correct state. Because although
> unlikely, but the misconfigurations like that are very hard to catch
> otherwise.

Right.

>> Question is, what if the checking in #3 reveals that the node in
>> question is not "proper" standby.
>>
>> - Do we want to add new status code other than "up", "down", "not
>>   connected" and "unused"?
>
> In my opinion, adding a new status  "not connected" or "invalid node"
> should be more helpful.

After thinking while, I tend to agree with you we should invent a new
status name for the node because it makes easier to recognize such
that node by using, for example, "show pool_nodes".

I prefer "invalid node" over "Not connected".

>> - Do we want to automatically detach the node so that Pgpool-II does
>>   not use the node?
>>
> 
> I think as soon as Pgpool-II detects that the backend node is not connected
> to the valid primary PostgreSQL, we should perform the failover on that
> node to make sure that all sessions using the node should immediately stop
> using it and also the database administrator can take desired and
> appropriate actions on that node in the failover script.

I agree. However we may want to have a new switch to choose existig
behavior.

>> - Do we want to the check more ferequetly, say a similar timing as
>>   health checking?
>>
> 
> May be we can change the health check query so that it can serve both
> purposes. I.e. Check the health and also the status of the backend node at
> same time.

Sounds nice. However, may be we need one more new switch to enable the
feature,

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


More information about the pgpool-hackers mailing list