[pgpool-general: 9057] Re: 0 th backend is not valid after upgrading to 4.5.1

Sat Mar 30 16:42:51 JST 2024

> Op vr 29 mrt 2024 om 15:26 schreef Tatsuo Ishii <ishii at sraoss.co.jp>:
> 
>> > Hi,
>> >
>> > After upgrading from 4.5.0 to 4.5.1 we see lots of regressions in our
>> > test-suite when failovers are involved with the following exception in
>> > WildFly:
>> > Caused by: org.postgresql.util.PSQLException: ERROR: unable to read
>> message
>> > kind from backend
>> >   Detail: 0 th backend is not valid
>> >
>> > I'm not sure what is causing this, but I've got the feeling that pgpool
>> > keeps trying to send queries to a database that not available. I've
>> > attached logging from pgpool. You can find the first error at 04:10:16
>> > (or 2024-03-27T03:10:16.735933246Z pgpool timestamp).
>>
>> Can you share the backend status information at 04:10:16?
>> - Each backend node id status (up or down)
>> - Which was the primary node id?
>>
> 
> I've attached the output for pcp_watchdog_info and pcp_node_info from node
> 1 and 3. As this is historical data, I don't have the status at exactly
> 04:10:16, but the output from node 3 is close (04:10:31). The error is
> persistent. It keeps throwing the error until the test is aborted at 04:29.
> 
> As you can see, all 3 databases are up and synchronized. Node 2 runs the
> primary database and Node 3 is the pgpool leader.

Ok. The commit you pointed out:
> is the change that is causing the problems:
> https://github.com/pgpool/pgpool2/commit/a2a86804cb838b416f317cc083d521a5c691f2ec

has a bug: when load_balance_mode is off, I found that it does not
work unless the primary node id is 0. It should have set
query_context->where_to_send map for the primary node id but when the
primary node id is not 0, the map was not set at all. This makes
Pgpool-II believe that 0 th backend needs to be accessed. This lead to
the error you are seeing.  Another error case I found was
(incorrectly) reading from node 0, which leads to hang up. This is
easy to reproduce.

(1) create a two-node cluster using pgpool_setup

(2) shutdown node 0 and recover node 1 (pcp_recovery_node 0). This
makes node 0 to be standby, node 1 to be primary.

(3) add followings to pgpool.conf and restart whole cluster.

load_balance_mode = off
backend_weight1 = 0

(4) type "begin" from psql. It gets stuck.

Attached is the patch against 4.5.1.
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fix_load_balance.patch
Type: text/x-patch
Size: 1147 bytes
Desc: not available
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20240330/2aefe394/attachment.bin>