[pgpool-general: 2347] Problems with recovered master node.

Thu Dec 19 08:04:41 JST 2013

Hi everyone.

I'm having an issue with pgpool.
First of all my configuration:
1 Server with pgpool-II version 3.3.2 (tokakiboshi)
3 Servers with:
postgresql93-contrib-9.3.2-1PGDG.rhel6.x86_64
postgresql93-devel-9.3.2-1PGDG.rhel6.x86_64
postgresql93-libs-9.3.2-1PGDG.rhel6.x86_64
postgresql93-9.3.2-1PGDG.rhel6.x86_64
postgresql93-server-9.3.2-1PGDG.rhel6.x86_64

All 4 servers are RHEL 6 machines. (Amazon AMI's actually)

Pgpool is configured to have Master/Slave Nodes with replication
and Load Balancing.

#------------------------------------------------------------------------------
# LOAD BALANCING MODE
#------------------------------------------------------------------------------

load_balance_mode = on
                                   # Activate load balancing mode
                                   # (change requires restart)
ignore_leading_white_space = on
                                   # Ignore leading white spaces of each
query
white_function_list = ''
                                   # Comma separated list of function names
                                   # that don't write to database
                                   # Regexp are accepted
black_function_list = 'nextval,setval'
                                   # Comma separated list of function names
                                   # that write to database
                                   # Regexp are accepted

#------------------------------------------------------------------------------
# MASTER/SLAVE MODE
#------------------------------------------------------------------------------

master_slave_mode = on
                                   # Activate master/slave mode
                                   # (change requires restart)
master_slave_sub_mode = 'stream'
                                   # Master/slave sub mode
                                   # Valid values are combinations slony or
                                   # stream. Default is slony.
                                   # (change requires restart)

# - Streaming -

sr_check_period = 1
                                   # Streaming replication check period
                                   # Disabled (0) by default
sr_check_user = 'postgres'
                                   # Streaming replication check user
                                   # This is necessary even if you disable
                                   # streaming replication delay check with
                                   # sr_check_period = 0
sr_check_password = ''
                                   # Password for streaming replication
check user
delay_threshold = 1
                                   # Threshold before not dispatching
query to standby node
                                   # Unit is in bytes
                                   # Disabled (0) by default

# - Special commands -

follow_master_command = '/bin/follow_master.sh %d %h %D %m %H %R'
                                   # Executes this command after master
failover
                                   # Special values:
                                   #   %d = node id
                                   #   %h = host name
                                   #   %p = port number
                                   #   %D = database cluster path
                                   #   %m = new master node id
                                   #   %H = hostname of the new master node
                                   #   %M = old master node id
                                   #   %P = old primary node id
                                   #   %r = new master port number
                                   #   %R = new master database cluster path
                                   #   %% = '%' character

#All three nodes are configured like this
backend_hostnamen = 'ip_node_n'
backend_portn = 5432
backend_weightn = 0
backend_data_directoryn = '/var/lib/pgsql/9.3/data/'
backend_flagn = 'ALLOW_TO_FAILOVER'

I'm using repmgr to handle the replication in the nodes.

My problem occurs when I kill the master node, everything works as
expected and a slave takes it's place.
My application continues working with no major issues until I decide to
recover the dead master node.

After that pgool starts sending this message:
2013-12-18 17:19:07 LOG:   pid 30801: do_child: exits with status 1 due
to error
2013-12-18 17:19:10 LOG:   pid 30802: pool_send_and_wait: Error or
notice message from backend: : DB node id: 0 backend pid: 27585
statement: ROLLBACK message: there is no transaction in progress
2013-12-18 17:19:10 ERROR: pid 30802: read_kind_from_backend: 1 th kind
C does not match with master or majority connection kind N
2013-12-18 17:19:10 ERROR: pid 30802: kind mismatch among backends.
Possible last query was: "ROLLBACK" kind details are: 0[N: there is no
transaction in progress] 1[C]

And I identified that the problem is that pgpool is sending a rollback
to the old master node [node 0] (Something illegal as far as I know) and
then all the transaction is cancelled and my application fails.
If I promote the old master again, then the application works with no
problem.

Anyone has an idea why is this ocurring?

Regards!

-- 

Be aware that if you reply directly to this message, your reply may not be 
secure. Do not send us communications that contain unencrypted confidential 
information such as passwords, account numbers, or Social Security numbers.

This message, including any attachments, is for the sole use of the 
intended recipient(s) and may contain confidential and privileged 
information. If you are not the intended recipient, please destroy all 
copies of this message and any attachments. In addition, please notify 
Gazzang immediately by email to info at gazzang.com.