<div dir="ltr">I actually tried to replicate this scenario by issuing a select statement to pgpool and running a pg_terminate_backend command on that query in the backend cluster. This did not trigger a cluster eviction. Also, after sometime, it evicted one of the clusters when there was no pg_terminate_backend running. Can pg_terminate_backend on any session cause this or does it need to terminate the specific session that is being handled by pgpool. Either way, I could not replicate this scenario in my testing. What else could have caused this?<div><br></div><div><div>2016-11-08 21:00:19: pid 14686: LOG: reading and processing packets</div><div>2016-11-08 21:00:19: pid 14686: DETAIL: postmaster on DB node 0 was shutdown by administrative command</div><div>2016-11-08 21:00:19: pid 14686: LOG: received degenerate backend request for node_id: 0 from pid [14686]</div><div>2016-11-08 21:00:19: pid 6098: LOG: starting degeneration. shutdown host <a href="http://armada.cutpdh4v4uf7.us-east-1.redshift.amazonaws.com">armada.cutpdh4v4uf7.us-east-1.redshift.amazonaws.com</a>(5439)</div><div>2016-11-08 21:00:19: pid 6098: LOG: Restart all children</div><div>2016-11-08 21:00:19: pid 6098: LOG: failover: set new primary node: -1</div><div>2016-11-08 21:00:19: pid 18576: LOG: child process received shutdown request signal 3</div><div>2016-11-08 21:00:19: pid 6098: LOG: failover: set new master node: 1</div><div>2016-11-08 21:00:19: pid 10896: LOG: child process received shutdown request signal 3</div><div>2016-11-08 21:00:19: pid 18491: LOG: child process received shutdown request signal 3</div><div>2016-11-08 21:00:19: pid 6099: LOG: child process received shutdown request signal 3</div><div>2016-11-08 21:00:19: pid 6556: LOG: child process received shutdown request signal 3</div><div>2016-11-08 21:00:19: pid 18495: LOG: child process received shutdown request signal 3</div><div>2016-11-08 21:00:19: pid 9492: LOG: child process received shutdown request signal 3</div><div>2016-11-08 21:00:19: pid 6128: LOG: child process received shutdown request signal 3</div><div>2016-11-08 21:00:19: pid 6549: LOG: child process received shutdown request signal 3</div><div>2016-11-08 21:00:19: pid 19025: LOG: child process received shutdown request signal 3</div><div>2016-11-08 21:00:19: pid 6103: LOG: child process received shutdown request signal 3</div><div>2016-11-08 21:00:19: pid 6127: LOG: child process received shutdown request signal 3</div><div>2016-11-08 21:00:19: pid 6554: LOG: child process received shutdown request signal 3</div><div>2016-11-08 21:00:19: pid 19026: LOG: child process received shutdown request signal 3</div><div>2016-11-08 21:00:19: pid 14007: LOG: child process received shutdown request signal 3</div><div>2016-11-08 21:00:19: pid 18494: LOG: child process received shutdown request signal 3</div><div>2016-11-08 21:00:19: pid 6553: LOG: child process received shutdown request signal 3</div><div>2016-11-08 21:00:19: pid 18575: LOG: child process received shutdown request signal 3</div><div>2016-11-08 21:00:19: pid 14686: LOG: child process received shutdown request signal 3</div><div>2016-11-08 21:00:19: pid 18492: LOG: child process received shutdown request signal 3</div><div>2016-11-08 21:00:19: pid 6123: LOG: child process received shutdown request signal 3</div><div>2016-11-08 21:00:19: pid 6552: LOG: child process received shutdown request signal 3</div><div>2016-11-08 21:00:19: pid 6113: LOG: child process received shutdown request signal 3</div><div>2016-11-08 21:00:19: pid 6119: LOG: child process received shutdown request signal 3</div><div>2016-11-08 21:00:19: pid 18497: LOG: child process received shutdown request signal 3</div><div>2016-11-08 21:00:19: pid 6117: LOG: child process received shutdown request signal 3</div><div>2016-11-08 21:00:19: pid 6104: LOG: child process received shutdown request signal 3</div><div>2016-11-08 21:00:19: pid 18493: LOG: child process received shutdown request signal 3</div><div>2016-11-08 21:00:19: pid 18496: LOG: child process received shutdown request signal 3</div><div>2016-11-08 21:00:19: pid 12497: LOG: child process received shutdown request signal 3</div><div>2016-11-08 21:00:19: pid 6109: LOG: child process received shutdown request signal 3</div><div>2016-11-08 21:00:19: pid 6106: LOG: child process received shutdown request signal 3</div><div>2016-11-08 21:00:19: pid 6132: LOG: worker process received restart request</div><div>failover done. shutdown host <a href="http://armada.amazonaws.com">armada.amazonaws.com</a>(5439)2016-11-08 21:00:19: pid 6098: LOG: failover done. shutdown host <a href="http://armada.amazonaws.com">armada.amazonaws.com</a>(5439)</div><div>2016-11-08 21:00:20: pid 6131: LOG: restart request received in pcp child process</div><div>2016-11-08 21:00:20: pid 6098: LOG: PCP child 6131 exits with status 0 in failover()</div><div>2016-11-08 21:00:20: pid 6098: LOG: fork a new PCP child pid 26992 in failover()</div></div><div><br></div><div><br></div><div>For the health check error below, why is this happening only with db node 0 and not with db node 1 when both of the underlying clusters have the exact same redshift configuration?</div><div><br></div><div><div>2016-11-08 21:07:46: pid 6098: LOG: pool_ssl: "SSL_read": "no SSL error reported"</div><div>2016-11-08 21:07:46: pid 6098: LOG: notice_backend_error: called from pgpool main. ignored.</div><div>2016-11-08 21:07:46: pid 6098: WARNING: child_exit: called from invalid process. ignored.</div><div>2016-11-08 21:07:46: pid 6098: ERROR: unable to read data from DB node 0</div><div>2016-11-08 21:07:46: pid 6098: DETAIL: socket read failed with an error "Success"</div></div><div><br></div><div class="gmail_extra"><br clear="all"><div><div class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr">Thanks,<div>- Manoj</div></div></div></div>
<br><div class="gmail_quote">On Wed, Nov 9, 2016 at 1:46 AM, Tatsuo Ishii <span dir="ltr"><<a href="mailto:ishii@sraoss.co.jp" target="_blank">ishii@sraoss.co.jp</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">>>> Whenever Pgpool-II thinks a backend is being down, there should be a<br>
>>>log entry in the Pgpool-II log file. Please check.<br>
>><br>
>> This is the error in the log file when this happens<br>
>><br>
>> 2016-11-02 00:00:07: pid 9217: DETAIL: postmaster on DB node 0 was<br>
>> shutdown by administrative command<br>
>> 2016-11-02 00:00:07: pid 9217: LOG: received degenerate backend request<br>
>> for node_id: 0 from pid [9217]<br>
>> 2016-11-02 00:00:07: pid 9188: LOG: starting degeneration. shutdown host<br>
>> <a href="http://prod1.amazonaws.com" rel="noreferrer" target="_blank">prod1.amazonaws.com</a>(5439)<br>
>> 2016-11-02 00:00:07: pid 9188: LOG: Restart all children<br>
>><br>
>> What does "postmaster on DB node 0 was shutdown by administrative command".<br>
>> I havent sent any shutdown commands to pgpool.<br>
><br>
> Someone shutdown PostgreSQL (or used pg_cancel_backend).<br>
<br>
</span>Correction. I meant pg_terminate_backend, rather than<br>
pg_cancel_backend (it just cancel current query if any and harmless<br>
for Pgpool-II).<br>
<div><div class="h5"><br>
>> I verify connectivity to the<br>
>> cluster whenever this happens and it is always fine. Why does the health<br>
>> check that I configured to run every 30 secs not sense that the cluster is<br>
>> back up again and update the pgpool_status file?<br>
><br>
> See the FAQ.<br>
> <a href="http://www.pgpool.net/mediawiki/index.php/FAQ#Why_does_not_Pgpool-II_automatically_recognize_a_database_comes_back_online.3F" rel="noreferrer" target="_blank">http://www.pgpool.net/<wbr>mediawiki/index.php/FAQ#Why_<wbr>does_not_Pgpool-II_<wbr>automatically_recognize_a_<wbr>database_comes_back_online.3F</a><br>
><br>
>> Health check details from<br>
>> the log are below<br>
>><br>
>> 2016-11-01 23:59:54: pid 9188: LOG: notice_backend_error: called from<br>
>> pgpool main. ignored.<br>
>> 2016-11-01 23:59:54: pid 9188: WARNING: child_exit: called from invalid<br>
>> process. ignored.<br>
><br>
> No worry for this part. There was a race condition inside Pgpool-II<br>
> but was resolved.<br>
><br>
>> 2016-11-01 23:59:54: pid 9188: ERROR: unable to read data from DB node 0<br>
>> 2016-11-01 23:59:54: pid 9188: DETAIL: socket read failed with an error<br>
>> "Success"<br>
>><br>
>> What dos the above log indicate?<br>
><br>
> DB node 0 disconnected the socket to Pgpool-II.<br>
><br>
>>>Yes, it randomly routes to backends. You can control the possibility<br>
>>>of the routing.<br>
>><br>
>> Is it possible to control routing using round robin approach or least used<br>
>> cluster? If so, where do I configure this?<br>
><br>
> No.<br>
><br>
>> Thanks,<br>
>> - Manoj<br>
>><br>
>> On Mon, Nov 7, 2016 at 12:08 AM, Tatsuo Ishii <<a href="mailto:ishii@sraoss.co.jp">ishii@sraoss.co.jp</a>> wrote:<br>
>><br>
>>> > I have pgpool configured against two redshift backend clusters to do<br>
>>> > parallel writes. Seemingly at random, pgpool determines that one or both<br>
>>> > the clusters are down and stops accepting connections even when they are<br>
>>> > not down. I have health check configured every 30 seconds but that does<br>
>>> not<br>
>>> > help as it checks heath and still determines they are down in<br>
>>> pgpool_status<br>
>>> > file. How is health status determined and written to the file<br>
>>> > /var/log/pgpool/pgpool_status and why does pgpool think the clusters are<br>
>>> > down when they are not?<br>
>>><br>
>>> Whenever Pgpool-II thinks a backend is being down, there should be a<br>
>>> log entry in the Pgpool-II log file. Please check.<br>
>>><br>
>>> > I also tested read query routing and noticed they were being routed<br>
>>> > randomly to the backend clusters. Is there a specific algorithm that<br>
>>> pgpool<br>
>>> > uses for read query routing?<br>
>>><br>
>>> Yes, it randomly routes to backends. You can control the possibility<br>
>>> of the routing.<br>
>>><br>
>>> ><br>
>>> ><br>
>>> ><br>
>>> ><br>
>>> > My config parameters are below<br>
>>> ><br>
>>> ><br>
>>> ><br>
>>> > backend_hostname0 = 'cluster1'<br>
>>> ><br>
>>> > backend_port0 = 5439<br>
>>> ><br>
>>> > backend_weight0 = 1<br>
>>> ><br>
>>> > backend_data_directory0 = '/data1'<br>
>>> ><br>
>>> > backend_flag0 = 'ALLOW_TO_FAILOVER'<br>
>>> ><br>
>>> ><br>
>>> ><br>
>>> > backend_hostname1 = 'cluster2'<br>
>>> ><br>
>>> > backend_port1 = 5439<br>
>>> ><br>
>>> > backend_weight1 = 1<br>
>>> ><br>
>>> > backend_data_directory1 = '/data1'<br>
>>> ><br>
>>> > backend_flag1 = 'ALLOW_TO_FAILOVER'<br>
>>> ><br>
>>> ><br>
>>> ><br>
>>> > #-----------------------------<wbr>------------------------------<br>
>>> > -------------------<br>
>>> ><br>
>>> > # HEALTH CHECK<br>
>>> ><br>
>>> > #-----------------------------<wbr>------------------------------<br>
>>> > -------------------<br>
>>> ><br>
>>> ><br>
>>> ><br>
>>> > health_check_period = 30<br>
>>> ><br>
>>> > # Health check period<br>
>>> ><br>
>>> > # Disabled (0) by default<br>
>>> ><br>
>>> > health_check_timeout = 20<br>
>>> ><br>
>>> > # Health check timeout<br>
>>> ><br>
>>> > # 0 means no timeout<br>
>>> ><br>
>>> > health_check_user = 'username'<br>
>>> ><br>
>>> > # Health check user<br>
>>> ><br>
>>> > health_check_password = 'password'<br>
>>> ><br>
>>> > # Password for health check user<br>
>>> ><br>
>>> > health_check_max_retries = 10<br>
>>> ><br>
>>> > # Maximum number of times to retry a<br>
>>> > failed health check before giving up.<br>
>>> ><br>
>>> > health_check_retry_delay = 1<br>
>>> ><br>
>>> > # Amount of time to wait (in seconds)<br>
>>> > between retries.<br>
>>><br>
</div></div>> ______________________________<wbr>_________________<br>
> pgpool-general mailing list<br>
> <a href="mailto:pgpool-general@pgpool.net">pgpool-general@pgpool.net</a><br>
> <a href="http://www.pgpool.net/mailman/listinfo/pgpool-general" rel="noreferrer" target="_blank">http://www.pgpool.net/mailman/<wbr>listinfo/pgpool-general</a><br>
</blockquote></div><br></div></div>