RPC server gets stuck in TLS handshake protocol and becomes unresponsive

Description

The listener thread of RPC server performs the TLS handshake protocol upon accepted a connection. Under high load the loop performing the TLS handshake might never exit.

Normally the handshake will continue until it finishes or until the server has reached the EOF from the underlying TCP socket. In high load there is a case where the connection has been dropped but the socket is still connected. Calling read on that socket will not return EOF but 0 (zero), and the handshake loop is spinning without exiting. As a consequence all incoming requests will be blocked behind this problematic connection.

After doing a thread dump, the thread in question (IPC Server Listener) is stuck at:

at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:197)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)

  • locked <0x00000001b2fb4e98> (a java.lang.Object)
    at org.apache.hadoop.ipc.RpcSSLEngineAbstr.doHandshake(RpcSSLEngineAbstr.java:95)
    at org.apache.hadoop.ipc.Server$Connection.doHandshake(Server.java:1660)
    at org.apache.hadoop.ipc.Server$Listener.doAccept(Server.java:1137)
    at org.apache.hadoop.ipc.Server$Listener.run(Server.java:1049)

Status

Assignee

Antonis Kouzoupis

Reporter

Antonis Kouzoupis

Labels

None

Fix versions

Affects versions

Priority

Highest
Configure