Handling Network Failures

SmartSockets has been designed to handle many different kinds of network failures, and this robust behavior provides a certain level of fault tolerance. The core function of SmartSockets fault tolerance is in connections.

This section describes the features of connections that implement fault tolerance. For a discussion of the features specific to RTserver and RTclient that add more fault tolerance, such as hot switchover from primary RTclients to backup RTclients, see Handling Network Failures In Publish Subscribe, and Running an RTclient With a Hot Backup.

In addition to detecting network failures, connections can completely recover from these failures by using guaranteed message delivery, covered in Chapter 4, Guaranteed Message Delivery.

What is Fault Tolerance?

Fault tolerance is a term used to describe computer systems that continue to function even when some of the hardware and software fail. Examples of failure conditions include:

processes or computers running out of memory

processes hanging or going into infinite loops

computers crashing or hanging

breaks in network cables

misconfigured computers

overloaded computers causing processes to run slowly

Fault tolerance can be implemented in hardware by mirrored filesystems on multiple disks, redundant networks, redundant CPUs, redundant memory, and so on. Hardware-based fault tolerant systems are more expensive than non-fault tolerant systems due to the extra components. Fault tolerance can also be implemented in software by products such as SmartSockets. Surprisingly, enabling the fault tolerant features of connections has little effect on message throughput.

avoid operations that can block indefinitely, or put an upper limit on the amount of time these operations can block

periodically check for potential failure conditions

Potential Network Failures

Connections use sockets as the communication link between two processes, and thus can use the features of sockets and each network protocol for detecting network failures. There are three areas of connections where problems can occur:

In each area SmartSockets builds on top of the features of sockets and network protocols to provide faster detection of network problems. Each IPC protocol (local, and TCP/IP) handles failures differently, which complicates matters. For the local protocol, there are no possible network failures, because this protocol does not use a network, although processes that use the local IPC protocol can still fail.

For creating a connection, the server connection does not need any special handling because creation of the server connection with the function TipcConnCreateServer completes immediately. The process with the server connection can use TipcConnCheck to check if a client has connected before calling TipcConnAccept to accept the client connection. The creation of the client connection with the function TipcConnCreateClient may not complete immediately if the node of the server connection is somehow unavailable, such as it has crashed, is turned off, or the network is ruptured. For TCP/IP client connections, the option Socket_Connect_Timeout can be used to set a limit on how long (in seconds) to wait for availability. The default value for Socket_Connect_Timeout is 5.0. If Socket_Connect_Timeout is set to 0.0, then the client connection creation timeout is disabled, and TCP/IP clients block for an operating system-dependent amount of time if the server node is not available (typically 75 seconds for TCP/IP, for example).

When sending data on a connection, if the data cannot be sent, either the receiving process is not keeping up or a network failure has occurred. The TCP/IP protocols by default do not send any packets during periods of inactivity and do not forcefully break a link for many types of network problems (for example, a broken network cable). TCP/IP does have the concept of an optional keepalive that can be enabled. This network-level TCP/IP keepalive is different from an application-level connection keep alive, but serves the same purpose. From this point, the term keepalive (one word) is used to refer to a TCP/IP health check, and the term keep alive (two words) is used to refer to an application-level health check. The default TCP/IP keepalive timeout is very large on most systems (typically two hours), cannot be changed by non-privileged users, and can only be changed for all TCP/IP links, not just one. This makes the TCP/IP keepalive unusable for most applications. It is available to SmartSockets programs though, through the socket option (not to be confused with a SmartSockets option) SO_KEEPALIVE. Refer to your operating system manuals for full information on this socket option.

T_INT4 conn_socket; 
int one = 1; 
if (!TipcConnGetSocket(conn, &conn_socket)) { 
  /* error */ 
} 
if (setsockopt(conn_socket, SOL_SOCKET, SO_KEEPALIVE, 
               (char *)&one, sizeof(one)) != 0) { 
  /* error */ 
}

For receiving data on a connection, if data cannot be received, then either the sending process has not sent anything, or a network failure has occurred. The above-mentioned features of TCP/IP also apply for receiving data: TCP/IP does not send packets during periods of inactivity (by default). If no data is received within a certain period of time, a connection can initiate a connection keep alive (not to be confused with a TCP/IP keepalive) to check the health of the connection. Keep alives are discussed in detail in the next section.

Keep Alives

A connection keep alive is a very simple way to check the health of a connection, including the network and the process at the other end of a connection. The function TipcConnKeepAlive is used to perform a keep alive. Connection keep alives are remote procedure calls that send a KEEP_ALIVE_CALL message through a connection and then wait for a KEEP_ALIVE_RESULT message back from the other process. If the other process is alive, it receives the KEEP_ALIVE_CALL message and sends back a KEEP_ALIVE_RESULT message. If the keep alive originator does not receive a response within a certain period of time, it assumes there has been a network failure and destroys the connection or takes other actions.

For most uses, you can simply set the block mode, read timeout, write timeout, and keep alive timeout properties of a connection to automatically control checking for network failures (see Connection Composition for details). The function TipcConnCheck automatically calls TipcConnKeepAlive if the amount of time that has elapsed since data was last read from the connection is greater than the read timeout property of the connection. A connection by default processes a KEEP_ALIVE_CALL message with the process callback function TipcCbConnProcessKeepAliveCall. This function handles sending back a KEEP_ALIVE_RESULT message to the process that originated the keep alive. While timeout checking is normally done automatically and transparently by TipcConnCheck, you can call TipcConnKeepAlive directly to explicitly check the health of a connection.

You should not try to explicitly send or receive KEEP_ALIVE_CALL and KEEP_ALIVE_RESULT messages, but instead always use TipcConnCheck, TipcConnKeepAlive, and TipcCbConnProcessKeepAlive to handle the details of keep alives. Because keep alives are checking the health of both the network and the other process, a process must be careful to read and process messages at a regular interval; otherwise the keep alives fail.

Blocking and Non-Blocking Read/Write Operations

As described in Block Mode, for read timeouts, write timeouts, and automatic keep alives to be enabled, the connection block mode must be set to FALSE to enable non-blocking read and write operations. If the connection block mode is TRUE, then read and write operations can block indefinitely, and many network failures cannot be detected.

Connection read and write operations are handled differently. If no data can be read within a certain period of time, some kind of failure may have occurred, or there may simply be no data to read. Thus if a read timeout occurs, a keep alive is initiated to check if the process at the other end of the connection is still alive.

If no data can be written within a certain period of time, however, this indicates a problem, as the connection’s socket is full. There is no point in initiating a keep alive when a write timeout occurs because the keep alive RPC call will most likely not be able to be written to the already-plugged socket.