New TCP Socket Architecture

From Nsnam
Revision as of 08:28, 16 June 2010 by Adriantam (Talk | contribs)

Jump to: navigation, search

This page describes an on-going rework on NS-3 TCP

In the following, a new architecture for TCP socket implementation is proposed. This is to replace the old TcpSocketImpl class in NS-3.8 so that different favors of TCP can be easily implemented.

The current working directory is located at

Old Structure

As of change set 6273:8d70de29d514 in the Mercurial, TCP simulation is implemented by class TcpSocketImpl, in src/internet-stack/tcp-socket-impl.h and src/internet-stack/ The TcpSocketImpl class is implementing TCP NewReno, despite the Doxygen comment claims that is implementing Tahoe.

The TcpSocketImpl class is derived from TcpSocket class, which in turn, is derived from Socket class. The TcpSocket class is merely an empty class defining the interface for attribute get/set. Examples of the attributes configured by the interface of TcpSocket class are the send and receive buffer sizes, initial congestion window size, etc. The Socket class, however, provides the interface for the L7 application to call.

How to use TcpSocketImpl

TCP state machine transitions are defined in tcp-l4-protocol.h and The class TcpSocketImpl does not maintain the transition rule but keeps track on the state of the current socket.

When an application needs a TCP connection, it has to get a socket from TcpL4Protocol::CreateSocket(). This call will allocate a TcpSocketImpl object and configure it (e.g. assign it to a particular node). The TcpL4Protocol object is unique on a TCP/IP stack and serve as a mux/demux layer for the real sockets, namely, TcpSocketImpl.

Once the TcpSocketImpl object is created, it is in CLOSED state. The application can instruct it to Bind() to a port number and then Connect() or Listen() as the traditional BSD socket does.

The Bind() call is to register its port in the TcpL4Protocol object as an Ipv4EndPoint, and set up the callback functions (by way of FinishBind()), so that mux/demux can be done.

The Listen() call puts the socket into LISTEN state. The Connect() call, on the other hand, puts the socket in SYN_SENT state and initiates three way handshake.

The application can close the socket by calling the Close() call, which in turn, destroys the Ipv4EndPoint after the FIN packets.

Once the socket is ready to send, application invokes the Send() call in TcpSocketImpl. The receiver side application calls Recv() to get the packet data.

Inside TcpSocketImpl

The operation of TcpSockImpl is carried out in two parallel mechanisms. To communicate with higher level applications, the Send() and Recv() calls are dealing with the buffers directly. They append the data into the send buffer and retrieve data from the receive buffer respectively. To send and receive data over the network through lower layers, functions ProcessEvent(), ProcessAction(), and ProcessPacketAction() are called.

Two functions are crucial to trigger these three process functions. The function ForwardUp() is invoked when the lower layer (Ipv4L3Protocol) received a packet that destined to this TCP socket. The function SendPendingData() is invoked whenever the application has anything appended to the send buffer.

ForwardUp() converts the incoming packet's TCP flags into an event. Then, it updates the current state machine by calling ProcessEvent(), and perform the subsequent action with ProcessPacketAction(). ProcessEvent() handles only the connection set up and tear down. All other cases are handled by ProcessPacketAction() and ProcessAction(). The function ProcessPacketAction() handles those cases that need to reference the TCP header of packet, other cases are handed over to ProcessAction().

SendPendingData() manages the send window. When the send window is big enough to send a packet, it extracts data from the send buffer and package it with a TCP header, then pass it over the lower layers.

New Structure

The new structure, TcpSocketBase class, is having the same relationship to TcpSocket and Socket classes as TcpSocketImpl. However, instead of providing a concrete TCP implementation, it is designed to meet the following goals:

  • Provide only the function common to all TCP classes, namely, the implementation of TCP state machine
  • Minimize the code footprint and make it modular to make it easier to understand

From a lower-layer's point of view, TCP has not changed since 1980. The TCP state machine remained the same. The only different between different variants of TCP is on the congestion control and fairness distribution. The TcpSocketBase class keeps the state machine operation, i.e. ProcessEvent() and ProcessAction() calls, the same as TcpSocketImpl class. These functions, however, will be tidied up in the future.

In the current TcpSocketBase class, two auxiliary classes are used, namely, TcpRxBuffer and TcpTxBuffer.

The TcpRxBuffer is the receive (Rx) buffer for TCP. It accepts packet fragments at any position. Function call TcpRxBuffer::Add() inserts a packet into the Rx buffer. It obtains the sequence number of the data from the provided TcpHeader. The Rx buffer has a maximum buffer size, defaults to 32KiB, can be set by TcpRxBuffer::SetMaxBufferSize(). The sequence number of the head of the buffer can be set by TcpRxBuffer::SetNextRxSeq(). This is supposed to be called upon the connection is established so that it can report out-of-sequence packets. TcpRxBuffer handles all the reordering work so that TcpSocketBase can simply extract from it in a single call, TcpRxBuffer::Extract().

The TcpTxBuffer is the transmit (Tx) buffer for TCP. The upper layer application sends data to TcpSocketBase. The data is then appended to the TcpTxBuffer by TcpTxBuffer::Add() call. Similar to TcpRxBuffer, it can also set the maximum buffer size by TcpTxBuffer::SetMaxBufferSize(). Appending data will fail if the buffer is going to store more data than its maximum size. Because appending to TcpTxBuffer is supposed to be sequential, without overlap, TcpTxBuffer::Add() call is merely put the data into the end of a list. TcpTxBuffer, however, support extracting data from anywhere in the buffer. This is done by TcpTxBuffer::CopyFromSeq() call.


Although the API (specifically, the public functions) did not change from TcpSocketImpl to TcpSocketBase, the internal operation between the two classes are vastly different. The most important change is the break down of TCP state machine into different functions. This design is in alignment of the TCP code in Linux. The change of state is done explicitly in various functions instead of doing a look up on the state transition table. Accordingly, in TcpSocketBase, the functions ProcessEvent, ProcessAction, and ProcessPacketAction are removed.

The following describe the operation of different interactions to the upper and lower layers:


The upper layer (application) can call Bind() in TcpSocketBase to bind a socket to an address/port. It allocates an end point (i.e. a mux/demux hook in TcpL4Protocol) and set up callback functions by SetupCallback(). One of the most important callback functions is ForwardUp(), which is invoked when a packet is passed from lower layers to this TCP socket.


The upper layer (application) initiates a connection by calling Connect(). It configures the end point to specify the peer's address, send a SYN packet to initiate the three-way handshake, and move to SYN_SENT state. Exception is when this socket already has a connection. In which cases, a RST packet is sent and the connection is torn down. Such state checking is done in DoConnect().


Instead of actively start a connection, application can also wait for an incoming connection by calling Listen(). What it does is just move the socket from CLOSED state to LISTEN state. If the socket was not in CLOSED state, an error is reported.


When the application decided to close the connection, it calls Close(). This function will check if the close can be done immediately, in which case, DoClose() is called. If not, it asserts m_closeOnEmpty so that it withholds the close until all data are transmitted. The close, either by way of DoClose() or by the packet-sending routines with m_closeOnEmpty asserted, is to send a FIN packet to the peer.

Application send data

Data is sent by function Send(). The function SendTo(), which allows the specification of an address as parameter, is identical to Send(). It stores the supplied data into m_txBuffer and call SendPendingData(), if the state allows transmission of data. SendPendingData() is basically a loop to send as much data as possible according to the limit of the sending window. It extracts data from m_txBuffer and packages the outgoing packet with a TCP header. Once the packet is ready, it pass the packet into m_tcp by invoking TcpL4Protocol::SendPacket().

Application receive data =

Application can call Recv() or RecvFrom() to extract data from the TCP socket. RecvFrom() is identical to Recv() except it also returns the remote peer's address. Recv() extracts data from m_rxBuffer and return it.

Lower layer forward an incoming packet to socket

When the lower layer (e.g. Ipv4L3Protocol) received a TCP packet, it is passed on to the TcpL4Protocol and the packet is forwarded to the socket if it matches the fingerprint in the socket's endpoint. The forwarding function is a callback function in endpoint. In TcpSocketBase, it is ForwardUp(). This function does three tasks: (1) invoke RTT calculation if the incoming packet has ACK asserted, (2) adjust the Rx window size to the value reported by the peer, (3) based on the current state, invoke corresponding processing function to handle the incoming packet. The last one is implemented as a switch structure with each state handled independently.

Basically there is a process function for each state, this is mimicking the behaviour of Linux. For example, in tcp_input.c of the Linux kernel, there is a function tcp_rcv_established() to handle all the incoming packets when the TCP socket is in ESTABLISHED state. In TcpSocketBase, this role is on function ProcessEstablished(). The similar functions are ProcessListen(), ProcessSynSent(), ProcessSynRcvd(), ProcessWait(), ProcessClosing() and ProcessLastAck(). The function ProcessWait() is responsible for states CLOSE_WAIT, FIN_WAIT_1, and FIN_WAIT_2. The rest are self-explanatory.

Difference in Behaviour

From the packet trace, there could be two behavioural difference between TcpSocketImpl and TcpSocketBase.

First, when a socket is moved from SYN_RCVD state to ESTABLISHED, TcpSocketBase set m_delAckCount to m_delAckMaxCount so that the first incoming data must be acknowledged immediately. This is not done in TcpSocketImpl. Thus, in TcpSocketImpl, there must be a delay ACK timeout between the sender sends the first and second data packet. The following is an excerpt from from the reference trace that showing the delay ACK timeout is blocking the sender from sending the second data packet:

r 0.0602016 /NodeList/2/DeviceList/0/$ns3::PointToPointNetDevice/MacRx ns3::Ipv4Header (tos 0x0 ttl 63 id 1 protocol 6 offset 0 flags [none] length: 40 > ns3::TcpHeader (49153 > 50000 [ ACK ] Seq=1 Ack=1 Win=65535)
r 0.0610928 /NodeList/2/DeviceList/0/$ns3::PointToPointNetDevice/MacRx ns3::Ipv4Header (tos 0x0 ttl 63 id 2 protocol 6 offset 0 flags [none] length: 576 > ns3::TcpHeader (49153 > 50000 [ ACK ] Seq=1 Ack=1 Win=65535) Payload Fragment [0:536]
+ 0.261093 /NodeList/2/DeviceList/0/$ns3::PointToPointNetDevice/TxQueue/Enqueue ns3::PppHeader (Point-to-Point Protocol: IP (0x0021)) ns3::Ipv4Header (tos 0x0 ttl 64 id 1 protocol 6 offset 0 flags [none] length: 40 > ns3::TcpHeader (50000 > 49153 [ ACK ] Seq=1 Ack=537 Win=65535)
- 0.261093 /NodeList/2/DeviceList/0/$ns3::PointToPointNetDevice/TxQueue/Dequeue ns3::PppHeader (Point-to-Point Protocol: IP (0x0021)) ns3::Ipv4Header (tos 0x0 ttl 64 id 1 protocol 6 offset 0 flags [none] length: 40 > ns3::TcpHeader (50000 > 49153 [ ACK ] Seq=1 Ack=537 Win=65535)
r 0.271126 /NodeList/1/DeviceList/1/$ns3::PointToPointNetDevice/MacRx ns3::Ipv4Header (tos 0x0 ttl 64 id 1 protocol 6 offset 0 flags [none] length: 40 > ns3::TcpHeader (50000 > 49153 [ ACK ] Seq=1 Ack=537 Win=65535)

In the above, at time 0.0610928, the first data packet arrived the receiver node (node 2). Only until time 0.261093, i.e. 0.2 second later, the ACK is sent. In the meantime, nothing is sent from the sender because its sending window is only 1 packet wide and the outstanding packet is not yet acknowledged.

The second difference is at the termination. In TcpSocketImpl, if a FIN packet is piggybacked on a data packet, there would be two back-to-back ACK packets responded, one for the data and one for the FIN. In TcpSocketBase, we avoided this by sending only one ACK, for the FIN because the FIN's sequence number is one plus the last data byte's sequence number. The following excerpt from shows the behaviour of TcpSocketImpl:

r 2.35371 /NodeList/2/DeviceList/0/$ns3::PointToPointNetDevice/MacRx ns3::Ipv4Header (tos 0x0 ttl 63 id 3733 protocol 6 offset 0 flags [none] length: 224 > ns3::TcpHeader (49153 > 50000 [ FIN  ACK ] Seq=1999817 Ack=1 Win=65535) Payload Fragment [784:888] Payload (size=80)
+ 2.35371 /NodeList/2/DeviceList/0/$ns3::PointToPointNetDevice/TxQueue/Enqueue ns3::PppHeader (Point-to-Point Protocol: IP (0x0021)) ns3::Ipv4Header (tos 0x0 ttl 64 id 1867 protocol 6 offset 0 flags [none] length: 40 > ns3::TcpHeader (50000 > 49153 [ ACK ] Seq=1 Ack=2000001 Win=65535)
- 2.35371 /NodeList/2/DeviceList/0/$ns3::PointToPointNetDevice/TxQueue/Dequeue ns3::PppHeader (Point-to-Point Protocol: IP (0x0021)) ns3::Ipv4Header (tos 0x0 ttl 64 id 1867 protocol 6 offset 0 flags [none] length: 40 > ns3::TcpHeader (50000 > 49153 [ ACK ] Seq=1 Ack=2000001 Win=65535)
+ 2.35371 /NodeList/2/DeviceList/0/$ns3::PointToPointNetDevice/TxQueue/Enqueue ns3::PppHeader (Point-to-Point Protocol: IP (0x0021)) ns3::Ipv4Header (tos 0x0 ttl 64 id 1868 protocol 6 offset 0 flags [none] length: 40 > ns3::TcpHeader (50000 > 49153 [ ACK ] Seq=1 Ack=2000002 Win=65535)
- 2.35374 /NodeList/2/DeviceList/0/$ns3::PointToPointNetDevice/TxQueue/Dequeue ns3::PppHeader (Point-to-Point Protocol: IP (0x0021)) ns3::Ipv4Header (tos 0x0 ttl 64 id 1868 protocol 6 offset 0 flags [none] length: 40 > ns3::TcpHeader (50000 > 49153 [ ACK ] Seq=1 Ack=2000002 Win=65535)

In the current repository, one can search for the keyword "old NS-3" in for these two differences. The current code in the repository is made to perform exactly the same with TcpSocketImpl so that it can pass the regression test.

Pluggable Congestion Control in Linux TCP

The next step would be to port the pluggable congestion control from Linux to NS-3. The ideal outcome would be a converter that can take Linux source code as input, produces NS-3 modules for each different TCP congestion control variants. In linux/include/tcp.h, the following structure is defined:

 struct tcp_congestion_ops {
       struct list_head        list;
       unsigned long flags;
       /* initialize private data (optional) */
       void (*init)(struct sock *sk);
       /* cleanup private data  (optional) */
       void (*release)(struct sock *sk);
       /* return slow start threshold (required) */
       u32 (*ssthresh)(struct sock *sk);
       /* lower bound for congestion window (optional) */
       u32 (*min_cwnd)(const struct sock *sk);
       /* do new cwnd calculation (required) */
       void (*cong_avoid)(struct sock *sk, u32 ack, u32 in_flight);
       /* call before changing ca_state (optional) */
       void (*set_state)(struct sock *sk, u8 new_state);
       /* call when cwnd event occurs (optional) */
       void (*cwnd_event)(struct sock *sk, enum tcp_ca_event ev);
       /* new value of cwnd after loss (optional) */
       u32  (*undo_cwnd)(struct sock *sk);
       /* hook for packet ack accounting (optional) */
       void (*pkts_acked)(struct sock *sk, u32 num_acked, s32 rtt_us);
       /* get info for inet_diag (optional) */
       void (*get_info)(struct sock *sk, u32 ext, struct sk_buff *skb);
       char            name[TCP_CA_NAME_MAX];
       struct module   *owner;

A new congestion control for TCP would be a copy of such structure with function pointers at least ssthresh and cong_avoid. Because the variables used in Linux TCP is limited, the ideal way to port would be finding the one-to-one mapping between the Linux TCP's variable and NS-3's variable. Besides the conversion, we should call, for example, cong_avoid in TcpSocketBase as well.