Bug 2150 - The TCP sender keeps retransmitting and does not terminate the connection after some retries.
The TCP sender keeps retransmitting and does not terminate the connection aft...
Status: RESOLVED FIXED
Product: ns-3
Classification: Unclassified
Component: tcp
ns-3.23
PC Linux
: P5 major
Assigned To: natale.patriciello
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2015-07-07 22:46 EDT by Mahdi
Modified: 2015-10-27 05:49 EDT (History)
4 users (show)

See Also:


Attachments
includes test case, modified point-to-point channel and the pcap files (123.28 KB, application/zip)
2015-07-07 22:46 EDT, Mahdi
Details
Introduced DataRetries (11.07 KB, patch)
2015-08-04 09:42 EDT, natale.patriciello
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Mahdi 2015-07-07 22:46:02 EDT
Created attachment 2083 [details]
includes test case, modified point-to-point channel and the pcap files

When a TCP sender does not receive the ack for the packet sent, it is trapped into a retransmission state indefinitely. This could happen for many reasons, for example the receiver becomes unreachable or it stops answering deliberately. However, The sender keeps retransmitting the same packet for ever. I tested for extremely long periods of time, but the sender did not give up and terminate the connection. Our original experiment includes a network discriminator which makes some flows suffer, but a simpler test case can be as below:

We have a simple point-to-point topology as :

n0 ----- n1

n0 is the sender and n1 is the receiver. By making a change in TransmitStart() method of point-to-point-channel.cc, we disable the channel after 3 seconds from the beginning of the experiment, so there will be no more interactions between the nodes. When the link stops being operational, the sender does not receive the last packet's ack and enters into an indefinite state of retransmission of the last packet. The retransmitted packets can be observed in the resulting .pcap files. In the attached test case the simulation time is set to 30,000 seconds. It uses an on-off application for the sender, but the same behavior is observed with bulk send application.

I would be grateful if you confirm that this is actually a bug.
Comment 1 natale.patriciello 2015-07-22 15:47:52 EDT
(In reply to Mahdi from comment #0)
> Created attachment 2083 [details]
> includes test case, modified point-to-point channel and the pcap files
> 
> When a TCP sender does not receive the ack for the packet sent, it is
> trapped into a retransmission state indefinitely.
[cut]
> I would be grateful if you confirm that this is actually a bug.

As I wrote in email, I can confirm this behavior. However, we have many ways to resolv it.

-> Modify the behavior of the attributes ConnCount or ConnTimeout, indicating that they apply also during the connection

-> Add a new attribute (e.g. TransmissionCount or TransmissionTimeout). If one of the two expires (max number of retries or time passed without any contact) the connection should be terminated.

What is the best way to terminate such a connection? Send a reset and notify the upper layer, doing for example:

  SendEmptyPacket (TcpHeader::RST);
  NotifyErrorClose ();
  DeallocateEndPoint ();

(which is done in SendRST ()) ?
Comment 2 Tom Henderson 2015-08-03 02:00:52 EDT
(In reply to natale.patriciello from comment #1)
> (In reply to Mahdi from comment #0)
> > Created attachment 2083 [details]
> > includes test case, modified point-to-point channel and the pcap files
> > 
> > When a TCP sender does not receive the ack for the packet sent, it is
> > trapped into a retransmission state indefinitely.
> [cut]
> > I would be grateful if you confirm that this is actually a bug.
> 
> As I wrote in email, I can confirm this behavior. However, we have many ways
> to resolv it.
> 
> -> Modify the behavior of the attributes ConnCount or ConnTimeout,
> indicating that they apply also during the connection
> 
> -> Add a new attribute (e.g. TransmissionCount or TransmissionTimeout). If
> one of the two expires (max number of retries or time passed without any
> contact) the connection should be terminated.

I'd like to see two attributes supported:

- MaxDataRetransmissions
- MaxSynRetransmissions (for SYN failures; may be different than data failures)

I think most TCP implementations support this in some way.   I'm open to other name suggestions.

> 
> What is the best way to terminate such a connection? Send a reset and notify
> the upper layer, doing for example:
> 
>   SendEmptyPacket (TcpHeader::RST);
>   NotifyErrorClose ();
>   DeallocateEndPoint ();
> 
> (which is done in SendRST ()) ?

I think this can be checked as to how implementations handle this.  We may want to move to TIME_WAIT in some cases rather than just deallocate.
Comment 3 natale.patriciello 2015-08-03 13:25:16 EDT
(In reply to Tom Henderson from comment #2)
> I'd like to see two attributes supported:
> 
> - MaxDataRetransmissions
> - MaxSynRetransmissions (for SYN failures; may be different than data
> failures)
> 
> I think most TCP implementations support this in some way.   I'm open to
> other name suggestions.

Linux:
 87 #define TCP_RETR1       3       /*
 88                                  * This is how many retries it does before it
 89                                  * tries to figure out if the gateway is
 90                                  * down. Minimal RFC value is 3; it corresponds
 91                                  * to ~3sec-8min depending on RTO.
 92                                  */
 93 
 94 #define TCP_RETR2       15      /*
 95                                  * This should take at least
 96                                  * 90 minutes to time out.
 97                                  * RFC1122 says that the limit is 100 sec.
 98                                  * 15 is ~13-30min depending on RTO.
 99                                  */
100 
101 #define TCP_SYN_RETRIES  6      /* This is how many retries are done
102                                  * when active opening a connection.
103                                  * RFC1122 says the minimum retry MUST
104                                  * be at least 180secs.  Nevertheless
105                                  * this value is corresponding to
106                                  * 63secs of retransmission with the
107                                  * current initial RTO.
108                                  */
109 
110 #define TCP_SYNACK_RETRIES 5    /* This is how may retries are done
111                                  * when passive opening a connection.
112                                  * This is corresponding to 31secs of
113                                  * retransmission with the current
114                                  * initial RTO.
115                                  */
116 
117 #define TCP_TIMEWAIT_LEN (60*HZ) /* how long to wait to destroy TIME-WAIT
118                                   * state, about 60 seconds     */
119 #define TCP_FIN_TIMEOUT TCP_TIMEWAIT_LEN
120                                  /* BSD style FIN_WAIT2 deadlock breaker.
121                                   * It used to be 3min, new value is 60sec,
122                                   * to combine FIN-WAIT-2 timeout with
123                                   * TIME-WAIT timer.
124                                   */

(I don't know why we have also a timer for opening a connection, which is ConnTimeout.. default to 3 seconds. What is its sense? In other words, we just fire this timer before even reaching the 3rd probe of SYN. We need to choose one, but it affects also NSC - it is declared in tcp-socket.cc -).

> 
> > 
> > What is the best way to terminate such a connection? Send a reset and notify
> > the upper layer, doing for example:
> > 
> >   SendEmptyPacket (TcpHeader::RST);
> >   NotifyErrorClose ();
> >   DeallocateEndPoint ();
> > 
> > (which is done in SendRST ()) ?
> 
> I think this can be checked as to how implementations handle this.  We may
> want to move to TIME_WAIT in some cases rather than just deallocate.

In Linux, after TCP_RETR1 times, they check if the route is valid. (this is done in TcpL4Protocol for each packet, function TcpL4Protocol::SendPacketV{4,6}).

After TCP_RETR2 times, they close the socket: the error is ETIMEDOUT, and they call tcp_done (http://lxr.free-electrons.com/source/net/ipv4/tcp.c#L2987). From what I'm seeing, no packets are sent, the socket closed and nothing more (some stats collected anyway).

My proposal:

-> Remove ConnTimeout
-> Rename ConnCount in SynRetries (same semantic as before)
-> Add SynAckCount (counting how many times we should try to passively open a connection)
-> Add DataRetries (which is your MaxDataRetransmissions)
-> Rename MaxSegLifetime in TimeWaitTimeout (? For me the name TimeWaitTimeout  is more clear, but in RFC it is called MaxSegLifetime)
-> Add FinTimeout

All these attribute are defined in tcp-socket.cc and set/get through pure virtual methods. They'll be used in tcp-socket-base.cc, and only set in nsc (as current ConnCount attribute: it is set but not used anywhere).

What's your opinion?
Comment 4 natale.patriciello 2015-08-04 09:42:27 EDT
Created attachment 2107 [details]
Introduced DataRetries

*) Introduced and used DataRetries
*) Renamed, inside the implementation, any reference to cnCount and cnRetries to synCount and synRetries, to differentiate it against dataCount and dataRetries.
Comment 5 natale.patriciello 2015-10-27 05:49:25 EDT
Fixed in 11711:902e19ae0511