Bug 1112 - TCP goes to dead-lock when both segment-retransmit and ACK-loss occurs
TCP goes to dead-lock when both segment-retransmit and ACK-loss occurs
Status: RESOLVED FIXED
Product: ns-3
Classification: Unclassified
Component: tcp
ns-3.11
Other Linux
: P5 normal
Assigned To: Adrian S.-W. Tam
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-04-25 03:48 EDT by Kirill Andreev
Modified: 2011-09-29 23:48 EDT (History)
2 users (show)

See Also:


Attachments
Example (with wscript), which reproduces this testcase (3.69 KB, application/x-gzip)
2011-04-25 03:48 EDT, Kirill Andreev
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Kirill Andreev 2011-04-25 03:48:20 EDT
Created attachment 1096 [details]
Example (with wscript), which reproduces this testcase

I have made a simple simulation with chain topology, WiFi and OLSR.
I have limited the number of transmission attempts in WiFi and have started
TCP stream.

When a path (5-6 retranslations) with high delay and low reliability is used
for TCP-transfer, the following scenario may happen:
                                  A -----*-----*-----*-----*-----*----- B
                                  A: TCP retransmit timer has expired, it
has set window equals to one (tcp-newreno.cc, line 185)
At the same moment:  B: Sends ACK, which is lost.
                                  A: send an only segment, which may be sent
(due to window resize when retransmission has occurred)
                                  B: drops this segment as OutOfRange
(tcp-socket-base.cc line 620)
After this situation, no data can be transferred and there is no timer,
which can break this dead-lock.

Of course, this situation is very unusual for WiFi operation, but is this a
normal operation of TCP or bug?

I have written simple example, when all callbacks are set for socket (i.e.
normal and erroneous shutdown, DataIsSent callback), but no error of socket
occurs for a long time.
Comment 1 Ryan Padilla 2011-08-29 15:49:37 EDT
I had a similar problem running TCP in ns-3.11.

I created an application that synchronized a 4 node topology using the Mesh Mac, so they each sent new data to every other node in the network after having received data from all other nodes.  It caused the nodes to all stay on the same round of data.  Every one sent round 1, then round 2, and so on.  When contention became high, all the nodes simply stopped sending.  I logged everything, created a packet trace and I believe I found the problem (There were two bugs that I found).

Bug 1:

At 5.40875 sec. packet with sequence number 13 and size 13 is sent.

The packet is received and ACKed at 5.41042

The ACK is lost

At 5.48522 new data arrives and is sent with sequence number 97 and size 26.

The packet is received at 5.49065, but is discarded because it's outside the receiving window.

In the current implementation any future data will never be received because it will always be outside the receiving window.

The tcp implementation should remove the already received data from the packet and accept the new data.

At line 621 of tcp-socket-base.cc I inserted the following to fix the problem:

if (tcpHeader.GetSequenceNumber () + packet->GetSize () > m_rxBuffer.NextRxSequence ())
    {
      uint32_t goodData = tcpHeader.GetSequenceNumber () + packet->GetSize () - m_rxBuffer.NextRxSequence ();
      packet->RemoveAtStart (packet->GetSize () - goodData);
      tcpHeader.SetSequenceNumber (m_rxBuffer.NextRxSequence ());
    }

Bug 2:

During a normal transmission in ns-3.11 the m_nextTxSequence variable is incremented by the size of packet being sent.  If a timeout occurs, m_nextTxSequence must be greater than m_txBuffer.HeadSequence () (which is set to the first byte of unACKed data) or it is assumed that the data has already been received and ACKed.  However, during a re-transmission m_nextTxSequence is not incremented again by the size of the packet being sent, so if a second timeout occurs, it appears as though the packet has been received and ACKed and the packet is not re-transmitted. This means a packet will only be re-transmitted after the first timeout.  At the second timeout, the connection essentially dies if the ACK was not received.

To fix the above problem, at line 1643 of tcp-socket-base.cc I inserted:

  m_nextTxSequence += p->GetSize ();

If there is something I missed in trying to fix this problem, please let me know. I applied the changed I suggested above and everything works fine now.
Comment 2 Ryan Padilla 2011-08-30 16:06:32 EDT
I suppose I should have tested this before I posted my previous comment, but I also have run the script from Kirill Andreev and it runs to completion without dropped segments.

Should I change the status of this bug to verified or resolved?  Would anyone to check this?

This is my first time working with bug reports, so any help would be appreciated.

Thanks.
Comment 3 Kirill Andreev 2011-08-31 08:25:00 EDT
(In reply to comment #2)
> I suppose I should have tested this before I posted my previous comment, but I
> also have run the script from Kirill Andreev and it runs to completion without
> dropped segments.
> 
> Should I change the status of this bug to verified or resolved?  Would anyone
> to check this?
> 
> This is my first time working with bug reports, so any help would be
> appreciated.
> 
> Thanks.

The best way is to write a testcase, which fails without fix and passes with a fix and add it to NS tests.
Comment 4 Ryan Padilla 2011-09-01 16:55:06 EDT
I'm not sure to create a good test case for this because, without directly changing the tcp code, I can't directly test the problem.  I can only create a situation where the problem is likely to appear, not guaranteed to appear.

The two these that need to occur for the two bugs to occur are:

* lost ACK, which the sender responds to by sending more data with the same sequence number (and potentially a larger packet size because more data has arrived)

* two back to back timeouts, which cause the connection to hang because sequence numbers aren't update correctly in the buffer during retransmissions

I am able to verify the bugs right now by running scenarios where contention is high, having TcpSocketBase logging enabled and look through the logs to find which connections died ( if any ) and why.

I can only create high contention among nodes, not guarantee or force ACK loss to occur. At least, I don't know how to do it right now.  Any suggestions would be appreciated.
Comment 5 Adrian S.-W. Tam 2011-09-29 23:48:42 EDT
This bug shall be closed by the patch as in bug 1274.