Bug 2141 - TCP data sent callback is not called for certain send events
TCP data sent callback is not called for certain send events
Status: RESOLVED FIXED
Product: ns-3
Classification: Unclassified
Component: tcp
ns-3.23
PC Linux
: P5 normal
Assigned To: natale.patriciello
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2015-06-19 08:43 EDT by Mark Keaton
Modified: 2015-12-17 06:40 EST (History)
3 users (show)

See Also:


Attachments
Packet capture file for the TCP flow exhibiting the data sent callback bug (129.70 KB, application/octet-stream)
2015-06-19 08:43 EDT, Mark Keaton
Details
Proposed patch for TCP data sent callback bug (677 bytes, patch)
2015-06-19 08:48 EDT, Mark Keaton
Details | Diff
Updated patch (737 bytes, patch)
2015-12-16 14:15 EST, natale.patriciello
Details | Diff
Proposed test case (3.66 KB, text/x-c++src)
2015-12-16 14:24 EST, natale.patriciello
Details
One-file patch for test case and bug correction (8.74 KB, patch)
2015-12-17 06:31 EST, natale.patriciello
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Mark Keaton 2015-06-19 08:43:01 EDT
Created attachment 2067 [details]
Packet capture file for the TCP flow exhibiting the data sent callback bug

I am observing a condition where a TcpNewReno socket does not report all of the data sent using the SetDataSentCallback() feature.  The callback works as it should almost all of the time, but we have observed one simulation run where one of these callbacks did not occur.

This issue is occurring with ns-3 3.23.

Here is our setup.  We have a custom NetDevice class that utilizes TCP sockets for tunneling data, and this class monitors the amount of data queued on the send side of each TCP socket.  It computes the amount of data queued by keeping track of the amount of data passed into the TCP socket Send() method minus the amount of data sent as reported via the SetDataSentCallback() callback.  Our NetDevice subclass limits how large this queue size can be, and stops sending if too much data is queued up.  In the simulation run in question, one of the NetDevices had delivered all of its data to the NetDevice at the other end of the TCP connection, yet the queue size measurement was stuck at a value greater than zero.  In fact, it was large enough that our NetDevice code stopped sending any more data into the TCP socket.

This bug occurs as part of a large and complex simulation, and has only ever been observed happening once.  Thus, I am unable to provide a simple test case that reproduces the bug.  However, the issue was repeatable when run on one of our simulation computers, and I was able to instrument the code and figure out what was happening.  What follows is a summary of what I found.

During the period of interest, the TCP m_txBuffer has the following user data in it (sequence numbers shown at the bottom):


    +------------+--------------+--------------+--------------+
    |  40 bytes  |  1052 bytes  |  1052 bytes  |  1052 bytes  |
    +------------+--------------+--------------+--------------+
     ^            ^              ^              ^              ^
   72697        72737          73789          74841          75893


Here is a summary of what happens at various times.  My instrumentation was not all-inclusive, so I do not have some of the exact event times, but all of the events that were captured are listed in the sequence that they occurred.  Note that the TCP connection is already established and has been sending packets for a short time before these events.  I am including the resulting m_txBuffer head and tail values, as well as the resulting TCP m_nextTxSeq and m_highTxMark values, after the event is over.  I also marked the transmissions that resulted in a "data sent callback" as "DSCB".


Time     Events                            Head   Tail   nextTxSeq  highTxMark
------   --------------------------------  -----  -----  ---------  ----------
3.0303   Queue 72697 (40 bytes)            67437  72737  69541      70593

3.03351  Recv ACK 69541,                   69541  72737  72697      72697
         Retrans 69541 (1052 bytes),
         Send 70593 (1052 bytes) DSCB,
         Send 71645 (1052 bytes) DSCB

3.03859  Queue 72737 (1052 bytes)          69541  73789  72697      72697

3.05121  Recv ACK 70593,                   70593  73789  73789      73789
         Send 72697 (1052 b) DSCB,
         Send 73749 (40 b) DSCB

3.05121  Queue 73789 (1052 bytes)          70593  74841  73789      73789

3.05628  Queue 74841 (1052 bytes)          70593  75893  73789      73789

3.?????  Retrans 70593 (1052 bytes)        70593  75893  73789      73789

3.?????  Retrans 70593 (1052 bytes)        70593  75893  71645      73789

3.?????  Recv ACK 72697,                   72697  75893  74801      74801
         Retrans 72697 (1052 bytes),
         Retrans 73749 (1052 bytes) ****

3.10437  Recv ACK 73789,                   73789  75893  75893      75893
         Send 74801 (1052 bytes) DSCB,
         Send 75853 (40 bytes) DSCB

3.?????  Retrans 73789 (1052 bytes)


In the event marked with "****", the packet is a retransmission of seq 73749 through 73788 (40 bytes), and the original transmission of seq 73789 through 74800 (1012 bytes).  However, the data sent callback is not called for the transmission of 1012 bytes.

Attached is a pcap file for the TCP flow.  In viewing this file in wireshark, I noticed that the time stamps are all a bit early compared to the times listed above.  I'm not sure why this is.  Packet number 181 in this trace is the packet that contains data that should be reported to the data sent callback.
Comment 1 Mark Keaton 2015-06-19 08:47:46 EDT
I have found the code that is causing this issue.  It is in the tcp-socket-base.cc file, in the TcpSocketBase::SendDataPacket() method.  On line 2068, the conditional test is not sufficient for cases where a packet sent includes both retransmission bytes and original transmission bytes.  Likewise, the number of bytes passed to the callback on line 2070 needs to be updated.

I have made a fix for this issue and tested is using the problematic simulation.  With this fix applied, the data sent callbacks are all occurring like they should be.

I am attaching a patch for this file.
Comment 2 Mark Keaton 2015-06-19 08:48:52 EDT
Created attachment 2068 [details]
Proposed patch for TCP data sent callback bug
Comment 3 Tom Henderson 2015-06-19 09:51:27 EDT
Thanks for a detailed bug report and the patch.  We should also create a test suite for this callback operation (over a lossy channel) before closing this bug.
Comment 4 natale.patriciello 2015-12-16 14:15:33 EST
Created attachment 2210 [details]
Updated patch

Hi,

I have investigated this bug and I don't understand why you use such a conditional statement. Wouldn't that be enough?

See my patch.

(Test case is coming)
Comment 5 natale.patriciello 2015-12-16 14:24:50 EST
Created attachment 2211 [details]
Proposed test case

This is the testcase: we run various transmission, and check if at the end the amount of data notified is the same than the amount of data transmitted at application level.

To simulate your environment, I used a TcpSocket that, the 50% of the time, saves in its buffer only the 50% of the packet, forcing on one packet the situation you described. Actual ns-3-dev fails; with any of our patches, it does not raise any error.
Comment 6 natale.patriciello 2015-12-17 06:31:20 EST
Created attachment 2212 [details]
One-file patch for test case and bug correction

I've checked. Your condition is the following:

(seq + sz > highTxMark && highTxMark >= seq)

I advocate that the latter part is not necessary, since highTxMark is always >= seq (or we have holes in the transmission.. For instance, if highTxMark < seq, it means that we have not transmitted the piece from highTxMark to seq).

I'm pushing these after checking that other tests does not fail, adding you as contributor. Thanks!
Comment 7 natale.patriciello 2015-12-17 06:40:39 EST
Fixed in 11789:fbe8b99cb680