Bug 3008

Summary:	assert error when sack is disabled
Product:	ns-3	Reporter:	RainSia <rainsia>
Component:	tcp	Assignee:	natale.patriciello
Status:	PATCH PENDING ---
Severity:	critical	CC:	ns-bugs, rainsia
Priority:	P5
Version:	ns-3.29
Hardware:	PC
OS:	Linux
Attachments:	the script to generate the error pcap on node 68 pcap on hub node (for node 68)

Description RainSia 2018-11-10 01:56:08 EST

Created attachment 3206 [details]
the script to generate the error

I got this assertion error when I disable the sack in TCP (newreno):

assert failed. cond="m_sentList.size () > 1", file=../src/internet/model/tcp-tx-buffer.cc, line=1343
terminate called without an active exception

The script which will raise the error is attached.

Comment 1 natale.patriciello 2018-11-10 09:27:36 EST

Hi!

What version are you using? Are you sure you are using 3.29? Because for me ns-3-dev does not have any error:

Reading symbols from /tmp/asd-nat/home/nat/Work/ns-3-dev-git/build/scratch/bug-3008...done.
(gdb) run
Starting program: /tmp/asd-nat/home/nat/Work/ns-3-dev-git/build/scratch/bug-3008 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/gcc/x86_64-pc-linux-gnu/8.2.1/../../../../lib/libthread_db.so.1".
[Inferior 1 (process 24035) exited normally]
(gdb)

Comment 2 RainSia 2018-11-10 20:47:26 EST

Sorry, I forgot to mention that you have to change the number of spoke nodes up to 200 via command line parameter.

Comment 3 natale.patriciello 2018-11-11 14:59:10 EST

Can you try the following patch?

I don't understand why this is happening. The node has 3 RTO, all of them covered, and then it comes a dupack. It should fix the behavior, but I would like to investigate more.

The error happens also with 100 nodes; do you mind to generate the pcap and seeing what is going on with the node 68?

Thanks!

diff --git i/src/internet/model/tcp-tx-buffer.cc w/src/internet/model/tcp-tx-buffer.cc
index bb1c17e24..72b992e53 100644
--- i/src/internet/model/tcp-tx-buffer.cc
+++ w/src/internet/model/tcp-tx-buffer.cc
@@ -1340,6 +1340,13 @@ void
 TcpTxBuffer::AddRenoSack (void)
 {
   NS_LOG_FUNCTION (this);
+
+  if (m_sentList.size () <= 1)
+    {
+      NS_LOG_INFO ("Request to add a reno SACK, but the sent list is 1. Ignore the request.");
+      return;
+    }
+
   NS_ASSERT (m_sentList.size () > 1);
 
   m_renoSack = true;

Comment 4 RainSia 2018-11-11 22:29:04 EST

Created attachment 3211 [details]
pcap on node 68

Comment 5 RainSia 2018-11-11 22:29:35 EST

Created attachment 3212 [details]
pcap on hub node (for node 68)

Comment 6 RainSia 2018-11-11 22:46:04 EST

Yes, the patch does its tricky to bypass the assertion. But the strange behavior of TCP still exists.

I investigated the pcap file on both node 68 and the hub node. I found a weird phenomenon in the network:

At node 68, there are four RTO when the flow starts, this is true, because the network is congested at that time. But after the fourth RTO, after the node retransmitted the packet, the receiver asked for seq=5201, and the node 68 did send the packet out through its net device; however, the hub node did not receive that packet at the other end (even before it's enqueued). It seems that the packet was magically dropped on the channel! Thus, an dupack is received, and the RTO occurs again. Moreover, this behavior is repeated, even after the network is not congested at all, till the end of the flow.

I don't know if i'm right or not, my questions is that why the packet was lost on the channel? shouldn't the packet be received at the other end of the channel by the net device?

Comment 7 RainSia 2018-11-12 04:01:42 EST

I was mistaken, the packet was received at the hub node, not dropped. But the behavior is still very strange. Is this because of the delayed ack?

Comment 8 natale.patriciello 2018-11-12 04:34:16 EST

Can you take a look into the receiver node, and print out the expected Rx sequence? It seems like the sender is thinking that the receiver needs seq=5201, but the receiver needs some other (higher) sequence.

Comment 9 RainSia 2018-11-12 06:11:16 EST

I think it has to do with delayedAck. When I change the delayed ack count to 1 (the default is 2), all flows behave correctly.

For example, in the pcap file of node 68, the sender sends out packet with seq=5201, and the cwnd is 1 (because of the RTO). And it waits for an ack to increase the cwnd, however, the receiver received the packet and delay the ack for another packet's arrival. Thus, the sender can only retransmit the packet till RTO, and the window shrink to 1 again. And this repeats till the end of the flow.

The limited transfer attribute is set to true, but it does not help.

Do you think this is the problem?

Comment 10 natale.patriciello 2018-11-12 08:17:53 EST

Now I understand the problem. 

Your RTO is set to 200 ms, and you delay ack time is set to 200 ms as well. If you reduce the RTO, you have to reduce the ACK delay as well.

Comment 11 RainSia 2018-11-12 08:37:42 EST

OK, thank you very much! I understand the problem now.

By the way, I found that some releases of the linux system has an option called TCP_QUICKACK which can calculate and adjust the ATO (ack timeout) through RTO, the timestamp of last packet and the timestamp of the current packet. It's very interesting. Do you have any plans to support this feature in NS3 in the later releases.

Comment 12 natale.patriciello 2018-11-12 08:45:28 EST

No, we don't. But patches welcome :)

Anyway, I don't know how to solve this bug. Would you mind writing a patch for the documentation sharing your experience? I believe the most appropriate thing is to write a paragraph in doxygen explaining the relationship between RTO and delay ack time.

Comment 13 RainSia 2018-11-12 08:58:03 EST

OK. I will do some research into linux kernel to see how they caluclate the ATO, and then submit a patch for the problem.