Bug 730 - Enabling fragmentation at run-time breaks simulation
Enabling fragmentation at run-time breaks simulation
Status: RESOLVED FIXED
Product: ns-3
Classification: Unclassified
Component: wifi
ns-3-dev
PC Linux
: P3 normal
Assigned To: sebastien.deronne
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2009-10-22 09:11 EDT by Christian
Modified: 2015-08-15 07:19 EDT (History)
6 users (show)

See Also:


Attachments
Simulation script showing the bug (4.98 KB, text/plain)
2009-10-22 09:11 EDT, Christian
Details
new sim program showing the bug (4.98 KB, text/x-c++src)
2010-04-21 11:40 EDT, Nicola Baldo
Details
Simulation results (comment #11) (275.99 KB, text/plain)
2010-07-20 03:08 EDT, Christian
Details
Updated simulation showing problem (5.14 KB, text/x-c++src)
2011-03-15 13:54 EDT, Josh Pelkey
Details
Fragmentation fix (2.45 KB, patch)
2012-06-22 10:56 EDT, Brian Swenson
Details | Diff
testcase to show the bug (5.52 KB, text/x-csrc)
2015-07-06 12:50 EDT, sebastien.deronne
Details
proposed fix for bug 730 (3.52 KB, patch)
2015-07-06 13:18 EDT, sebastien.deronne
Details | Diff
patch to add a unit test (6.78 KB, patch)
2015-07-30 11:56 EDT, sebastien.deronne
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Christian 2009-10-22 09:11:09 EDT
Created attachment 638 [details]
Simulation script showing the bug

When turning fragmentation on _during the simulation_, the simulation breaks, in that the node with fragmentation turned on does not send packets anymore (still, it can receive packets).

To reproduce this bug, just set 'FragmentationThreshold' at a value lower than the packet size, after the simulation has begun. That is:
  Simulator::Schedule (Seconds (10.0), Config::Set, "/NodeList/0/DeviceList/0/RemoteStationManager/FragmentationThreshold", StringValue ("800"));

Attached there is a script which reproduces this behavior.

Turning on/off other mechanisms at run-time works fine (e.g. setting RstCtsThreshold).
Changing parameters like the physical data rate also works smoothly.
Comment 1 Mathieu Lacage 2010-01-07 09:50:34 EST
I am sorry but I do not see the behavior you are describing: the code appears to work well after the fragmentation has been enabled.
Comment 2 Christian 2010-01-08 14:40:53 EST
(In reply to comment #1)

I downloaded the latest dev and the behavior seems normal to me, too.
I will investigate and try to see what was wrong. In the meanwhile, I would suggest that this bug be closed.
Comment 3 Christian 2010-01-22 08:02:27 EST
(In reply to comment #1)
> I am sorry but I do not see the behavior you are describing: the code appears
> to work well after the fragmentation has been enabled.
I dug into the problem a bit more, and I found that the program behaves uncorrectly when the number of nodes (nNodes in the code) is less or equal to 3.

I could not find what causes the problem, though. As a side note, the routing procedure written in the code seems fine to me.
Comment 4 Nicola Baldo 2010-04-13 04:34:21 EDT
(In reply to comment #3)
> I dug into the problem a bit more, and I found that the program behaves
> uncorrectly when the number of nodes (nNodes in the code) is less or equal to
> 3.

I ran the program with nNodes = 3 and it still seems to work correctly. 
Closing the bug.
Comment 5 Christian 2010-04-16 08:53:22 EDT
(In reply to comment #4)
> I ran the program with nNodes = 3 and it still seems to work correctly. 
> Closing the bug.

Actually, with 3 nodes it seems to work correctly.

However, could you please try and set nNodes = 2, 4, or 5?
Using ns-3-dev (rev a6ee8748aee7) the script seems to break at 10 seconds (when nNodes=2 or nNodes=4) and at about 26 seconds (when nNodes=5).
Comment 6 Nicola Baldo 2010-04-21 11:40:07 EDT
Created attachment 842 [details]
new sim program showing the bug
Comment 7 Christian 2010-04-22 03:19:01 EDT
The bug seems related to the TCP variable DelAckCount.
Here's a matrix of tests I performed with different numbers of nodes and different DelAckCount values:
+--------------+----+----+----+----+ 
|      nNodes= |  2 |  3 |  4 |  5 | 
+--------------+----+----+----+----+ 
|DelAckCount=1 | 10 |  - | 10 | 26 | 
+--------------+----+----+----+----+ 
|DelAckCount=2 |  - | 10 |  - |  - | 
+--------------+----+----+----+----+ 
|DelAckCount=3 | 10 |  - |  - |  - | 
+--------------+----+----+----+----+ 
|DelAckCount=5 | 10 |  - | 10 |  - | 
+--------------+----+----+----+----+ 

Numbers in the table stand for "the second in which the simulation breaks".
No number means the simulation seems not to break.

I further investigated the case in which DelAckCount=1 and nNodes=2.
After fragmentation is enabled, when sending a TCP data packet, the sender seems to repeatedly enter the if-test at internet-stack/tcp-socket-impl.cc:416 (rev a6ee8748aee7), reported hereunder for convenience:
if (p->GetSize() > GetTxAvailable ())
{
  m_errno = ERROR_MSGSIZE;
  return -1;
}

The sink stops sending new ACKs, thus preventing the source to send new packets (m_highestRxAck stays the same and, accordingly, GetTxAvailable ()).
However, it is not clear to me why the sink stops sending ACKs.
Comment 8 Nicola Baldo 2010-04-22 06:46:36 EDT
Christian, thank you very much for your good work in identifying the cause of the problem. I am reassigning the bug to the internet-stack component, so that it can be addressed by the right people.
Comment 9 Tom Henderson 2010-05-09 23:37:26 EDT
reassigning to Josh (Tcp)
Comment 10 Josh Pelkey 2010-07-19 12:43:40 EDT
It seems like this is working now, using latest ns-3-dev (bd6bf8901c92).  I tried the simulation script provided by Nicola and used DelAck=1,2,3 and nNodes=3,4,5.  I couldn't get anything strange to happen (ie, it ran to the end and NodeList/2/Device/0 was still transmitting).  Could anyone confirm this is still an issue?

Note, I had to change the DataMode in the script for the remote station manager, since it wouldn't build otherwise.  I used DsssRate1Mbps.
Comment 11 Christian 2010-07-20 03:08:20 EDT
Created attachment 938 [details]
Simulation results (comment #11)
Comment 12 Christian 2010-07-20 03:08:44 EDT
(In reply to comment #10)
> It seems like this is working now, using latest ns-3-dev (bd6bf8901c92).  I
> tried the simulation script provided by Nicola and used DelAck=1,2,3 and
> nNodes=3,4,5.  I couldn't get anything strange to happen (ie, it ran to the end
> and NodeList/2/Device/0 was still transmitting).  Could anyone confirm this is
> still an issue?
I think this is still an issue. I run a test using rev bd6bf8901c92 with nNodes=3, DelAck=2, DataMode=DsssRate1Mbps and it broke around 10 secs. I have attached the simulation result: is it the same as yours?

Also with DataMode=DsssRate11Mbps it breaks around 10 seconds.
Strangely, with nNodes=2, DelAck=1, DataMode=DsssRate11Mbps, trace sources are not hooked.

If you need more tests, don't hesitate to ask.
Comment 13 Josh Pelkey 2010-07-20 11:19:26 EDT
(In reply to comment #12)
> (In reply to comment #10)
> > It seems like this is working now, using latest ns-3-dev (bd6bf8901c92).  I
> > tried the simulation script provided by Nicola and used DelAck=1,2,3 and
> > nNodes=3,4,5.  I couldn't get anything strange to happen (ie, it ran to the end
> > and NodeList/2/Device/0 was still transmitting).  Could anyone confirm this is
> > still an issue?
> I think this is still an issue. I run a test using rev bd6bf8901c92 with
> nNodes=3, DelAck=2, DataMode=DsssRate1Mbps and it broke around 10 secs. I have
> attached the simulation result: is it the same as yours?
> 
> Also with DataMode=DsssRate11Mbps it breaks around 10 seconds.
> Strangely, with nNodes=2, DelAck=1, DataMode=DsssRate11Mbps, trace sources are
> not hooked.
> 
> If you need more tests, don't hesitate to ask.

My results are different with nNodes=3 and DelAck=2.  Maybe you should post the exact test case you are using?  Again, I am using the one Nicola posted, with the slight modification of DsssRate1Mbps.  I tried to attach the simulation results, but the file was too big.  It looks like yours in the beginning, but doesn't have a huge jump from 10s to 51s.
Comment 14 Christian 2010-07-20 11:40:57 EDT
(In reply to comment #13)
> My results are different with nNodes=3 and DelAck=2.  Maybe you should post the
> exact test case you are using?  Again, I am using the one Nicola posted, with
> the slight modification of DsssRate1Mbps.  I tried to attach the simulation
> results, but the file was too big.  It looks like yours in the beginning, but
> doesn't have a huge jump from 10s to 51s.

Actually, me too am using the version Nicola posted.
I've only changed the rate and the DelAck number:
--- frag_runtime.cc	2010-07-20 17:27:42.000000000 +0200
+++ frag_runtime2.cc	2010-07-20 17:34:21.000000000 +0200
@@ -48,7 +48,7 @@
   WifiHelper wifi = WifiHelper::Default ();
   wifi.SetStandard (WIFI_PHY_STANDARD_80211b);
   wifi.SetRemoteStationManager ("ns3::ConstantRateWifiManager",
-      "DataMode", StringValue ("wifib-11mbs"));
+      "DataMode", StringValue ("DsssRate1Mbps"));
 
   NqosWifiMacHelper mac = NqosWifiMacHelper::Default ();
   Ssid ssid = Ssid ("test-ssid-test");
@@ -70,7 +70,7 @@
   InternetStackHelper internet;
   internet.InstallAll ();
 
-  Config::SetDefault ("ns3::TcpSocket::DelAckCount", UintegerValue (1));
+  Config::SetDefault ("ns3::TcpSocket::DelAckCount", UintegerValue (2));
   Config::SetDefault ("ns3::TcpSocket::SegmentSize", UintegerValue (1460));
 
   Ipv4AddressHelper addresses;

I confirm I still see the bug.

The system I am running the simulations on is a Linux machine (Debian), Intel(R) Core(TM)2 Duo CPU E7200 @ 2.53GHz.
Other information:
$ free
             total       used       free     shared    buffers     cached
Mem:       3362876    2860500     502376          0     126412    1769524
-/+ buffers/cache:     964564    2398312
Swap:      1951888        156    1951732

$ uname -a
Linux harakei 2.6.27.7-20081126 #1 SMP Wed Nov 26 16:30:50 CET 2008 i686 GNU/Linux

I can test this bug on an amd64 machine (not sooner than a couple of hours from now).
Comment 15 Josh Pelkey 2010-07-20 12:33:26 EDT
Odd, I'm doing the same:

diff scratch/fragmentation.cc scratch/frag_runtime.cc 

44d43
<   Config::SetDefault ("ns3::TcpSocket::DelAckCount", UintegerValue (2));
52c51
<         "DataMode", StringValue ("DsssRate1Mbps"));
---
>       "DataMode", StringValue ("wifib-11mbs"));

I am using a 64-bit machine.  Did you ./waf build in debug or optimised?  I am using debug.
Comment 16 Christian 2010-07-20 13:32:21 EDT
(In reply to comment #15)
> Odd, I'm doing the same:
Odd, indeed!

> I am using a 64-bit machine.  Did you ./waf build in debug or optimised?  I am
> using debug.
I am using "-d debug" as well.

I've just run the script (with DelAck=2,rate=DsssRate1Mbps) on my amd64 machine (Debian) with rev bd6bf8901c92 and I confirm the bug.
Comment 17 Josh Pelkey 2010-07-20 13:41:55 EDT
(In reply to comment #16)
> (In reply to comment #15)
> > Odd, I'm doing the same:
> Odd, indeed!
> 
> > I am using a 64-bit machine.  Did you ./waf build in debug or optimised?  I am
> > using debug.
> I am using "-d debug" as well.
> 
> I've just run the script (with DelAck=2,rate=DsssRate1Mbps) on my amd64 machine
> (Debian) with rev bd6bf8901c92 and I confirm the bug.

Ok, I figured it out.  My problem as usual :)  I set the DelAck at the top of the file, but failed to delete the one that was farther down (setting it back to one).  Now that its failing, I'll start looking in to the tcp code.
Comment 18 Christian 2010-07-20 14:19:31 EDT
(In reply to comment #17)
> Ok, I figured it out.  My problem as usual :)  I set the DelAck at the top of
> the file, but failed to delete the one that was farther down (setting it back
> to one).  Now that its failing, I'll start looking in to the tcp code.

"Glad" that we found the reason! I was starting to run out of machines to test the bug :P

Don't know if I can be of help, but in case, drop me a line.
Comment 19 Josh Pelkey 2010-07-20 15:55:11 EDT
(In reply to comment #17)
> (In reply to comment #16)
> > (In reply to comment #15)
> > > Odd, I'm doing the same:
> > Odd, indeed!
> > 
> > > I am using a 64-bit machine.  Did you ./waf build in debug or optimised?  I am
> > > using debug.
> > I am using "-d debug" as well.
> > 
> > I've just run the script (with DelAck=2,rate=DsssRate1Mbps) on my amd64 machine
> > (Debian) with rev bd6bf8901c92 and I confirm the bug.
> 
> Ok, I figured it out.  My problem as usual :)  I set the DelAck at the top of
> the file, but failed to delete the one that was farther down (setting it back
> to one).  Now that its failing, I'll start looking in to the tcp code.

While I haven't been able to figure out the problem exactly, it does look like Adrian's new TCP code (http://code.nsnam.org/adrian/ns-3-tcp/) fixes this problem.  We plan to merge this code, hopefully, in a ns-3.9.1 release.

So I guess my next step will just be to create a test case that shows this failing with the old code and hopefully will start passing once Adrian's code gets merged.  I will post a testcase here when I create one.
Comment 20 Josh Pelkey 2011-03-15 13:54:44 EDT
Created attachment 1042 [details]
Updated simulation showing problem

This still seems to be a problem. Here is an updated example to show the issue. Run with:

./waf --run 'scratch/frag_runtime --nNodes=3 --delAck=2'

Packets seem to stop flowing at 10 seconds, when the fragmentation is scheduled. Not sure if the above table still applies -- don't think so -- but will use this program for some more debugging.

Note:

./waf --run 'scratch/frag_runtime --nNodes=3 --delAck=1' and anything greater than delAck=3 seems to work.
Comment 21 Josh Pelkey 2011-08-13 12:12:10 EDT
This is still an issue in the latest ns-3-dev (710d6ff1d227, Aug 11, 2011)

I've run some tests using the most recent simulation above. Results in table below (blank indicates no issue, number is approximately where the simulation seems to break -- it's more like 10.2272):

+--------------+----+----+----+----+----+----+----+----+ 
|      nNodes= |  3 |  4 |  5 |  6 |  7 |  8 |  9 | 10 |
+--------------+----+----+----+----+----+----+----+----+
|DelAckCount=1 |  - |  - |  - |  - |  - |  - |  - | 10 |
+--------------+----+----+----+----+----+----+----+----+
|DelAckCount=2 | 10 | 10 |  - |  - |  - |  - |  - | 10 |
+--------------+----+----+----+----+----+----+----+----+
|DelAckCount=3 | 10 | 10 |  - | 10 |  - |  - | 10 |  - |
+--------------+----+----+----+----+----+----+----+----+
|DelAckCount=4 |  - |  - |  - |  - | 10 |  - |  - |  - |
+--------------+----+----+----+----+----+----+----+----+
|DelAckCount=5 |  - |  - |  - |  - |  - | 10 |  - |  - |
+--------------+----+----+----+----+----+----+----+----+

Note, I didn't do 2 nodes simply because the script is setup to monitor the third node's RxOk and RxError. I went ahead and changed it and ran with two nodes and it seems to break on all DelAckCounts I tried. Also, if you change the time at which the fragmentation threshold is changed (for example, 15s), the simulation seems to break at this point (15s).

Viewing the full log, this is pretty much what you get over and over again, after the simulation seems to "break":

50.939s OnOffApplication:SendPacket()
50.939s OnOffApplication:SendPacket(): sending packet at +50938967997.0ns
50.939s Simulator:IsExpired(0x1295a50)
50.939s Buffer:Buffer(0x129a558, 1460)
50.939s Buffer:Initialize(0x129a558, 1460)
50.939s ByteTagList:ByteTagList(0x129a578)
50.939s Simulator:GetSystemId()
50.939s PacketMetadata:Create(): create size=10, max=10
50.939s PacketMetadata:Create(): create alloc size=10
50.939s PacketMetadata:DoAddHeader(0x129a590, 0, 1460)
50.939s Socket:Send()
50.939s 50.939 [node 0] TcpSocketBase:Send(0x1296d30, 0x129a550)
50.939s TcpTxBuffer:Add(0x1296f30, 0x129a550)
50.939s TcpTxBuffer:Add(): Packet of size 1460 appending to window starting at 400041, availSize=1132
50.939s TcpTxBuffer:Add(): Rejected. Not enough room to buffer packet.

So the TcpTxBuffer is filling up early and after the frag threshold change, for some reason, it doesn't get emptied. The last "SendPendingData" is at 10.2272s
Comment 22 Brian Swenson 2012-06-22 10:56:14 EDT
Created attachment 1416 [details]
Fragmentation fix

Problem was changing fragmentation threshold in middle of packet fragmentation.  In the test case provided by Josh, when the packet was originally sent it all went in one fragment.  When it reached it's destination the simulator was expecting many fragments because the threshold had changed.  Since it never received any more fragments the transaction never finished, therefore it never cleaned out the TCP buffer and therefore it filled up and it stopped sending packets.  The attached pass addresses this.
Comment 23 Tom Henderson 2012-06-22 12:42:53 EDT
(In reply to comment #22)
> Created attachment 1416 [details]
> Fragmentation fix
> 
> Problem was changing fragmentation threshold in middle of packet fragmentation.
>  In the test case provided by Josh, when the packet was originally sent it all
> went in one fragment.  When it reached it's destination the simulator was
> expecting many fragments because the threshold had changed.  Since it never
> received any more fragments the transaction never finished, therefore it never
> cleaned out the TCP buffer and therefore it filled up and it stopped sending
> packets.  The attached pass addresses this.

Changing the component to "WiFi" since the patch suggests that this is localized to a WiFi bug; perhaps also the test case could be turned into a regression test.
Comment 24 sebastien.deronne 2015-05-27 10:31:07 EDT
Fix seems ok, but I would also update EdcaTxopN:

+  m_stationManager->UpdateFragmentationThreshold();


On a realism point of you, will this really occur in real systems?
Comment 25 Matías Richart 2015-06-30 10:03:44 EDT
The patch seems OK to me.

However looking at the code I don't completely agree with Brian explanation.
I think the problem is not when receiving the frame at the receiver side, given that the information if there is more fragments or not is in the frame (the MoreFrag flag).
The problem arises when the sender receives the ACK frame. What is done there is looking if more ACKs (from more fragments of the "same frame") should be expected or not. This is done looking at the original frame size and the fragmentation threshold. If the threshold was changed in between the frame was sent and the ack received, the sender keeps waiting for more ACKs.

The proposed patch solves the problem but I think a more correct (but more complex) solution would be to remember the last number of fragments sent.
Comment 26 Tom Henderson 2015-06-30 10:25:43 EDT
OK with me to commit (with Sebastien's extension).  

Regarding Matias's comment that a more direct check on the sender side would be nice, I agree, but since this may be a more limited use case than normal, perhaps we can get by with this.

Rather than invest time in a more sophisticated fix now, I would suggest instead to turn the simulation that shows the problem (perhaps scheduling a few such events with different threshold sizes) into a test case that is broken on ns-3-dev but for which the patch fixes it.  Then, we may catch future breakage of this and work on better patches then (since we probably will not think of this use case much during our future development).
Comment 27 sebastien.deronne 2015-07-06 12:50:28 EDT
Created attachment 2081 [details]
testcase to show the bug

updated testcase to be in line with the latest ns-3-dev and to better show the issue
Comment 28 sebastien.deronne 2015-07-06 13:18:57 EDT
Created attachment 2082 [details]
proposed fix for bug 730

updated fix to be in line with the latest ns-3-dev (+ add doxygen comments)
Comment 29 sebastien.deronne 2015-07-30 11:56:39 EDT
Created attachment 2106 [details]
patch to add a unit test

As suggested by Tom, this patch turns the simulation that shows the problem into a test case that is broken on ns-3-dev but for which the patch fixes it.

I still have to add few comments and document a bit the test case, but it can at least be reviewed.
Comment 30 sebastien.deronne 2015-08-15 07:19:22 EDT
Fixed in changeset 11580:4bf4b6dfdf64, unit test added in changeset 11581:e205cbdadc69