Bugzilla – Bug 730
Enabling fragmentation at run-time breaks simulation
Last modified: 2015-08-15 07:19:22 EDT
Created attachment 638 [details] Simulation script showing the bug When turning fragmentation on _during the simulation_, the simulation breaks, in that the node with fragmentation turned on does not send packets anymore (still, it can receive packets). To reproduce this bug, just set 'FragmentationThreshold' at a value lower than the packet size, after the simulation has begun. That is: Simulator::Schedule (Seconds (10.0), Config::Set, "/NodeList/0/DeviceList/0/RemoteStationManager/FragmentationThreshold", StringValue ("800")); Attached there is a script which reproduces this behavior. Turning on/off other mechanisms at run-time works fine (e.g. setting RstCtsThreshold). Changing parameters like the physical data rate also works smoothly.
I am sorry but I do not see the behavior you are describing: the code appears to work well after the fragmentation has been enabled.
(In reply to comment #1) I downloaded the latest dev and the behavior seems normal to me, too. I will investigate and try to see what was wrong. In the meanwhile, I would suggest that this bug be closed.
(In reply to comment #1) > I am sorry but I do not see the behavior you are describing: the code appears > to work well after the fragmentation has been enabled. I dug into the problem a bit more, and I found that the program behaves uncorrectly when the number of nodes (nNodes in the code) is less or equal to 3. I could not find what causes the problem, though. As a side note, the routing procedure written in the code seems fine to me.
(In reply to comment #3) > I dug into the problem a bit more, and I found that the program behaves > uncorrectly when the number of nodes (nNodes in the code) is less or equal to > 3. I ran the program with nNodes = 3 and it still seems to work correctly. Closing the bug.
(In reply to comment #4) > I ran the program with nNodes = 3 and it still seems to work correctly. > Closing the bug. Actually, with 3 nodes it seems to work correctly. However, could you please try and set nNodes = 2, 4, or 5? Using ns-3-dev (rev a6ee8748aee7) the script seems to break at 10 seconds (when nNodes=2 or nNodes=4) and at about 26 seconds (when nNodes=5).
Created attachment 842 [details] new sim program showing the bug
The bug seems related to the TCP variable DelAckCount. Here's a matrix of tests I performed with different numbers of nodes and different DelAckCount values: +--------------+----+----+----+----+ | nNodes= | 2 | 3 | 4 | 5 | +--------------+----+----+----+----+ |DelAckCount=1 | 10 | - | 10 | 26 | +--------------+----+----+----+----+ |DelAckCount=2 | - | 10 | - | - | +--------------+----+----+----+----+ |DelAckCount=3 | 10 | - | - | - | +--------------+----+----+----+----+ |DelAckCount=5 | 10 | - | 10 | - | +--------------+----+----+----+----+ Numbers in the table stand for "the second in which the simulation breaks". No number means the simulation seems not to break. I further investigated the case in which DelAckCount=1 and nNodes=2. After fragmentation is enabled, when sending a TCP data packet, the sender seems to repeatedly enter the if-test at internet-stack/tcp-socket-impl.cc:416 (rev a6ee8748aee7), reported hereunder for convenience: if (p->GetSize() > GetTxAvailable ()) { m_errno = ERROR_MSGSIZE; return -1; } The sink stops sending new ACKs, thus preventing the source to send new packets (m_highestRxAck stays the same and, accordingly, GetTxAvailable ()). However, it is not clear to me why the sink stops sending ACKs.
Christian, thank you very much for your good work in identifying the cause of the problem. I am reassigning the bug to the internet-stack component, so that it can be addressed by the right people.
reassigning to Josh (Tcp)
It seems like this is working now, using latest ns-3-dev (bd6bf8901c92). I tried the simulation script provided by Nicola and used DelAck=1,2,3 and nNodes=3,4,5. I couldn't get anything strange to happen (ie, it ran to the end and NodeList/2/Device/0 was still transmitting). Could anyone confirm this is still an issue? Note, I had to change the DataMode in the script for the remote station manager, since it wouldn't build otherwise. I used DsssRate1Mbps.
Created attachment 938 [details] Simulation results (comment #11)
(In reply to comment #10) > It seems like this is working now, using latest ns-3-dev (bd6bf8901c92). I > tried the simulation script provided by Nicola and used DelAck=1,2,3 and > nNodes=3,4,5. I couldn't get anything strange to happen (ie, it ran to the end > and NodeList/2/Device/0 was still transmitting). Could anyone confirm this is > still an issue? I think this is still an issue. I run a test using rev bd6bf8901c92 with nNodes=3, DelAck=2, DataMode=DsssRate1Mbps and it broke around 10 secs. I have attached the simulation result: is it the same as yours? Also with DataMode=DsssRate11Mbps it breaks around 10 seconds. Strangely, with nNodes=2, DelAck=1, DataMode=DsssRate11Mbps, trace sources are not hooked. If you need more tests, don't hesitate to ask.
(In reply to comment #12) > (In reply to comment #10) > > It seems like this is working now, using latest ns-3-dev (bd6bf8901c92). I > > tried the simulation script provided by Nicola and used DelAck=1,2,3 and > > nNodes=3,4,5. I couldn't get anything strange to happen (ie, it ran to the end > > and NodeList/2/Device/0 was still transmitting). Could anyone confirm this is > > still an issue? > I think this is still an issue. I run a test using rev bd6bf8901c92 with > nNodes=3, DelAck=2, DataMode=DsssRate1Mbps and it broke around 10 secs. I have > attached the simulation result: is it the same as yours? > > Also with DataMode=DsssRate11Mbps it breaks around 10 seconds. > Strangely, with nNodes=2, DelAck=1, DataMode=DsssRate11Mbps, trace sources are > not hooked. > > If you need more tests, don't hesitate to ask. My results are different with nNodes=3 and DelAck=2. Maybe you should post the exact test case you are using? Again, I am using the one Nicola posted, with the slight modification of DsssRate1Mbps. I tried to attach the simulation results, but the file was too big. It looks like yours in the beginning, but doesn't have a huge jump from 10s to 51s.
(In reply to comment #13) > My results are different with nNodes=3 and DelAck=2. Maybe you should post the > exact test case you are using? Again, I am using the one Nicola posted, with > the slight modification of DsssRate1Mbps. I tried to attach the simulation > results, but the file was too big. It looks like yours in the beginning, but > doesn't have a huge jump from 10s to 51s. Actually, me too am using the version Nicola posted. I've only changed the rate and the DelAck number: --- frag_runtime.cc 2010-07-20 17:27:42.000000000 +0200 +++ frag_runtime2.cc 2010-07-20 17:34:21.000000000 +0200 @@ -48,7 +48,7 @@ WifiHelper wifi = WifiHelper::Default (); wifi.SetStandard (WIFI_PHY_STANDARD_80211b); wifi.SetRemoteStationManager ("ns3::ConstantRateWifiManager", - "DataMode", StringValue ("wifib-11mbs")); + "DataMode", StringValue ("DsssRate1Mbps")); NqosWifiMacHelper mac = NqosWifiMacHelper::Default (); Ssid ssid = Ssid ("test-ssid-test"); @@ -70,7 +70,7 @@ InternetStackHelper internet; internet.InstallAll (); - Config::SetDefault ("ns3::TcpSocket::DelAckCount", UintegerValue (1)); + Config::SetDefault ("ns3::TcpSocket::DelAckCount", UintegerValue (2)); Config::SetDefault ("ns3::TcpSocket::SegmentSize", UintegerValue (1460)); Ipv4AddressHelper addresses; I confirm I still see the bug. The system I am running the simulations on is a Linux machine (Debian), Intel(R) Core(TM)2 Duo CPU E7200 @ 2.53GHz. Other information: $ free total used free shared buffers cached Mem: 3362876 2860500 502376 0 126412 1769524 -/+ buffers/cache: 964564 2398312 Swap: 1951888 156 1951732 $ uname -a Linux harakei 2.6.27.7-20081126 #1 SMP Wed Nov 26 16:30:50 CET 2008 i686 GNU/Linux I can test this bug on an amd64 machine (not sooner than a couple of hours from now).
Odd, I'm doing the same: diff scratch/fragmentation.cc scratch/frag_runtime.cc 44d43 < Config::SetDefault ("ns3::TcpSocket::DelAckCount", UintegerValue (2)); 52c51 < "DataMode", StringValue ("DsssRate1Mbps")); --- > "DataMode", StringValue ("wifib-11mbs")); I am using a 64-bit machine. Did you ./waf build in debug or optimised? I am using debug.
(In reply to comment #15) > Odd, I'm doing the same: Odd, indeed! > I am using a 64-bit machine. Did you ./waf build in debug or optimised? I am > using debug. I am using "-d debug" as well. I've just run the script (with DelAck=2,rate=DsssRate1Mbps) on my amd64 machine (Debian) with rev bd6bf8901c92 and I confirm the bug.
(In reply to comment #16) > (In reply to comment #15) > > Odd, I'm doing the same: > Odd, indeed! > > > I am using a 64-bit machine. Did you ./waf build in debug or optimised? I am > > using debug. > I am using "-d debug" as well. > > I've just run the script (with DelAck=2,rate=DsssRate1Mbps) on my amd64 machine > (Debian) with rev bd6bf8901c92 and I confirm the bug. Ok, I figured it out. My problem as usual :) I set the DelAck at the top of the file, but failed to delete the one that was farther down (setting it back to one). Now that its failing, I'll start looking in to the tcp code.
(In reply to comment #17) > Ok, I figured it out. My problem as usual :) I set the DelAck at the top of > the file, but failed to delete the one that was farther down (setting it back > to one). Now that its failing, I'll start looking in to the tcp code. "Glad" that we found the reason! I was starting to run out of machines to test the bug :P Don't know if I can be of help, but in case, drop me a line.
(In reply to comment #17) > (In reply to comment #16) > > (In reply to comment #15) > > > Odd, I'm doing the same: > > Odd, indeed! > > > > > I am using a 64-bit machine. Did you ./waf build in debug or optimised? I am > > > using debug. > > I am using "-d debug" as well. > > > > I've just run the script (with DelAck=2,rate=DsssRate1Mbps) on my amd64 machine > > (Debian) with rev bd6bf8901c92 and I confirm the bug. > > Ok, I figured it out. My problem as usual :) I set the DelAck at the top of > the file, but failed to delete the one that was farther down (setting it back > to one). Now that its failing, I'll start looking in to the tcp code. While I haven't been able to figure out the problem exactly, it does look like Adrian's new TCP code (http://code.nsnam.org/adrian/ns-3-tcp/) fixes this problem. We plan to merge this code, hopefully, in a ns-3.9.1 release. So I guess my next step will just be to create a test case that shows this failing with the old code and hopefully will start passing once Adrian's code gets merged. I will post a testcase here when I create one.
Created attachment 1042 [details] Updated simulation showing problem This still seems to be a problem. Here is an updated example to show the issue. Run with: ./waf --run 'scratch/frag_runtime --nNodes=3 --delAck=2' Packets seem to stop flowing at 10 seconds, when the fragmentation is scheduled. Not sure if the above table still applies -- don't think so -- but will use this program for some more debugging. Note: ./waf --run 'scratch/frag_runtime --nNodes=3 --delAck=1' and anything greater than delAck=3 seems to work.
This is still an issue in the latest ns-3-dev (710d6ff1d227, Aug 11, 2011) I've run some tests using the most recent simulation above. Results in table below (blank indicates no issue, number is approximately where the simulation seems to break -- it's more like 10.2272): +--------------+----+----+----+----+----+----+----+----+ | nNodes= | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | +--------------+----+----+----+----+----+----+----+----+ |DelAckCount=1 | - | - | - | - | - | - | - | 10 | +--------------+----+----+----+----+----+----+----+----+ |DelAckCount=2 | 10 | 10 | - | - | - | - | - | 10 | +--------------+----+----+----+----+----+----+----+----+ |DelAckCount=3 | 10 | 10 | - | 10 | - | - | 10 | - | +--------------+----+----+----+----+----+----+----+----+ |DelAckCount=4 | - | - | - | - | 10 | - | - | - | +--------------+----+----+----+----+----+----+----+----+ |DelAckCount=5 | - | - | - | - | - | 10 | - | - | +--------------+----+----+----+----+----+----+----+----+ Note, I didn't do 2 nodes simply because the script is setup to monitor the third node's RxOk and RxError. I went ahead and changed it and ran with two nodes and it seems to break on all DelAckCounts I tried. Also, if you change the time at which the fragmentation threshold is changed (for example, 15s), the simulation seems to break at this point (15s). Viewing the full log, this is pretty much what you get over and over again, after the simulation seems to "break": 50.939s OnOffApplication:SendPacket() 50.939s OnOffApplication:SendPacket(): sending packet at +50938967997.0ns 50.939s Simulator:IsExpired(0x1295a50) 50.939s Buffer:Buffer(0x129a558, 1460) 50.939s Buffer:Initialize(0x129a558, 1460) 50.939s ByteTagList:ByteTagList(0x129a578) 50.939s Simulator:GetSystemId() 50.939s PacketMetadata:Create(): create size=10, max=10 50.939s PacketMetadata:Create(): create alloc size=10 50.939s PacketMetadata:DoAddHeader(0x129a590, 0, 1460) 50.939s Socket:Send() 50.939s 50.939 [node 0] TcpSocketBase:Send(0x1296d30, 0x129a550) 50.939s TcpTxBuffer:Add(0x1296f30, 0x129a550) 50.939s TcpTxBuffer:Add(): Packet of size 1460 appending to window starting at 400041, availSize=1132 50.939s TcpTxBuffer:Add(): Rejected. Not enough room to buffer packet. So the TcpTxBuffer is filling up early and after the frag threshold change, for some reason, it doesn't get emptied. The last "SendPendingData" is at 10.2272s
Created attachment 1416 [details] Fragmentation fix Problem was changing fragmentation threshold in middle of packet fragmentation. In the test case provided by Josh, when the packet was originally sent it all went in one fragment. When it reached it's destination the simulator was expecting many fragments because the threshold had changed. Since it never received any more fragments the transaction never finished, therefore it never cleaned out the TCP buffer and therefore it filled up and it stopped sending packets. The attached pass addresses this.
(In reply to comment #22) > Created attachment 1416 [details] > Fragmentation fix > > Problem was changing fragmentation threshold in middle of packet fragmentation. > In the test case provided by Josh, when the packet was originally sent it all > went in one fragment. When it reached it's destination the simulator was > expecting many fragments because the threshold had changed. Since it never > received any more fragments the transaction never finished, therefore it never > cleaned out the TCP buffer and therefore it filled up and it stopped sending > packets. The attached pass addresses this. Changing the component to "WiFi" since the patch suggests that this is localized to a WiFi bug; perhaps also the test case could be turned into a regression test.
Fix seems ok, but I would also update EdcaTxopN: + m_stationManager->UpdateFragmentationThreshold(); On a realism point of you, will this really occur in real systems?
The patch seems OK to me. However looking at the code I don't completely agree with Brian explanation. I think the problem is not when receiving the frame at the receiver side, given that the information if there is more fragments or not is in the frame (the MoreFrag flag). The problem arises when the sender receives the ACK frame. What is done there is looking if more ACKs (from more fragments of the "same frame") should be expected or not. This is done looking at the original frame size and the fragmentation threshold. If the threshold was changed in between the frame was sent and the ack received, the sender keeps waiting for more ACKs. The proposed patch solves the problem but I think a more correct (but more complex) solution would be to remember the last number of fragments sent.
OK with me to commit (with Sebastien's extension). Regarding Matias's comment that a more direct check on the sender side would be nice, I agree, but since this may be a more limited use case than normal, perhaps we can get by with this. Rather than invest time in a more sophisticated fix now, I would suggest instead to turn the simulation that shows the problem (perhaps scheduling a few such events with different threshold sizes) into a test case that is broken on ns-3-dev but for which the patch fixes it. Then, we may catch future breakage of this and work on better patches then (since we probably will not think of this use case much during our future development).
Created attachment 2081 [details] testcase to show the bug updated testcase to be in line with the latest ns-3-dev and to better show the issue
Created attachment 2082 [details] proposed fix for bug 730 updated fix to be in line with the latest ns-3-dev (+ add doxygen comments)
Created attachment 2106 [details] patch to add a unit test As suggested by Tom, this patch turns the simulation that shows the problem into a test case that is broken on ns-3-dev but for which the patch fixes it. I still have to add few comments and document a bit the test case, but it can at least be reviewed.
Fixed in changeset 11580:4bf4b6dfdf64, unit test added in changeset 11581:e205cbdadc69