Bugzilla – Bug 2788
test-runner crashes running multiple tests
Last modified: 2019-11-05 14:22:53 EST
On Sep 4, 2017, at 10:37 AM, Ammon, Robert (ammo6818@vandals.uidaho.edu) <ammo6818@vandals.uidaho.edu> wrote: I am working with Tom on a project to set up an NS-3 Development Environment on Visual Studio. I have created the Visual Studio environment and made all code changes necessary to build the NS-3 development branch under Visual Studio. I started from John Abraham's previous work but it needed a lot of work (the details of which are a 10 page long E-mail that I previously sent Tom that I have left out here for brevity and because I don't think the details are germane to the following discussion). I have now started executing TestRunner to verify all of the tests and have encountered an issue. The issue appears to be in the design of the Time class and by my review should be platform neutral (we should be having the same issue under Linux), even though it does not appear under Linux. Since the issue only appears under Window, I am looking for your help in trying to understand the difference between Windows and Linux and why we dont see the same issue under Linux (since it appears that we should). The following is the output from testrunner that is produced in the command window for the application when run under Windows. The issue is the assert on the last line that is being triggered. The interesting thing is that the assert is valid and matches the current implementation of the Time class and the affected test case (AttributeTestCase). PASS angles 0.000 s PASS cosine-antenna-model 0.013 s PASS degrees-radians 0.000 s PASS isotropic-antenna-model 0.000 s PASS parabolic-antenna-model 0.011 s PASS aodv-routing-id-cache 1.217 s FAIL routing-aodv-regression 3.085 s PASS routing-aodv 2.176 s PASS routing-aodv-loopback 0.783 s PASS udp-client-server 3.085 s assert failed. cond="g_markingTimes != 0", msg="No MarkedTimes registry. Time::SetResolution () called more than once?", file=c:\ns3\ns-3-win-dev\src\core\model\time.cc, line=351 I have attached a PDF file that shows screen shots of the particular execution. As can be seen in the PDF, the first breakpoint is from the CORE.DLL initialization of the static elements. Next the simulation initialization executes SetDefaultNsResolution which calls SetResolution to initialize the time resolution for the simulation. Then, ClearMarkedTimes is called when the first test case (AODV) is executed. I have left them out but ClearMarkedTimes is called for each test case shown in the output window. Finally, the AttributeValue test case is executed. In that test case, it calls Time::SetResolution to ensure the time units are nanoseconds. Eventually that call to Time::SetResolution triggers the assertion. In reviewing the current implementation, I cant see any way that this should behave any differently under Linux (the same assertion should be thrown). The affected member of the Time class is g_markingTimes and this member only gets set on static variable initialization at program load and gets cleared (set to null) by Time::ClearMarkedTimes and one execution path of Time::ConvertTimes. Once cleared by either of these two functions, there is no way to set the value and since ClearMarkedTimes is called by Simulator::Run for each test cases executed, I cant see how Time::SetResolution could ever be successfully used in any of the test cases. I suspect there is some little thing some place that I am missing here, and since the current implementation was implemented by you as part of Bug 954 updates, I am hoping that you can point out the little something that I might be missing. Any help or ideas you can provide would be greatly appreciated.
This is indeed a puzzle. After a quick look I agree with most of your summary of the execution paths, except for one point: while Simulator::Run *does* clear the marked times, I don’t see where *tests* call Simulator::Run. It that is indeed the case, then there shouldn’t be an issue with a test setting the resolution. I think we should modify execution so calling SetResolution a second time with the current resolution *does not* throw an exception; a no-op is a no-op, after all. But this will hide your assert, which indicates a real issue. I’m not sure what to suggest, other than watching g_markingTimes, and/or set breakpoints on all the relevant functions and examine the call stack on each hit.
On Sep 5, 2017, at 6:18 PM, Ammon, Robert (ammo6818@vandals.uidaho.edu) <ammo6818@vandals.uidaho.edu> wrote: I explored this some more and here is the issue. Under Linux, ./test.py runs a new instance of test-runner for each test being executed. What I was doing on Windows was executing test-runner.exe with no parameters to execute every test. In this situation, SetResolution gets called by multiple test cases which looks to the Time class library and multiple calls to SetResolution within a single program (which it is). If I create a batch file to execute test-runner.exe once for each test case, I get the same behavior as under Linux. My follow on question is why does the implementation of the Time class set the value of g_markingTimes to null in ConvertTime and ClearMarkTimes? I assume that is to prevent SetResolution from being called more than once but I am not sure what's the reason for that implementation.
On Sep 5, 2017, at 6:18 PM, Ammon, Robert (ammo6818@vandals.uidaho.edu) <ammo6818@vandals.uidaho.edu> wrote: > I explored this some more and here is the issue. > > Under Linux, ./test.py runs a new instance of test-runner for each test > being executed. > > What I was doing on Windows was executing test-runner.exe with no > parameters to execute every test. Ah, that explains it. > In this situation, SetResolution gets called by multiple test cases which > looks to the Time class library and multiple calls to SetResolution within a > single program (which it is). It looks like accepting SetResolution to the current resolution will only fix a few of those cases. I found Time::SetResolution here: attribute-test-suite.cc: Time::SetResolution (Time::NS); first.cc: Time::SetResolution (Time::NS); sample-log-time-format.cc: Time::SetResolution (search->second); time-test-suite.cc: Time::SetResolution (Time::PS); So only the first two would be safe. > If I create a batch file to execute test-runner.exe once for each test case, I > get the same behavior as under Linux. Yes, that will have to be the real solution. It would be nice if TestRunnerIpml could do something clever when running all tests, like print a warning, omit the tests with SetResolution, only run one of them (I’d vote for time-test-suite). Not obvious to to me how to do any of these. Perhaps add TestCase::SetsTimeResolution(void), to mark those tests? > My follow on question is why does the implementation of the Time class > set the value of g_markingTimes to null in ConvertTime and > ClearMarkTimes? I assume that is to prevent SetResolution from being > called more than once but I am not sure what's the reason for that i > mplementation. Since Time is used everywhere, we didn’t want the resolution option to slow things down while running. Therefore we only support changing the resolution during setup, and freeze the resolution during execution. The implementation mechanics are to set a flag, which happens to be a pointer, to indicate if we’re recording Time instances or not. Notice that the Time constructors all check that flag. Simulator::Run Time::SetResolution clear the flag.
On Sep 28, 2017, at 10:21 AM, Tom Henderson <tomh@tomh.org> wrote: I'm not sure this was driven to resolution; do we need to open a Bugzilla issue to not forget about the details?
On Sep 28, 2017, at 5:40 PM, Ammon, Robert (ammo6818@vandals.uidaho.edu) <ammo6818@vandals.uidaho.edu> wrote: This originally got written because when i tried to execute test-runner w/o any parameters in the Windows environment, it started throwing assertions failures. Turns out that event though test-runner is designed to automatically execute all test cases if you execute the program without any command line parameters, that wont work because the of limitations on changing the time resolution. Without significantly changing this restriction, test-runner cant be executed w/o command line parameters on either Linux or Windows. So my plan is the use the same approach to execute tests on Windows as on Linux, namely ./test.py. I need to do this anyway in order to pick up the Examples that get executed by ./test.py so the same set of tests and examples get executed on both Linux and Windows. Bottom line is that I don't ever see this behavior changing so i don't see the need for a Bugzilla issue.
Yes, I don't think we wan to change the behavior of Simulator::Run. We *could* change Time::SetResolution not to dump the g_markingTimes list (but those tests which then Simulator::Run will still fail when run together. It looks like first.cc and sample-log-time-format.cc are the only ones which both SetResolution and Simulator::Run. We could hard-code an exclusive list in test-runner to run only one of those at a time. Seems a little hokey, but it would prevent this error. Which one should be enabled?
Underlying issue is running each test suite in a separate process (Linux and Mac), vs. running all in the same process (Windows port). There didn't appear to be any good alternatives, other than run tests in Windows as in the other systems. Can we close this?
I am OK to close as resolved/wontfix, and if the Windows VS port is revived someday in a similar fashion, it can be revisited then.
Resolving as WONTFIX. Could also be considered as INVALID, since the Windows test framework was running all tests in a single process context, not as designed.