Bugzilla – Bug 1397
threaded-simulator slow on OS X
Last modified: 2012-04-10 19:18:45 EDT
Test crashes on Fedora 17 with GCC 4.7, and it doesn't on Fedora 16. Is there anything I can do to help debug it?
Created attachment 1368 [details] Debug info Backtrace and other stuff
provide a more useful backtrace with gdb, please
works under gdb, fails outside.
I observed that as well, and thought at first I didn't wait enough. Any ideas why? Could it be gcc/glibc bug?
It crashes even in gdb if you wait long enough.
changeset: 61fc92f6d7f4 This was some crazy corner case in some never-used piece of code. A great example of the cost of code that is never used.
Great, thanks.
I still see this occurring intermittently, on different systems. Over the weekend, I was running the tests a lot and would see this pop up from time to time. See, for example, from earlier today: https://ns-buildmaster.ee.washington.edu:8010/job/osx-lion/label=master/97/consoleText
(In reply to comment #8) > I still see this occurring intermittently, on different systems. Over the > weekend, I was running the tests a lot and would see this pop up from time to > time. > > See, for example, from earlier today: > https://ns-buildmaster.ee.washington.edu:8010/job/osx-lion/label=master/97/consoleText url does not answer from my side of the internet.
Works here. Can you check again?
(In reply to comment #10) > Works here. Can you check again? I believe that Alina is looking into this at the moment.
For the moment, I disabled the ns2-calendar-scheduler test (which seems to have been the culprit). I will leave open until if/when we remove ns2-calendar-scheduler from the codebase.
So, has Mathieu fixed something else entirely when fixing this on F17?
(In reply to comment #13) > So, has Mathieu fixed something else entirely when fixing this on F17? I believe that there was a fix to the calender scheduler in changeset 61fc92f6d7f4, but this fix exposed some fragility in the ns2-calendar-scheduler, which began to intermittently fail. I was able to get rid of the intermittent failures by disabling ns2-calendar-scheduler from that test. There is a lingering issue still in that this test fails on OS X ns-buildmaster machine due to the below: assert failed. cond="uid != 0", msg="Assert in TypeId::LookupByName: ns3::RealtimeSimulatorImpl not found" which I haven't looked at yet.
Created attachment 1373 [details] Only test RealtimeSimulatorImpl if librt is available
It is possible this is happening because env['ENABLE_REAL_TIME'] is False in src/core/wscript. In this case the realtime-simulator-impl will not be included, but the threaded-simulator test suite will. The (before)attached patch fixes this by only using the RealtimeSimulatorImpl in the threaded-simulator suite if HAVE_RT is defined. I don't have access to a OS X box, so I could not test this in OS X.
(In reply to comment #16) > It is possible this is happening because env['ENABLE_REAL_TIME'] is False in > src/core/wscript. In this case the realtime-simulator-impl will not be > included, but the threaded-simulator test suite will. > > The (before)attached patch fixes this by only using the RealtimeSimulatorImpl > in the threaded-simulator suite if HAVE_RT is defined. > > I don't have access to a OS X box, so I could not test this in OS X. Yes, I am testing this, and also conditionally including threaded-test-suite.cc in the build only if PTHREAD is enabled. @@ -289,6 +288,7 @@ 'model/unix-system-mutex.cc', 'model/unix-system-condition.cc', ]) + core_test.source.extend(['test/threaded-test-suite.cc']) core.use.append('PTHREAD') core_test.use.append('PTHREAD') headers.source.extend([ I will check this in if it passes tests.
So, did I understand correctly, ns2-calendar-scheduler wasn't the issue, but realtime stuff?
(In reply to comment #18) > So, did I understand correctly, ns2-calendar-scheduler wasn't the issue, but > realtime stuff? There are two independent issues: * One is caused by using the ns2-calendar-scheduler in the threaded-simulator suite, which produces intermittently the following error: assert failed. cond="next.key.m_ts >= m_currentTs", msg="RealtimeSimulatorImpl::ProcessOneEvent(): next.GetTs() earlier than m_currentTs (list order error), file=../src/core/model/realtime-simulator-impl.cc, line=330 terminate called without an active exception Aborted (core dumped) * The other is caused by using the realtime-simulator-impl in the threaded-simulator suite when realtime-simulator-impl is not included in ns-3 (because librt is not present in the system). This produces the error: assert failed. cond="uid != 0", msg="Assert in TypeId::LookupByName: ns3::RealtimeSimulatorImpl not found" Additionally, as Tom pointed out, the threaded-simulator suite should only be included when pthread is found in the system.
Thanks for the explanation Alina.
I have a patch to fix but, in testing, it raised some concerns about the performance of the threaded test suite on OS X (which has pthread but not librt). Here are some performance results from this latest patch on ns-buildmaster (an OS X Mac Pro) and ns-test (a Fedora 14 machine): ns-buildmaster: time ./test.py -s threaded-simulator PASS: TestSuite threaded-simulator 1 of 1 tests passed (1 passed, 0 skipped, 0 failed, 0 crashed, 0 valgrind errors) real 8m57.021s user 6m29.876s sys 52m23.002s ns-test: time ./test.py -s threaded-simulator PASS: TestSuite threaded-simulator 1 of 1 tests passed (1 passed, 0 skipped, 0 failed, 0 crashed, 0 valgrind errors) real 0m19.791s user 0m15.807s sys 0m9.165s
Created attachment 1374 [details] only test threaded-simulator cases appropriate for the configuration
(In reply to comment #21) > I have a patch to fix but, in testing, it raised some concerns about the > performance of the threaded test suite on OS X (which has pthread but not > librt). > > Here are some performance results from this latest patch on ns-buildmaster (an > OS X Mac Pro) and ns-test (a Fedora 14 machine): > > ns-buildmaster: time ./test.py -s threaded-simulator > > PASS: TestSuite threaded-simulator > 1 of 1 tests passed (1 passed, 0 skipped, 0 failed, 0 crashed, 0 valgrind > errors) > > real 8m57.021s > user 6m29.876s > sys 52m23.002s > > > ns-test: time ./test.py -s threaded-simulator > > PASS: TestSuite threaded-simulator > 1 of 1 tests passed (1 passed, 0 skipped, 0 failed, 0 crashed, 0 valgrind > errors) > > real 0m19.791s > user 0m15.807s > sys 0m9.165s The default-simulator-impl uses unix-system-mutex for synchronization, which internally uses pthread_mutex_t and pthread_mutex_lock to lock the mutex. Apparently OS X has performance issues with pthread_mutex_lock. Following [1], I ran the proposed test program in ns-buildmaster and ns-test, and got the following results: ns-buildmaster:repos alina$ time ./pthread_mutex_test real 0m21.786s user 0m2.869s sys 0m18.533s [alina@ns-test repos]$ time ./pthread_mutex_test real 0m3.230s user 0m2.380s sys 0m10.333s This shows an important performance difference. [2] suggests using spin locks instead of mutex locks ... [1] http://lists.apple.com/archives/perfoptimization-dev/2008/Feb/msg00011.html [2] http://lists.apple.com/archives/perfoptimization-dev/2008/Feb/msg00017.html
(In reply to comment #23) As an additional comment to this problem: people that use ns-3 with OS X can only use the DefaultSimulatorImpl. Before applying the 'thread-safe-fixes patch' to ns-3, simulations using DefaultSimulatorImpl with multiple threads would crash or misbehave (in contrast, now they work but are really slow on OS X). For this reason, it is probable that nobody is using multithreding + DefaultSimulatorImpl on OS X, so what would be really important is to evaluate the performance impact of the thread-safety changes to the DefaultSimulatorImpl for single threaded simulations on OS X. (I will test this and let you know.) My point is that, another possible option to fix the problem would be to re-implement all the classes that use pthreads in ns-3 for OS X, using native threading (such as NSThread). This is probably not worth it if nobody is using multithreaded simulations on OS X.
added this patch (changeset 5ed75237e75a) so we can unbreak the build. Leaving this bug open (with different title) for two issues: 1) remove Ns2CalendarScheduler (also possibly Heap and Calendar) 2) resolve or wontfix OS X issues