Bug 1397

Summary: threaded-simulator slow on OS X
Product: ns-3 Reporter: Vedran Miletić <vedran>
Component: coreAssignee: Mathieu Lacage <mathieu.lacage>
Status: REOPENED ---    
Severity: major CC: aquereilhac, ns-bugs, tomh
Priority: P5    
Version: pre-release   
Hardware: PC   
OS: Mac OS   
Attachments: Debug info
Only test RealtimeSimulatorImpl if librt is available
only test threaded-simulator cases appropriate for the configuration

Description Vedran Miletić 2012-03-27 08:44:01 EDT
Test crashes on Fedora 17 with GCC 4.7, and it doesn't on Fedora 16. Is there anything I can do to help debug it?
Comment 1 Vedran Miletić 2012-03-27 08:54:19 EDT
Created attachment 1368 [details]
Debug info

Backtrace and other stuff
Comment 2 Mathieu Lacage 2012-03-27 09:10:18 EDT
provide a more useful backtrace with gdb, please
Comment 3 Mathieu Lacage 2012-03-28 06:26:07 EDT
works under gdb, fails outside.
Comment 4 Vedran Miletić 2012-03-28 10:26:22 EDT
I observed that as well, and thought at first I didn't wait enough.

Any ideas why? Could it be gcc/glibc bug?
Comment 5 Vedran Miletić 2012-03-28 13:38:59 EDT
It crashes even in gdb if you wait long enough.
Comment 6 Mathieu Lacage 2012-04-01 10:02:24 EDT
changeset: 61fc92f6d7f4

This was some crazy corner case in some never-used piece of code. A great example of the cost of code that is never used.
Comment 7 Vedran Miletić 2012-04-01 10:11:48 EDT
Great, thanks.
Comment 8 Tom Henderson 2012-04-02 09:38:20 EDT
I still see this occurring intermittently, on different systems.  Over the weekend, I was running the tests a lot and would see this pop up from time to time.

See, for example, from earlier today:  https://ns-buildmaster.ee.washington.edu:8010/job/osx-lion/label=master/97/consoleText
Comment 9 Mathieu Lacage 2012-04-02 09:41:35 EDT
(In reply to comment #8)
> I still see this occurring intermittently, on different systems.  Over the
> weekend, I was running the tests a lot and would see this pop up from time to
> time.
> 
> See, for example, from earlier today: 
> https://ns-buildmaster.ee.washington.edu:8010/job/osx-lion/label=master/97/consoleText

url does not answer from my side of the internet.
Comment 10 Vedran Miletić 2012-04-05 18:19:27 EDT
Works here. Can you check again?
Comment 11 Tom Henderson 2012-04-06 00:29:34 EDT
(In reply to comment #10)
> Works here. Can you check again?

I believe that Alina is looking into this at the moment.
Comment 12 Tom Henderson 2012-04-06 18:29:19 EDT
For the moment, I disabled the ns2-calendar-scheduler test (which seems to have been the culprit).  I will leave open until if/when we remove ns2-calendar-scheduler from the codebase.
Comment 13 Vedran Miletić 2012-04-07 04:19:30 EDT
So, has Mathieu fixed something else entirely when fixing this on F17?
Comment 14 Tom Henderson 2012-04-07 11:56:53 EDT
(In reply to comment #13)
> So, has Mathieu fixed something else entirely when fixing this on F17?

I believe that there was a fix to the calender scheduler in changeset 61fc92f6d7f4, but this fix exposed some fragility in the ns2-calendar-scheduler, which began to intermittently fail. 

I was able to get rid of the intermittent failures by disabling ns2-calendar-scheduler from that test.

There is a lingering issue still in that this test fails on OS X ns-buildmaster machine due to the below:

assert failed. cond="uid != 0", msg="Assert in TypeId::LookupByName: ns3::RealtimeSimulatorImpl not found"

which I haven't looked at yet.
Comment 15 alina 2012-04-08 13:42:40 EDT
Created attachment 1373 [details]
Only test RealtimeSimulatorImpl if librt is available
Comment 16 alina 2012-04-08 13:43:29 EDT
It is possible this is happening because env['ENABLE_REAL_TIME'] is False in src/core/wscript. In this case the realtime-simulator-impl will not be included, but the threaded-simulator test suite will.

The (before)attached patch fixes this by only using the RealtimeSimulatorImpl in the threaded-simulator suite if HAVE_RT is defined.

I don't have access to a OS X box, so I could not test this in OS X.
Comment 17 Tom Henderson 2012-04-08 13:52:32 EDT
(In reply to comment #16)
> It is possible this is happening because env['ENABLE_REAL_TIME'] is False in
> src/core/wscript. In this case the realtime-simulator-impl will not be
> included, but the threaded-simulator test suite will.
> 
> The (before)attached patch fixes this by only using the RealtimeSimulatorImpl
> in the threaded-simulator suite if HAVE_RT is defined.
> 
> I don't have access to a OS X box, so I could not test this in OS X.

Yes, I am testing this, and also conditionally including threaded-test-suite.cc in the build only if PTHREAD is enabled.  

@@ -289,6 +288,7 @@
             'model/unix-system-mutex.cc',
             'model/unix-system-condition.cc',
             ])
+        core_test.source.extend(['test/threaded-test-suite.cc'])
         core.use.append('PTHREAD')
         core_test.use.append('PTHREAD')
         headers.source.extend([

I will check this in if it passes tests.
Comment 18 Vedran Miletić 2012-04-08 15:08:29 EDT
So, did I understand correctly, ns2-calendar-scheduler wasn't the issue, but realtime stuff?
Comment 19 alina 2012-04-08 15:59:16 EDT
(In reply to comment #18)
> So, did I understand correctly, ns2-calendar-scheduler wasn't the issue, but
> realtime stuff?

There are two independent issues:

* One is caused by using the ns2-calendar-scheduler in the threaded-simulator suite, which produces intermittently the following error:

assert failed. cond="next.key.m_ts >= m_currentTs", msg="RealtimeSimulatorImpl::ProcessOneEvent(): next.GetTs() earlier than m_currentTs (list order error), file=../src/core/model/realtime-simulator-impl.cc, line=330
terminate called without an active exception
Aborted (core dumped)

* The other is caused by using the realtime-simulator-impl in the threaded-simulator suite when realtime-simulator-impl is not included in ns-3 (because librt is not present in the system). This produces the error:

assert failed. cond="uid != 0", msg="Assert in TypeId::LookupByName:
ns3::RealtimeSimulatorImpl not found"

Additionally, as Tom pointed out, the threaded-simulator suite should only be included when pthread is found in the system.
Comment 20 Vedran Miletić 2012-04-08 17:10:35 EDT
Thanks for the explanation Alina.
Comment 21 Tom Henderson 2012-04-09 02:05:23 EDT
I have a patch to fix but, in testing, it raised some concerns about the performance of the threaded test suite on OS X (which has pthread but not librt).  

Here are some performance results from this latest patch on ns-buildmaster (an OS X Mac Pro) and ns-test (a Fedora 14 machine):

ns-buildmaster:  time ./test.py -s threaded-simulator

PASS: TestSuite threaded-simulator
1 of 1 tests passed (1 passed, 0 skipped, 0 failed, 0 crashed, 0 valgrind errors)

real	8m57.021s
user	6m29.876s
sys	52m23.002s


ns-test:  time ./test.py -s threaded-simulator

PASS: TestSuite threaded-simulator
1 of 1 tests passed (1 passed, 0 skipped, 0 failed, 0 crashed, 0 valgrind errors)

real	0m19.791s
user	0m15.807s
sys	0m9.165s
Comment 22 Tom Henderson 2012-04-09 02:07:53 EDT
Created attachment 1374 [details]
only test threaded-simulator cases appropriate for the configuration
Comment 23 alina 2012-04-09 20:22:18 EDT
(In reply to comment #21)
> I have a patch to fix but, in testing, it raised some concerns about the
> performance of the threaded test suite on OS X (which has pthread but not
> librt).  
> 
> Here are some performance results from this latest patch on ns-buildmaster (an
> OS X Mac Pro) and ns-test (a Fedora 14 machine):
> 
> ns-buildmaster:  time ./test.py -s threaded-simulator
> 
> PASS: TestSuite threaded-simulator
> 1 of 1 tests passed (1 passed, 0 skipped, 0 failed, 0 crashed, 0 valgrind
> errors)
> 
> real    8m57.021s
> user    6m29.876s
> sys    52m23.002s
> 
> 
> ns-test:  time ./test.py -s threaded-simulator
> 
> PASS: TestSuite threaded-simulator
> 1 of 1 tests passed (1 passed, 0 skipped, 0 failed, 0 crashed, 0 valgrind
> errors)
> 
> real    0m19.791s
> user    0m15.807s
> sys    0m9.165s

The default-simulator-impl uses unix-system-mutex for synchronization, which internally uses pthread_mutex_t and pthread_mutex_lock to lock the mutex.

Apparently OS X has performance issues with pthread_mutex_lock. Following [1], I ran the proposed test program in ns-buildmaster and ns-test, and got the following results:

ns-buildmaster:repos alina$ time ./pthread_mutex_test

real	0m21.786s
user	0m2.869s
sys	0m18.533s


[alina@ns-test repos]$ time ./pthread_mutex_test

real	0m3.230s
user	0m2.380s
sys	0m10.333s


This shows an important performance difference. [2] suggests using spin locks instead of mutex locks ...  


[1] http://lists.apple.com/archives/perfoptimization-dev/2008/Feb/msg00011.html
[2] http://lists.apple.com/archives/perfoptimization-dev/2008/Feb/msg00017.html
Comment 24 alina 2012-04-09 20:53:20 EDT
(In reply to comment #23)

As an additional comment to this problem: people that use ns-3 with OS X can only use the DefaultSimulatorImpl. Before applying the 'thread-safe-fixes patch' to ns-3, simulations using DefaultSimulatorImpl with multiple threads would crash or misbehave (in contrast, now they work but are really slow on OS X). For this reason, it is probable that nobody is using multithreding + DefaultSimulatorImpl on OS X, so what would be really important is to evaluate the performance impact of the thread-safety changes to the DefaultSimulatorImpl for single threaded simulations on OS X. (I will test this and let you know.)

My point is that, another possible option to fix the problem would be to re-implement all the classes that use pthreads in ns-3 for OS X, using native threading (such as NSThread). This is probably not worth it if nobody is using multithreaded simulations on OS X.
Comment 25 Tom Henderson 2012-04-10 19:15:26 EDT
added this patch (changeset 5ed75237e75a) so we can unbreak the build. Leaving this bug open (with different title) for two issues:

1) remove Ns2CalendarScheduler (also possibly Heap and Calendar)
2) resolve or wontfix OS X issues