Bug 695 - DcfManager::UpdateBackoff () uses slow HighPrecision::Div()
DcfManager::UpdateBackoff () uses slow HighPrecision::Div()
Status: RESOLVED FIXED
Product: ns-3
Classification: Unclassified
Component: wifi
ns-3-dev
All All
: P3 enhancement
Assigned To: Mathieu Lacage
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2009-09-29 15:04 EDT by Andrey Mazo
Modified: 2009-11-30 09:39 EST (History)
5 users (show)

See Also:


Attachments
Solution #1 (via long doubles) (1.13 KB, patch)
2009-09-29 17:23 EDT, Andrey Mazo
Details | Diff
Solution #2 (via TimeInvert) (1.49 KB, patch)
2009-09-29 17:24 EDT, Andrey Mazo
Details | Diff
Modifications to examples to measure perfomance gain. (1.59 KB, patch)
2009-09-29 17:26 EDT, Andrey Mazo
Details | Diff
solution #3 (2.46 KB, patch)
2009-11-24 06:34 EST, Mathieu Lacage
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Andrey Mazo 2009-09-29 15:04:22 EDT
UpdateBackoff() function seems to be a bottleneck for wifi-based simulations eating up to 30% in "examples/mesh".

The really slow line is (src/devices/wifi/dcf-manager.cc:522 at rev 684768c10c59):
Scalar nSlots = (Simulator::Now () - backoffStart) / m_slotTime;

This line executes slow HighPrecision division each time the Updatebackoff() function is called (and moreover each for() loop iteration).

I see at least 2 possible solutions:
1) retrieve nanoseconds from all involved Time variables and run all calculations using long double precision;
2) Calculate (1 / m_slotTime) once in DcfManager::SetSlot() and use faster 128bit multiplication instead of slow 128bit devision here.

Solution #1 theoretically can be less precise than original code, but:
1) GetNanoSeconds () returns uint64_t which won't overflow until 584-year-long simulation which is more than enough;
2) difference (Simulator::Now () - backoffStart) isn't supposed to be very large, so long double easily handles it;
3) ratio ((Simulator::Now () - backoffStart) / m_slotTime) isn't supposed to be very large too, so again long double is enough;
4) nSlots is anyway converted to double, thus losing the HighPrecision.
So I think solution #1 isn't less precise in practice.

Solution #2 gives less runtime speed up than solution #1, but is still as precise as original code.

All unit tests and regression tests are passed for both solutions.
Personally I vote for solution #1.

More detailed benchmarks and patches will be posted soon.
Comment 1 Andrey Mazo 2009-09-29 17:21:41 EDT
(In reply to comment #0)
> More detailed benchmarks and patches will be posted soon.
Static optimized build compiled with gcc-4.3.2 run on Turion 64 X2 TL-60 @ 2Ghz.
I've slightly modified the following examples to make them run longer (patch attached).

0) Original code (revision c1bd4ffb5e47)
 Command being timed: "./build/optimized/examples/mesh"
 User time (seconds): 182.09
 System time (seconds): 0.04
 Percent of CPU this job got: 99%
 Elapsed (wall clock) time (h:mm:ss or m:ss): 3:02.24

 Command being timed: "build/optimized/examples/wifi-wired-bridging --nStas=4"
 User time (seconds): 280.16
 System time (seconds): 2.41
 Percent of CPU this job got: 98%
 Elapsed (wall clock) time (h:mm:ss or m:ss): 4:46.7

1) Solution #1
 Command being timed: "build/optimized/examples/mesh"
 User time (seconds): 120.45
 System time (seconds): 0.09
 Percent of CPU this job got: 99%
 Elapsed (wall clock) time (h:mm:ss or m:ss): 2:01.58

 Command being timed: "build/optimized/examples/wifi-wired-bridging --nStas=4"
 User time (seconds): 240.95
 System time (seconds): 2.87
 Percent of CPU this job got: 94%
 Elapsed (wall clock) time (h:mm:ss or m:ss): 4:18.66

2) Solution #2
 Command being timed: "build/optimized/examples/mesh"
 User time (seconds): 122.12
 System time (seconds): 0.08
 Percent of CPU this job got: 97%
 Elapsed (wall clock) time (h:mm:ss or m:ss): 2:05.70

 Command being timed: "build/optimized/examples/wifi-wired-bridging --nStas=4"
 User time (seconds): 243.98
 System time (seconds): 3.34
 Percent of CPU this job got: 97%
 Elapsed (wall clock) time (h:mm:ss or m:ss): 4:14.37



So, we're getting a speed up of about 15-30%.
Comment 2 Andrey Mazo 2009-09-29 17:23:39 EDT
Created attachment 606 [details]
Solution #1 (via long doubles)
Comment 3 Andrey Mazo 2009-09-29 17:24:45 EDT
Created attachment 607 [details]
Solution #2 (via TimeInvert)
Comment 4 Andrey Mazo 2009-09-29 17:26:02 EDT
Created attachment 608 [details]
Modifications to examples to measure perfomance gain.
Comment 5 Pavel Boyko 2009-10-01 07:53:26 EDT
(In reply to comment #0)
> UpdateBackoff() function seems to be a bottleneck for wifi-based simulations
> eating up to 30% in "examples/mesh".

  Mathieu, could you explain me why do we need 128 bits for time at all? We always find high precision arithmetics on top of the profiles (oprofile, cachegrind) -- can we try to avoid them using "native" 64 bit time? More globally, I have an impression that ns-3 is too slow to practically simulate O(100) wifi stations with dense traffic -- do you agree? 

> The really slow line is (src/devices/wifi/dcf-manager.cc:522 at rev
> 684768c10c59):
> Scalar nSlots = (Simulator::Now () - backoffStart) / m_slotTime;
> 
> This line executes slow HighPrecision division each time the Updatebackoff()
> function is called (and moreover each for() loop iteration).
> 
> I see at least 2 possible solutions:
> 1) retrieve nanoseconds from all involved Time variables and run all
> calculations using long double precision;
> 2) Calculate (1 / m_slotTime) once in DcfManager::SetSlot() and use faster
> 128bit multiplication instead of slow 128bit devision here.
> 
> Solution #1 theoretically can be less precise than original code, but:
> 1) GetNanoSeconds () returns uint64_t which won't overflow until 584-year-long
> simulation which is more than enough;
> 2) difference (Simulator::Now () - backoffStart) isn't supposed to be very
> large, so long double easily handles it;
> 3) ratio ((Simulator::Now () - backoffStart) / m_slotTime) isn't supposed to be
> very large too, so again long double is enough;
> 4) nSlots is anyway converted to double, thus losing the HighPrecision.
> So I think solution #1 isn't less precise in practice.
> 
> Solution #2 gives less runtime speed up than solution #1, but is still as
> precise as original code.
> 
> All unit tests and regression tests are passed for both solutions.
> Personally I vote for solution #1.
> 
> More detailed benchmarks and patches will be posted soon.
> 

Comment 6 Mathieu Lacage 2009-11-24 06:34:15 EST
Created attachment 672 [details]
solution #3

How about this 3rd solution ?
Comment 7 Mathieu Lacage 2009-11-24 07:12:51 EST
ns-3-dev:
bash-3.2$ time -p ./build/optimized/examples/wireless/wifi-wired-bridging
--nStas=4
real 278.54
user 249.31
sys 3.84
bash-3.2$ time -p ./build/optimized/examples/mesh/mesh
real 216.39
user 206.39
sys 0.02

ns-3-dev + solution #3:
bash-3.2$ time -p ./build/optimized/examples/wireless/wifi-wired-bridging
--nStas=4
real 226.54
user 209.00
sys 3.03
bash-3.2$ time -p ./build/optimized/examples/mesh/mesh
real 63.63
user 62.24
sys 0.19
Comment 8 Mathieu Lacage 2009-11-24 07:28:31 EST
(In reply to comment #5)
>   Mathieu, could you explain me why do we need 128 bits for time at all? We

That goes back to a long discussion a long long time ago. To make it short:
  - we want to make it easy for model developers to get the same simulation results on different platforms with different floating point arithmetic units
  - hence, we want to encourage avoiding the use of floating point arithmetic
  - but users still need to make simple calculations which need more precision than simple integer arithmetic
  - our internal time is 64bits
  - hence, user calculations need more precision than 64 bits
  - hence, the easiest thing to do is to give them 64.64 (128) precision.

> always find high precision arithmetics on top of the profiles (oprofile,
> cachegrind) -- can we try to avoid them using "native" 64 bit time? More

In some models, (say, DcfManager), we use Time because it's convenient, not because we need to. i.e., a lot of the manager code could work with simple integer arithmetics. In fact, it was originally written that way.

> globally, I have an impression that ns-3 is too slow to practically simulate
> O(100) wifi stations with dense traffic -- do you agree? 

I suspect that if you try to make 100 stations _interfere_, then, yes, you will run into interesting problems but mainly because the underlying interference PHY model is O(n2) and, hopefully, you can easily fix this by using a different PHY model. Otherwise, I would be happy to help fix any optimization issue you have with the wifi models to make them useful to you.
Comment 9 Andrey Mazo 2009-11-24 07:53:16 EST
(In reply to comment #6)
> Created an attachment (id=672) [details]
> solution #3
> 
> How about this 3rd solution ?
Good for me.
It gets much better speed gain than others!
Comment 10 Mathieu Lacage 2009-11-24 07:54:25 EST
can you review it for correctness and push if it is ok ? (all tests appear to pass, but, who knows...)
Comment 11 Andrey Mazo 2009-11-24 08:11:58 EST
(In reply to comment #10)
> can you review it for correctness and push if it is ok ? (all tests appear to
> pass, but, who knows...)
I'll try to compare some traces for several (relatively) long-running simulations and then push.
Comment 12 Andrey Mazo 2009-11-24 08:51:11 EST
(In reply to comment #10)
> can you review it for correctness and push if it is ok ? (all tests appear to
> pass, but, who knows...)
I've noticed, that your patch and original code will behave differently in several situations, because you've replaced lrint() with integer devision, which works more like floor().

Old code:
Simulator::Now() = 27000;
backoffStart = 10000;
m_slotTime = 9000;
nSlots = (Simulator::Now () - backoffStart) / m_slotTime = 1.88888;
nIntSlots = lrint (nSlots.GetDouble ()) = 2;

New code:
Simulator::Now().GetMicroSeconds() = 27;
backoffStart.GetMicroSeconds() = 10;
m_slotTimeUs = 9;
nus = 27 - 10 = 17;
nIntSlots = nus / m_slotTimeUs = 17/9 = 1;

I don't know, which of the answers is correct according to the standard.
Comment 13 Mathieu Lacage 2009-11-24 09:51:15 EST
oh, uhm. I suspect that the correct answer is either round to upper always or round to lower integer always, but never round to the closest.

I have no idea what we should be doing here though.
Comment 14 Pavel Boyko 2009-11-24 10:01:36 EST
  Mathieu,

  thank you for explanations about history of Time.

> I suspect that if you try to make 100 stations _interfere_, then, yes, you will
> run into interesting problems but mainly because the underlying interference
> PHY model is O(n2) and, 

  Yes, now I know this too. Typical profile of large scale wifi simulation I see:

samples  %        image name               symbol name
2361908309 68.2932  mesh-sc                  ns3::InterferenceHelper::GetEnergyDuration(double)
185529257  5.3645  mesh-sc                  ns3::InterferenceHelper::CalculateNoiseInterferenceW(ns3::Ptr<ns3::InterferenceHelper::Event>, std::vector<ns3::InterferenceHelper::NiChange, std::allocator<ns3::InterferenceHelper::NiChange> >*) const
91399806  2.6428  mesh-sc                  ns3::DcfManager::GetBackoffStartFor(ns3::DcfState*)
88964737  2.5724  mesh-sc                  ns3::DcfManager::GetAccessGrantStart() const

> hopefully, you can easily fix this by using a different
> PHY model. 

  Do you have one? We will try to optimize interference model when we'll have some time (= students). 

> Otherwise, I would be happy to help fix any optimization issue you
> have with the wifi models to make them useful to you.

  Thank you! We will definitely return to wifi runtime optimizations.
Comment 15 Andrey Mazo 2009-11-24 10:24:43 EST
(In reply to comment #13)
> oh, uhm. I suspect that the correct answer is either round to upper always or
> round to lower integer always, but never round to the closest.
> 
> I have no idea what we should be doing here though.
I think, I would go for rounding to upper integer always to avoid situations with nIntSlots == 0 and updating backoffStart to the past (but I'm not sure, that this is bad):
Simulator::Now().GetMicroSeconds() = 17;
backoffStart.GetMicroSeconds() = 10;
m_slotTimeUs = 9;
nus = 17 - 10 = 7;
nIntSlots = nus / m_slotTimeUs = 7/9 = 0;
Comment 16 Mathieu Lacage 2009-11-24 10:27:26 EST
(In reply to comment #14)
> > hopefully, you can easily fix this by using a different
> > PHY model. 
> 
>   Do you have one? We will try to optimize interference model when we'll have
> some time (= students). 

No, I don't have one but I wish I had time to add one.
Comment 17 Mathieu Lacage 2009-11-25 03:14:54 EST
(In reply to comment #15)
> (In reply to comment #13)
> > oh, uhm. I suspect that the correct answer is either round to upper always or
> > round to lower integer always, but never round to the closest.
> > 
> > I have no idea what we should be doing here though.
> I think, I would go for rounding to upper integer always to avoid situations
> with nIntSlots == 0 and updating backoffStart to the past (but I'm not sure,
> that this is bad):
> Simulator::Now().GetMicroSeconds() = 17;
> backoffStart.GetMicroSeconds() = 10;
> m_slotTimeUs = 9;
> nus = 17 - 10 = 7;
> nIntSlots = nus / m_slotTimeUs = 7/9 = 0;

I would support whatever you feel is appropriate if you have tested it. So, feel free to commit a version which works for you.
Comment 18 Andrey Mazo 2009-11-30 09:26:11 EST
(In reply to comment #15)
> (In reply to comment #13)
> > oh, uhm. I suspect that the correct answer is either round to upper always or
> > round to lower integer always, but never round to the closest.
> > 
> > I have no idea what we should be doing here though.
> I think, I would go for rounding to upper integer always to avoid situations
> with nIntSlots == 0 and updating backoffStart to the past (but I'm not sure,
> that this is bad):
> Simulator::Now().GetMicroSeconds() = 17;
> backoffStart.GetMicroSeconds() = 10;
> m_slotTimeUs = 9;
> nus = 17 - 10 = 7;
> nIntSlots = nus / m_slotTimeUs = 7/9 = 0;

Seems, that I misunderstood the purpose of UpdateBackoff() function.
As I was explained to, this function removes already passed slots from backoff.
So it's a normal situation, when it removes no slots.

Now we're going to push Mathieu's patch and update regression traces (for now, only third example fails regression test).
Comment 19 Andrey Mazo 2009-11-30 09:39:58 EST
(In reply to comment #17)
> I would support whatever you feel is appropriate if you have tested it. So,
> feel free to commit a version which works for you.

changeset d3fdc15b065f