Bug 695 - DcfManager::UpdateBackoff () uses slow HighPrecision::Div()
: DcfManager::UpdateBackoff () uses slow HighPrecision::Div()
Status: RESOLVED FIXED
: ns-3
wifi
: ns-3-dev
: All All
: P3 enhancement
Assigned To:
:
:
:
:
  Show dependency treegraph
 
Reported: 2009-09-29 15:04 EDT by
Modified: 2009-11-30 09:39 EDT (History)


Attachments
Solution #1 (via long doubles) (1.13 KB, patch)
2009-09-29 17:23 EDT, Andrey Mazo
Details | Diff
Solution #2 (via TimeInvert) (1.49 KB, patch)
2009-09-29 17:24 EDT, Andrey Mazo
Details | Diff
Modifications to examples to measure perfomance gain. (1.59 KB, patch)
2009-09-29 17:26 EDT, Andrey Mazo
Details | Diff
solution #3 (2.46 KB, patch)
2009-11-24 06:34 EDT, Mathieu Lacage
Details | Diff


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2009-09-29 15:04:22 EDT
UpdateBackoff() function seems to be a bottleneck for wifi-based simulations
eating up to 30% in "examples/mesh".

The really slow line is (src/devices/wifi/dcf-manager.cc:522 at rev
684768c10c59):
Scalar nSlots = (Simulator::Now () - backoffStart) / m_slotTime;

This line executes slow HighPrecision division each time the Updatebackoff()
function is called (and moreover each for() loop iteration).

I see at least 2 possible solutions:
1) retrieve nanoseconds from all involved Time variables and run all
calculations using long double precision;
2) Calculate (1 / m_slotTime) once in DcfManager::SetSlot() and use faster
128bit multiplication instead of slow 128bit devision here.

Solution #1 theoretically can be less precise than original code, but:
1) GetNanoSeconds () returns uint64_t which won't overflow until 584-year-long
simulation which is more than enough;
2) difference (Simulator::Now () - backoffStart) isn't supposed to be very
large, so long double easily handles it;
3) ratio ((Simulator::Now () - backoffStart) / m_slotTime) isn't supposed to be
very large too, so again long double is enough;
4) nSlots is anyway converted to double, thus losing the HighPrecision.
So I think solution #1 isn't less precise in practice.

Solution #2 gives less runtime speed up than solution #1, but is still as
precise as original code.

All unit tests and regression tests are passed for both solutions.
Personally I vote for solution #1.

More detailed benchmarks and patches will be posted soon.
------- Comment #1 From 2009-09-29 17:21:41 EDT -------
(In reply to comment #0)
> More detailed benchmarks and patches will be posted soon.
Static optimized build compiled with gcc-4.3.2 run on Turion 64 X2 TL-60 @
2Ghz.
I've slightly modified the following examples to make them run longer (patch
attached).

0) Original code (revision c1bd4ffb5e47)
 Command being timed: "./build/optimized/examples/mesh"
 User time (seconds): 182.09
 System time (seconds): 0.04
 Percent of CPU this job got: 99%
 Elapsed (wall clock) time (h:mm:ss or m:ss): 3:02.24

 Command being timed: "build/optimized/examples/wifi-wired-bridging --nStas=4"
 User time (seconds): 280.16
 System time (seconds): 2.41
 Percent of CPU this job got: 98%
 Elapsed (wall clock) time (h:mm:ss or m:ss): 4:46.7

1) Solution #1
 Command being timed: "build/optimized/examples/mesh"
 User time (seconds): 120.45
 System time (seconds): 0.09
 Percent of CPU this job got: 99%
 Elapsed (wall clock) time (h:mm:ss or m:ss): 2:01.58

 Command being timed: "build/optimized/examples/wifi-wired-bridging --nStas=4"
 User time (seconds): 240.95
 System time (seconds): 2.87
 Percent of CPU this job got: 94%
 Elapsed (wall clock) time (h:mm:ss or m:ss): 4:18.66

2) Solution #2
 Command being timed: "build/optimized/examples/mesh"
 User time (seconds): 122.12
 System time (seconds): 0.08
 Percent of CPU this job got: 97%
 Elapsed (wall clock) time (h:mm:ss or m:ss): 2:05.70

 Command being timed: "build/optimized/examples/wifi-wired-bridging --nStas=4"
 User time (seconds): 243.98
 System time (seconds): 3.34
 Percent of CPU this job got: 97%
 Elapsed (wall clock) time (h:mm:ss or m:ss): 4:14.37



So, we're getting a speed up of about 15-30%.
------- Comment #2 From 2009-09-29 17:23:39 EDT -------
Created an attachment (id=606) [details]
Solution #1 (via long doubles)
------- Comment #3 From 2009-09-29 17:24:45 EDT -------
Created an attachment (id=607) [details]
Solution #2 (via TimeInvert)
------- Comment #4 From 2009-09-29 17:26:02 EDT -------
Created an attachment (id=608) [details]
Modifications to examples to measure perfomance gain.
------- Comment #5 From 2009-10-01 07:53:26 EDT -------
(In reply to comment #0)
> UpdateBackoff() function seems to be a bottleneck for wifi-based simulations
> eating up to 30% in "examples/mesh".

  Mathieu, could you explain me why do we need 128 bits for time at all? We
always find high precision arithmetics on top of the profiles (oprofile,
cachegrind) -- can we try to avoid them using "native" 64 bit time? More
globally, I have an impression that ns-3 is too slow to practically simulate
O(100) wifi stations with dense traffic -- do you agree? 

> The really slow line is (src/devices/wifi/dcf-manager.cc:522 at rev
> 684768c10c59):
> Scalar nSlots = (Simulator::Now () - backoffStart) / m_slotTime;
> 
> This line executes slow HighPrecision division each time the Updatebackoff()
> function is called (and moreover each for() loop iteration).
> 
> I see at least 2 possible solutions:
> 1) retrieve nanoseconds from all involved Time variables and run all
> calculations using long double precision;
> 2) Calculate (1 / m_slotTime) once in DcfManager::SetSlot() and use faster
> 128bit multiplication instead of slow 128bit devision here.
> 
> Solution #1 theoretically can be less precise than original code, but:
> 1) GetNanoSeconds () returns uint64_t which won't overflow until 584-year-long
> simulation which is more than enough;
> 2) difference (Simulator::Now () - backoffStart) isn't supposed to be very
> large, so long double easily handles it;
> 3) ratio ((Simulator::Now () - backoffStart) / m_slotTime) isn't supposed to be
> very large too, so again long double is enough;
> 4) nSlots is anyway converted to double, thus losing the HighPrecision.
> So I think solution #1 isn't less precise in practice.
> 
> Solution #2 gives less runtime speed up than solution #1, but is still as
> precise as original code.
> 
> All unit tests and regression tests are passed for both solutions.
> Personally I vote for solution #1.
> 
> More detailed benchmarks and patches will be posted soon.
> 
------- Comment #6 From 2009-11-24 06:34:15 EDT -------
Created an attachment (id=672) [details]
solution #3

How about this 3rd solution ?
------- Comment #7 From 2009-11-24 07:12:51 EDT -------
ns-3-dev:
bash-3.2$ time -p ./build/optimized/examples/wireless/wifi-wired-bridging
--nStas=4
real 278.54
user 249.31
sys 3.84
bash-3.2$ time -p ./build/optimized/examples/mesh/mesh
real 216.39
user 206.39
sys 0.02

ns-3-dev + solution #3:
bash-3.2$ time -p ./build/optimized/examples/wireless/wifi-wired-bridging
--nStas=4
real 226.54
user 209.00
sys 3.03
bash-3.2$ time -p ./build/optimized/examples/mesh/mesh
real 63.63
user 62.24
sys 0.19
------- Comment #8 From 2009-11-24 07:28:31 EDT -------
(In reply to comment #5)
>   Mathieu, could you explain me why do we need 128 bits for time at all? We

That goes back to a long discussion a long long time ago. To make it short:
  - we want to make it easy for model developers to get the same simulation
results on different platforms with different floating point arithmetic units
  - hence, we want to encourage avoiding the use of floating point arithmetic
  - but users still need to make simple calculations which need more precision
than simple integer arithmetic
  - our internal time is 64bits
  - hence, user calculations need more precision than 64 bits
  - hence, the easiest thing to do is to give them 64.64 (128) precision.

> always find high precision arithmetics on top of the profiles (oprofile,
> cachegrind) -- can we try to avoid them using "native" 64 bit time? More

In some models, (say, DcfManager), we use Time because it's convenient, not
because we need to. i.e., a lot of the manager code could work with simple
integer arithmetics. In fact, it was originally written that way.

> globally, I have an impression that ns-3 is too slow to practically simulate
> O(100) wifi stations with dense traffic -- do you agree? 

I suspect that if you try to make 100 stations _interfere_, then, yes, you will
run into interesting problems but mainly because the underlying interference
PHY model is O(n2) and, hopefully, you can easily fix this by using a different
PHY model. Otherwise, I would be happy to help fix any optimization issue you
have with the wifi models to make them useful to you.
------- Comment #9 From 2009-11-24 07:53:16 EDT -------
(In reply to comment #6)
> Created an attachment (id=672) [details] [details]
> solution #3
> 
> How about this 3rd solution ?
Good for me.
It gets much better speed gain than others!
------- Comment #10 From 2009-11-24 07:54:25 EDT -------
can you review it for correctness and push if it is ok ? (all tests appear to
pass, but, who knows...)
------- Comment #11 From 2009-11-24 08:11:58 EDT -------
(In reply to comment #10)
> can you review it for correctness and push if it is ok ? (all tests appear to
> pass, but, who knows...)
I'll try to compare some traces for several (relatively) long-running
simulations and then push.
------- Comment #12 From 2009-11-24 08:51:11 EDT -------
(In reply to comment #10)
> can you review it for correctness and push if it is ok ? (all tests appear to
> pass, but, who knows...)
I've noticed, that your patch and original code will behave differently in
several situations, because you've replaced lrint() with integer devision,
which works more like floor().

Old code:
Simulator::Now() = 27000;
backoffStart = 10000;
m_slotTime = 9000;
nSlots = (Simulator::Now () - backoffStart) / m_slotTime = 1.88888;
nIntSlots = lrint (nSlots.GetDouble ()) = 2;

New code:
Simulator::Now().GetMicroSeconds() = 27;
backoffStart.GetMicroSeconds() = 10;
m_slotTimeUs = 9;
nus = 27 - 10 = 17;
nIntSlots = nus / m_slotTimeUs = 17/9 = 1;

I don't know, which of the answers is correct according to the standard.
------- Comment #13 From 2009-11-24 09:51:15 EDT -------
oh, uhm. I suspect that the correct answer is either round to upper always or
round to lower integer always, but never round to the closest.

I have no idea what we should be doing here though.
------- Comment #14 From 2009-11-24 10:01:36 EDT -------
  Mathieu,

  thank you for explanations about history of Time.

> I suspect that if you try to make 100 stations _interfere_, then, yes, you will
> run into interesting problems but mainly because the underlying interference
> PHY model is O(n2) and, 

  Yes, now I know this too. Typical profile of large scale wifi simulation I
see:

samples  %        image name               symbol name
2361908309 68.2932  mesh-sc                 
ns3::InterferenceHelper::GetEnergyDuration(double)
185529257  5.3645  mesh-sc                 
ns3::InterferenceHelper::CalculateNoiseInterferenceW(ns3::Ptr<ns3::InterferenceHelper::Event>,
std::vector<ns3::InterferenceHelper::NiChange,
std::allocator<ns3::InterferenceHelper::NiChange> >*) const
91399806  2.6428  mesh-sc                 
ns3::DcfManager::GetBackoffStartFor(ns3::DcfState*)
88964737  2.5724  mesh-sc                 
ns3::DcfManager::GetAccessGrantStart() const

> hopefully, you can easily fix this by using a different
> PHY model. 

  Do you have one? We will try to optimize interference model when we'll have
some time (= students). 

> Otherwise, I would be happy to help fix any optimization issue you
> have with the wifi models to make them useful to you.

  Thank you! We will definitely return to wifi runtime optimizations.
------- Comment #15 From 2009-11-24 10:24:43 EDT -------
(In reply to comment #13)
> oh, uhm. I suspect that the correct answer is either round to upper always or
> round to lower integer always, but never round to the closest.
> 
> I have no idea what we should be doing here though.
I think, I would go for rounding to upper integer always to avoid situations
with nIntSlots == 0 and updating backoffStart to the past (but I'm not sure,
that this is bad):
Simulator::Now().GetMicroSeconds() = 17;
backoffStart.GetMicroSeconds() = 10;
m_slotTimeUs = 9;
nus = 17 - 10 = 7;
nIntSlots = nus / m_slotTimeUs = 7/9 = 0;
------- Comment #16 From 2009-11-24 10:27:26 EDT -------
(In reply to comment #14)
> > hopefully, you can easily fix this by using a different
> > PHY model. 
> 
>   Do you have one? We will try to optimize interference model when we'll have
> some time (= students). 

No, I don't have one but I wish I had time to add one.
------- Comment #17 From 2009-11-25 03:14:54 EDT -------
(In reply to comment #15)
> (In reply to comment #13)
> > oh, uhm. I suspect that the correct answer is either round to upper always or
> > round to lower integer always, but never round to the closest.
> > 
> > I have no idea what we should be doing here though.
> I think, I would go for rounding to upper integer always to avoid situations
> with nIntSlots == 0 and updating backoffStart to the past (but I'm not sure,
> that this is bad):
> Simulator::Now().GetMicroSeconds() = 17;
> backoffStart.GetMicroSeconds() = 10;
> m_slotTimeUs = 9;
> nus = 17 - 10 = 7;
> nIntSlots = nus / m_slotTimeUs = 7/9 = 0;

I would support whatever you feel is appropriate if you have tested it. So,
feel free to commit a version which works for you.
------- Comment #18 From 2009-11-30 09:26:11 EDT -------
(In reply to comment #15)
> (In reply to comment #13)
> > oh, uhm. I suspect that the correct answer is either round to upper always or
> > round to lower integer always, but never round to the closest.
> > 
> > I have no idea what we should be doing here though.
> I think, I would go for rounding to upper integer always to avoid situations
> with nIntSlots == 0 and updating backoffStart to the past (but I'm not sure,
> that this is bad):
> Simulator::Now().GetMicroSeconds() = 17;
> backoffStart.GetMicroSeconds() = 10;
> m_slotTimeUs = 9;
> nus = 17 - 10 = 7;
> nIntSlots = nus / m_slotTimeUs = 7/9 = 0;

Seems, that I misunderstood the purpose of UpdateBackoff() function.
As I was explained to, this function removes already passed slots from backoff.
So it's a normal situation, when it removes no slots.

Now we're going to push Mathieu's patch and update regression traces (for now,
only third example fails regression test).
------- Comment #19 From 2009-11-30 09:39:58 EDT -------
(In reply to comment #17)
> I would support whatever you feel is appropriate if you have tested it. So,
> feel free to commit a version which works for you.

changeset d3fdc15b065f