Bug 418 - RttEstimator::RetransmitTimeout can return a negative value
: RttEstimator::RetransmitTimeout can return a negative value
Status: RESOLVED FIXED
: ns-3
internet-stack
: pre-release
: All All
: P1 normal
Assigned To:
:
:
:
:
  Show dependency treegraph
 
Reported: 2008-11-24 02:53 EDT by
Modified: 2008-12-15 12:10 EDT (History)


Attachments
avoid crash (896 bytes, patch)
2008-11-24 02:53 EDT, Mathieu Lacage
Details | Diff
maybe a fix? (580 bytes, patch)
2008-12-11 13:39 EDT, Rajib Bhattacharjea
Details | Diff
don't make negative estimates in rtt (889 bytes, patch)
2008-12-14 15:15 EDT, Craig Dowell
Details | Diff


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2008-11-24 02:53:12 EDT
Created an attachment (id=314) [details]
avoid crash

When it returns a negative value, the simulator crashes because that value is
fed to Simulator::Schedule.
Attached patch makes sure that the returned value is not negative.
------- Comment #1 From 2008-12-10 02:37:42 EDT -------
again, waiting for approval from raj. reassigning.
------- Comment #2 From 2008-12-11 13:16:50 EDT -------
The estimate should not be negative ever, and rounding negative values up to
zero isn't the right fix.  I'm working on a fix...
------- Comment #3 From 2008-12-11 13:39:39 EDT -------
Created an attachment (id=332) [details]
maybe a fix?

Will you see if this works for you?

Also, provide a test case?
------- Comment #4 From 2008-12-11 13:52:18 EDT -------
(In reply to comment #3)
> Created an attachment (id=332) [details] [details]
> maybe a fix?
> 
> Will you see if this works for you?
> 
> Also, provide a test case?

The testcase is coming from the ns-3-simu tree which is broken at the moment
so, I can't test this.
------- Comment #5 From 2008-12-12 00:39:35 EDT -------
I see a path to a simple test case.  It involves lowering the delay of the
PointToPointChannel between RTT measurements.  I am not having luck with:

Simulator::Schedule
  (Seconds(7),
   &Config::SetDefault,
   "ns3::PointToPointChannel::Delay",
   TimeValue (MilliSeconds(250)));

Any ideas?
------- Comment #6 From 2008-12-12 02:11:45 EDT -------
(In reply to comment #5)
> I see a path to a simple test case.  It involves lowering the delay of the
> PointToPointChannel between RTT measurements.  I am not having luck with:
> 
> Simulator::Schedule
>   (Seconds(7),
>    &Config::SetDefault,
>    "ns3::PointToPointChannel::Delay",
>    TimeValue (MilliSeconds(250)));
> 
> Any ideas?

use the loopback interface for the server and client: This is what I have been
doing in my ns-3-simu tests

> 
------- Comment #7 From 2008-12-12 15:23:48 EDT -------
(From update of attachment 332 [details])
This patch is no good either.  Its redundant (the absolute value of err is
already taken).
------- Comment #8 From 2008-12-13 23:49:08 EDT -------
In the interest of closing this, I applied a modified version of the patch
which makes the minimum RTO value a configurable attribute.  I'd like to
investigate this further though; the value being less than zero means there are
bigger problems in the logic (i.e. it should never happen).

Mathieu, a concrete test case would be great.  I had no luck with the loopback
device.

4023:d320dea20aca
------- Comment #9 From 2008-12-14 11:51:21 EDT -------
testcase:
http://code.nsnam.org/mathieu/ns-3-simu
cd ns-3-simu
hg backout --merge -r 4379
./waf
./waf --shell
./build/debug/examples/process-bench --Benchmark=tcp --Iterations=1000
assert failed. file=../src/simulator/default-simulator-impl.cc, line=189,
cond="tAbsolute >= TimeStep (m_currentTs)"



enjoy
------- Comment #10 From 2008-12-14 15:07:35 EDT -------
The problem is that that in RttMeanDeviation::Measurement an estimated RTT is
made that is negative.  The variable <est> needs to be clamped to be zero or
positive.  You can watch the control loop tick down and go negative as it
approaches an estimate of 0.

Code should be more like,

  Time err = m - est;
  est = est + Scalar (gain) * err; // estimated rtt
  if (est.IsNegative ())
    {
      est = Seconds (0.); // don't make negative estimates
    }
------- Comment #11 From 2008-12-14 15:15:55 EDT -------
Created an attachment (id=335) [details]
don't make negative estimates in rtt
------- Comment #12 From 2008-12-14 15:21:30 EDT -------
What's wrong with the patch I applied to ns-3-dev?
------- Comment #13 From 2008-12-14 15:23:43 EDT -------
@Craig:

The logic implemented is from RFC2988 and a Van Jacobson paper, and the RTO
should NEVER return a negative value unless the current estimate, or the
measurement, is negative (which should never happen with the correct logic in a
causal simulator).  Look carefully at the math.  Regardless of gain [0,1],  the
new estimate should be a weighted average of m and the current estimate.

My point is that the fact that RTO<0 at all implies that we are failing to
implement the logic correctly.
------- Comment #14 From 2008-12-14 15:55:22 EDT -------
> The logic implemented is from RFC2988 and a Van Jacobson paper, and the RTO
> should NEVER return a negative value unless the current estimate, or the
> measurement, is negative (which should never happen with the correct logic 
> in a causal simulator).  Look carefully at the math.  Regardless of gain
> [0,1],  the new estimate should be a weighted average of m and the current
> estimate.

I started by looking at the math, but I didn't immediately see how the estimate
could become zero, so I looked in a debugger since I didn't have a great handle
on possible numeric errors.  Despite what I seen in the math, in the program
the current estimate becomes negative.

In Mathieu's case, every measurement is 0 ns.  What happens when I watch the
control loop is that it begins converging toward zero seconds. Close to the
edge, you have the estimate finally at a point where it displays 0 ns (perhaps
this really means within epsilon of 0).  The error is calculated as m - est
which "the math" says should be 0-0 = 0.  This is displayed as -1 ns so there
is at least some tiny error present.  This tiny error persists in the
calculations, but the variance now begins dropping toward zero.  Eventually 4
times the variance becomes less than est and RetransmitTimeout returns a (timy)
negative number.

> My point is that the fact that RTO<0 at all implies that we are failing to
> implement the logic correctly.

Perhaps the logic is fine, but we aren't admitting the possibility of
vanishingly small numeric errors?
------- Comment #15 From 2008-12-14 16:12:33 EDT -------
That should have read,

  I started by looking at the math, but I didn't immediately see how the
  estimate could become *less than* zero, so I looked in a debugger since I
  didn't have a great handle on possible numeric errors.

Anyway, this is "precisely" the kind of situation where floating point can
inflict nasty wounds ...
------- Comment #16 From 2008-12-14 17:38:30 EDT -------
I'm holding up ns-3.3-RC4 until a resolution of this issue.  I'd like to get it
done by tonight (PST) so our European friends can have at the RC on their
Monday morning.

The attached patch does fix the above test case, so are you okay with it
Mathieu?

What do you think, Raj?  is it okay to apply this patch or do you want to look
deeper?
------- Comment #17 From 2008-12-14 23:14:23 EDT -------
FYI I am looking deeper.  Will have an answer in a few minutes.
------- Comment #18 From 2008-12-14 23:57:54 EDT -------
The following changeset, which was applied to ns-3-dev yesterday:
http://code.nsnam.org/ns-3-dev/rev/d320dea20aca

When applied to Mathieu's branch, "fixes" the problem, i.e. there is no
segfault.

Instead of pinning the minimum estimate (est) to zero, it pins the minimum RTO
to a configurable value, exported through the attributes system (i.e. old BSDs
use 1.0s, Linux uses 0.2s, etc)

The fact that 0-0 = -1e-9 is another bug I think?  As far as I can tell, it has
something to do with fast and slow values...

As it stands, the changeset checked-in to ns-3-dev fixes the issue.  We need to
understand the rounding / precision issue separately and file another bug
report.

I think this bug is okay to close.  Craig?
------- Comment #19 From 2008-12-15 00:13:21 EDT -------
I'm fine with clamping the RTO at some minimum value > 0 -- makes sense.  

If the RTO is pinned to some positive minimum value I don't see how this can
avoid fixing the problem; but I'd like to let Mathieu close the bug if he
considers the issue resolved to avoid any further close/repoen games.
------- Comment #20 From 2008-12-15 00:23:07 EDT -------
Works for me.  The token has been passed to Mathieu.
------- Comment #21 From 2008-12-15 08:37:23 EDT -------
The only thing I really care about is that the segfault is fixed in my
testcase: how it is fixed is not something I can comment on because I did not
try to look at the issue very much. From a high-level perspective, this looked
like a classical example of a floating point substraction rounding error where
you substract 2 values which are almost equal but not exactly equal and the
result is basically undefined unless you perform a post-processing explicit
rounding/clamping.

To summarize, I have no idea where the problem really came from or if your fix
is correct. I am just a bug reporter.

FYI, I have no access to a test machine right now so, I don't even know if your
patch fixes the segfault but I assume that you have tested this already.
------- Comment #22 From 2008-12-15 11:24:29 EDT -------
(In reply to comment #21)
> To summarize, I have no idea where the problem really came from or if your fix
> is correct. I am just a bug reporter.
> 
> FYI, I have no access to a test machine right now so, I don't even know if your
> patch fixes the segfault but I assume that you have tested this already.
> 

Yes, it was tested.  It fixes your segfault.