Bug 1351 - TCP not able to take RTT samples on long delay network
TCP not able to take RTT samples on long delay network
Status: RESOLVED FIXED
Product: ns-3
Classification: Unclassified
Component: tcp
pre-release
All All
: P5 normal
Assigned To: Tommaso Pecorella
:
: 1333 (view as bug list)
Depends on:
Blocks: 1333
  Show dependency treegraph
 
Reported: 2012-01-30 09:51 EST by Tom Henderson
Modified: 2012-03-21 13:54 EDT (History)
3 users (show)

See Also:


Attachments
Patch fixing the bug (1.54 KB, patch)
2012-02-01 14:58 EST, Tommaso Pecorella
Details | Diff
Patch fixing the bug and many other things (14.92 KB, patch)
2012-02-03 16:01 EST, Tommaso Pecorella
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Tom Henderson 2012-01-30 09:51:47 EST
(reported here on ns-3-users list: http://groups.google.com/group/ns-3-users/browse_thread/thread/3d788c5c01bf9d6a)

I was using NS to test TCP congestion control variants. Since last week decided to migrate from NS-3.10 to NS-3.13.
On NS-3.13 I am having several problems with the evolution of the TCP's congestion window (the results are totally different in NS-3.10 and NS-3.13).
As far i could see when there is high delay (250 ms or more) on the simulated network TCP is not able to take RTT samples, and due to this is not able to calculate RTT estimation and the RTO (congestion window is restarted to 1MSS very often).

The network i have been simulating is simple:

n0  ============== n2 ----------------  n2 ==============  n1
              csma                       p2p                       csma

n0 is the source node (onoffapplication)
n2 and n3 acts as router and implements droptail queue , p2p delay is set to 250ms
n3 is the sink node (sink application)

Attached to the email are the graph of the congestion window on n0 for NS-3.10 and NS-3.13

Is there any bug related to RTT calculation for NS-3.13?

Regards
Mauro Viudotto
Comment 1 Tommaso Pecorella 2012-02-01 14:58:04 EST
Created attachment 1324 [details]
Patch fixing the bug

Mauro is right, there's a bug.

When the RttEstimator is reset, the initial RttEstimation is reset to the wrong value: "1" instead of "Seconds(1)".

The attached patch fix the issue (along with a typo and two attribute descriptions).
Comment 2 Tommaso Pecorella 2012-02-02 19:11:57 EST
Can I push this change ? Comments ?

T.
Comment 3 Tom Henderson 2012-02-03 01:10:42 EST
(In reply to comment #2)
> Can I push this change ? Comments ?
> 
> T.

Regarding this change and comment:

-  est = 1; // XXX: we should go back to the 'initial value' here. Need to add support in Object for this.
+  SetEstimate(Seconds (1.0)); // XXX: we should go back to the 'initial value' here. Need to add support in Object for this.


I do not agree with the comment (I realize it has always been in this code, but we should delete it).  If we need to keep the initial value around, then we should not store the current estimate.  Rather, keep est and InitialEstimation separate state variables.

So, I recommend to:

1) make new state variable 'Time m_initialEst'.  

2) change the setter/getter for attribute InitialEstimation into methods SetInitialEstimate and GetInitialEstimate.  

3) est should initialize from GetInitialEstimate().

4) Reset() should set est to GetInitialEstimate().

5) est, minrto, nSamples should be private; will it break code to hide them?  While you are at it, name them properly (m_est, m_minRto, m_nSamples).

6) delete this XXX comment

Sorry for increasing the scope of the solution.
Comment 4 Tom Henderson 2012-02-03 01:17:59 EST
By the way, is anyone sure that this method works as expected?

void RttMeanDeviation::Measurement (Time m)
{
  if (nSamples)
    { // Not first
      int64x64_t err = m - est;
      est = est + gain * err;         // estimated rtt
      variance = variance + gain * (Abs (err) - variance); // variance of rtt


err is int64x64_t, m is Time, est is Time, gain is double.  Can a double multiply a Time?  why not make err a Time?

A good unit test of this method is probably needed...
Comment 5 Tommaso Pecorella 2012-02-03 16:01:20 EST
Created attachment 1326 [details]
Patch fixing the bug and many other things

Changed a lot of things.

1) the bug is gone
2) Code has been cleaned up:
2a) Variable names are m_xxx now
2b) Functions and classes have their documentation
2c) Attributes have their descriptions
2d) No inline functions
3) Code improvements with:
3a) No unneeded public variables (some are private tho)
3b) Ironed out some variables, e.g., double -> uint, int64x64 -> Time, etc.
4) Can't remember.

Hell, this code had only a small bug but for sure needed some cleaning.
Comment 6 Tom Henderson 2012-02-06 00:30:44 EST
(In reply to comment #5)
> Created attachment 1326 [details]
> Patch fixing the bug and many other things
> 
> Changed a lot of things.
> 
> 1) the bug is gone
> 2) Code has been cleaned up:
> 2a) Variable names are m_xxx now
> 2b) Functions and classes have their documentation
> 2c) Attributes have their descriptions
> 2d) No inline functions
> 3) Code improvements with:
> 3a) No unneeded public variables (some are private tho)
> 3b) Ironed out some variables, e.g., double -> uint, int64x64 -> Time, etc.
> 4) Can't remember.
> 
> Hell, this code had only a small bug but for sure needed some cleaning.

Thanks for doing this.  Two questions:

1) is this the extent of the patch or are you just omitting for brevity the TCP changes that would arise (GetEstimate -> GetCurrentEstimate)?

2) it seems like there is a lot of unnecessary mixing of int64x64_t and Time types, not encapsulating the implementation of Time as int64x64_t; could all of those be changed to Time?
Comment 7 Tommaso Pecorella 2012-02-06 13:05:14 EST
Hi,

short reply:
1) GetEstimate was never used in any part of the code, so for GetCurrentEstimate. It's a sort of dead code, and I'd had removed it if not for its usefulness for (future) TCP variants.

2) yes, you'r right. The reason is: originally almost everything was int64x64_t, and I kept it where it is of paramunt importance to expose the hidden Time data structure. The reason is: some calcs should (in my opinion) force the coder to be aware of the fact that the stuff he/she is working with is NOT a double. Since there are some multiplications there, it's good to keep it in mind.

This patch is not intended to be THE definitive patch of this code, as checking the original paper I found some quite interesting points where we can improve the code, e.g.;
a) m_variance should be m_mDev for real (mean deviation).
b) The gain should be different between two different calcs there, while right now it's the same one. 
c) The gain(s) are supposed to be 1/2^n, so to use shifts instead of casts and float math,

Before changing those, however, I wanted to do a double check about what's used in real Linux kernel, so to know if the code we have is a "pure" implementation of the paper's code or it does have a connection with Linux kernels.
In the first case I'll improve the code, in the second case I'll document the differences.

On the other hand, the latter part goes under "enhancements", while this is a bugfix... with some minor code enhancements mostly invisible to the general public.

Cheers,

Tommaso
Comment 8 Tommaso Pecorella 2012-02-13 12:09:28 EST
Sorry for the bump, but...

can I merge this ?

Cheers,

T.
Comment 9 Tom Henderson 2012-02-14 00:20:52 EST
(In reply to comment #8)
> Sorry for the bump, but...
> 
> can I merge this ?
> 
> Cheers,
> 
> T.

Sorry, I wasn't sure whether you were planning to do more research on this.

I am OK with the merge now.  thanks for doing extra work on it.  We do have a TCP maintainer who hasn't weighed in so I will remind him.

I would recommend after +1 from Adrian:
- keep the bug open (possibly renamed) with a summary of open issues that you think remain (including lack of any unit tests)
- edit CHANGES.html and RELEASE_NOTES accordingly
Comment 10 Tommaso Pecorella 2012-02-23 19:27:17 EST
I found out that there was an "old" bug where Adrian proposed a patch that is, basically, the very same.

The bug number is 1333, and it is still open.

The patches are functionally identical, with this only being more lengthy due to type changes, variable names changes and such.

Since the bug is affecting the TCP behaviour (and right now it's quite a bug), I'd ask kindly Adrian to give an opinion,

Cheers,

T.
Comment 11 Tommaso Pecorella 2012-03-09 14:38:05 EST
Can I merge ?
Comment 12 Tommaso Pecorella 2012-03-21 13:52:58 EDT
fixed in changeset 7776 - 19174ba45009
Comment 13 Tommaso Pecorella 2012-03-21 13:54:01 EDT
*** Bug 1333 has been marked as a duplicate of this bug. ***