Bugzilla – Bug 1900
time arithmetic consistency across platforms
Last modified: 2019-11-04 18:11:21 EST
test suites 'global-routing' and 'routing-aodv-regression' are failing on optimized builds only on 32-bit buildslaves, Fedora Core 19 and 20; e.g.
These two tests involve ascii trace file comparison so perhaps there are some subtle timing differences involved.
in bug 1868, Peter bisected this problem to revision 10139 which relates to mesh and wifi changes from last August. The problem later started to appear once Fedora Core 19 became available (gcc-4.8.2), and continued with Fedora Core 20. 64-bit builds do not seem to be affected.
I looked at the routing-aodv-regression and the test failed since the timestamp of one of the packet is slightly off from the referenced pcap.
I'm not sure if this is related to bug 1868 or bug 1761 yet.
Created attachment 1825 [details]
Minimal code where optimized and debug builds give different values.
I attached the minimal code to reproduce the problem here. I tested it with Fedora 19 32bit (gcc version 4.8.2 20131212 (Red Hat 4.8.2-7)).
Output for optimized build:
Output for debug build:
The optimized build gives the correct value. So I think there is overflow in the debug build of ns3::Time.
For the aodv test, the application happens to use these same values so the timing is a bit off between optimized and debug.
Another thing (even more important?), if you try the simple code:
Time nextTime (Seconds (0.15));
std::cout << nextTime << std::endl;
*both* optimized and debug builds print +149999999.0ns. At least on the machine
For routing-aodv-regression (both UDP and TCP):
The application starts at 1 where it schedules its next transmission in the next (9600/64000) = 0.15 seconds.
The optimized build calculates the time to +150000000.0ns.
The debug build calculates the time to +149999999.0ns.
However, ns-3 stores Seconds (0.15) as +149999999.0ns (I tried on both 32/64 bit and both optimized/debug, AFAIK) so the debug build passed the test but the optimized build failed even though the optimized build actually gets the correct(?) time of +150000000.0ns.
Not sure if these will be any helpful but just in case:
Seconds(0.160000002) = +160000002.0ns
Seconds(0.160000001) = +160000001.0ns
Seconds(0.160000000) = +160000000.0ns
Seconds(0.159999999) = +159999999.0ns
Seconds(0.159999998) = +159999998.0ns
Seconds(0.150000002) = +150000001.0ns*
Seconds(0.150000001) = +150000000.0ns*
Seconds(0.150000000) = +149999999.0ns*
Seconds(0.149999999) = +149999998.0ns*
Seconds(0.149999998) = +149999997.0ns*
Seconds(0.140000002) = +140000002.0ns
Seconds(0.140000001) = +140000001.0ns
Seconds(0.140000000) = +140000000.0ns
Seconds(0.139999999) = +139999999.0ns
Seconds(0.139999998) = +139999997.0ns*
Seconds(0.139999997) = +139999996.0ns*
Seconds(0.139999996) = +139999995.0ns*
Seconds(0.139999995) = +139999994.0ns*
Seconds(0.130000002) = +130000002.0ns
Seconds(0.130000001) = +130000001.0ns
Seconds(0.130000000) = +130000000.0ns
Seconds(0.129999999) = +129999999.0ns
Seconds(0.129999998) = +129999998.0ns
For global-routing: same issue as routing-aodv-regression.
The debug build calculates the time between packets of on-off application to the correct value of 0.2 seconds but the optimized build calculates it to be 199999999.0ns. As a result, the optimized build is able to 'sneak in' one extra packet before Application::Stop is called. (according to the correct timing, one packet is scheduled to be sent at the time when the Application::Stop is called so the packet is not sent)
Discussed this with Peter and Daniel, and conclusion was:
1) this bug to be split into two, since two different problems occur here. This bug will stay open to deal with the time arithmetic problems resulting from unstable arithmetic when Time objects are configured with floating point inputs. There may be some limitations in fixing this completely, but recommending use of integer arguments where possible as a best practice, and providing some more API for scalar multiplication options of time objects, may help. This is viewed as a post-ns-3.20 release fix.
2) the second problem is how to make the current tests and models pass consistently between debug and optimized across all platforms. Some tweaks to the tests and examples may accomplish this for ns-3.20.
3) open a new bug for resolving the crashing mesh code (probably unrelated to this issue).
(In reply to Tom Henderson from comment #7)
> 3) open a new bug for resolving the crashing mesh code (probably unrelated
> to this issue).
Actually, bug 1770 is already open for this separate issue.
I bisected the first AODV failing test (bug 772) to changeset 10637. It manifests itself in an interesting way, in that the only difference for UDP is the ephemeral port number used in the trace. I'll track down further what is going on with port allocation and try to find a consistent solution.
*** Bug 2558 has been marked as a duplicate of this bug. ***
Time should support the following arithmetic operations:
Time operator + (Time, Time);
Time operator - (Time, Time);
Time operator * (Time, int); // Looks like long int, long long int are also required
Time operator * (Time, int64x64_t)
Time operator * (Time, double); // Converts double to int64x64_t
int64x64_t operator / (Time, Time); // Retains 64 bit precision
// Convert to double
Time operator / (Time, int64x64_t); // Retains 64 bit precision
Time operator / (Time, double); // Converts double to int64x64_t
Time operator % (Time, Time); // Modular division
Time Div (Time, Time); // Modular division
Time Rem (Time, Time); // Modular remainder