Bug 556 - DcfManager does not handle AckTimeout properly when there are multiple queues
: DcfManager does not handle AckTimeout properly when there are multiple queues
Status: RESOLVED FIXED
: ns-3
wifi
: ns-3-dev
: All All
: P5 normal
Assigned To:
:
:
:
:
  Show dependency treegraph
 
Reported: 2009-04-24 08:43 EDT by
Modified: 2009-06-08 07:39 EDT (History)


Attachments
the first version of patch (10.16 KB, text/plain)
2009-04-24 15:22 EDT, Kirill V. Andreev
Details
Second version of patch: changed names (11.58 KB, application/octet-stream)
2009-04-28 11:55 EDT, Kirill V. Andreev
Details
Final version of patch (11.66 KB, patch)
2009-04-29 06:04 EDT, Kirill V. Andreev
Details | Diff
Full patch: bugfix + test-cases (18.55 KB, patch)
2009-06-04 07:54 EDT, Kirill V. Andreev
Details | Diff
Infinite loop in shedule method example (734 bytes, application/octet-stream)
2009-06-08 06:22 EDT, Kirill V. Andreev
Details


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2009-04-24 08:43:16 EDT
see http://mailman.isi.edu/pipermail/ns-developers/2009-April/005808.html

To quote the original report:

---------------------------------
MacLow sends a packet while ACK timeout has not expired (In 802.11 standard in
chapter 9.2.8 was said, that we must wait ack timeout before next
transmission).
I have added NS_ASSERT(false) after line, which cancels ACK timeout in
CancelAllEvents () method in MacLow, which is called from StartTransmission ()
and my simulation has failed with tis assert.
When I have looked through output more detailed - I have seen that a packet
from another queue (beacon queue) has accessed the medium before ACK timeout of
data packet has expired after crash of ACK. I think, that the problem may be
resolved by adding a moment of time when AckTimeout expires in DcfManager. Now
DcfManager knows nothing about AckTimeout and can grant access before ACK
timeout has expired.
---------------------------------

my reply:
---------------------------------
I assume that you reference 9.9.1.3-c.1/2 from 802.11-2007: in the
future, when you are arguing against invalid implementation behavior, it
would be nice to point me to the spec sections rather than make me dig
out the relevant items: it would save you a lot of time to convince me
that what you are describing is a real bug.

To answer your point, yes, I think that you are right and that I was
wrong: the DcfManager appears to be in need of some extra knowledge
about pending transmissions waiting for an ack: I suspect that you will
have to rename MacLowNavListener to MacLowDcfListener and extend it to
report WaitAck and GotAck events. If you do this, you will have to
provide first a new testcase in dcf-manager-test.cc to show that it
fails with the current implementation, and, then, that the new
implementation passes that new testcase.
---------------------------------

the referenced sections of the standard:
---------------------------------
On specific slot boundaries, each EDCAF shall make a determination to perform
one and only one of the following functions:
   —    Initiate the transmission of a frame exchange sequence for that access
function.

[...]

The specific slot boundaries at which exactly one of these operations shall be
performed are defined as follows,
for each EDCAF:

[...]

When any other EDCAF at this STA transmitted a frame requiring acknowledgment,
the earlier of
1) The end of the ACK-Timeout interval timed from the PHY_TXEND.confirm
primitive, followed by AIFSN[AC] x aSlotTime + aSIFSTime – aRxTxTurnaroundTime
of idle medium,    and
2) The end of the first AIFSN[AC] × aSlotTime – aRxTxTurnaroundTime of idle
medium after SIFS (not necessarily medium idle during the SIFS duration, the
start of the SIFS duration implied by the length in the PLCP header of the
previous frame) when a PHY-RXEND.indication primitive occurs as specified in
9.2.8.
---------------------------------
------- Comment #1 From 2009-04-24 15:22:57 EDT -------
Created an attachment (id=429) [details]
the first version of patch

I have fixed this bug, but have not written a test for it. I would like to You
to make somecomments about it. As soon as I modify dcf-manager-test, I will
post another patch
------- Comment #2 From 2009-04-27 03:18:54 EDT -------
(From update of attachment 429 [details])
I don't know if the basic idea of adding a couple of extra 'end' variables is
correct: if the regression tests pass, then, it probably is fine.

Small details:

This should change should be completely unneeded, right ?

>-          if (tmp > Simulator::Now ())
>+          if (tmp.GetNanoSeconds () > Simulator::Now ().GetNanoSeconds ())

NS_LOG_UNCOND should not be used in the final patch:

>@@ -283,6 +283,7 @@
>   bool oneRunning = false;
>   if (m_normalAckTimeoutEvent.IsRunning ()) 
>     {
>+      NS_LOG_UNCOND(Simulator::Now ());
>       m_normalAckTimeoutEvent.Cancel ();
>       oneRunning = true;
>     }

How about prefixing these method names with "Notify" ? Don't forget to change
the name of the type NavListeners+CI too.

>+void
>+MacLow::AckTimeoutStartNow (Time duration)
>+{
>+  for (NavListenersCI i = m_navListeners.begin (); i != m_navListeners.end (); i++) 
>+    {
>+      (*i)->AckTimeoutStart (duration);
>+    }
>+}
>+void
>+MacLow::AckTimeoutResetNow ()
>+{
>+  for (NavListenersCI i = m_navListeners.begin (); i != m_navListeners.end (); i++) 
>+    {
>+      (*i)->AckTimeoutReset ();
>+    }
>+}
>+void
>+MacLow::CtsTimeoutStartNow (Time duration)
>+{
>+  for (NavListenersCI i = m_navListeners.begin (); i != m_navListeners.end (); i++) 
>+    {
>+      (*i)->CtsTimeoutStart (duration);
>+    }
>+}
>+void
>+MacLow::CtsTimeoutResetNow ()
>+{
>+  for (NavListenersCI i = m_navListeners.begin (); i != m_navListeners.end (); i++) 
>+    {
>+      (*i)->CtsTimeoutReset ();
>+    }
>+}

Here: s/NavListeners/DcfListeners/

>-  typedef std::vector<MacLowNavListener *>::const_iterator NavListenersCI;
>-  typedef std::vector<MacLowNavListener *> NavListeners;
>+  typedef std::vector<MacLowDcfListener *>::const_iterator NavListenersCI;
>+  typedef std::vector<MacLowDcfListener *> NavListeners;
>   NavListeners m_navListeners;
------- Comment #3 From 2009-04-27 06:35:21 EDT -------
I would like to tell some words about yhe following changes:
>-          if (tmp > Simulator::Now ())
>+          if (tmp.GetNanoSeconds () > Simulator::Now ().GetNanoSeconds ())
When I have not added this, the tmp was grater than Simulator::Now (), but when
I have printed the moments of time, the nanoseconds value turned out to be
equal. I have observed it because while (1) caused. That was strange a bit(I
did not look at the code of Time reailzation and how operator > was
implemented).


OK, all variables I will fix and attach another version of patch.
------- Comment #4 From 2009-04-27 06:49:11 EDT -------
(In reply to comment #3)
> I would like to tell some words about yhe following changes:
> >-          if (tmp > Simulator::Now ())
> >+          if (tmp.GetNanoSeconds () > Simulator::Now ().GetNanoSeconds ())
> When I have not added this, the tmp was grater than Simulator::Now (), but when
> I have printed the moments of time, the nanoseconds value turned out to be
> equal. I have observed it because while (1) caused. That was strange a bit(I

Time stores values with 64.64 precision and GetNanoSeconds throws away the
fractional part. It would be nice to know exactly why that behavior is
problematic in your use-case.
------- Comment #5 From 2009-04-28 11:55:52 EDT -------
Created an attachment (id=433) [details]
Second version of patch: changed names

I have still now way to solve the problem with comparsion between Time. I still
have to compare nanoseconds values rather than compare Time. I will think about
this problem.
------- Comment #6 From 2009-04-29 05:50:28 EDT -------
Well, it seems to me that I know why it does not work without GetNanoSeconds ()
incomparsion.
When we compare Time and Time - we use the whole 128 bits of data, and when we
call Schedule (...), it uses only 64 bits of data (it uses GetTimeStep ()),
when it inserts an event.
So. it occurs that we schedule event immediately and catch while(1), but the
moments of time we compared before were different, so the following code: (in
DcfManager::DoRestartAccessTimeoutIfNeeded)
m_accessTimeout = Simulator::Schedule (expectedBackoffDelay,
                                     &DcfManager::AccessTimeout, this);

shall be executed all the time.
So, as I think, replacing GetNanoSeconds () with GetTimeStap() is true solution
------- Comment #7 From 2009-04-29 06:04:48 EDT -------
Created an attachment (id=435) [details]
Final version of patch
------- Comment #8 From 2009-05-02 04:46:28 EDT -------
(In reply to comment #7)
> Created an attachment (id=435) [details] [details]
> Final version of patch
> 

ok, but as I said before, I won't apply any patch to this code without an
associated reduced testcase/patch in dcf-manager-test.cc. 
------- Comment #9 From 2009-05-19 09:42:27 EDT -------
any news on trying to get a reduced testcase ?
------- Comment #10 From 2009-05-20 11:47:20 EDT -------
(In reply to comment #9)
> any news on trying to get a reduced testcase ?
> 

Well, I was away all last week, and I need to pass some exams. But as soon as I
can returtn to work with NS, I will write a testcase (I think, I need about a
week to do it). Sorry for delay
------- Comment #11 From 2009-06-04 07:54:31 EDT -------
Created an attachment (id=455) [details]
Full patch: bugfix + test-cases

Test cases are described in picture in comments. All previous tests are filled
with ack-timeout starting and cancelling
------- Comment #12 From 2009-06-08 04:11:32 EDT -------
(In reply to comment #11)
> Created an attachment (id=455) [details] [details]
> Full patch: bugfix + test-cases
> 
> Test cases are described in picture in comments. All previous tests are filled
> with ack-timeout starting and cancelling
> 

Thank you very much: this looks totally awesome. A couple more comments (the
idea of touching that piece of code makes me crazy which is why I am so super
careful):

1) ./waf --regression does not pass: I am looking into it but you can beat me
to it.

2) I know that you are going to hate me but if I remove your calls to
GetTimeStep, "./waf check" still passes ("./waf --regression hangs"). If you
really want me to take these 2 lines in the patch, you would have to put
together a testcase which fails without these 2 lines so that we understand
really why they are here.
------- Comment #13 From 2009-06-08 04:14:48 EDT -------
(In reply to comment #12)

> 1) ./waf --regression does not pass: I am looking into it but you can beat me
> to it.

here is the change in output:
--- a   2009-06-08 10:13:20.000000000 +0200
+++ b   2009-06-08 10:13:44.000000000 +0200
@@ -3,10 +3,10 @@
 02:00:00.000291 291us tsft 6.0 Mb/s 2437 MHz (0x00c0) Acknowledgment
RA:00:00:00:00:00:04 (oui Ethernet) 
 02:00:00.000369 369us tsft 6.0 Mb/s 2437 MHz (0x00c0) Assoc Response AID(0) ::
Succesful
 02:00:00.000513 513us tsft 6.0 Mb/s 2437 MHz (0x00c0) -50dB signal -66dB noise
Acknowledgment RA:00:00:00:00:00:03 (oui Eth
ernet) 
-02:00:00.000950 950us tsft 6.0 Mb/s 2437 MHz (0x00c0) -59dB signal -101dB
noise Assoc Request () [6.0 9.0 12.0 18.0 24.0 36
.0 48.0 54.0 Mbit]
-02:00:00.000966 966us tsft 6.0 Mb/s 2437 MHz (0x00c0) Acknowledgment
RA:00:00:00:00:00:05 (oui Ethernet) 
-02:00:00.001044 1044us tsft 6.0 Mb/s 2437 MHz (0x00c0) Assoc Response AID(0)
:: Succesful
-02:00:00.001188 1188us tsft 6.0 Mb/s 2437 MHz (0x00c0) -59dB signal -101dB
noise Acknowledgment RA:00:00:00:00:00:03 (oui E
thernet) 
+02:00:00.000965 965us tsft 6.0 Mb/s 2437 MHz (0x00c0) -59dB signal -101dB
noise Assoc Request () [6.0 9.0 12.0 18.0 24.0 36
.0 48.0 54.0 Mbit]
+02:00:00.000981 981us tsft 6.0 Mb/s 2437 MHz (0x00c0) Acknowledgment
RA:00:00:00:00:00:05 (oui Ethernet) 
+02:00:00.001059 1059us tsft 6.0 Mb/s 2437 MHz (0x00c0) Assoc Response AID(0)
:: Succesful
+02:00:00.001203 1203us tsft 6.0 Mb/s 2437 MHz (0x00c0) -59dB signal -101dB
noise Acknowledgment RA:00:00:00:00:00:03 (oui E
thernet) 
 02:00:00.508948 508948us tsft 6.0 Mb/s 2437 MHz (0x00c0) -48dB signal -101dB
noise 00:00:00:00:00:07 (oui Ethernet) > 00:00
:00:00:00:04 (oui Ethernet) SNAP Unnumbered, ui, Flags [Command], length 524
 02:00:00.508964 508964us tsft 6.0 Mb/s 2437 MHz (0x00c0) Acknowledgment
RA:00:00:00:00:00:04 (oui Ethernet) 
 02:00:00.517140 517140us tsft 6.0 Mb/s 2437 MHz (0x00c0) -48dB signal -101dB
noise 00:00:00:00:00:07 (oui Ethernet) > 00:00
:00:00:00:04 (oui Ethernet) SNAP Unnumbered, ui, Flags [Command], length 524
------- Comment #14 From 2009-06-08 04:28:24 EDT -------
This is the only meaningful change: the assoc req is sent exactly 15us later

-02:00:00.000950 950us tsft 6.0 Mb/s 2437 MHz (0x00c0) -59dB signal -101dB
noise Assoc Request () [6.0 9.0 12.0 18.0 24.0 36.0 48.0 54.0 Mbit]
+02:00:00.000965 965us tsft 6.0 Mb/s 2437 MHz (0x00c0) -59dB signal -101dB
noise Assoc Request () [6.0 9.0 12.0 18.0 24.0 36.0 48.0 54.0 Mbit]
------- Comment #15 From 2009-06-08 05:42:59 EDT -------
(In reply to comment #14)
> This is the only meaningful change: the assoc req is sent exactly 15us later
> 
> -02:00:00.000950 950us tsft 6.0 Mb/s 2437 MHz (0x00c0) -59dB signal -101dB
> noise Assoc Request () [6.0 9.0 12.0 18.0 24.0 36.0 48.0 54.0 Mbit]
> +02:00:00.000965 965us tsft 6.0 Mb/s 2437 MHz (0x00c0) -59dB signal -101dB
> noise Assoc Request () [6.0 9.0 12.0 18.0 24.0 36.0 48.0 54.0 Mbit]
> 

In both cases, we observe the same scenario: ":05" sends assoc req twice to
":03", first fails with ack timeout because ":03" is busy receiving another req
from ":04", second is the one we see here. the difference between the two
scenarios is 15.666us.
------- Comment #16 From 2009-06-08 06:22:13 EDT -------
Created an attachment (id=457) [details]
Infinite loop in shedule method example

I have written a short example, which repeats inifinite loop situation in wifi.
Shedule simply ignores precision less than 1 nanosecond (GetTimeStep -- see
line 183 in rc/simulator/default-simulator-impl.cc). Similar sitation was
casued when Ack timeout expires.
------- Comment #17 From 2009-06-08 07:17:48 EDT -------
(In reply to comment #15)
> (In reply to comment #14)
> > This is the only meaningful change: the assoc req is sent exactly 15us later
> > 
> > -02:00:00.000950 950us tsft 6.0 Mb/s 2437 MHz (0x00c0) -59dB signal -101dB
> > noise Assoc Request () [6.0 9.0 12.0 18.0 24.0 36.0 48.0 54.0 Mbit]
> > +02:00:00.000965 965us tsft 6.0 Mb/s 2437 MHz (0x00c0) -59dB signal -101dB
> > noise Assoc Request () [6.0 9.0 12.0 18.0 24.0 36.0 48.0 54.0 Mbit]
> > 
> 
> In both cases, we observe the same scenario: ":05" sends assoc req twice to
> ":03", first fails with ack timeout because ":03" is busy receiving another req
> from ":04", second is the one we see here. the difference between the two
> scenarios is 15.666us.
> 

Ok, this change is indeed compatible with the specification I believe: 15.666 =
9 (slot) + 2*1000 / 300000000 (round trip of 1000m if speed of light is
300000000m/s for max propagation delay) is the extra time which an ack timeout
lasts compared to the nav setting.
------- Comment #18 From 2009-06-08 07:39:23 EDT -------
changeset b2654e0f071d

The patch I finally pushed gets rid of the GetTimeStep calls and, instead,
modifies wifi-mac.cc: the problem was simply that all these values were
expected to be rounded to microseconds

diff -r 3135a5c24358 -r b2654e0f071d src/devices/wifi/wifi-mac.cc
--- a/src/devices/wifi/wifi-mac.cc    Mon Jun 08 09:43:47 2009 +0200
+++ b/src/devices/wifi/wifi-mac.cc    Mon Jun 08 13:37:30 2009 +0200
@@ -63,7 +63,7 @@
   */
   Time ctsTimeout = GetDefaultSifs ();
   ctsTimeout += GetDefaultCtsAckDelay ();
-  ctsTimeout += GetDefaultMaxPropagationDelay () * Scalar (2);
+  ctsTimeout += MicroSeconds (GetDefaultMaxPropagationDelay ().GetMicroSeconds
() * 2);
   ctsTimeout += GetDefaultSlot ();
   return ctsTimeout;
 }