Bug 2470

Summary: handshake to setup the Block Ack Agreement is not protected
Product: ns-3 Reporter: Tom Henderson <tomh>
Component: wifiAssignee: sebastien.deronne
Status: RESOLVED FIXED    
Severity: normal CC: krotov, muh.iqbal.cr, ns-bugs, redieteab.orange, selinis.g
Priority: P5    
Version: ns-3.29   
Hardware: All   
OS: All   
Attachments: proposed solution
Final patch
Final patch reworked

Description Tom Henderson 2016-08-07 14:54:49 EDT
When conditions exist to trigger a ADDBA Request/Response, packets to that remote address and TID are added to a list of QosBlockedDestinations (blocked until the handshake completes).

However, it may be the case that either the Request or Response exceed the retry limit, and is lost.  There is no mechanism to restart this handshake, and even subsequent frames that arrive will not trigger a new handshake attempt.  The destination/tid becomes blocked for the rest of the simulation.

In my opinion, some kind of a timer protecting this is needed, such that if a timeout expires without successful setup, the station either retries the handshake or unblocks the address/tid so that a subsequent AccessGranted event could trigger a new handshake.  Also, I recommend a test program that explicitly forces a loss of either the Request or Response and checks that the proper state is restored on both endpoints.
Comment 1 sebastien.deronne 2016-08-13 06:08:05 EDT
I agree we should have a kind of protection here.
Is there anything mentioned in the std about that?
Comment 2 sebastien.deronne 2016-11-01 11:38:46 EDT
We should unblock this thread.
Any one knows whether some kind of protection is implemented in the field?
Comment 3 sebastien.deronne 2016-11-23 14:26:42 EST
Tom, do you have a scenario to reproduce?
Comment 4 sebastien.deronne 2016-11-23 16:43:05 EST
(In reply to sebastien.deronne from comment #3)
> Tom, do you have a scenario to reproduce?

I could reproduce with some quick hack in the code.
Comment 5 sebastien.deronne 2016-11-23 16:53:21 EST
This might help:

http://linux-wireless.vger.kernel.narkive.com/izrVkBAO/patch-mac80211-fix-the-way-addba-request-count-being-modified

My view is that we should add a counter (parameter) of the maximum number of allowed ADDBA attempts (and 0 meaning infinite) iso using maxSlrc. When this maximum number of attempts is reached, A-MPDU should be disabled for that TID. Also, AP that does not support Block Ack should not reply to ADDBA, but this is maybe beyond the scope of this issue.

Any other view or complement on this?
Comment 6 Ioannis 2016-11-26 15:55:39 EST
I found the following

- IEEE-802.11-2012std (10.5.2.2 Procedure at the originator):

"If there is no response from the recipient within dot11ADDBAFailureTimeout, the STA has not established a Block Ack mechanism with the recipient STA; and the MLME shall issue an MLME-ADDBA.confirm primitive with a result code of TIMEOUT."

- Next Generation Wireless LANs: 802.11n and 802.11ac (2nd edition), E. Perahia, R. Stacey:

"The block ack session is initiated by the originator sending an ADDBA Request frame. In response to a correctly received ADDBA Request frame, the responder sends an ACK. After further processing the responder sends an ADDBA Response frame to which the originator responds with an ACK if correctly received. The originator and responder will retransmit ADDBA Request/Response if the expected ACK is not received. An inactivity timeout at the originator will detect a failed session setup." 

Regarding the SLRC (long retry counter) or SSRC (short retry counter) there are two sections in IEEE-802.11-2012std;
 
- 9.3.3 Random backoff time:
The SSRC shall be incremented when any short retry count (SRC) associated with any MPDU of type Data is incremented. The SLRC shall be incremented when any long retry count (LRC) associated with any MPDU of type Data is incremented. 

- 9.19.2.6 Retransmit procedures:
QoS STAs shall maintain a short retry counter and a long retry counter for each MSDU, A-MSDU, or MMPDU that belongs to a TC requiring acknowledgment. The initial value for the short and long retry counters shall be 0. QoS STAs also maintain a short retry counter and a long retry counter for each AC. They
are defined as QSRC[AC] and QLRC[AC], respectively, and each is initialized to a value of 0. After transmitting a frame that requires an immediate acknowledgment, the STA shall perform either of the acknowledgment procedures, as appropriate, that are defined in 9.3.2.8 and 9.21.3. The short retry count for an MSDU or A-MSDU that is not part of a Block Ack agreement or for an MMPDU shall be incremented every time transmission of a frame of length less than or equal to dot11RTSThreshold fails for that MSDU, A-MSDU, or MMPDU. QSRC[AC] shall be incremented every time transmission of an A-MPDU or frame of length less than or equal to dot11RTSThreshold fails. This short retry count and the QoS STA QSRC[AC] shall be reset when an A-MPDU or frame of length less than or equal to dot11RTSThreshold succeeds. The long retry count for an MSDU or A-MSDU that is not part of a Block Ack agreement or for an MMPDU shall be incremented every time transmission of a MAC frame of length greater than dot11RTSThreshold fails for that MSDU, A-MSDU, or MMPDU. QLRC[AC] shall be incremented every time transmission of an A-MPDU or frame of length greater than or equal to dot11RTSThreshold fails. This long retry count and the QLRC[AC] shall be reset when an A-MPDU or frame of length greater than dot11RTSThreshold succeeds. ...
Retries for failed transmission attempts shall continue until the short retry count for the MSDU, A-MSDU, or MMPDU is equal to dot11ShortRetryLimit or until the long retry count for the MSDU, A-MSDU, or MMPDU is equal to dot11LongRetryLimit. When either of these limits is reached, retry attempts shall cease, and the MSDU, A-MSDU, or MMPDU shall be discarded.


So, I think that indeed a timer and Ssrc are needed for these frames.
Comment 7 sebastien.deronne 2016-12-09 14:32:21 EST
What would happen if the timeout has elapsed? Does it continuously retry to establish the session? Also, I do not get your point wrt ssrc, could you please explain?
Comment 8 sebastien.deronne 2016-12-14 12:49:57 EST
Created attachment 2701 [details]
proposed solution

I made a patch with a proposed solution for this bug.

I am not sure there is a strict rule in the standard for this, but they do mention an addbafailuretimeout. I thus added a timeout in EdcaTxopN: once elapsed and no addba response was received yet, it restarts the addba request procedure.
Comment 9 Ioannis 2016-12-17 10:23:15 EST
(In reply to sebastien.deronne from comment #7)
> What would happen if the timeout has elapsed? Does it continuously retry to
> establish the session? 

I think that during that session, if the originator(recipient) doesn't receive an ack to establish the session, they retransmit the missed frame up to ssrc or dot11ADDBAFailureTimeout. So, if this timeout elapses the BlockAck is not established and the frames are transmitted as normal PPDUs. I'm not sure if after some time the nodes retry to establish the session (i.e. after BlockAckTimeout, since if BlockAckTimeout occurs the Block Ack is considered to be torn down (6.3.29.2.2 Semantics of the service primitive))


I haven't gone through the patch yet, but I think there is a typo in the attribute
"AddBaFailureTimeout",
Specifies a time limit (in multiple of 104 microseconds) should be 
Specifies a time limit (in multiple of 1024 microseconds)

and that the minimum value should be 1 (according to 6.3.29.2.2 Semantics of the service primitive).
Comment 10 sebastien.deronne 2016-12-17 10:27:05 EST
(In reply to Ioannis from comment #9)
> (In reply to sebastien.deronne from comment #7)
> > What would happen if the timeout has elapsed? Does it continuously retry to
> > establish the session? 
> 
> I think that during that session, if the originator(recipient) doesn't
> receive an ack to establish the session, they retransmit the missed frame up
> to ssrc or dot11ADDBAFailureTimeout. So, if this timeout elapses the
> BlockAck is not established and the frames are transmitted as normal PPDUs.
> I'm not sure if after some time the nodes retry to establish the session
> (i.e. after BlockAckTimeout, since if BlockAckTimeout occurs the Block Ack
> is considered to be torn down (6.3.29.2.2 Semantics of the service
> primitive))
> 

I understand the opposite: if the timeout elapsed, it starts a new BA session establishment, otherwise in a scenario with a lot of stations trying to establish a BA session we have a high chance to not be able to use Block ACKs, which is not really acceptable in my opinion.

> 
> I haven't gone through the patch yet, but I think there is a typo in the
> attribute
> "AddBaFailureTimeout",
> Specifies a time limit (in multiple of 104 microseconds) should be 
> Specifies a time limit (in multiple of 1024 microseconds)
> 
> and that the minimum value should be 1 (according to 6.3.29.2.2 Semantics of
> the service primitive).
Comment 11 Ioannis 2016-12-17 10:47:52 EST
(In reply to sebastien.deronne from comment #10)
> I understand the opposite: if the timeout elapsed, it starts a new BA
> session establishment, otherwise in a scenario with a lot of stations trying
> to establish a BA session we have a high chance to not be able to use Block
> ACKs, which is not really acceptable in my opinion.

You might be right but 
In a dense deployment under full buffer conditions, most of the nodes won't be able to establish the session. What if the nodes continuously retry to establish a BA session and end up in a situation where most of the frames flying around are management frames.
Comment 12 Tom Henderson 2016-12-20 01:35:28 EST
> I haven't gone through the patch yet, but I think there is a typo in the
> attribute
> "AddBaFailureTimeout",
> Specifies a time limit (in multiple of 104 microseconds) should be 
> Specifies a time limit (in multiple of 1024 microseconds)
> 
> and that the minimum value should be 1 (according to 6.3.29.2.2 Semantics of
> the service primitive).

I tend to agree on this point; why should we provide an option (special value of zero) to disable this behavior?
Comment 13 Tom Henderson 2016-12-20 01:42:55 EST
Regarding the rest of the patch and the patch on what to do upon a timeout, I might lean towards Ioannis's viewpoint that if the conditions do not support setting up agreement, that it may be more harmful to just try incessantly without trying instead normal PPDUs (at least for a while).  But I suspect we won't get clear guidance on this detail and we are just guessing.  I wonder if conditions that lead to such failure will also lead to disassociation, allowing reassociation and perhaps reset/retry at that point?

Also, I'd prefer to add a regression test case where these frames are forcibly dropped to check the behavior.
Comment 14 sebastien.deronne 2016-12-21 15:01:51 EST
(In reply to Tom Henderson from comment #12)
> > I haven't gone through the patch yet, but I think there is a typo in the
> > attribute
> > "AddBaFailureTimeout",
> > Specifies a time limit (in multiple of 104 microseconds) should be 
> > Specifies a time limit (in multiple of 1024 microseconds)
> > 
> > and that the minimum value should be 1 (according to 6.3.29.2.2 Semantics of
> > the service primitive).
> 
> I tend to agree on this point; why should we provide an option (special
> value of zero) to disable this behavior?

correct, standard says valid range starts from 1
Comment 15 sebastien.deronne 2016-12-21 15:04:16 EST
(In reply to Tom Henderson from comment #13)
> Regarding the rest of the patch and the patch on what to do upon a timeout,
> I might lean towards Ioannis's viewpoint that if the conditions do not
> support setting up agreement, that it may be more harmful to just try
> incessantly without trying instead normal PPDUs (at least for a while).  But
> I suspect we won't get clear guidance on this detail and we are just
> guessing.  I wonder if conditions that lead to such failure will also lead
> to disassociation, allowing reassociation and perhaps reset/retry at that
> point?
> 
> Also, I'd prefer to add a regression test case where these frames are
> forcibly dropped to check the behavior.

I indeed think best would be to restart association, to avoid blocking scenario. 

We are guessing what would be the most appropriate solution, standard does not mention anything and I have actually no idea what is implemented in practice.

I agree we should add a regression test case.
Comment 16 sebastien.deronne 2017-04-21 06:06:04 EDT
Still pending action: investigate how this work in practice.
Comment 17 Tom Henderson 2018-06-01 14:01:09 EDT
I found a useful summary here on block ack:
http://wirelesslearnings.blogspot.com/2015/09/block-ack-session.html
Comment 18 Tom Henderson 2018-06-02 11:14:30 EDT
Muhammad is working on this bug.  The approach is to add a minimal test case to the 'devices-wifi' test suite in wifi-test.cc that uses data sent via packet sockets to induce a successful agreement setup.  Then, force losses of specific packets to cause a failure.  Use NS_LOG to trace out the faulty behavior, and propose and implement a solution.  Once the solution is in place, add some NS_TEST macros to check that the association eventually completes despite the lost frames.

To induce specific frame drops, it may be convenient to add the 'ReceiveErrorModel' similar to the PointToPointNetDevice.
Comment 19 Ioannis 2018-06-02 12:28:58 EDT
(In reply to Tom Henderson from comment #18)
> Muhammad is working on this bug.  The approach is to add a minimal test case
> to the 'devices-wifi' test suite in wifi-test.cc that uses data sent via
> packet sockets to induce a successful agreement setup.  Then, force losses
> of specific packets to cause a failure.  Use NS_LOG to trace out the faulty
> behavior, and propose and implement a solution.  Once the solution is in
> place, add some NS_TEST macros to check that the association eventually
> completes despite the lost frames.
> 
> To induce specific frame drops, it may be convenient to add the
> 'ReceiveErrorModel' similar to the PointToPointNetDevice.

That would be great. When you say though "force losses", you mean to specific control frames during the Block-ACK establishment? 

A proposed solution will be based on the implementation of Block-ACK establishment as it occurs in real devices (Not sure if this has been addressed on real devices)? Is there any information on how a failed Block-ACK est. is managed?
Comment 20 Tom Henderson 2018-06-02 12:53:05 EDT
(In reply to Ioannis from comment #19)
> > To induce specific frame drops, it may be convenient to add the
> > 'ReceiveErrorModel' similar to the PointToPointNetDevice.
> 
> That would be great. When you say though "force losses", you mean to
> specific control frames during the Block-ACK establishment? 

Yes

> 
> A proposed solution will be based on the implementation of Block-ACK
> establishment as it occurs in real devices (Not sure if this has been
> addressed on real devices)? Is there any information on how a failed
> Block-ACK est. is managed?

So far, I couldn't obtain any specific information by talking to vendors, other than general feedback that the device would probably try again to establish.
Comment 21 Ioannis 2018-06-02 13:13:12 EDT
(In reply to Tom Henderson from comment #20)
> (In reply to Ioannis from comment #19)
> > > To induce specific frame drops, it may be convenient to add the
> > > 'ReceiveErrorModel' similar to the PointToPointNetDevice.
> > 
> > That would be great. When you say though "force losses", you mean to
> > specific control frames during the Block-ACK establishment? 
> 
> Yes
> 
> > 
> > A proposed solution will be based on the implementation of Block-ACK
> > establishment as it occurs in real devices (Not sure if this has been
> > addressed on real devices)? Is there any information on how a failed
> > Block-ACK est. is managed?
> 
> So far, I couldn't obtain any specific information by talking to vendors,
> other than general feedback that the device would probably try again to
> establish.

From a quick search that's what I've found (ERROR exceeded max RX BA sessions), the device re-tries again to establish Block-ACK after a short period of time, as you already mentioned. Within two consecutive tries, does the device send any data frame under normal-ack? I suppose that based on what I found it does not and waits until BA to be established. In any case, if BA cannot be established then I guess data frames would also have slim chances to be correctly received.

https://e2e.ti.com/support/wireless_connectivity/wilink_wifi_bluetooth/f/307/t/352435?WL127X-Producing-Errors-wl12xx-ERROR-exceeded-max-RX-BA-sessions-#
Comment 22 sebastien.deronne 2018-06-02 13:33:41 EDT
(In reply to Tom Henderson from comment #18)
> Muhammad is working on this bug.  The approach is to add a minimal test case
> to the 'devices-wifi' test suite in wifi-test.cc that uses data sent via
> packet sockets to induce a successful agreement setup.  Then, force losses
> of specific packets to cause a failure.  Use NS_LOG to trace out the faulty
> behavior, and propose and implement a solution.  Once the solution is in
> place, add some NS_TEST macros to check that the association eventually
> completes despite the lost frames.

Great, it sounds good!

> 
> To induce specific frame drops, it may be convenient to add the
> 'ReceiveErrorModel' similar to the PointToPointNetDevice.

Yes this is something we need, I sometimes do similar stuff manually, but if we would have a model for dropping frames having a given pattern, that would be very convenient.
Comment 23 sebastien.deronne 2018-06-02 13:35:15 EDT
(In reply to Tom Henderson from comment #20)
> (In reply to Ioannis from comment #19)
> > > To induce specific frame drops, it may be convenient to add the
> > > 'ReceiveErrorModel' similar to the PointToPointNetDevice.
> > 
> > That would be great. When you say though "force losses", you mean to
> > specific control frames during the Block-ACK establishment? 
> 
> Yes
> 
> > 
> > A proposed solution will be based on the implementation of Block-ACK
> > establishment as it occurs in real devices (Not sure if this has been
> > addressed on real devices)? Is there any information on how a failed
> > Block-ACK est. is managed?
> 
> So far, I couldn't obtain any specific information by talking to vendors,
> other than general feedback that the device would probably try again to
> establish.

For soft-mac devices, is this something implemented in firmware/hardware or on host? If it's on host, we could have a look in Linux driver how they implement that.
Comment 24 Muhammad Iqbal CR 2018-06-04 05:06:26 EDT
Thanks for the inputs, everyone.

I worked on the regression test and found out that since the ADDBA request is considered as a DATA and needs ACK, when it misses ACK it will be retransmitted until MaxSlrc. The SLRC usage aside, we can be confident to says that retransmission happen for ADDBA packets in ns-3. After exceeding MaxSlrc, ACK is declared as failed, the originator just stopped the transmission and the scheduled packets are lost.

You can check it yourself by cloning this branch and running addba-test test suite:
https://bitbucket.org/kyuucr/ns-3-dev-lbt/commits/branch/exp_addba-test

Also as a minor fix, I properly cancelled the BA agreement when missing the ADDBA request's ACK. This does not fix the lost packet, just providing proper closure.
Comment 25 Muhammad Iqbal CR 2018-06-10 11:40:42 EDT
I found out that the simulation always reply ADDBA request with success code. If I force a failure code, the transmission completely stopped similar to this case, indicating that this behavior is not handled.

I have an idea, if the standard does not strictly specified what happen on missing ADDBA request, does it specify what happen when the originator received failure code on ADDBA response? If it is, could we use that as a solution for this bug? Considering that the two cases is similar, which is the behavior when BA is requested but unavailable.

Sorry since I do not have the standard on me, I cannot find the specifics myself.
Comment 26 Tom Henderson 2018-06-10 20:42:45 EDT
(In reply to Muhammad Iqbal CR from comment #25)
> I found out that the simulation always reply ADDBA request with success
> code. If I force a failure code, the transmission completely stopped similar
> to this case, indicating that this behavior is not handled.
> 
> I have an idea, if the standard does not strictly specified what happen on
> missing ADDBA request, does it specify what happen when the originator
> received failure code on ADDBA response? If it is, could we use that as a
> solution for this bug? Considering that the two cases is similar, which is
> the behavior when BA is requested but unavailable.
> 

It may not be handled the same, because an explicit failure may indicate to the sender that it is pointless to retry, while the soft failure (lost frame) we are talking about here can be due to frame losses.

I suggest to try to implement the following behavior.  If the ADDBA handshake fails, unblock any packets that are held due to QosBlockedDestinations and send as normal MPDU.  This may require that a new timer is implemented to protect this handshake.  After this expiry, and after any unblocked MPDUs are sent, then return the state to original (i.e. conditions may again arise where the destination becomes blocked waiting for a new ADDBA handshake).
Comment 27 Muhammad Iqbal CR 2018-07-04 09:32:42 EDT
Hi everyone,

This is my attempt to fix this bug, please review the commits on this branch [1]. The first 3 commits on the branch are for preparation to the bug fix. The first commit is to introduce ErrorModel class to WifiPhy which used to force packet drop for the bug's unit test. The second commit changed BA state diagram on the comment block of originator-block-ack-agreement.h to verbatim block so it can be viewed on doxygen.

The third commit changed the behavior of GetBaAgreementExist to GetBaAgreementEstablished. Previously, MacLow and QosTxop used it to check if the agreement is exist, then send packets as AMPDU or do operations that related to the BA agreement. This could lead to the packets sent as AMPDU even if the agreement hasn't been established. It is also needed by the last commit so the originator can send packets as normal MPDU when the agreement is not established.

On the last commit I added two transitory states to the block ACK:

- NO_REPLY, which triggers when the transmission of ADDBA request has failed or when the ADDBA response is not received after a timeout; in this state the rest of the packet in the queue will be transmitted using normal MPDU for a period of time

- RESET, after NO_REPLY state timed-out, the BA agreement will transition to this state, so it can be recreated for the next queued packet

You can see the change to the state machine at this commit [2]. The unit test for this bug on is wifi-test.cc. Also, since I set the timeout arbirarily so I would like to hear your input. The timeout to wait for ADDBA response is 25 ms, and for dwelling in NO_REPLY state is 200 ms [3].

[1]: https://bitbucket.org/kyuucr/ns-3-dev-lbt/commits/branch/bug2470_rc1
[2]: https://bitbucket.org/kyuucr/ns-3-dev-lbt/commits/06531e169fc96abd447ff09a41ece80077af84ce#Lsrc/wifi/model/originator-block-ack-agreement.hF34
[3]: https://bitbucket.org/kyuucr/ns-3-dev-lbt/commits/06531e169fc96abd447ff09a41ece80077af84ce?at=bug2470_rc1#Lsrc/wifi/model/qos-txop.ccT55
Comment 28 Muhammad Iqbal CR 2018-07-10 23:09:35 EDT
Hi everyone,

I have revised the fix based on your comments, you can check it here: https://bitbucket.org/kyuucr/ns-3-dev-lbt/commits/branch/bug2470_rc2
Comment 29 Alexander Krotov 2018-07-23 06:44:31 EDT
(In reply to Muhammad Iqbal CR from comment #28)
> Hi everyone,
> 
> I have revised the fix based on your comments, you can check it here:
> https://bitbucket.org/kyuucr/ns-3-dev-lbt/commits/branch/bug2470_rc2

It does not compile for me. I had to #include "ns3/error-model.h" from wifi-phy.h (and remove ErrorModel forward declaration).
Comment 30 Alexander Krotov 2018-07-23 07:40:42 EDT
I have tested the patch with my experiments, works for me.
Comment 31 Alexander Krotov 2018-08-01 06:12:12 EDT
BlockAckManager::NeedBarRetransmission sometimes fails on assertion NS_ASSERT (it != m_agreements.end ());

I had replaced it with an "if" that returns "false" if agreement is not found as a quick fix.
Comment 32 Alexander Krotov 2018-08-01 06:46:22 EDT
It seems like enabling inactivity timeout with this patch causes a lot of assertion fails, in MacLow etc.
Comment 33 Muhammad Iqbal CR 2018-08-01 07:50:06 EDT
(In reply to Alexander Krotov from comment #32)
> It seems like enabling inactivity timeout with this patch causes a lot of
> assertion fails, in MacLow etc.

Yeah, I'm interested if the NO_REPLY and RESET timeout works properly in other cases. Can you provide test cases where it fails?
Comment 34 Alexander Krotov 2018-08-01 12:39:46 EDT
I don't have testcases, only some huge scenario.

Here is what happens: block ack request is transmitted, but then channel fails (let's say due to mobility STA goes away). Because of the bug https://www.nsnam.org/bugzilla/show_bug.cgi?id=2928 when inactivity timeout is 0, block ack request keeps being retransmitted forever. So it blocks the queue, even though I have data to transmit to other stations.

To fix this, I try to enable inactivity timeout. It worked before I applied your patch, but now it leads to assertion fails, because block ack agreement is removed on timeout and some things still expect it to be around (BlockAckManager::NeedBarRetransmission expects that it can find the agreement, for example).
Comment 35 sebastien.deronne 2018-08-02 16:56:03 EDT
I tend to agree with the patch.

But in the meantime, did someone get feedback on how this is implemented on real devices? Or is it very vendor specific?
Comment 36 Tom Henderson 2018-08-02 17:22:38 EDT
(In reply to sebastien.deronne from comment #35)
> I tend to agree with the patch.
> 
> But in the meantime, did someone get feedback on how this is implemented on
> real devices? Or is it very vendor specific?

I tried but could not obtain such feedback.
Comment 37 sebastien.deronne 2018-08-08 15:34:36 EDT
(In reply to Tom Henderson from comment #36)
> (In reply to sebastien.deronne from comment #35)
> > I tend to agree with the patch.
> > 
> > But in the meantime, did someone get feedback on how this is implemented on
> > real devices? Or is it very vendor specific?
> 
> I tried but could not obtain such feedback.

On QCA, I see they have a debug mode to disable ADDBA exchange and use normal ACKs, but this does not say whether they use this mode upon excessive transmission failures.
Comment 38 sebastien.deronne 2018-10-27 05:22:31 EDT
I see the last patch is from Muhammad, I'd like to have a conclusion on the design we will choose, so that I can rework and integrate Muhammad's patch to ns-3-dev.

In the meantime, did anybody encounter some code from a manufacturer that would provide information about how this is dealt in practice?
Comment 39 sebastien.deronne 2018-11-04 09:08:38 EST
I reviewed Muhammad's patches and I had a look at Linux mac80211 implementation.
I agree with Muhammad's patch which is in the end quite similar to what mac80211 implements, but there is a significant difference in the default timeout value.
This is typically 1 ms on x86_64 Linux machines, whereas a value of 25 ms has been selected here. The other timeout value seems acceptable.

I am now starting with merging Muhammad's work to ns-3-dev, and I would like to know whether this is fine for everybody to use a smaller timeout value (1ms)?
Comment 40 Tom Henderson 2018-11-04 09:48:38 EST
> I am now starting with merging Muhammad's work to ns-3-dev, and I would like
> to know whether this is fine for everybody to use a smaller timeout value
> (1ms)?

Agree.
Comment 41 sebastien.deronne 2018-11-04 10:26:22 EST
OK, I will prepare and post a final patch here before pushing.
Comment 42 sebastien.deronne 2018-11-04 11:58:58 EST
I already fixed all regression issues, but I still face 2 issues:

1/ Changing timeout to 1 ms is causing the second test case to fail. I am further investigating.

2/ Test is randomly crashing (fine if run alone), so I fear there is still some randomness somewhere. Any idea how I can try to solve this?
Comment 43 Tom Henderson 2018-11-04 13:09:23 EST
(In reply to sebastien.deronne from comment #42)
> I already fixed all regression issues, but I still face 2 issues:
> 
> 1/ Changing timeout to 1 ms is causing the second test case to fail. I am
> further investigating.
> 
> 2/ Test is randomly crashing (fine if run alone), so I fear there is still
> some randomness somewhere. Any idea how I can try to solve this?

Crashing or failing?

When faced with this, I use low-level test-runner:
./waf --command-template="gdb %s" --run test-runner
(gdb) r --suite=name-of-test-suite

If it is randomly crashing when run in the context of other tests, then force use of different RngRun values until you find a value that causes it (possibly NS_GLOBAL_VALUE="RngRun=value" ./waf ... ).  If this doesn't work, it is possible that some other upstream test changed a default value?  This can be possibly traced by modifying the test to print out default values via ConfigStore.
Comment 44 sebastien.deronne 2018-11-04 14:01:32 EST
It is not crashing but failing, randomly but almost always.
Comment 45 sebastien.deronne 2018-11-04 14:33:12 EST
If I disabled those two tests, this is working fine:

  //AddTestCase (new Bug2483TestCase, TestCase::QUICK); //Bug 2483
  //AddTestCase (new Bug2831TestCase, TestCase::QUICK); //Bug 2831

I have this at the begin of my test to make sure seed/run are correct:

  RngSeedManager::SetSeed (1);
  RngSeedManager::SetRun (1);

Is there something else than rng that can stay over test cases?
Comment 46 sebastien.deronne 2018-11-05 17:34:13 EST
Created attachment 3203 [details]
Final patch

I integrated and fixed last solution from Muhammad. If fine for everybody, I'll push this later this week.
Comment 47 sebastien.deronne 2018-11-06 03:22:29 EST
I would like to push it this week.
Comment 48 Muhammad Iqbal CR 2018-11-06 03:53:01 EST
(In reply to sebastien.deronne from comment #46)
> Created attachment 3203 [details]
> Final patch
> 
> I integrated and fixed last solution from Muhammad. If fine for everybody,
> I'll push this later this week.

I'm fine with the patch. Does removing the Config::SetDefault declaration fixes the test? I had some suspicion while working on my GSoC project a while ago, that it can broke tests after that declaration since the next tests after that will use that default attribute. Maybe we need to take a look at all of the test and remove those default config declaration.
Comment 49 sebastien.deronne 2018-11-06 04:04:14 EST
Yes all tests are passing. I will have a look further this week to clean all Config::Default in wifi tests.
Comment 50 Rediet 2018-11-06 06:28:27 EST
(In reply to sebastien.deronne from comment #46)
> Created attachment 3203 [details]
> Final patch
> 
> I integrated and fixed last solution from Muhammad. If fine for everybody,
> I'll push this later this week.

Minor comment:
- Line 105 of originator-block-ack-agreement.h, UNSUCCESSFUL state is still documented
Comment 51 sebastien.deronne 2018-11-06 06:34:28 EST
Good point, I still have to review the documentation.
Comment 52 sebastien.deronne 2018-11-06 12:11:19 EST
Created attachment 3204 [details]
Final patch reworked

Updated documentation and removed remaining Config::SetDefault in wifi test.
Comment 53 sebastien.deronne 2018-11-10 13:39:39 EST
Fixed in changeset 13850:aa1df67d0c6a