Bug 451 - Race condition in ARP cache
Race condition in ARP cache
Status: RESOLVED FIXED
Product: ns-3
Classification: Unclassified
Component: internet
ns-3-dev
All All
: P1 normal
Assigned To: Rajib Bhattacharjea
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2008-12-19 13:44 EST by Gustavo J. A. M. Carneiro
Modified: 2009-01-15 00:51 EST (History)
3 users (show)

See Also:


Attachments
patch (3.03 KB, patch)
2008-12-22 07:09 EST, Gustavo J. A. M. Carneiro
Details | Diff
minor tweaks to Gustavo's previous patch (4.18 KB, patch)
2008-12-30 14:38 EST, Tom Henderson
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Gustavo J. A. M. Carneiro 2008-12-19 13:44:13 EST
Logs:
[...]
22s [node 4 dev 0] ArpCache:HandleWaitReplyTimeout()
22s [node 4 dev 0] ArpCache:IsWaitReply()
22s [node 4 dev 0] ArpCache:IsExpired()
22s [node 4 dev 0] ArpCache:GetWaitReplyTimeout()
22s [node 4 dev 0] ArpCache:IsExpired(): delta=1s
22s [node 4 dev 0] ArpCache:GetRetries(10.1.1.2, 1)
22s [node 4 dev 0] ArpCache:HandleWaitReplyTimeout(): 22s [node 4 dev 0] ArpCache:GetRetries(10.1.1.2, 1)
22s [node 4 dev 0] ArpCache:GetIpv4Address()
22s [node 4 dev 0] ArpCache:GetIpv4Address()
22s [node 4] ArpL3Protocol:SendArpRequest()
22s [node 4] ArpL3Protocol:SendArpRequest(): 22s [node 4 dev 0] ArpCache:GetDevice()
22s [node 4 dev 0] ArpCache:GetInterface()
22s [node 4 dev 0] ArpCache:GetDevice()
ARP: sending request from node 4 || src: 02-06-d0:00:00:00:00:07 / 10.1.1.1 || dst: 02-06-ff:ff:ff:ff:ff:ff / 10.1.1.2
22s [node 4 dev 0] ArpCache:GetDevice()
22s [node 4 dev 0] ArpCache:GetInterface()
22s [node 4 dev 0] ArpCache:GetDevice()
22s [node 4 dev 0] ArpCache:GetDevice()
22s [node 4 dev 0] ArpCache:GetDevice()
22s [node 4 dev 0] ArpCache:IncrementRetries()
22s [node 4 dev 0] ArpCache:UpdateSeen(00-00-00, 10.1.1.2)
22s [node 4 dev 0] ArpCache:HandleWaitReplyTimeout(): Restarting WaitReplyTimer at 22
23s [node 4] ArpL3Protocol:Lookup(10.1.1.2)
23s [node 4 dev 0] ArpCache:Lookup()
23s [node 4 dev 0] ArpCache:IsExpired()
23s [node 4 dev 0] ArpCache:GetWaitReplyTimeout()
23s [node 4 dev 0] ArpCache:IsExpired(): delta=1s
23s [node 4 dev 0] ArpCache:IsDead()
23s [node 4 dev 0] ArpCache:IsAlive()
23s [node 4 dev 0] ArpCache:IsWaitReply()
../src/internet-stack/arp-l3-protocol.cc:227: Test for possibly unreachable code-- please file a bug report, with a test case, if this is ever hit

What I think is happening here is the following:

  0. (some past history that is not important)
  1. ARP retries to send request at t=22s
  2. At exactly t=23s, two separate events are scheduled to be executed simultaneously:
    a) A new packet is queued for destination 10.1.1.2;
    b) The pending ARP resolution is retried (HandleWaitReplyTimeout).

The problem here seems to be that the event handler for a) is being executed before the event for b).  If b) were to execute before a) then ARP would not complain, because it would update m_lastSeen in the Arp entry, and so when the new Lookup arrives the entry appears to be in WaitReply state but at least not expired.  Because a) happens before b), the cache entry appears expired and in WaitReply state, and so the assertion is triggered.

Suggestions for fixing this?  I can see two possible solutions, but I didn't try either, neither do I have an opinion which one is better:

  1. remove the assertion and call UpdateWaitReply instead, as if the entry wasn't expired;

  2. Change the expiration condition in ArpCache::Entry::IsExpired, use the expression (delta > timeout) instead of (delta >= timeout).
Comment 1 Tom Henderson 2008-12-22 00:35:18 EST
(In reply to comment #0)

> Suggestions for fixing this?  I can see two possible solutions, but I didn't
> try either, neither do I have an opinion which one is better:
> 
>   1. remove the assertion and call UpdateWaitReply instead, as if the entry
> wasn't expired;
> 
>   2. Change the expiration condition in ArpCache::Entry::IsExpired, use the
> expression (delta > timeout) instead of (delta >= timeout).
> 

I looked at your reasoning and agree that is probably what is going on.  I would opt for option 2 since the present assertion seems to be helping to find bugs.

--- a/src/internet-stack/arp-cache.cc   Thu Dec 18 22:33:33 2008 -0800
+++ b/src/internet-stack/arp-cache.cc   Sun Dec 21 21:37:09 2008 -0800
@@ -381,7 +381,7 @@ ArpCache::Entry::IsExpired (void)
     break;
   }
   Time delta = Simulator::Now () - m_lastSeen;
-  if (delta >= timeout) 
+  if (delta > timeout) 
     {
       return true;
     } 


Comment 2 Gustavo J. A. M. Carneiro 2008-12-22 06:08:37 EST
(In reply to comment #1)
> (In reply to comment #0)
> 
> > Suggestions for fixing this?  I can see two possible solutions, but I didn't
> > try either, neither do I have an opinion which one is better:
> > 
> >   1. remove the assertion and call UpdateWaitReply instead, as if the entry
> > wasn't expired;
> > 
> >   2. Change the expiration condition in ArpCache::Entry::IsExpired, use the
> > expression (delta > timeout) instead of (delta >= timeout).
> > 
> 
> I looked at your reasoning and agree that is probably what is going on.  I
> would opt for option 2 since the present assertion seems to be helping to find
> bugs.
> 
> --- a/src/internet-stack/arp-cache.cc   Thu Dec 18 22:33:33 2008 -0800
> +++ b/src/internet-stack/arp-cache.cc   Sun Dec 21 21:37:09 2008 -0800
> @@ -381,7 +381,7 @@ ArpCache::Entry::IsExpired (void)
>      break;
>    }
>    Time delta = Simulator::Now () - m_lastSeen;
> -  if (delta >= timeout) 
> +  if (delta > timeout) 
>      {
>        return true;
>      } 
> 

Hmm.. on second thought, if we do this then ARP retries might stop working, because:

  1. At t=23s HandleWaitReplyTimeout is called;

  2. The expression:

      if (entry != 0 && entry->IsWaitReply () && entry->IsExpired ())

fails because entry->IsExpired () will return false.

I'll research / test a solution and report back later.
Comment 3 Gustavo J. A. M. Carneiro 2008-12-22 07:09:08 EST
Created attachment 340 [details]
patch

I think this is a better way to fix.
Comment 4 Tom Henderson 2008-12-22 09:08:13 EST
(In reply to comment #2)
> (In reply to comment #1)
> > (In reply to comment #0)
> > 
> > > Suggestions for fixing this?  I can see two possible solutions, but I didn't
> > > try either, neither do I have an opinion which one is better:
> > > 
> > >   1. remove the assertion and call UpdateWaitReply instead, as if the entry
> > > wasn't expired;
> > > 
> > >   2. Change the expiration condition in ArpCache::Entry::IsExpired, use the
> > > expression (delta > timeout) instead of (delta >= timeout).
> > > 
> > 
> > I looked at your reasoning and agree that is probably what is going on.  I
> > would opt for option 2 since the present assertion seems to be helping to find
> > bugs.
> > 
> > --- a/src/internet-stack/arp-cache.cc   Thu Dec 18 22:33:33 2008 -0800
> > +++ b/src/internet-stack/arp-cache.cc   Sun Dec 21 21:37:09 2008 -0800
> > @@ -381,7 +381,7 @@ ArpCache::Entry::IsExpired (void)
> >      break;
> >    }
> >    Time delta = Simulator::Now () - m_lastSeen;
> > -  if (delta >= timeout) 
> > +  if (delta > timeout) 
> >      {
> >        return true;
> >      } 
> > 
> 
> Hmm.. on second thought, if we do this then ARP retries might stop working,
> because:
> 
>   1. At t=23s HandleWaitReplyTimeout is called;
> 
>   2. The expression:
> 
>       if (entry != 0 && entry->IsWaitReply () && entry->IsExpired ())
> 
> fails because entry->IsExpired () will return false.

what about just deleting the test entry->IsExpired () in that statement?  I am not seeing what it adds there. 

It seems to me that having to have nearly identical cases "IsExpired()" and "IsNearlyExpired()" smells like there is probably something wrong with the state machine logic, and we should fix that instead of resorting to another method.
Comment 5 Gustavo J. A. M. Carneiro 2008-12-22 09:20:17 EST
(In reply to comment #4)
> (In reply to comment #2)
> > (In reply to comment #1)
> > > (In reply to comment #0)
> > > 
> > > > Suggestions for fixing this?  I can see two possible solutions, but I didn't
> > > > try either, neither do I have an opinion which one is better:
> > > > 
> > > >   1. remove the assertion and call UpdateWaitReply instead, as if the entry
> > > > wasn't expired;
> > > > 
> > > >   2. Change the expiration condition in ArpCache::Entry::IsExpired, use the
> > > > expression (delta > timeout) instead of (delta >= timeout).
> > > > 
> > > 
> > > I looked at your reasoning and agree that is probably what is going on.  I
> > > would opt for option 2 since the present assertion seems to be helping to find
> > > bugs.
> > > 
> > > --- a/src/internet-stack/arp-cache.cc   Thu Dec 18 22:33:33 2008 -0800
> > > +++ b/src/internet-stack/arp-cache.cc   Sun Dec 21 21:37:09 2008 -0800
> > > @@ -381,7 +381,7 @@ ArpCache::Entry::IsExpired (void)
> > >      break;
> > >    }
> > >    Time delta = Simulator::Now () - m_lastSeen;
> > > -  if (delta >= timeout) 
> > > +  if (delta > timeout) 
> > >      {
> > >        return true;
> > >      } 
> > > 
> > 
> > Hmm.. on second thought, if we do this then ARP retries might stop working,
> > because:
> > 
> >   1. At t=23s HandleWaitReplyTimeout is called;
> > 
> >   2. The expression:
> > 
> >       if (entry != 0 && entry->IsWaitReply () && entry->IsExpired ())
> > 
> > fails because entry->IsExpired () will return false.
> 
> what about just deleting the test entry->IsExpired () in that statement?  I am
> not seeing what it adds there. 

I think this is because there is a single timer for multiple ARP cache entries, so the timer callback has to check each entry individually to see if it is expired to decide whether to retransmit the ARP request.

> 
> It seems to me that having to have nearly identical cases "IsExpired()" and
> "IsNearlyExpired()" smells like there is probably something wrong with the
> state machine logic, and we should fix that instead of resorting to another
> method.
> 

OK.  But I really have no more clue, I'll leave it to the maintainer to figure it out :P
Comment 6 Tom Henderson 2008-12-22 09:52:48 EST
(In reply to comment #5)
> 
> I think this is because there is a single timer for multiple ARP cache entries,
> so the timer callback has to check each entry individually to see if it is
> expired to decide whether to retransmit the ARP request.

Yes, I think you are right.

> 
> > 
> > It seems to me that having to have nearly identical cases "IsExpired()" and
> > "IsNearlyExpired()" smells like there is probably something wrong with the
> > state machine logic, and we should fix that instead of resorting to another
> > method.
> > 
> 
> OK.  But I really have no more clue, I'll leave it to the maintainer to figure
> it out :P
> 

It seems to me that the public method IsExpired() should return true if delta > timeout.  Internally within the arp cache, there seems to be a need to detect the >= case, for retransmissions.  Maybe we can slightly adapt your patch by moving IsNearlyExpired() to private and possibly renaming it to "IsExpiring()" or "NeedsRetransmission()". 

I agree with the other aspect of your patch to make the GetTimeout() piece of code a separate method. 

Comment 7 Tom Henderson 2008-12-30 14:38:58 EST
Created attachment 343 [details]
minor tweaks to Gustavo's previous patch

I ended up pretty much back at the same point as Gustavo's previous patch (just changed "IsNearlyExpired()" to "IsExpiring()" and I could not make it private since it is needed by ArpCache code.  This has a bit more doxygen surrounding the two methods to allow users to distinguish.

Will transfer bug to Raj for approval and pushing to ns-3-dev.
Comment 8 Tom Henderson 2008-12-30 14:40:09 EST
Waiting for review and push by Raj.
Comment 9 Rajib Bhattacharjea 2009-01-12 15:14:12 EST
4084:dd514b5e0bc6
Comment 10 Tom Henderson 2009-01-15 00:51:09 EST
*** Bug 468 has been marked as a duplicate of this bug. ***