Bug 451 - Race condition in ARP cache
: Race condition in ARP cache
Status: RESOLVED FIXED
: ns-3
internet-stack
: ns-3-dev
: All All
: P1 normal
Assigned To:
:
:
:
:
  Show dependency treegraph
 
Reported: 2008-12-19 13:44 EDT by
Modified: 2009-01-15 00:51 EDT (History)


Attachments
patch (3.03 KB, patch)
2008-12-22 07:09 EDT, Gustavo J. A. M. Carneiro
Details | Diff
minor tweaks to Gustavo's previous patch (4.18 KB, patch)
2008-12-30 14:38 EDT, Tom Henderson
Details | Diff


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2008-12-19 13:44:13 EDT
Logs:
[...]
22s [node 4 dev 0] ArpCache:HandleWaitReplyTimeout()
22s [node 4 dev 0] ArpCache:IsWaitReply()
22s [node 4 dev 0] ArpCache:IsExpired()
22s [node 4 dev 0] ArpCache:GetWaitReplyTimeout()
22s [node 4 dev 0] ArpCache:IsExpired(): delta=1s
22s [node 4 dev 0] ArpCache:GetRetries(10.1.1.2, 1)
22s [node 4 dev 0] ArpCache:HandleWaitReplyTimeout(): 22s [node 4 dev 0]
ArpCache:GetRetries(10.1.1.2, 1)
22s [node 4 dev 0] ArpCache:GetIpv4Address()
22s [node 4 dev 0] ArpCache:GetIpv4Address()
22s [node 4] ArpL3Protocol:SendArpRequest()
22s [node 4] ArpL3Protocol:SendArpRequest(): 22s [node 4 dev 0]
ArpCache:GetDevice()
22s [node 4 dev 0] ArpCache:GetInterface()
22s [node 4 dev 0] ArpCache:GetDevice()
ARP: sending request from node 4 || src: 02-06-d0:00:00:00:00:07 / 10.1.1.1 ||
dst: 02-06-ff:ff:ff:ff:ff:ff / 10.1.1.2
22s [node 4 dev 0] ArpCache:GetDevice()
22s [node 4 dev 0] ArpCache:GetInterface()
22s [node 4 dev 0] ArpCache:GetDevice()
22s [node 4 dev 0] ArpCache:GetDevice()
22s [node 4 dev 0] ArpCache:GetDevice()
22s [node 4 dev 0] ArpCache:IncrementRetries()
22s [node 4 dev 0] ArpCache:UpdateSeen(00-00-00, 10.1.1.2)
22s [node 4 dev 0] ArpCache:HandleWaitReplyTimeout(): Restarting WaitReplyTimer
at 22
23s [node 4] ArpL3Protocol:Lookup(10.1.1.2)
23s [node 4 dev 0] ArpCache:Lookup()
23s [node 4 dev 0] ArpCache:IsExpired()
23s [node 4 dev 0] ArpCache:GetWaitReplyTimeout()
23s [node 4 dev 0] ArpCache:IsExpired(): delta=1s
23s [node 4 dev 0] ArpCache:IsDead()
23s [node 4 dev 0] ArpCache:IsAlive()
23s [node 4 dev 0] ArpCache:IsWaitReply()
../src/internet-stack/arp-l3-protocol.cc:227: Test for possibly unreachable
code-- please file a bug report, with a test case, if this is ever hit

What I think is happening here is the following:

  0. (some past history that is not important)
  1. ARP retries to send request at t=22s
  2. At exactly t=23s, two separate events are scheduled to be executed
simultaneously:
    a) A new packet is queued for destination 10.1.1.2;
    b) The pending ARP resolution is retried (HandleWaitReplyTimeout).

The problem here seems to be that the event handler for a) is being executed
before the event for b).  If b) were to execute before a) then ARP would not
complain, because it would update m_lastSeen in the Arp entry, and so when the
new Lookup arrives the entry appears to be in WaitReply state but at least not
expired.  Because a) happens before b), the cache entry appears expired and in
WaitReply state, and so the assertion is triggered.

Suggestions for fixing this?  I can see two possible solutions, but I didn't
try either, neither do I have an opinion which one is better:

  1. remove the assertion and call UpdateWaitReply instead, as if the entry
wasn't expired;

  2. Change the expiration condition in ArpCache::Entry::IsExpired, use the
expression (delta > timeout) instead of (delta >= timeout).
------- Comment #1 From 2008-12-22 00:35:18 EDT -------
(In reply to comment #0)

> Suggestions for fixing this?  I can see two possible solutions, but I didn't
> try either, neither do I have an opinion which one is better:
> 
>   1. remove the assertion and call UpdateWaitReply instead, as if the entry
> wasn't expired;
> 
>   2. Change the expiration condition in ArpCache::Entry::IsExpired, use the
> expression (delta > timeout) instead of (delta >= timeout).
> 

I looked at your reasoning and agree that is probably what is going on.  I
would opt for option 2 since the present assertion seems to be helping to find
bugs.

--- a/src/internet-stack/arp-cache.cc   Thu Dec 18 22:33:33 2008 -0800
+++ b/src/internet-stack/arp-cache.cc   Sun Dec 21 21:37:09 2008 -0800
@@ -381,7 +381,7 @@ ArpCache::Entry::IsExpired (void)
     break;
   }
   Time delta = Simulator::Now () - m_lastSeen;
-  if (delta >= timeout) 
+  if (delta > timeout) 
     {
       return true;
     } 
------- Comment #2 From 2008-12-22 06:08:37 EDT -------
(In reply to comment #1)
> (In reply to comment #0)
> 
> > Suggestions for fixing this?  I can see two possible solutions, but I didn't
> > try either, neither do I have an opinion which one is better:
> > 
> >   1. remove the assertion and call UpdateWaitReply instead, as if the entry
> > wasn't expired;
> > 
> >   2. Change the expiration condition in ArpCache::Entry::IsExpired, use the
> > expression (delta > timeout) instead of (delta >= timeout).
> > 
> 
> I looked at your reasoning and agree that is probably what is going on.  I
> would opt for option 2 since the present assertion seems to be helping to find
> bugs.
> 
> --- a/src/internet-stack/arp-cache.cc   Thu Dec 18 22:33:33 2008 -0800
> +++ b/src/internet-stack/arp-cache.cc   Sun Dec 21 21:37:09 2008 -0800
> @@ -381,7 +381,7 @@ ArpCache::Entry::IsExpired (void)
>      break;
>    }
>    Time delta = Simulator::Now () - m_lastSeen;
> -  if (delta >= timeout) 
> +  if (delta > timeout) 
>      {
>        return true;
>      } 
> 

Hmm.. on second thought, if we do this then ARP retries might stop working,
because:

  1. At t=23s HandleWaitReplyTimeout is called;

  2. The expression:

      if (entry != 0 && entry->IsWaitReply () && entry->IsExpired ())

fails because entry->IsExpired () will return false.

I'll research / test a solution and report back later.
------- Comment #3 From 2008-12-22 07:09:08 EDT -------
Created an attachment (id=340) [details]
patch

I think this is a better way to fix.
------- Comment #4 From 2008-12-22 09:08:13 EDT -------
(In reply to comment #2)
> (In reply to comment #1)
> > (In reply to comment #0)
> > 
> > > Suggestions for fixing this?  I can see two possible solutions, but I didn't
> > > try either, neither do I have an opinion which one is better:
> > > 
> > >   1. remove the assertion and call UpdateWaitReply instead, as if the entry
> > > wasn't expired;
> > > 
> > >   2. Change the expiration condition in ArpCache::Entry::IsExpired, use the
> > > expression (delta > timeout) instead of (delta >= timeout).
> > > 
> > 
> > I looked at your reasoning and agree that is probably what is going on.  I
> > would opt for option 2 since the present assertion seems to be helping to find
> > bugs.
> > 
> > --- a/src/internet-stack/arp-cache.cc   Thu Dec 18 22:33:33 2008 -0800
> > +++ b/src/internet-stack/arp-cache.cc   Sun Dec 21 21:37:09 2008 -0800
> > @@ -381,7 +381,7 @@ ArpCache::Entry::IsExpired (void)
> >      break;
> >    }
> >    Time delta = Simulator::Now () - m_lastSeen;
> > -  if (delta >= timeout) 
> > +  if (delta > timeout) 
> >      {
> >        return true;
> >      } 
> > 
> 
> Hmm.. on second thought, if we do this then ARP retries might stop working,
> because:
> 
>   1. At t=23s HandleWaitReplyTimeout is called;
> 
>   2. The expression:
> 
>       if (entry != 0 && entry->IsWaitReply () && entry->IsExpired ())
> 
> fails because entry->IsExpired () will return false.

what about just deleting the test entry->IsExpired () in that statement?  I am
not seeing what it adds there. 

It seems to me that having to have nearly identical cases "IsExpired()" and
"IsNearlyExpired()" smells like there is probably something wrong with the
state machine logic, and we should fix that instead of resorting to another
method.
------- Comment #5 From 2008-12-22 09:20:17 EDT -------
(In reply to comment #4)
> (In reply to comment #2)
> > (In reply to comment #1)
> > > (In reply to comment #0)
> > > 
> > > > Suggestions for fixing this?  I can see two possible solutions, but I didn't
> > > > try either, neither do I have an opinion which one is better:
> > > > 
> > > >   1. remove the assertion and call UpdateWaitReply instead, as if the entry
> > > > wasn't expired;
> > > > 
> > > >   2. Change the expiration condition in ArpCache::Entry::IsExpired, use the
> > > > expression (delta > timeout) instead of (delta >= timeout).
> > > > 
> > > 
> > > I looked at your reasoning and agree that is probably what is going on.  I
> > > would opt for option 2 since the present assertion seems to be helping to find
> > > bugs.
> > > 
> > > --- a/src/internet-stack/arp-cache.cc   Thu Dec 18 22:33:33 2008 -0800
> > > +++ b/src/internet-stack/arp-cache.cc   Sun Dec 21 21:37:09 2008 -0800
> > > @@ -381,7 +381,7 @@ ArpCache::Entry::IsExpired (void)
> > >      break;
> > >    }
> > >    Time delta = Simulator::Now () - m_lastSeen;
> > > -  if (delta >= timeout) 
> > > +  if (delta > timeout) 
> > >      {
> > >        return true;
> > >      } 
> > > 
> > 
> > Hmm.. on second thought, if we do this then ARP retries might stop working,
> > because:
> > 
> >   1. At t=23s HandleWaitReplyTimeout is called;
> > 
> >   2. The expression:
> > 
> >       if (entry != 0 && entry->IsWaitReply () && entry->IsExpired ())
> > 
> > fails because entry->IsExpired () will return false.
> 
> what about just deleting the test entry->IsExpired () in that statement?  I am
> not seeing what it adds there. 

I think this is because there is a single timer for multiple ARP cache entries,
so the timer callback has to check each entry individually to see if it is
expired to decide whether to retransmit the ARP request.

> 
> It seems to me that having to have nearly identical cases "IsExpired()" and
> "IsNearlyExpired()" smells like there is probably something wrong with the
> state machine logic, and we should fix that instead of resorting to another
> method.
> 

OK.  But I really have no more clue, I'll leave it to the maintainer to figure
it out :P
------- Comment #6 From 2008-12-22 09:52:48 EDT -------
(In reply to comment #5)
> 
> I think this is because there is a single timer for multiple ARP cache entries,
> so the timer callback has to check each entry individually to see if it is
> expired to decide whether to retransmit the ARP request.

Yes, I think you are right.

> 
> > 
> > It seems to me that having to have nearly identical cases "IsExpired()" and
> > "IsNearlyExpired()" smells like there is probably something wrong with the
> > state machine logic, and we should fix that instead of resorting to another
> > method.
> > 
> 
> OK.  But I really have no more clue, I'll leave it to the maintainer to figure
> it out :P
> 

It seems to me that the public method IsExpired() should return true if delta >
timeout.  Internally within the arp cache, there seems to be a need to detect
the >= case, for retransmissions.  Maybe we can slightly adapt your patch by
moving IsNearlyExpired() to private and possibly renaming it to "IsExpiring()"
or "NeedsRetransmission()". 

I agree with the other aspect of your patch to make the GetTimeout() piece of
code a separate method. 
------- Comment #7 From 2008-12-30 14:38:58 EDT -------
Created an attachment (id=343) [details]
minor tweaks to Gustavo's previous patch

I ended up pretty much back at the same point as Gustavo's previous patch (just
changed "IsNearlyExpired()" to "IsExpiring()" and I could not make it private
since it is needed by ArpCache code.  This has a bit more doxygen surrounding
the two methods to allow users to distinguish.

Will transfer bug to Raj for approval and pushing to ns-3-dev.
------- Comment #8 From 2008-12-30 14:40:09 EDT -------
Waiting for review and push by Raj.
------- Comment #9 From 2009-01-12 15:14:12 EDT -------
4084:dd514b5e0bc6
------- Comment #10 From 2009-01-15 00:51:09 EDT -------
*** Bug 468 has been marked as a duplicate of this bug. ***