Bug 2211

Summary: Ipv{4,6}EndPoint can cause memory corruption
Product: ns-3 Reporter: Tommaso Pecorella <tommaso.pecorella>
Component: internetAssignee: Tommaso Pecorella <tommaso.pecorella>
Status: REOPENED ---    
Severity: normal CC: ammo6818, natale.patriciello, ns-bugs, tomh
Priority: P5    
Version: ns-3-dev   
Hardware: All   
OS: All   
Bug Depends on: 417    
Bug Blocks:    
Attachments: Alexander's patch (slightly modified)
Massif output without the proposed patch
Massif output with the proposed patcha
test program

Description Tommaso Pecorella 2015-11-08 06:22:16 EST
Alexander Krotov noted in ns-dev mailing list that the patch is actually opening a bug.

The short summary is that a class should always cancel its pending events when it's destroyed, and Ipv{4,6}EndPoint doesn't do that.

The solution could be to track the events, but this particular case is slightly different.
Due to changes in the Loopback NetDevice (back the days it wasn't even a NetDevice), the bug is invalid. I.e., using the Loopback NetDevice will not cause anymore an infinite recursion.
Hence, the obvious solution is to *not* have deferred events in Ipv{4,6}EndPoint.

The full mail is here:
----------------------
Ipv4EndPoint and Ipv6EndPoint classes contain DoForwardUp and
DoForwardIcmp functions that call m_rxCallback and m_icmpCallback.
ForwardUp and ForwardIcmp functions are just wrappers that call these
functions through ScheduleNow.  These changes can be tracked down to
commit [1], which references bug 417 [2].

The problem was that if both endpoints of a connection were on the same
node, calling m_rxCallback directly triggered immediate response. The
chain of events consisted only of function calls and could be arbitrarily
long, enough to cause stack overflow sometimes.

However, the change [1] introduced another bug. By the time DoForwardUp
is called, Ipv4Endpoint object may already be destroyed. For example,
if TCP RST segment arrives followed immediately by a data segment,
first segment will destroy Ipv4EndPoint and the second segment should
not be forwarded up at all.

This bug is hard to reproduce as it involves packet reordering,
simultaneous arrival of packets and some memory corruption that I will
describe below.  Still, it can happen in LTE or Wi-Fi, where packets
can be aggregated and delivered all at the same time.  Other people were
hit by this bug before [3]. Note that this bug was triggered in ns-3.17.

It seems this bug was attemted to fix in Ipv4EndPoint by checking
that m_rxCallback is not nullified in DoForwardUp [4].  Ipv4EndPoint
destructor nullifies all callbacks.  By the way, Ipv6EndPoint lacks
this modification.  From the log [5] you can see that this "bugfix" was
applied shortly after ns-3.16 release and the code was already present
in ns-3.17. Yet it was not enough to avoid segfault.

The problem is that by the time m_rxCallback.IsNull () check in
DoForwardUp is performed, it is either false, as the m_rxCallback.IsNull ()
was false in ForwardUp, or Ipv4EndPoint is already destroyed and memory
allocated for m_rxCallback is freed.  Usually memory is not reused yet
and you can get away with this use-after-free, but in some cases the
code that processes other simultaneous events will allocate this memory
and corrupt m_rxCallback. It makes the bug even harder to reproduce,
because you need a large-scale simulation that has many simultaneous
events.

I suggest to remove all these wrappers and just call m_rxCallback
immediately.  Ipv4EndPoint is not the right place for the bugfix, because
it is a transient object that may be destroyed before postponed events
are triggered.  Postponing callback itself is not a solution either,
because bound arguments may become invalid too.

One may suggest to cancel scheduled events in Ipv4EndPoint
destructor. However, it would require keeping track of all scheduled
DoForward{Up,Icmp} events until packets are actually forwarded to the
upper layer. Storing an EventId vector for each EndPoint is too expensive.

Logical place to stop the chain of function calls is when a packet is
transmitted. When packet is placed in the queue, the owner of the queue
should schedule or re-schedule TX event. Another trigger for the TX
event scheduling should be medium indicating that it is free if there
was previous transmission. Only one TX event per queue is enough.

There is some L2.5 (Traffic Control Layer) activity on the list that I have
not looked in yet. There is a high chance that just introducing it will fix
the loopback problem as there will be at least one Schedule(Now) call in the
loop.

Attached is a patch that removes all the wrapping from Ipv{4,6}EndPoint.
Callback nullifying is also removed from the destructor, as it misleadingly
suggests that IsNull can be checked later, while it is a use-after-free
actually.

[1] http://code.nsnam.org/ns-3-dev/rev/142c13a3975f
[2] https://www.nsnam.org/bugzilla/show_bug.cgi?id=417
[3] https://groups.google.com/forum/#!msg/ns-3-users/TbKuPTSEKi0/R3BYhs9KhQQJ
[4] http://code.nsnam.org/ns-3-dev/rev/55055ecdbd47
[5] http://code.nsnam.org/ns-3-dev/shortlog/9217
Comment 1 Tommaso Pecorella 2015-11-08 06:33:57 EST
Created attachment 2174 [details]
Alexander's patch (slightly modified)
Comment 2 Tom Henderson 2015-11-08 14:33:07 EST
I agree with the modified patch.  I also think it is useful to see if the loopback recursion can still be found with this patch applied, but if so, to still apply this patch and look for a different solution to any recursion problem found.
Comment 3 Tommaso Pecorella 2015-11-08 17:34:50 EST
Created attachment 2176 [details]
Massif output without the proposed patch
Comment 4 Tommaso Pecorella 2015-11-08 17:36:18 EST
Created attachment 2177 [details]
Massif output with the proposed patcha
Comment 5 Tommaso Pecorella 2015-11-08 17:38:16 EST
Created attachment 2178 [details]
test program

The command line to get the massif report is:
./waf --run scratch/first --command-template="valgrind --tool=massif --stacks=yes --time-unit=B %s"
Comment 6 Tommaso Pecorella 2015-11-08 17:41:18 EST
According to Massif, there's no infinite recursion anymore.
Just waiting for Nat's green lights.
Comment 7 natale.patriciello 2015-11-09 07:27:11 EST
While I have arguments to say "ScheduleNow () messes up everything, better to remove it", I wondering on offering a consistent look in all the internet files. If ForwardUp/DoForwardUp isn't needed anymore, I'd like to remove it also from other protocol files. Do you agree?


By the way, the reason I think ScheduleNow() is bad is because there isn't a strict ordering of events (if I remember well). I mean, I could do something like that:

static void foo (int *a) { *a = 5; }
static void bar (int *a) {delete a;}
static void print (int *a) { print *a; }

ScheduleNow (foo, a);
ScheduleNow (print, a);
ScheduleNow (bar, a);

and having a bad runtime error, while executing as

foo (a);
print (a);
bar (a);

is perfectly fine.
Comment 8 natale.patriciello 2015-11-09 08:00:04 EST
Also: we need to define "best pratices" in deconstructor. Should callbacks be zeroed?
Comment 9 Tommaso Pecorella 2015-11-09 08:33:16 EST
This goes a bit beyond the scope of this bug. However, let me share two cents (just for the sake of future readers).

ScheduleNow is a deferred call that happens at the same time of the caller. The execution order of multiple ScheduleNow is undefined, exactly as it's undefined the execution order of two or more scheduled events having the same execution time.

My rule of thumb is: if a function uses Schedule (including ScheduleNow) or is used by Schedule (including ScheduleNow) **must not** mess up with an object status beyond what is acceptable for other functions operating on the same object.

In other words, I'd consider function bar in your example a bad mistake. You're deleting an object you don't own and whose pointer is being used by other functions.

With respect to the statement "ScheduleNow () messes up everything, better to remove it", I could show you that yes, it can be removed. However, the equivalent code will be a complete mess. E.g., the function:
void foo (somehting)
{
   ...
   if (something) ScheduleNow (...);
   [more code here]
}

Will have to be rewritten to:
void foo (somehting)
{
   ...
   if (something) {Put up a flag to call the right function at the end of the function - save the right params so the code below will not change them}
   [more code here]
   if (that flag we did set before)
   {
     do here what I wanted to do in the ScheduleNow
   }
}

The point is that "more code here" can change the object status, and this is wanted / needed in many cases.



(In reply to natale.patriciello from comment #7)
> While I have arguments to say "ScheduleNow () messes up everything, better
> to remove it", I wondering on offering a consistent look in all the internet
> files. If ForwardUp/DoForwardUp isn't needed anymore, I'd like to remove it
> also from other protocol files. Do you agree?
> 
> 
> By the way, the reason I think ScheduleNow() is bad is because there isn't a
> strict ordering of events (if I remember well). I mean, I could do something
> like that:
> 
> static void foo (int *a) { *a = 5; }
> static void bar (int *a) {delete a;}
> static void print (int *a) { print *a; }
> 
> ScheduleNow (foo, a);
> ScheduleNow (print, a);
> ScheduleNow (bar, a);
> 
> and having a bad runtime error, while executing as
> 
> foo (a);
> print (a);
> bar (a);
> 
> is perfectly fine.
Comment 10 Tom Henderson 2015-11-09 10:49:25 EST
(In reply to Tommaso Pecorella from comment #9)
> This goes a bit beyond the scope of this bug. However, let me share two
> cents (just for the sake of future readers).
> 
> ScheduleNow is a deferred call that happens at the same time of the caller.
> The execution order of multiple ScheduleNow is undefined, exactly as it's
> undefined the execution order of two or more scheduled events having the
> same execution time.

It seems to me from reading the code now that the execution order is defined behavior (depending on the scheduler and simulator implementations).  Using the default simulator implementation and default map scheduler, I would expect that ScheduleNow (a), ScheduleNow (b), etc. would cause them to execute in that order, because the event Id assignment is not deferred and is used in the comparison key for map insertion (and the less-than operator is defined for EventKey to sort on the basis of uid if the timestamp is equal).  I don't see a unit test checking this, however. 

Natale, if you are using TimeStep to avoid reordering the segment transmissions, I would suggest that we remove that and debug why you might be getting reordering when you iteratively call ScheduleNow().
Comment 11 Tommaso Pecorella 2015-11-09 11:47:21 EST
(In reply to Tom Henderson from comment #10)
> (In reply to Tommaso Pecorella from comment #9)
> > This goes a bit beyond the scope of this bug. However, let me share two
> > cents (just for the sake of future readers).
> > 
> > ScheduleNow is a deferred call that happens at the same time of the caller.
> > The execution order of multiple ScheduleNow is undefined, exactly as it's
> > undefined the execution order of two or more scheduled events having the
> > same execution time.
> 
> It seems to me from reading the code now that the execution order is defined
> behavior (depending on the scheduler and simulator implementations).  Using
> the default simulator implementation and default map scheduler, I would
> expect that ScheduleNow (a), ScheduleNow (b), etc. would cause them to
> execute in that order, because the event Id assignment is not deferred and
> is used in the comparison key for map insertion (and the less-than operator
> is defined for EventKey to sort on the basis of uid if the timestamp is
> equal).  I don't see a unit test checking this, however. 
> 
> Natale, if you are using TimeStep to avoid reordering the segment
> transmissions, I would suggest that we remove that and debug why you might
> be getting reordering when you iteratively call ScheduleNow().

I'd propose to move the discussion to ns-dev and/or to another bug (if needed).

As a side note, then I said "undefined", I meant that it's not guaranteed that the order is preserved among different schedulers. The order itself is not random, it's just not guaranteed that there is a specific binding between the Schedule calls order and the Event execution order.
Comment 12 Tom Henderson 2015-11-09 12:34:55 EST
(In reply to Tommaso Pecorella from comment #11)
> (In reply to Tom Henderson from comment #10)
> > (In reply to Tommaso Pecorella from comment #9)
> > > This goes a bit beyond the scope of this bug. However, let me share two
> > > cents (just for the sake of future readers).
> > > 
> > > ScheduleNow is a deferred call that happens at the same time of the caller.
> > > The execution order of multiple ScheduleNow is undefined, exactly as it's
> > > undefined the execution order of two or more scheduled events having the
> > > same execution time.
> > 
> > It seems to me from reading the code now that the execution order is defined
> > behavior (depending on the scheduler and simulator implementations).  Using
> > the default simulator implementation and default map scheduler, I would
> > expect that ScheduleNow (a), ScheduleNow (b), etc. would cause them to
> > execute in that order, because the event Id assignment is not deferred and
> > is used in the comparison key for map insertion (and the less-than operator
> > is defined for EventKey to sort on the basis of uid if the timestamp is
> > equal).  I don't see a unit test checking this, however. 
> > 
> > Natale, if you are using TimeStep to avoid reordering the segment
> > transmissions, I would suggest that we remove that and debug why you might
> > be getting reordering when you iteratively call ScheduleNow().
> 
> I'd propose to move the discussion to ns-dev and/or to another bug (if
> needed).

I agree to move to a different bug; not sure that ns-dev discussion is needed?
> 
> As a side note, then I said "undefined", I meant that it's not guaranteed
> that the order is preserved among different schedulers. The order itself is
> not random, it's just not guaranteed that there is a specific binding
> between the Schedule calls order and the Event execution order.

I disagree; this should be guaranteed.  The doxygen says

   * Schedule an event to expire Now. All events scheduled to
   * to expire "Now" are scheduled FIFO, after all normal events
   * have expired.

so I would like to know if there are any instances where this is not true.  The schedulers should not change event ordering; different schedulers should just offer tradeoffs on event insertion/removal performance.
Comment 13 natale.patriciello 2015-11-10 04:47:37 EST
I would like to keep here only the discussion inherent to this bug.. so, my previous question is still unanswered:

If ForwardUp/DoForwardUp mechanism isn't needed anymore, I'd like to remove it also from TcpSocketBase, do you agree? If yes (there isn't neither any Simulator::Schedule involved...) the fix could hit ns-3-dev at the speed of light :)


====

BTW, scheduling of one timestep was due to this bug: https://www.nsnam.org/bugzilla/show_bug.cgi?id=2067 (check mail from Feb.2015, Problems with tcp-bulk-send Wireless and TCP issues-serious ones) which is not related to segment reordering but interactions with applications and sockets.

I'm still thinking that using ScheduleNow is dangerous, and (like in the example proposed) is used wrongly as "deferred" call. If you know that the event is happening *ever* at the same time, you don't need ScheduleNow, it is just to write good code. For all other cases, you use Schedule, which (eventually) could be happening at the same time (e.g. 0 propagation delay and infinite bandwidth in a simplenetdevice).

Everything IMHO, obviously :) 
Nat
Comment 14 Tommaso Pecorella 2015-11-10 04:52:22 EST
I'd open a new bug / enhancement for L4 (we could do the same changes to UDP and ICMP).
I guess that we'll need a lot of testing on those, but I'm not against the idea.

(In reply to natale.patriciello from comment #13)
> I would like to keep here only the discussion inherent to this bug.. so, my
> previous question is still unanswered:
> 
> If ForwardUp/DoForwardUp mechanism isn't needed anymore, I'd like to remove
> it also from TcpSocketBase, do you agree? If yes (there isn't neither any
> Simulator::Schedule involved...) the fix could hit ns-3-dev at the speed of
> light :)
> 
> 
> ====
> 
> BTW, scheduling of one timestep was due to this bug:
> https://www.nsnam.org/bugzilla/show_bug.cgi?id=2067 (check mail from
> Feb.2015, Problems with tcp-bulk-send Wireless and TCP issues-serious ones)
> which is not related to segment reordering but interactions with
> applications and sockets.
> 
> I'm still thinking that using ScheduleNow is dangerous, and (like in the
> example proposed) is used wrongly as "deferred" call. If you know that the
> event is happening *ever* at the same time, you don't need ScheduleNow, it
> is just to write good code. For all other cases, you use Schedule, which
> (eventually) could be happening at the same time (e.g. 0 propagation delay
> and infinite bandwidth in a simplenetdevice).
> 
> Everything IMHO, obviously :) 
> Nat
Comment 15 Tom Henderson 2015-11-11 23:45:49 EST
Is there anything more here to review before committing the slightly modified patch?  I opened 2214 for the TCP issue.  I did not include ForwardUp/DoForwardUp changes suggested by Natale (please open a new bug if you want that also).
Comment 16 Tommaso Pecorella 2015-11-12 03:52:28 EST
I was waiting for Natale's tests with "crafted" out of order TCP packets (involving a FIN, mainly).


(In reply to Tom Henderson from comment #15)
> Is there anything more here to review before committing the slightly
> modified patch?  I opened 2214 for the TCP issue.  I did not include
> ForwardUp/DoForwardUp changes suggested by Natale (please open a new bug if
> you want that also).
Comment 17 Tommaso Pecorella 2015-11-15 13:14:36 EST
changeset:   11766:61acba4a788a
Comment 18 Robert Ammon 2017-09-05 23:09:57 EDT
The bug identified in this report has not been fully resolved.  In testing the port of NS-3 to Visual Studio, there still remains a use after free bug.

In function TcpSocketBase::ReceivedData, the following code has a use after free error.  The error can be created using test-endpoint-bug2211.

The error occurs when NotifyDataRecv makes a callback to test-endpoint-bug2211 for the second packet.  In TcpEndPointBug2211Test::Recv, the socket is closed.  This frees all of the SocketBase object PTR members including m_rxBuffer which is then used in the second IF check.

  if (expectedSeq < m_rxBuffer->NextRxSequence ())
    { // NextRxSeq advanced, we have something to send to the app
      if (!m_shutdownRecv)
        {
          NotifyDataRecv ();
        }
      // Handle exceptions
      if (m_closeNotified)
        {
          NS_LOG_WARN ("Why TCP " << this << " got data after close notification?");
        }
      // If we received FIN before and now completed all "holes" in rx buffer,
      // invoke peer close procedure
      if (m_rxBuffer->Finished () && (tcpHeader.GetFlags () & TcpHeader::FIN) == 0)
        {
          DoPeerClose ();
          return;
        }
    }