Bug 1791

Summary: TCP Endpoint never deallocates when closing
Product: ns-3 Reporter: Marcel Flores <marcel-flores>
Component: internetAssignee: Tom Henderson <tomh>
Status: RESOLVED FIXED    
Severity: normal CC: bswenson3, j.ivey, ns-bugs, pmms, tomh
Priority: P5    
Version: ns-3.17   
Hardware: PC   
OS: All   
Attachments: Simple example using modified tcp-bulk-send.cc
Log with if condition removed
Original log with if condition
Patch to tcp-socket-base.cc
Patch to fix tcp-socket-base.cc deallocation bug.
revised patch for clean valgrind

Description Marcel Flores 2013-11-07 11:08:26 EST
Created attachment 1700 [details]
Simple example using modified tcp-bulk-send.cc

The TcpSocketBase never deallocates its endpoint during an active close. In particular, it moves from FIN_WAIT1 through to TIME_WAIT and CLOSED without ever calling DeAllocateEndPoint. While it does call CloseAndNotify, this explicitly doesn't deallocate when in TIME_WAIT.

Attached is a modified version of tcp-bulk-send.cc that extends the total simulation time (i.e. > 2 * msl, so it has time to move to CLOSED), and turns on TcpSocketBase LOGIC logging. While Node 1 takes care of its Endpoint without a problem, Node0 never does. 

The end result is the Ipv4EndpointDemux on the node can eventually fill up (though obviously not in this example).
Comment 1 Brian Swenson 2013-12-12 11:48:28 EST
So here is the code for CloseAndNotify:

void
TcpSocketBase::CloseAndNotify (void)
{
  NS_LOG_FUNCTION (this);

  if (!m_closeNotified)
    {
      NotifyNormalClose ();
    }
  if (m_state != TIME_WAIT)
    {
      DeallocateEndPoint ();
    }
  m_closeNotified = true;
  NS_LOG_INFO (TcpStateName[m_state] << " -> CLOSED");
  CancelAllTimers ();
  m_state = CLOSED;
}

At the end of this function the socket is always closed so the endpoint can be deallocated correct?  I don't see the point of the "if (m_state != TIME_WAIT)".  You certainly don't want to delete the endpoint while in TIME_WAIT but at this point the socket is closing, there is no way out of this function without the socket being in state CLOSED, so why not always deallocate?
Comment 2 Brian Swenson 2013-12-12 12:08:19 EST
Created attachment 1733 [details]
Log with if condition removed

Log with if condition removed.
Comment 3 Brian Swenson 2013-12-12 12:08:50 EST
Created attachment 1734 [details]
Original log with if condition
Comment 4 Jared Ivey 2013-12-24 11:17:53 EST
Created attachment 1748 [details]
Patch to tcp-socket-base.cc

Patch to tcp-socket-base.cc to remove conditional for TIME_WAIT from CloseAndNotify.
Comment 5 Brian Swenson 2014-04-22 11:55:18 EDT
10694:4af272d94cfd
Comment 6 Brian Swenson 2014-05-01 17:00:21 EDT
Patch caused valgrind errors on tcp test case.

test.py -g --suite=tcp

For more info:
./waf --command-template="valgrind %s --suite=tcp" --run "test-runner"

Reverting.  Needs to be looked into.

10771:aa839638df6c
Comment 7 Pedro Silva 2014-10-17 06:12:19 EDT
Created attachment 1907 [details]
Patch to fix tcp-socket-base.cc deallocation bug.

The if condition for TIME_WAIT state seems to be there to prevent self destruction while executing CloseAndNotify function: when socket pointer is removed from m_tcp->m_sockets, since it is the last reference, the destructor is invoked. Obviously, the socket cannot be destroyed while its member function is still executing. This problem does not seem to affect any other state as, I suppose, there is another reference somewhere else; probably because the TIME_WAIT state is only one that calls CloseAndNotify through the scheduler.

To get the destructor called only after the function ends, I created another reference to the socket. For me, this seems to do the trick.
Comment 8 Jared Ivey 2014-10-17 17:37:08 EDT
As far as I can tell, Pedro's patch looks appropriate. From the example script, the endpoint for node 0 gets deallocated. Additionally, no new valgrind errors seem to be introduced.
Comment 9 Tom Henderson 2014-10-18 16:58:06 EDT
I'll schedule to apply this patch next week, thanks.
Comment 10 Tom Henderson 2014-10-29 13:14:43 EDT
Pedro's patch was pushed in changeset 11040:cd2eda848730
Comment 11 Tom Henderson 2014-10-31 15:31:22 EDT
Created attachment 1911 [details]
revised patch for clean valgrind

Pedro contributed this cleaner fix for the valgrind issues; allowing the removal of the statement that creates a lingering reference:
  Ptr<TcpSocketBase> l_ptr = (m_state == TIME_WAIT ? this : 0);

The rationale (provided by Pedro) is as follows:

Within CloseAndNotify function:

    "CancelAllTimers ();" removal -> it is already called by DeallocateEndPoint function (unless there is a case where the timers are active and there is no endpoint);
    "m_closeNotified = true;" move to within the if statement -> it just needs to be set to true if it isn't yet;
    "DeallocateEndPoint ();" move to the end -> if this function is the last to be executed, there is no need to keep an additional reference.

The last point implies also minor changes to DeallocationEndPoint function:

    "CancelAllTimers ();" move to the beginning of the if statements -> we cannot cancel timers after removing the last reference from m_tcp->m_sockets as the object will be freed;
    else if instead of plain if statement -> "m_tcp->m_sockets.erase (it);" statement must be executed at last so the "else" prevents an additional check at m_endPoint6 pointer.
Comment 12 Tom Henderson 2014-10-31 16:39:25 EDT
revised patch pushed in changeset 11044:6b153691af7c