Bugzilla – Bug 1791
TCP Endpoint never deallocates when closing
Last modified: 2014-10-31 16:39:25 EDT
Created attachment 1700 [details]
Simple example using modified tcp-bulk-send.cc
The TcpSocketBase never deallocates its endpoint during an active close. In particular, it moves from FIN_WAIT1 through to TIME_WAIT and CLOSED without ever calling DeAllocateEndPoint. While it does call CloseAndNotify, this explicitly doesn't deallocate when in TIME_WAIT.
Attached is a modified version of tcp-bulk-send.cc that extends the total simulation time (i.e. > 2 * msl, so it has time to move to CLOSED), and turns on TcpSocketBase LOGIC logging. While Node 1 takes care of its Endpoint without a problem, Node0 never does.
The end result is the Ipv4EndpointDemux on the node can eventually fill up (though obviously not in this example).
So here is the code for CloseAndNotify:
if (m_state != TIME_WAIT)
m_closeNotified = true;
NS_LOG_INFO (TcpStateName[m_state] << " -> CLOSED");
m_state = CLOSED;
At the end of this function the socket is always closed so the endpoint can be deallocated correct? I don't see the point of the "if (m_state != TIME_WAIT)". You certainly don't want to delete the endpoint while in TIME_WAIT but at this point the socket is closing, there is no way out of this function without the socket being in state CLOSED, so why not always deallocate?
Created attachment 1733 [details]
Log with if condition removed
Log with if condition removed.
Created attachment 1734 [details]
Original log with if condition
Created attachment 1748 [details]
Patch to tcp-socket-base.cc
Patch to tcp-socket-base.cc to remove conditional for TIME_WAIT from CloseAndNotify.
Patch caused valgrind errors on tcp test case.
test.py -g --suite=tcp
For more info:
./waf --command-template="valgrind %s --suite=tcp" --run "test-runner"
Reverting. Needs to be looked into.
Created attachment 1907 [details]
Patch to fix tcp-socket-base.cc deallocation bug.
The if condition for TIME_WAIT state seems to be there to prevent self destruction while executing CloseAndNotify function: when socket pointer is removed from m_tcp->m_sockets, since it is the last reference, the destructor is invoked. Obviously, the socket cannot be destroyed while its member function is still executing. This problem does not seem to affect any other state as, I suppose, there is another reference somewhere else; probably because the TIME_WAIT state is only one that calls CloseAndNotify through the scheduler.
To get the destructor called only after the function ends, I created another reference to the socket. For me, this seems to do the trick.
As far as I can tell, Pedro's patch looks appropriate. From the example script, the endpoint for node 0 gets deallocated. Additionally, no new valgrind errors seem to be introduced.
I'll schedule to apply this patch next week, thanks.
Pedro's patch was pushed in changeset 11040:cd2eda848730
Created attachment 1911 [details]
revised patch for clean valgrind
Pedro contributed this cleaner fix for the valgrind issues; allowing the removal of the statement that creates a lingering reference:
Ptr<TcpSocketBase> l_ptr = (m_state == TIME_WAIT ? this : 0);
The rationale (provided by Pedro) is as follows:
Within CloseAndNotify function:
"CancelAllTimers ();" removal -> it is already called by DeallocateEndPoint function (unless there is a case where the timers are active and there is no endpoint);
"m_closeNotified = true;" move to within the if statement -> it just needs to be set to true if it isn't yet;
"DeallocateEndPoint ();" move to the end -> if this function is the last to be executed, there is no need to keep an additional reference.
The last point implies also minor changes to DeallocationEndPoint function:
"CancelAllTimers ();" move to the beginning of the if statements -> we cannot cancel timers after removing the last reference from m_tcp->m_sockets as the object will be freed;
else if instead of plain if statement -> "m_tcp->m_sockets.erase (it);" statement must be executed at last so the "else" prevents an additional check at m_endPoint6 pointer.
revised patch pushed in changeset 11044:6b153691af7c