Bug 1022 – Possibly inappropriate ASSERT in tcp-socket-impl.cc

Bug 1022 - Possibly inappropriate ASSERT in tcp-socket-impl.cc


Summary:	Possibly inappropriate ASSERT in tcp-socket-impl.cc

Status:	RESOLVED FIXED

Product:	ns-3
Classification:	Unclassified
Component:	internet
Version:	ns-3.9
Hardware:	All All

Importance:	P5 normal
Assigned To:	ns-bugs

URL:
Keywords:

Depends on:
Blocks:
	Show dependency tree / graph

Reported:	2010-11-03 16:38 EDT by Bill Roome
Modified:	2010-11-17 19:24 EST (History)
CC List:	1 user (show)

See Also:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Bill Roome 2010-11-03 16:38:43 EDT

In a long-running sim, the following assert failed -- lines 1516-1519 in tcp-socket-impl.cc, version 3.9 released:

      //buffer this, it'll be read by call to Recv
      UnAckData_t::iterator i = 
          m_bufferedData.find (tcpHeader.GetSequenceNumber () );
      NS_ASSERT(i == m_bufferedData.end ()); //no way it should have been found

This is in TcpSocketImpl::NewRx(), when it handles a duplicate packet.

I gather m_bufferedData holds data that the tcp stack has received from the peer, but which the application hasn't read yet. So this assert fails when the stack receives a duplicate packet before the application reads the original packet.

Granted that's unlikely, but is that really an error? Doesn't that just mean that the application is slow? Or (as I think happened in our case) the network was slow and the original packet and the duplicate arrived back-to-back?

So I think this assert isn't appropriate. If that condition is true, the stack should either discard the newly-arrived duplicate, or replace the original with the duplicate. I think that if we just remove the assert (and the iterator declaration) the subsequent code will do the latter, and the Ptr<> mechanism will release the original packet without a memory leak.

When we commented out that assert out and re-ran the sim, it completed without a problem.

Comment 1 Tom Henderson 2010-11-04 21:52:20 EDT

(In reply to comment #0)
> In a long-running sim, the following assert failed -- lines 1516-1519 in
> tcp-socket-impl.cc, version 3.9 released:
> 
>       //buffer this, it'll be read by call to Recv
>       UnAckData_t::iterator i = 
>           m_bufferedData.find (tcpHeader.GetSequenceNumber () );
>       NS_ASSERT(i == m_bufferedData.end ()); //no way it should have been found
> 
> This is in TcpSocketImpl::NewRx(), when it handles a duplicate packet.
> 
> I gather m_bufferedData holds data that the tcp stack has received from the
> peer, but which the application hasn't read yet. So this assert fails when the
> stack receives a duplicate packet before the application reads the original
> packet.
> 
> Granted that's unlikely, but is that really an error? Doesn't that just mean
> that the application is slow? Or (as I think happened in our case) the network
> was slow and the original packet and the duplicate arrived back-to-back?
> 
> So I think this assert isn't appropriate. If that condition is true, the stack
> should either discard the newly-arrived duplicate, or replace the original with
> the duplicate. I think that if we just remove the assert (and the iterator
> declaration) the subsequent code will do the latter, and the Ptr<> mechanism
> will release the original packet without a memory leak.
> 
> When we commented out that assert out and re-ran the sim, it completed without
> a problem.

I tend to agree with you.  

How long is your long-running test case (bytes transferred, link data rate, propagation delay)?

Comment 2 Bill Roome 2010-11-05 12:55:51 EDT

We're simulating a p2p sharing network with 800 nodes in 5 point-to-point-star clusters. 4 clusters are 100Gbps; 1 is 200 Kbps. We do not simulate packet loss, but (for obvious reasons!) the peers on the 200 Kbps cluster experience considerable delays.

The assert failed after about 1200 sec of sim time, or about a day of wall-clock time. Lord only knows how many gigabytes were transferred. When we removed the assert and tried again, it ran for almost 2 days of real time, and terminated cleanly at our scheduled stop time of 2000 sim seconds.

In other words, it's not a practical to use that as a regression test.

Comment 3 Tom Henderson 2010-11-17 19:24:24 EST

pushed in changeset 3e7336abae57