Bug 2250 - An assert causes when large amount of data is being sent and received by nodes
An assert causes when large amount of data is being sent and received by nodes
Status: RESOLVED WONTFIX
Product: ns-3
Classification: Unclassified
Component: internet
ns-3.24
PC Linux
: P5 major
Assigned To: Tommaso Pecorella
: bug
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2015-12-30 13:52 EST by Ubaid ur Rahman
Modified: 2016-01-02 09:25 EST (History)
3 users (show)

See Also:


Attachments
Screen shot of terminal when the assert happened (118.72 KB, image/png)
2015-12-30 13:52 EST, Ubaid ur Rahman
Details
TCP socket base log (10.87 KB, application/x-gzip)
2016-01-02 08:25 EST, Ubaid ur Rahman
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Ubaid ur Rahman 2015-12-30 13:52:11 EST
Created attachment 2219 [details]
Screen shot of terminal when the assert happened

What I am trying to simulate is VM communication on a single access network. The access network consist of one router, and 20 nodes. IP v4 for each point-to-point link with /30 prefix. Routing is IPv4GlobalRouting.

The communication pattern is something like, there are 3 classes of VM that act as a Producer, ConsumerProducer and Consumer. I have one application that respond to a request like packet sink only replies with the requested amount of data. ConsumerProducer or Consumer can request data from the server.

There is no issue when it comes to one or two VMs communicating or requesting data. But when I increase the number to 10 then surprise the error
"assert failed. cond="m_state == ESTABLISHED || m_state == SYN_RCVD", file=../src/internet/model/tcp-socket-base.cc, line=1554
terminate called without an active exception"

All 10 VMs run at the same time.

P.S By VM I mean combination of modified OnOff and Packet Sink applications.
Comment 1 natale.patriciello 2015-12-31 05:35:15 EST
Hi Ubaid, 

Thank you for reporting this bug. However, we miss some information: 

- the code for the modified producer/consumer 
- a simple script to trigger this bug. 
- the version of ns3 you are using. 

Some comments: 

I presume that the error is triggered in dopeerclose. This means that the node receives a FIN, but the connection is not established; please check that. 

Nat
Comment 2 Ubaid ur Rahman 2015-12-31 10:45:35 EST
Hello,

Those files are missing because they are a part of extension that we are working on and is not yet finished. We plan to publish it soon. However, I had spent hours in finding a temporary fix for it.

The temp fix:
Viewing Mr. Tomasso Pecorella's comment on the group, I increased the DataRate for point-to-point links way more then a single application's transmission rate, e.g. The transmission rate of OnOffApplication.

The problem was due to packet losses and time expiring. The TCP may not be correctly adjusting its congestion window in order to control the flow of traffic.

I'll explain the setup:

Take OnOffApplication and PacketSink, enable the PacketSink to respond to requests. The OnOff should request for data > 20MB at the rate of its own (send the DataRate inside the packet). The OnOff transmission I set at 100Mbps.

A basic network: Total 20 nodes, connected via a router with each point-to-point link (set to 1Gbps at first, now with temp fix, its 10Gbps). Two of the nodes were assigned the PacketSink and rest were dynamically assigned the OnOff. Total number of OnOff was 30.
Comment 3 natale.patriciello 2015-12-31 12:59:19 EST
(In reply to Ubaid ur Rahman from comment #2)
> Those files are missing because they are a part of extension that we are
> working on and is not yet finished. We plan to publish it soon. However, I
> had spent hours in finding a temporary fix for it.

Ok but keep in mind that, if these classes do anything that "interfere" with TCP, I cannot establish if the bug is present in the ns-3 release or is introduced by your code.
 
> The problem was due to packet losses and time expiring. The TCP may not be
> correctly adjusting its congestion window in order to control the flow of
> traffic.

I don't get why the congestion window plays a role in this. Can you setup an example script which reproduces this bug (e.g. place the 20 onoff and the 2 sink on a simple setup, using the default onoff and sink) and try if it crashes?

Otherwise, can you please provide the following information:

- What is the value of m_state before the assert ? (you can check it by inserting a NS_LOG_UNCOND statement before the assert)
- Can you please produce a full log of TcpSocketBase ? You can do that by typing:

export NS_LOG="TcpSocketBase=level_all|prefix_func|prefix_time"
./waf --run "your-experiment" 2>out.txt 
gzip out.txt

and then upload out.txt.gz

just to be sure: are you using ns-3.24 ? Can you test if ns-3-dev still have this bug?

Thank you, and happy new year!
Comment 4 Ubaid ur Rahman 2016-01-01 02:16:58 EST
(In reply to natale.patriciello from comment #3)
> (In reply to Ubaid ur Rahman from comment #2)
> > Those files are missing because they are a part of extension that we are
> > working on and is not yet finished. We plan to publish it soon. However, I
> > had spent hours in finding a temporary fix for it.
> 
> Ok but keep in mind that, if these classes do anything that "interfere" with
> TCP, I cannot establish if the bug is present in the ns-3 release or is
> introduced by your code.

Looks like it was a false alarm, sorry for that, upon deep investigation of my classes at some point the link DatRate was reset to default 100Mbps. Corrected it now and its working fine.

> 
> Thank you, and happy new year!

Thank you for your support, and a happy new year to you too!
Comment 5 natale.patriciello 2016-01-02 07:11:20 EST
(In reply to Ubaid ur Rahman from comment #4)
> Looks like it was a false alarm, sorry for that, upon deep investigation of
> my classes at some point the link DatRate was reset to default 100Mbps.
> Corrected it now and its working fine.

Be aware that, if a bug existed in ns-3, with this "fix" we didn't resolved it. I'm closing the bug for now; if you want to help into the investigation, please provide the information that I asked in my post above (log and so on) and ask to reopen the bug.

Meanwhile, I wish you the best for your project!

Nat
Comment 6 Ubaid ur Rahman 2016-01-02 08:25:32 EST
Created attachment 2220 [details]
TCP socket base log
Comment 7 Ubaid ur Rahman 2016-01-02 08:27:28 EST
(In reply to Ubaid ur Rahman from comment #6)
> Created attachment 2220 [details]
> TCP socket base log

Hello,

Weird! when I added the global configuration line for Ipv4GlobalRouting::RandomEcmpRouting I got the error agian, I have attached the log file
Comment 8 natale.patriciello 2016-01-02 09:05:46 EST
Ok, from the begininning of the log, there are no suspicious activities (this is a normal exchange between two nodes).

Then, near the end, the node 4 closes its socket by sending a FIN and going into FIN_WAIT_1. Then, it is destroyed and another socket is created, which sends a SYN and entering SYN_SENT. This is clearly an error, because the socket shouldn't be destroyed so early.

The newest socket, since it is listening on the same ip/port as the oldest, it  receives the FIN from the old communication partner (remember the FIN_WAIT_1 state?). The socket then will crash since a FIN isn't expected in SYN_SENT state.

How your code is creating/destroying socket ? The reason for the assert is here.
Comment 9 Ubaid ur Rahman 2016-01-02 09:25:55 EST
First a socket is created to send request for data, when the other application has finished sending the amount requested. That application closes the socket. The requesting application checks upon close, whether the data received is equal to the amount I want. If yes then okay! other wise send another request.

Now in case this requesting application has to transmit some data to another VM, the sequence becomes:

First Recieve -> Close socket -> Create socket for other receiver -> send data

I think I have to check the Handlers for each close and read. May be there is the issue.