Bug 812

Summary: Assert when getting socket in RecvReply for AODV
Product: ns-3 Reporter: Kevin Peters <kevjay>
Component: routingAssignee: Pavel Boyko <boyko>
Status: RESOLVED FIXED    
Severity: critical CC: jpelkey, kevjay, ns-bugs, sunnmy, yacoubmassad
Priority: P2    
Version: ns-3.7   
Hardware: All   
OS: All   
Attachments: Script to show assert in RecvReply
Script to show assert in SendRerrMessage

Description Kevin Peters 2010-02-09 21:18:57 EST
Created attachment 759 [details]
Script to show assert in RecvReply

I've applied the preliminary patch provided at
http://www.nsnam.org/bugzilla/show_bug.cgi?id=772, because I ran into
that problem early on.  When my load testing gets up to 40 non-mobile nodes, it
fails with the following at second 78 of the simulation for
NS_ASSERT (socket) in RecvReply:

assert failed. file=../src/routing/aodv/aodv-routing-protocol.cc, line=1131, cond="socket"
[New Thread 0x2aaf0569a150 (LWP 25388)]

Program received signal SIGSEGV, Segmentation fault.
0x00002aaf04e84665 in ns3::aodv::RoutingProtocol::RecvReply (this=0xaf44070, p={m_ptr = 0x7fffeecf2d80}, receiver={m_address = 167837976}, sender=
      {m_address = 167837954}) at ../src/routing/aodv/aodv-routing-protocol.cc:1131
1131      NS_ASSERT (socket);
(gdb) where
#0  0x00002aaf04e84665 in ns3::aodv::RoutingProtocol::RecvReply (this=0xaf44070, p={m_ptr = 0x7fffeecf2d80}, receiver={m_address = 167837976}, sender=
      {m_address = 167837954}) at ../src/routing/aodv/aodv-routing-protocol.cc:1131
#1  0x00002aaf04e872a8 in ns3::aodv::RoutingProtocol::RecvAodv (this=0xaf44070, socket={m_ptr = 0x7fffeecf2e20})
    at ../src/routing/aodv/aodv-routing-protocol.cc:748
#2  0x00002aaf04e968be in ns3::MemPtrCallbackImpl<ns3::aodv::RoutingProtocol*, void (ns3::aodv::RoutingProtocol::*)(ns3::Ptr<ns3::Socket>), void, ns3::Ptr<ns3::Socket>, ns3::empty, ns3::empty, ns3::empty, ns3::empty, ns3::empty, ns3::empty, ns3::empty, ns3::empty>::operator() (this=0xaf67bf0, a1=
      {m_ptr = 0x7fffeecf2e70}) at debug/ns3/callback.h:223
#3  0x00002aaf04bbb512 in ns3::Callback<void, ns3::Ptr<ns3::Socket>, ns3::empty, ns3::empty, ns3::empty, ns3::empty, ns3::empty, ns3::empty, ns3::empty, ns3::empty>::operator() (this=0xaf67ae8, a1={m_ptr = 0x7fffeecf2eb0}) at debug/ns3/callback.h:410
#4  0x00002aaf04bb7aed in ns3::Socket::NotifyDataRecv (this=0xaf67a80) at ../src/node/socket.cc:284
#5  0x00002aaf04c6045a in ns3::UdpSocketImpl::ForwardUp (this=0xaf67a80, packet={m_ptr = 0x7fffeecf3150}, ipv4={m_address = 167837954}, port=654)
    at ../src/internet-stack/udp-socket-impl.cc:602
#6  0x00002aaf04c67e1f in ns3::MemPtrCallbackImpl<ns3::Ptr<ns3::UdpSocketImpl>, void (ns3::UdpSocketImpl::*)(ns3::Ptr<ns3::Packet>, ns3::Ipv4Address, unsigned short), void, ns3::Ptr<ns3::Packet>, ns3::Ipv4Address, unsigned short, ns3::empty, ns3::empty, ns3::empty, ns3::empty, ns3::empty, ns3::empty>::operator()
    (this=0xaf67c80, a1={m_ptr = 0x7fffeecf31a0}, a2={m_address = 167837954}, a3=654) at debug/ns3/callback.h:229
#7  0x00002aaf04c38ce0 in ns3::Callback<void, ns3::Ptr<ns3::Packet>, ns3::Ipv4Address, unsigned short, ns3::empty, ns3::empty, ns3::empty, ns3::empty, ns3::empty, ns3::empty>::operator() (this=0xaf67c38, a1={m_ptr = 0x7fffeecf3200}, a2={m_address = 167837954}, a3=654) at debug/ns3/callback.h:416
#8  0x00002aaf04c38577 in ns3::Ipv4EndPoint::DoForwardUp (this=0xaf67c20, p={m_ptr = 0x7fffeecf3250}, saddr={m_address = 167837954}, sport=654)
    at ../src/internet-stack/ipv4-end-point.cc:119
#9  0x00002aaf04c38704 in Notify (this=0xb018a90) at debug/ns3/make-event.h:167


I printed out the destination and origin for this failure for the RREP
header of the packet which shows they are legitimate addresses:

RREP destination 10.1.1.26 RREP origin 10.1.1.31

Attached is a script that reproduces the problem.  I was unable to start exactly with 40 nodes to reproduce the problem.  It has to start at 10 nodes and then build up to 40 nodes for the problem to happen.
Comment 1 Kevin Peters 2010-02-15 11:24:53 EST
Changed priority to P2-Critical because this is blocking the use of AODV for several people who were going to use AODV in their thesis.  We will have to abandon testing with AODV if it is not stable.
Comment 2 Pavel Boyko 2010-03-01 06:45:00 EST
(In reply to comment #1)
> Changed priority to P2-Critical because this is blocking the use of AODV for
> several people who were going to use AODV in their thesis.  We will have to
> abandon testing with AODV if it is not stable.

  Fixed in changeset 6090:1961fb8188e1 , please confirm that now your scripts don't fail (or reopen bug with new testcase). Also please take a look at bug 737 since it seriously affects broadcast collision probability in multihop wifi networks. I'd suggest you to apply attached patch when studying congested AODV scenarios or do not run AODV scenarios until that bug will be fixed.
Comment 3 Kevin Peters 2010-03-05 13:57:53 EST
Thanks, I updated my ns3-7 code base with the three relevant AODV source files from that changeset.  I no longer see any crashes.  The only problem I saw seems to be specific to the flow monitor not knowing what the drop code is for when a route is not found which might be because this is with ns3-7:

Unexpected drop reason code 5

Program received signal SIGSEGV, Segmentation fault.
0x6f9ad1f2 in ns3::Ipv4FlowProbe::DropLogger (this=0xbae8f8, ipHeader=@0x228b30, ipPayload={m_ptr = 0xbc2e00},
    reason=ns3::Ipv4L3Protocol::DROP_ROUTE_ERROR, ifIndex=0) at ../src/contrib/flow-monitor/ipv4-flow-probe.cc:167
167               NS_FATAL_ERROR ("Unexpected drop reason code " << reason);


I can easily get around this be disabling the flow monitor.

As far as bug 737, there are a lot of us here using ns-3 for MANET research.  We do not see very good packet delivery in higher node and data rate scenarios so are very interested if this bug is related.  Do you know when there will be a more final solution/patch or is it suggested to use the simple patch that is already attached to this bug?
Comment 4 Kevin Peters 2010-03-07 17:00:22 EST
I may have spoken too soon.  After extensive testing, I'm seeing another assertion on socket at line 1511 of aodv-routing-protocol.cc.  This time it's in SendRerrMessage.

So far, I've been unable to make a simple standalone script outside my performance testing that recreates this.  I will continue to try and work on this.  In the mean time, is this enough information to determine what could be wrong?
Comment 5 Kevin Peters 2010-03-08 12:34:14 EST
Created attachment 783 [details]
Script to show assert in SendRerrMessage

Unfortunately, I cannot reproduce the problem with a simple single simulation run.  Attached is a new script that will take quite some time to run.  By the time it gets to 40 nodes and 64Kbps send rate, the crash happens at second 149.
Comment 6 Pavel Boyko 2010-03-09 03:43:41 EST
(In reply to comment #3)
> As far as bug 737, there are a lot of us here using ns-3 for MANET research. 
> We do not see very good packet delivery in higher node and data rate scenarios
> so are very interested if this bug is related.  Do you know when there will be
> a more final solution/patch or is it suggested to use the simple patch that is
> already attached to this bug?

  We are using an attached bug 737 patch for our research for 3 month already. So I suggest you to try it and feed back your experience to http://www.nsnam.org/bugzilla/show_bug.cgi?id=737  

  Pavel
Comment 7 Pavel Boyko 2010-03-09 03:44:11 EST
  Thank you for the test case, I will take a look ASAP.

(In reply to comment #5)
> Created an attachment (id=783) [details]
> Script to show assert in SendRerrMessage
> 
> Unfortunately, I cannot reproduce the problem with a simple single simulation
> run.  Attached is a new script that will take quite some time to run.  By the
> time it gets to 40 nodes and 64Kbps send rate, the crash happens at second 149.
Comment 8 Pavel Boyko 2010-03-11 04:36:07 EST
> Created an attachment (id=783) [details]
> Script to show assert in SendRerrMessage

  I can't reproduce the bug. your script has successfully finished (in ~18 hours  on my Core2 @ 2.80GHz with debug build). Please confirm that you can reproduce this bug using latest ns-3-dev, and, if so, how long does it take. Do you need your output.csv?
Comment 9 Kevin Peters 2010-03-13 09:15:12 EST
(In reply to comment #8)
>   I can't reproduce the bug. your script has successfully finished (in ~18
> hours  on my Core2 @ 2.80GHz with debug build). Please confirm that you can
> reproduce this bug using latest ns-3-dev, and, if so, how long does it take. Do
> you need your output.csv?

I applied the simple patch from bug 737 (unfortunately did not see performance improvements) on my ns3-7 base with AODV from dev, and I see the crash occur even earlier (30 nodes at 40Kbps).  Do you know if there are any related fixes in dev that would fix what I am seeing?  It will take me some time to test in dev, and I need to stay on a stable release build for my research.  Thanks!
Comment 10 Elena Buchatskaya 2010-03-25 05:24:07 EDT
(In reply to comment #9)
> I applied the simple patch from bug 737 (unfortunately did not see performance
> improvements) on my ns3-7 base with AODV from dev, and I see the crash occur
> even earlier (30 nodes at 40Kbps).  Do you know if there are any related fixes
> in dev that would fix what I am seeing?  It will take me some time to test in
> dev, and I need to stay on a stable release build for my research.  Thanks!

I can't reproduce the bug. I use ns-3.7.1 with AODV from ns-dev + patch from bug 737. I tested your script with normal and optimized builds, but it has successfully finished in both cases. Please, can you try to provide a more detailed description about how to reproduce this bug?
Comment 11 Kevin Peters 2010-03-26 09:14:39 EDT
After upgrading to ns-3.7.1 and using the latest AODV from dev, I no longer see any crashes.  Thanks!