Bug 939 - EmuNetDevice uses too much memory when reading packet bursts
EmuNetDevice uses too much memory when reading packet bursts
Status: RESOLVED FIXED
Product: ns-3
Classification: Unclassified
Component: devices
ns-3-dev
All All
: P5 normal
Assigned To: ns-bugs
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2010-06-09 10:10 EDT by Gustavo J. A. M. Carneiro
Modified: 2010-08-18 01:17 EDT (History)
4 users (show)

See Also:


Attachments
patch (2.69 KB, patch)
2010-06-09 10:10 EDT, Gustavo J. A. M. Carneiro
Details | Diff
additional patch: add RxQueueSize attribute (1.77 KB, patch)
2010-06-24 13:49 EDT, Gustavo J. A. M. Carneiro
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Gustavo J. A. M. Carneiro 2010-06-09 10:10:30 EDT
Created attachment 915 [details]
patch

EmuNetDevice has a "read thread" that does a malloc, and a main simulation thread that converts the data buffer to a Packet and then frees the buffer.

If there is a lot of packets to read from the network, it happens that the read thread easily receives the packets and allocates memory to them, but the simulation thread is much slower to process them.  The result is a lot of malloc'ed buffers that are not freed for a long time, leading to the virtual memory of the process to grow enormously.

The attached patch limits the number of concurrent pending reads and so improves the memory profile considerably.

PS: the read packet burst happens also when transmitting packet bursts, due to an unfortunate side effect of putting the interface in promiscuous mode, which then receives back packets that we transmitted in the first place :(
Comment 1 Gustavo J. A. M. Carneiro 2010-06-24 06:08:47 EDT
In heavy load experiments I have found the patch to cause EmuNetDevice to miss too many packets.  I changed 

#define MAX_PENDING_READS 1

to

#define MAX_PENDING_READS 800

This solves the performance issue.  No, I don't know what is the "best" value, but that value seems to work pretty well.

Contrary to what I expected, I guess kernel-side buffering of packets is not that great.  But without the patch, userspace buffering of packets is infinite, not that great either...
Comment 2 Craig Dowell 2010-06-24 09:53:52 EDT
> I changed 
> 
> #define MAX_PENDING_READS 1
>  
> to
> 
> #define MAX_PENDING_READS 800

Seems like an Attribute is in order here.
Comment 3 Gustavo J. A. M. Carneiro 2010-06-24 13:49:18 EDT
Created attachment 929 [details]
additional patch: add RxQueueSize attribute

Agreed about the attribute.  Consider this additional patch, on top of the previous one, to add the attribute.
Comment 4 Andrey Mazo 2010-07-13 20:15:41 EDT
(In reply to comment #0)
> Created an attachment (id=915) [details]
> patch
> 
> EmuNetDevice has a "read thread" that does a malloc, and a main simulation
> thread that converts the data buffer to a Packet and then frees the buffer.
> 
> If there is a lot of packets to read from the network, it happens that the read
> thread easily receives the packets and allocates memory to them, but the
> simulation thread is much slower to process them.  The result is a lot of
> malloc'ed buffers that are not freed for a long time, leading to the virtual
> memory of the process to grow enormously.
> 
> The attached patch limits the number of concurrent pending reads and so
> improves the memory profile considerably.

Maybe another approach is to read all available packets from the network and then apply some kind of a queue management algorithm in the "read thread" like random early drop or simple tail drop?
It's maybe useful for user to have such a control on EmuNetDevice instead of leaving queue management in kernel.
Isn't it nice to apply the same algorithm for both fully simulated devices and EmuNetDevice?
Comment 5 Gustavo J. A. M. Carneiro 2010-07-14 05:25:18 EDT
(In reply to comment #4)
> Maybe another approach is to read all available packets from the network and
> then apply some kind of a queue management algorithm in the "read thread" like
> random early drop or simple tail drop?

Well, what we have right now is an infinite queue in disguise.  Each packet available to read is queued using ScheduleRealtimeNow.  The queue is in fact the in the event scheduler itself.

My patch(es) adds a limit to this "queue", configurable by the user.

> It's maybe useful for user to have such a control on EmuNetDevice instead of
> leaving queue management in kernel.

Yes, in fact there is a big difference in performance (I measured), for the worse, if you remove the queuing system at the NS-3 side.  The Linux kernel doesn't seem to do a great job with this...

> Isn't it nice to apply the same algorithm for both fully simulated devices and
> EmuNetDevice?

I guess we could add the packets into a Queue object instead, but we'd still need a ScheduleRealtimeNow because the Queue object is not thread-safe.  And then we arrive at a similar solution as we have now, except slightly more complex.  I don't think that it's worth it, to be honest.
Comment 6 Tom Henderson 2010-08-12 19:28:59 EDT
Do you have an example testcase that we can add to the set of programs that is periodically tested to avoid regressions?  

I am inclined to push these two patches now for ns-3.9 if you are confident with the patches and they still test well for you against ns-3-dev.
Comment 7 Tom Henderson 2010-08-12 19:53:32 EDT
Note, I have seen this crasher twice in testing the above two patches but it is difficult to reproduce as I type this.

sudo ./waf --run emu-ping
Waf: Entering directory `/home/tomh/emu/ns-3-allinone/ns-3-dev/build'
'/home/tomh/emu/ns-3-allinone/ns-3-dev/bindings/python/ns3/__init__.py' -> '/home/tomh/emu/ns-3-allinone/ns-3-dev/build/debug/bindings/python/ns3/__init__.py'
Waf: Leaving directory `/home/tomh/emu/ns-3-allinone/ns-3-dev/build'
'build' finished successfully (0.582s)
Received Response with RTT = 29245000ns
Received Response with RTT = 27753000ns
*** glibc detected *** /home/tomh/emu/ns-3-allinone/ns-3-dev/build/debug/scratch/emu-ping: double free or corruption (out): 0x00000000022c24b0 ***
======= Backtrace: =========
/lib64/libc.so.6[0x350e674a56]
msg="SystemMutexPrivate::Lock()pthread_mutex_lock failed: 22 = "Invalid argument"", file=../src/core/unix-system-mutex.cc, line=82
terminate called without an active exception
Command ['/home/tomh/emu/ns-3-allinone/ns-3-dev/build/debug/scratch/emu-ping'] terminated with signal SIGIOT. Run it under a debugger to get more information (./waf --run <program> --command-template="gdb --args %s <args>").
Comment 8 Gustavo J. A. M. Carneiro 2010-08-13 06:00:24 EDT
(In reply to comment #7)
> Note, I have seen this crasher twice in testing the above two patches but it is
> difficult to reproduce as I type this.

I see, that is worrisome.  However, I am read-reading the patches I produced and cannot find a fault in them (there isn't even any 'delete' or 'free' in them), so probably the bug is elsewhere?

Regarding unit testing, I don't think it's possible to include EmuNetDevice in unit tests.  It's very complicated thing to do (at the very least it needs a program to create a TAP interface, to which EmuNetDevice will write as fast as it can), and would require the tests to run with sudo privilege.

Anyway, I'm ok with leaving the bug for post 3.9.  The bug has always been there.  I found the problem when writing a program that tries to send packets through EmuNetDevice as fast as the network interface can handle.  Since each packet transmitted is also received back (promiscuous mode), this lead to memory exhaustion.  But I guess it is not something most people will want to do, so it is rarely triggered.  It must be fixed IMHO, but it's not urgent.
Comment 9 Gustavo J. A. M. Carneiro 2010-08-13 06:02:56 EDT
(In reply to comment #7)
> Note, I have seen this crasher twice in testing the above two patches but it is
> difficult to reproduce as I type this.
> 

Also be ware of WAF that sometimes forgets to rebuild example programs when just a structure changes.  If the layout of a structure changes and the program is not rebuilt, this type of error may happen.  In that case, you need to remove the example program binary and object file manually to have it rebuild.  /me kicks WAF.
Comment 10 Josh Pelkey 2010-08-16 13:45:25 EDT
changeset: 29512368dd2e
Comment 11 Tom Henderson 2010-08-17 01:08:07 EDT
Unfortunately this seems to not be quite there yet.  I am seeing this error occasionally with this patch (which is in ns-3.9-RC2):
msg="SystemMutexPrivate::Lock()pthread_mutex_lock failed: 22 = "Invalid
argument"", file=../src/core/unix-system-mutex.cc, line=82
terminate called without an active exception

I ran comparison tests this evening, running 400 instances of emu-ping.cc example on a version of ns-3 with and without this patch.  I found 24 instances of the above fatal error in the version with the patch, and 0 instances without it.

I'll try to track this down further tomorrow, but I think we will need to either fix now or pull from ns-3.9.
Comment 12 Gustavo J. A. M. Carneiro 2010-08-17 05:57:03 EDT
I have never seen crashes _during_ the simulation itself, but I am aware that EmuNetDevice does not always cleanup gracefully when the process terminates, due to the multi-threaded nature.  Perhaps calling StopDevice in DoDispose will improve the situation:

diff -r 2c0e1ae4c00d src/devices/emu/emu-net-device.cc
--- a/src/devices/emu/emu-net-device.cc	Sat Aug 14 15:37:02 2010 -0700
+++ b/src/devices/emu/emu-net-device.cc	Tue Aug 17 10:54:55 2010 +0100
@@ -201,6 +201,10 @@
 EmuNetDevice::DoDispose()
 {
   NS_LOG_FUNCTION_NOARGS ();
+  if (m_readThread != 0)
+    {
+      StopDevice ();
+    }
   m_node = 0;
   NetDevice::DoDispose ();
 }
Comment 13 Tom Henderson 2010-08-17 14:06:38 EDT
(In reply to comment #12)
> I have never seen crashes _during_ the simulation itself, but I am aware that
> EmuNetDevice does not always cleanup gracefully when the process terminates,
> due to the multi-threaded nature.  Perhaps calling StopDevice in DoDispose will
> improve the situation:
> 
> diff -r 2c0e1ae4c00d src/devices/emu/emu-net-device.cc
> --- a/src/devices/emu/emu-net-device.cc    Sat Aug 14 15:37:02 2010 -0700
> +++ b/src/devices/emu/emu-net-device.cc    Tue Aug 17 10:54:55 2010 +0100
> @@ -201,6 +201,10 @@
>  EmuNetDevice::DoDispose()
>  {
>    NS_LOG_FUNCTION_NOARGS ();
> +  if (m_readThread != 0)
> +    {
> +      StopDevice ();
> +    }
>    m_node = 0;
>    NetDevice::DoDispose ();
>  }

The above patch cleared all errors (400 consecutive clean runs); thanks!
Comment 14 Tom Henderson 2010-08-18 01:17:53 EDT
patch applied:  ecb0f9a3849a