For an detailed description you should also read this thread: http://mailman.isi.edu/pipermail/ns-developers/2008-July/004454.html
This approach analyses the current NS3 architecture, spot areas of parallelization and build the fundamentals algorithms to achieve performance gains! Main goal is a CPU local parallelization but an powerful architecture on the other hand should also scale in large (in a distributed environments).
The approach should be universal and transparent for all major subsystems withing the simulator. Therefore an additional abstraction layer should be introduced to hide all implementation issues and enable the possibility to disable the parallelization completely, substitute or enhance the algorithms. The additional layer is an increment, the first usable results should illume where an interface is suitable. Focus is still an working implementation!
- Literature study
- Basic parallelization and packet serialization/serialization
- Synchronization approach
- Node local (CMP/SMP)
- Distributed (MPI)
- Balance subsystem isolation (WHERE to split the NS3 system for parallelization)
- Clean parallelization layer with the following characteristics
- as few as possible interaction with other subsystems
- minimal overhead
- new technologies should be implemented without knownledge of the underlying algorithm (e.g. interference calculation for wireless nodes)
- last but not least: the introduced algorithm should scale well for uniprocessor systems as same as TOP500.org clusters! ;)
Current approach and fundamental algorithm is based on a space parallel paradigm. Nodes are merged into subsets (called federates) where each subset represent a working thread (consider this as a thread, local process or distributed working task - for example via MPI).
The underling synchronization method is based on MPI. Therefore you need some additional libraries to build the parallelized ns-3. On a Debian based system you should type
aptitude install libopenmpi1 libopenmpi-dev openmpi-common openmpi-bin
to install the required dependencies.
To compile the branch (ns-3-para) you should always call all ./waf commands with a leading "CXX=/usr/bin/mpicxx". This tells waf to replace the common compiler with a MPI wrapper compiler (which itself calls the appropriate compiler). At the end you will end up with a line similar to the following to compile the branch:
./waf configure && CXX=/usr/bin/mpicxx ./waf
Currently there are no modification to the simulated scenario files required, except of one: you must add the line
in front of "Simulator::Run ();". If you do not at this line the simulator behavior is similar a normal run. To start the simulation you must set up the MPI environment. Therefore you must execute the mpirun(1) command. To start the point-to-point-udp-discard scenario (bundled with ns-3) you could execute the following:
./waf --shell mpirun --np 2 --mca btl \^udapl,openib build/debug/examples/point-to-point-udp-discard
--np 2 means that two instances on the local machine is spawned
--mca btl \^udapl,openib signal MPI that you aren't run via low-latency bus system like infiniband and suppress some warnings.
At the end you invoke the normal program, no magic here. Thats all! At the end some wrapper scripts should be supplied and compile time environment variabled should be replaced by waf configure options.
- Synchronization between federates (Packet as well time information) - 95 %
- MPI nearly completed (lets say 95%)
- Outlook: shared memory based approach
- Time synchronization - 0 %
- Is related to the question how can federates act and execute events if they does not know if a neighboring federate want to execute a event earlier in the timelime. The main challenge is to minimize the synchronization overhead to a minimum. The choice of a proper algorithm is of existential impact. This is an open question and is treated in after the data synchronization is completed.
- Input/Output handling - 0 %
- Currently the output isn't synchronized if several instances of NS-3 are executed in parallel. This must be fixed! The idea is to introduce a last phase where all the simulation is done, synchronize the data (e.g. send it to the main instance - rank 0 for MPI) and output the data). This could be on answer but introduce on the other hand additional overhead (time as well data synchronization).
There are several profiling tools and several areas of profiling (like cache miss rate, IO/CPU impact, ...). These section discuss a call graph based approach via valgrind - a popular profiling tool. The interpretation of the generated data is left to the reader. These paragraph show what are the required steps to get the data.
First of all, you need the required dependencies these include valgrind and kcachegrind (kdelib based visualization tool). On a Debian based system you can install these via
aptitude install valgrind kcachegrind
To visualize the calltree for a particular scenario file you invoke ns-3 like this:
./waf --run tcp-large-transfer --command-template="valgrind --tool=callgrind --trace-children=yes --collect-jumps=yes %s" kcachegrind callgrind.out.*
- GloMoSim: A Library for Parallel Simulation of Large-scale Wireless Networks 
- Space-parallel network simulations using ghosts 
- Lock-free Scheduling of Logical Processes in Parallel Simulation 
- Learning Not to Share 
- Towards Realistic Million-Node Internet Simulations 
- A Generic Framework for Parallelization of Network Simulations