Difference between revisions of "Parallel Simulations"

From Nsnam
Jump to: navigation, search
(Mathieu's suggestions on parallel simulations)
 
(Calltree Analysis)
 
(12 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
{{TOC}}
 
{{TOC}}
  
Link [http://www.nsnam.org/wiki/index.php/Suggested_Projects back to Suggested Projects page]
+
= General =
  
== Suggestions from Mathieu Lacage ==
+
For an detailed description you should also read this thread: http://mailman.isi.edu/pipermail/ns-developers/2008-July/004454.html
  
The first task is to decide what the scope of the project should be.
+
= Project Background =
There are really two possible options:
+
This approach analyses the current NS3 architecture, spot areas of parallelization and build the fundamentals algorithms to achieve performance gains!
* provide a parallel tool for clusters, and, more generally, for
+
Main goal is a CPU local parallelization but an powerful architecture on the other hand should also scale in large (in a distributed environments).
physically distributed systems
+
* provide a parallel tool for multicore systems (numa/smp)
+
  
Since I doubt very much that the same solution can be applied with
+
The approach should be universal and transparent for all major subsystems withing the simulator. Therefore an additional abstraction layer should be introduced to hide all implementation issues and enable the possibility to disable the parallelization completely, substitute or enhance the algorithms. The additional layer is an increment, the first usable results should illume where an interface is suitable. Focus is still an working implementation!
useful performance to both options, I really would urge interested
+
students to focus on one of the above use-cases.
+
  
Now, since I was asked for what what I would recommend, the following is
+
= Project Plan =
really a summary of one of the many implementation options you could
+
* Literature study
chose to pursue so, don't take this is as any kind of mandatory path.
+
* Basic parallelization and packet serialization/serialization
This is merely a hint as to what might be implementable within the
+
* Synchronization approach
requested timeframe and what might bring interesting and useful output
+
** Node local (CMP/SMP)
to users.
+
** Distributed (MPI)
 +
* Balance subsystem isolation (WHERE to split the NS3 system for parallelization)
 +
* Clean parallelization layer with the following characteristics
 +
** as few as possible interaction with other subsystems
 +
** minimal overhead
 +
** new technologies should be implemented without knownledge of the underlying algorithm (e.g. interference calculation for wireless nodes)
 +
** last but not least: the introduced algorithm should scale well for uniprocessor systems as same as TOP500.org clusters! ;)
  
1) The basic idea is to implement a conservative distributed time
+
= Approach =
synchronization algorithm. The classic algorithm is described in
+
"Parallel simulation: parallel and distributed simulation systems" by
+
Richard M. Fujimoto. The fundamental assumption here is that the
+
simulation models make no use of global variables and that simulated
+
nodes are connected through channels which have a non-null minimum
+
propagation delay. This allows you to parallelize two events e1 and e2
+
generated on two simulated nodes (n1 and n2) if e2 > e1 + min_delay
+
(n1,n2) or if e1 > e2 + min_delay (n1,n2).
+
  
2) We could, of course, implement the classic algorithm in the classic
+
Current approach and fundamental algorithm is based on a space parallel paradigm. Nodes are merged into subsets (called federates) where each subset represent a working thread (consider this as a thread, local process or distributed working task - for example via MPI).
way to allow a user to build very large scale simulations by
+
distributing the simulated nodes on a large set of simulation machines.
+
This kind of parallelization is useful but it has a number of drawbacks,
+
the main one being that it requires the user to describe a static
+
partitioning of the simulated nodes: the total set of nodes is split in
+
a set of partitions, each of which contains a subset of the total set of
+
nodes. Then, each partition is scheduled to run on a single machine.
+
Once you have made this decision, you cannot undo it since you are not
+
using a language like java which allows code migration. In practice,
+
finding a partitioning which does not generate slowdowns is non-trivial.
+
  
3) So, from my perspective, a desirable solution would require that the
+
= Usage =
user does not have to describe a partition set and it would work really
+
hard to get automatically _some_ speedup from a multicore system.
+
  
A simple solution would be to re-use a classic synchronization algorithm
+
== Dependencies ==
with a very fine-grained partitioning (i.e., each partition contains
+
exactly one node). Then, you could schedule each partition on one of the
+
cpus of the multicore machine until the synchronization algorithm tells
+
you that this partition has no events to schedule anymore. At this
+
point, you can pick another available partition to schedule it on one of
+
the available cpus.
+
  
Pluses:
+
The underling synchronization method is based on MPI. Therefore you need some additional libraries to build the parallelized ns-3. On a Debian based system you should type
* no user-visible partitioning on a single multicore machine.
+
* can be re-used if you have multiple multicore machines to schedule
+
events within each multicore machine. So, you can reuse this work if you
+
want to provide a more generic distributed simulation solution.
+
  
Minuses:
+
<pre>aptitude install libopenmpi1 libopenmpi-dev openmpi-common openmpi-bin</pre>
* have no idea if you will get any kind of measurable speedup and you
+
will need to work hard to optimize local cache thrashing, etc.
+
  
4) How do you implement this ?
+
to install the required dependencies.
  
The core synchronization algorithm should live in the 'simulator'
+
== Compile Instructions ==
module: src/simulator/simulator.cc should basically be replaced by a new
+
version which performs the synchronization/partition scheduling. It is
+
trivial to make this code dynamically replaceable by exporting something
+
like SimulatorImpl defined src/simulator/simulator.cc so, I will skip
+
this detail.
+
  
The biggest problem from an API point of view is that you need to be
+
To compile the branch (ns-3-para) you should always call all <tt>./waf</tt> commands with a leading "<tt>CXX=/usr/bin/mpicxx</tt>". This tells waf to replace the common compiler with a MPI wrapper compiler (which itself calls the appropriate compiler). At the end you will end up with a line similar to the following to compile the branch:
able to tell which partition an event belongs to. Traditionally, this is
+
not a problem because there is a single partition per physical
+
machine/address-space. However, here, we have multiple partitions within
+
a single address-space (physical machine). Since every partition is made
+
of a single Node, we just need to tell which node an event is coming
+
from. There are many options, but here are two:
+
* Simulator::ScheduleWithObject (Ptr<Object> node, Time delay,
+
function, ...);
+
* Node::Schedule (Time delay, function, ...);
+
  
And, then, you need to make sure that all your models use these
+
<pre>./waf configure && CXX=/usr/bin/mpicxx ./waf</pre>
scheduling functions rather than the functions they are now using.
+
  
5) A small tricky detail to get right from an algorithmic perspective is
+
== Parallelized Simulation ==
how you will handle events which expire at the same time. There are two
+
issues to deal with here:
+
* you need to ensure that the parallel simulations provide a
+
deterministic ordering of these events.
+
* you need to ensure that the deterministic ordering of these events
+
is the same in both non-parallel and parallel simulations to ensure that
+
models are trivially portable from a non-parallel simulation to a
+
parallel one.
+
  
I suspect that there are multiple solutions to this problem, but here is
+
Currently there are no modification to the simulated scenario files required, except of one: you must add the line
an ordering algorithm which should solve this problem:
+
* all same-time events within a single node should be scheduled FIFO.
+
* if you have two same-time events on node A and B, and if
+
A->GetNodeId () < B->GetNodeId (), then, the event on A should be
+
scheduled first.
+
  
This means that you need to add to the event ordering key the nodeid as
+
<pre>Simulator::EnableParallelSimulation();</pre>
a secondary ordering key (in fact, there are already 2 keys in there,
+
 
so, this would be a third ordering key)
+
in front of "Simulator::Run ();". If you do not at this line the simulator behavior is similar a normal run. To start the simulation you must set up the MPI environment. Therefore you must execute the <tt>mpirun(1)</tt> command. To start the  point-to-point-udp-discard scenario (bundled with ns-3) you could execute the following:
 +
 
 +
<pre>
 +
./waf --shell
 +
mpirun --np 2 --mca btl \^udapl,openib  build/debug/examples/point-to-point-udp-discard
 +
</pre>
 +
 
 +
<tt>--np 2</tt> means that two instances on the local machine is spawned<br />
 +
<tt>--mca btl \^udapl,openib</tt> signal MPI that you aren't run via low-latency bus system like infiniband and suppress some warnings.
 +
 
 +
At the end you invoke the normal program, no magic here. Thats all! At the end some wrapper scripts should be supplied and compile time environment variabled should be replaced by waf configure options.
 +
 
 +
 
 +
= Milestones =
 +
 
 +
* Synchronization between federates (Packet as well time information) - 95 %
 +
** MPI nearly completed (lets say 95%)
 +
** Outlook: shared memory based approach
 +
* Time synchronization - 0 %
 +
** Is related to the question how can federates act and execute events if they does not know if a neighboring federate want to execute a event earlier in the timelime. The main challenge is to minimize the synchronization overhead to a minimum. The choice of a proper algorithm is of existential impact. This is an open question and is treated in after the data synchronization is completed.
 +
* Input/Output handling - 0 %
 +
** Currently the output isn't synchronized if several instances of NS-3 are executed in parallel. This must be fixed! The idea is to introduce a last phase where all the simulation is done, synchronize the data (e.g. send it to the main instance - rank 0 for MPI) and output the data). This could be on answer but introduce on the other hand additional overhead (time as well data synchronization).
 +
 
 +
 
 +
= Profiling =
 +
 
 +
== Calltree Analysis ==
 +
 
 +
There are several profiling tools and several areas of profiling (like cache miss rate, IO/CPU impact, ...). These section discuss a call graph based approach via valgrind - a popular profiling tool. The interpretation of the generated data is left to the reader. These paragraph show what are the required steps to get the data.
 +
 
 +
First of all, you need the required dependencies these include valgrind and kcachegrind (kdelib based visualization tool). On a Debian based system you can install these via
 +
 
 +
<pre>
 +
aptitude install valgrind kcachegrind
 +
</pre>
 +
 
 +
To visualize the calltree for a particular scenario file you invoke ns-3 like this:
 +
 
 +
<pre>
 +
./waf --run tcp-large-transfer --command-template="valgrind --tool=callgrind  --trace-children=yes --collect-jumps=yes %s"
 +
kcachegrind callgrind.out.*
 +
</pre>
 +
 
 +
= References =
 +
 
 +
 
 +
== Parallelization Approaches ==
 +
* omnet++ [http://www.omnetpp.org/doc/parsim-api/main.html]
 +
* The Georgia Tech Network Simulator [http://www.ece.gatech.edu/research/labs/MANIACS/GTNetS/]
 +
* Dartmouth SSF [http://www.crhc.uiuc.edu/~jasonliu/projects/ssf/intro.html]
 +
 
 +
== Synchronization/Communication ==
 +
 
 +
* MPI
 +
** OpenMPI [http://www.open-mpi.org/]
 +
** MPI on Multicore, an OpenMP Alternative? [http://www.linux-mag.com/id/4608]
 +
 
 +
== Literature ==
 +
 
 +
* GloMoSim: A Library for Parallel Simulation of Large-scale Wireless Networks [http://www.qualnet.com/pdf/glomosim.pdf]
 +
* Space-parallel network simulations using ghosts [http://ieeexplore.ieee.org/Xplore/login.jsp?url=/iel5/9119/28915/01301298.pdf?arnumber=1301298]
 +
* Lock-free Scheduling of Logical Processes in Parallel Simulation [http://www.crhc.uiuc.edu/~jasonliu/projects/ssf/papers/PADS01-lockfree.ps]
 +
* Learning Not to Share [http://www.crhc.uiuc.edu/~jasonliu/projects/ssf/papers/PADS01-noshare.ps]
 +
* Towards Realistic Million-Node Internet Simulations [http://www.ssfnet.org/Papers/pdpta99.pdf]
 +
* A Generic Framework for Parallelization of Network Simulations [http://citeseer.ist.psu.edu/rd/0%2C245012%2C1%2C0.25%2CDownload/http://citeseer.ist.psu.edu/cache/papers/cs/11632/http:zSzzSzwww.cc.gatech.eduzSzfaczSzMostafa.AmmarzSzpaperszSzmascots.pdf/riley99generic.pdf]

Latest revision as of 09:33, 29 July 2008

Main Page - Current Development - Developer FAQ - Tools - Related Projects - Project Ideas - Summer Projects

Installation - Troubleshooting - User FAQ - HOWTOs - Samples - Models - Education - Contributed Code - Papers

General

For an detailed description you should also read this thread: http://mailman.isi.edu/pipermail/ns-developers/2008-July/004454.html

Project Background

This approach analyses the current NS3 architecture, spot areas of parallelization and build the fundamentals algorithms to achieve performance gains! Main goal is a CPU local parallelization but an powerful architecture on the other hand should also scale in large (in a distributed environments).

The approach should be universal and transparent for all major subsystems withing the simulator. Therefore an additional abstraction layer should be introduced to hide all implementation issues and enable the possibility to disable the parallelization completely, substitute or enhance the algorithms. The additional layer is an increment, the first usable results should illume where an interface is suitable. Focus is still an working implementation!

Project Plan

  • Literature study
  • Basic parallelization and packet serialization/serialization
  • Synchronization approach
    • Node local (CMP/SMP)
    • Distributed (MPI)
  • Balance subsystem isolation (WHERE to split the NS3 system for parallelization)
  • Clean parallelization layer with the following characteristics
    • as few as possible interaction with other subsystems
    • minimal overhead
    • new technologies should be implemented without knownledge of the underlying algorithm (e.g. interference calculation for wireless nodes)
    • last but not least: the introduced algorithm should scale well for uniprocessor systems as same as TOP500.org clusters! ;)

Approach

Current approach and fundamental algorithm is based on a space parallel paradigm. Nodes are merged into subsets (called federates) where each subset represent a working thread (consider this as a thread, local process or distributed working task - for example via MPI).

Usage

Dependencies

The underling synchronization method is based on MPI. Therefore you need some additional libraries to build the parallelized ns-3. On a Debian based system you should type

aptitude install libopenmpi1 libopenmpi-dev openmpi-common openmpi-bin

to install the required dependencies.

Compile Instructions

To compile the branch (ns-3-para) you should always call all ./waf commands with a leading "CXX=/usr/bin/mpicxx". This tells waf to replace the common compiler with a MPI wrapper compiler (which itself calls the appropriate compiler). At the end you will end up with a line similar to the following to compile the branch:

./waf configure && CXX=/usr/bin/mpicxx ./waf

Parallelized Simulation

Currently there are no modification to the simulated scenario files required, except of one: you must add the line

Simulator::EnableParallelSimulation();

in front of "Simulator::Run ();". If you do not at this line the simulator behavior is similar a normal run. To start the simulation you must set up the MPI environment. Therefore you must execute the mpirun(1) command. To start the point-to-point-udp-discard scenario (bundled with ns-3) you could execute the following:

./waf --shell
mpirun --np 2 --mca btl \^udapl,openib  build/debug/examples/point-to-point-udp-discard

--np 2 means that two instances on the local machine is spawned
--mca btl \^udapl,openib signal MPI that you aren't run via low-latency bus system like infiniband and suppress some warnings.

At the end you invoke the normal program, no magic here. Thats all! At the end some wrapper scripts should be supplied and compile time environment variabled should be replaced by waf configure options.


Milestones

  • Synchronization between federates (Packet as well time information) - 95 %
    • MPI nearly completed (lets say 95%)
    • Outlook: shared memory based approach
  • Time synchronization - 0 %
    • Is related to the question how can federates act and execute events if they does not know if a neighboring federate want to execute a event earlier in the timelime. The main challenge is to minimize the synchronization overhead to a minimum. The choice of a proper algorithm is of existential impact. This is an open question and is treated in after the data synchronization is completed.
  • Input/Output handling - 0 %
    • Currently the output isn't synchronized if several instances of NS-3 are executed in parallel. This must be fixed! The idea is to introduce a last phase where all the simulation is done, synchronize the data (e.g. send it to the main instance - rank 0 for MPI) and output the data). This could be on answer but introduce on the other hand additional overhead (time as well data synchronization).


Profiling

Calltree Analysis

There are several profiling tools and several areas of profiling (like cache miss rate, IO/CPU impact, ...). These section discuss a call graph based approach via valgrind - a popular profiling tool. The interpretation of the generated data is left to the reader. These paragraph show what are the required steps to get the data.

First of all, you need the required dependencies these include valgrind and kcachegrind (kdelib based visualization tool). On a Debian based system you can install these via

aptitude install valgrind kcachegrind

To visualize the calltree for a particular scenario file you invoke ns-3 like this:

./waf --run tcp-large-transfer --command-template="valgrind --tool=callgrind  --trace-children=yes --collect-jumps=yes %s"
kcachegrind callgrind.out.*

References

Parallelization Approaches

  • omnet++ [1]
  • The Georgia Tech Network Simulator [2]
  • Dartmouth SSF [3]

Synchronization/Communication

  • MPI
    • OpenMPI [4]
    • MPI on Multicore, an OpenMP Alternative? [5]

Literature

  • GloMoSim: A Library for Parallel Simulation of Large-scale Wireless Networks [6]
  • Space-parallel network simulations using ghosts [7]
  • Lock-free Scheduling of Logical Processes in Parallel Simulation [8]
  • Learning Not to Share [9]
  • Towards Realistic Million-Node Internet Simulations [10]
  • A Generic Framework for Parallelization of Network Simulations [11]