Parallel Simulations

From Nsnam
Revision as of 03:06, 22 March 2008 by Tomh (Talk | contribs) (Mathieu's suggestions on parallel simulations)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Main Page - Current Development - Developer FAQ - Tools - Related Projects - Project Ideas - Summer Projects

Installation - Troubleshooting - User FAQ - HOWTOs - Samples - Models - Education - Contributed Code - Papers

Link back to Suggested Projects page

Suggestions from Mathieu Lacage

The first task is to decide what the scope of the project should be. There are really two possible options:

  • provide a parallel tool for clusters, and, more generally, for

physically distributed systems

  • provide a parallel tool for multicore systems (numa/smp)

Since I doubt very much that the same solution can be applied with useful performance to both options, I really would urge interested students to focus on one of the above use-cases.

Now, since I was asked for what what I would recommend, the following is really a summary of one of the many implementation options you could chose to pursue so, don't take this is as any kind of mandatory path. This is merely a hint as to what might be implementable within the requested timeframe and what might bring interesting and useful output to users.

1) The basic idea is to implement a conservative distributed time synchronization algorithm. The classic algorithm is described in "Parallel simulation: parallel and distributed simulation systems" by Richard M. Fujimoto. The fundamental assumption here is that the simulation models make no use of global variables and that simulated nodes are connected through channels which have a non-null minimum propagation delay. This allows you to parallelize two events e1 and e2 generated on two simulated nodes (n1 and n2) if e2 > e1 + min_delay (n1,n2) or if e1 > e2 + min_delay (n1,n2).

2) We could, of course, implement the classic algorithm in the classic way to allow a user to build very large scale simulations by distributing the simulated nodes on a large set of simulation machines. This kind of parallelization is useful but it has a number of drawbacks, the main one being that it requires the user to describe a static partitioning of the simulated nodes: the total set of nodes is split in a set of partitions, each of which contains a subset of the total set of nodes. Then, each partition is scheduled to run on a single machine. Once you have made this decision, you cannot undo it since you are not using a language like java which allows code migration. In practice, finding a partitioning which does not generate slowdowns is non-trivial.

3) So, from my perspective, a desirable solution would require that the user does not have to describe a partition set and it would work really hard to get automatically _some_ speedup from a multicore system.

A simple solution would be to re-use a classic synchronization algorithm with a very fine-grained partitioning (i.e., each partition contains exactly one node). Then, you could schedule each partition on one of the cpus of the multicore machine until the synchronization algorithm tells you that this partition has no events to schedule anymore. At this point, you can pick another available partition to schedule it on one of the available cpus.

Pluses:

  • no user-visible partitioning on a single multicore machine.
  • can be re-used if you have multiple multicore machines to schedule

events within each multicore machine. So, you can reuse this work if you want to provide a more generic distributed simulation solution.

Minuses:

  • have no idea if you will get any kind of measurable speedup and you

will need to work hard to optimize local cache thrashing, etc.

4) How do you implement this ?

The core synchronization algorithm should live in the 'simulator' module: src/simulator/simulator.cc should basically be replaced by a new version which performs the synchronization/partition scheduling. It is trivial to make this code dynamically replaceable by exporting something like SimulatorImpl defined src/simulator/simulator.cc so, I will skip this detail.

The biggest problem from an API point of view is that you need to be able to tell which partition an event belongs to. Traditionally, this is not a problem because there is a single partition per physical machine/address-space. However, here, we have multiple partitions within a single address-space (physical machine). Since every partition is made of a single Node, we just need to tell which node an event is coming from. There are many options, but here are two:

  • Simulator::ScheduleWithObject (Ptr<Object> node, Time delay,

function, ...);

  • Node::Schedule (Time delay, function, ...);

And, then, you need to make sure that all your models use these scheduling functions rather than the functions they are now using.

5) A small tricky detail to get right from an algorithmic perspective is how you will handle events which expire at the same time. There are two issues to deal with here:

  • you need to ensure that the parallel simulations provide a

deterministic ordering of these events.

  • you need to ensure that the deterministic ordering of these events

is the same in both non-parallel and parallel simulations to ensure that models are trivially portable from a non-parallel simulation to a parallel one.

I suspect that there are multiple solutions to this problem, but here is an ordering algorithm which should solve this problem:

  • all same-time events within a single node should be scheduled FIFO.
  • if you have two same-time events on node A and B, and if

A->GetNodeId () < B->GetNodeId (), then, the event on A should be scheduled first.

This means that you need to add to the event ordering key the nodeid as a secondary ordering key (in fact, there are already 2 keys in there, so, this would be a third ordering key)