|
|
Line 5: |
Line 5: |
| == Suggestions from Mathieu Lacage == | | == Suggestions from Mathieu Lacage == |
| | | |
− | The first task is to decide what the scope of the project should be.
| + | These were posted on the ns-developers mailing-list: |
− | There are really two possible options:
| + | http://mailman.isi.edu/pipermail/ns-developers/2008-March/003829.html |
− | * provide a parallel tool for clusters, and, more generally, for
| + | |
− | physically distributed systems
| + | |
− | * provide a parallel tool for multicore systems (numa/smp)
| + | |
− | | + | |
− | Since I doubt very much that the same solution can be applied with
| + | |
− | useful performance to both options, I really would urge interested
| + | |
− | students to focus on one of the above use-cases.
| + | |
− | | + | |
− | Now, since I was asked for what what I would recommend, the following is
| + | |
− | really a summary of one of the many implementation options you could
| + | |
− | chose to pursue so, don't take this is as any kind of mandatory path.
| + | |
− | This is merely a hint as to what might be implementable within the
| + | |
− | requested timeframe and what might bring interesting and useful output
| + | |
− | to users.
| + | |
− | | + | |
− | 1) The basic idea is to implement a conservative distributed time
| + | |
− | synchronization algorithm. The classic algorithm is described in
| + | |
− | "Parallel simulation: parallel and distributed simulation systems" by
| + | |
− | Richard M. Fujimoto. The fundamental assumption here is that the
| + | |
− | simulation models make no use of global variables and that simulated
| + | |
− | nodes are connected through channels which have a non-null minimum
| + | |
− | propagation delay. This allows you to parallelize two events e1 and e2
| + | |
− | generated on two simulated nodes (n1 and n2) if e2 > e1 + min_delay
| + | |
− | (n1,n2) or if e1 > e2 + min_delay (n1,n2).
| + | |
− | | + | |
− | 2) We could, of course, implement the classic algorithm in the classic
| + | |
− | way to allow a user to build very large scale simulations by
| + | |
− | distributing the simulated nodes on a large set of simulation machines.
| + | |
− | This kind of parallelization is useful but it has a number of drawbacks,
| + | |
− | the main one being that it requires the user to describe a static
| + | |
− | partitioning of the simulated nodes: the total set of nodes is split in
| + | |
− | a set of partitions, each of which contains a subset of the total set of
| + | |
− | nodes. Then, each partition is scheduled to run on a single machine.
| + | |
− | Once you have made this decision, you cannot undo it since you are not
| + | |
− | using a language like java which allows code migration. In practice,
| + | |
− | finding a partitioning which does not generate slowdowns is non-trivial.
| + | |
− | | + | |
− | 3) So, from my perspective, a desirable solution would require that the
| + | |
− | user does not have to describe a partition set and it would work really
| + | |
− | hard to get automatically _some_ speedup from a multicore system.
| + | |
− | | + | |
− | A simple solution would be to re-use a classic synchronization algorithm
| + | |
− | with a very fine-grained partitioning (i.e., each partition contains
| + | |
− | exactly one node). Then, you could schedule each partition on one of the
| + | |
− | cpus of the multicore machine until the synchronization algorithm tells
| + | |
− | you that this partition has no events to schedule anymore. At this
| + | |
− | point, you can pick another available partition to schedule it on one of
| + | |
− | the available cpus.
| + | |
− | | + | |
− | Pluses:
| + | |
− | * no user-visible partitioning on a single multicore machine.
| + | |
− | * can be re-used if you have multiple multicore machines to schedule
| + | |
− | events within each multicore machine. So, you can reuse this work if you
| + | |
− | want to provide a more generic distributed simulation solution.
| + | |
− | | + | |
− | Minuses:
| + | |
− | * have no idea if you will get any kind of measurable speedup and you
| + | |
− | will need to work hard to optimize local cache thrashing, etc.
| + | |
− | | + | |
− | 4) How do you implement this ?
| + | |
− | | + | |
− | The core synchronization algorithm should live in the 'simulator'
| + | |
− | module: src/simulator/simulator.cc should basically be replaced by a new
| + | |
− | version which performs the synchronization/partition scheduling. It is
| + | |
− | trivial to make this code dynamically replaceable by exporting something
| + | |
− | like SimulatorImpl defined src/simulator/simulator.cc so, I will skip
| + | |
− | this detail.
| + | |
− | | + | |
− | The biggest problem from an API point of view is that you need to be
| + | |
− | able to tell which partition an event belongs to. Traditionally, this is
| + | |
− | not a problem because there is a single partition per physical
| + | |
− | machine/address-space. However, here, we have multiple partitions within
| + | |
− | a single address-space (physical machine). Since every partition is made
| + | |
− | of a single Node, we just need to tell which node an event is coming
| + | |
− | from. There are many options, but here are two:
| + | |
− | * Simulator::ScheduleWithObject (Ptr<Object> node, Time delay,
| + | |
− | function, ...);
| + | |
− | * Node::Schedule (Time delay, function, ...);
| + | |
− | | + | |
− | And, then, you need to make sure that all your models use these
| + | |
− | scheduling functions rather than the functions they are now using.
| + | |
− | | + | |
− | 5) A small tricky detail to get right from an algorithmic perspective is
| + | |
− | how you will handle events which expire at the same time. There are two
| + | |
− | issues to deal with here:
| + | |
− | * you need to ensure that the parallel simulations provide a
| + | |
− | deterministic ordering of these events.
| + | |
− | * you need to ensure that the deterministic ordering of these events
| + | |
− | is the same in both non-parallel and parallel simulations to ensure that
| + | |
− | models are trivially portable from a non-parallel simulation to a
| + | |
− | parallel one.
| + | |
− | | + | |
− | I suspect that there are multiple solutions to this problem, but here is
| + | |
− | an ordering algorithm which should solve this problem:
| + | |
− | * all same-time events within a single node should be scheduled FIFO.
| + | |
− | * if you have two same-time events on node A and B, and if
| + | |
− | A->GetNodeId () < B->GetNodeId (), then, the event on A should be
| + | |
− | scheduled first.
| + | |
− | | + | |
− | This means that you need to add to the event ordering key the nodeid as
| + | |
− | a secondary ordering key (in fact, there are already 2 keys in there,
| + | |
− | so, this would be a third ordering key)
| + | |