Difference between revisions of "VerificationValidationAndTesting"

From Nsnam
Jump to: navigation, search
 
(21 intermediate revisions by the same user not shown)
Line 1: Line 1:
 +
[[TestLink]]
 +
 
== Verification, Validation and Testing ==
 
== Verification, Validation and Testing ==
  
Line 5: Line 7:
 
A computer model is a mathematical or logical representation of something.  It can represent a vehicle, a frog or a networking card.  Models can also represent processes such as global warming, freeway traffic flow or a specification of a networking protocol.  Models can be completely faithful representations of a logical process specification, but they necessarily can never completely simulate a physical object or process.  In most cases, a number of simplifications are made to the model to make simulation computationally tractable.
 
A computer model is a mathematical or logical representation of something.  It can represent a vehicle, a frog or a networking card.  Models can also represent processes such as global warming, freeway traffic flow or a specification of a networking protocol.  Models can be completely faithful representations of a logical process specification, but they necessarily can never completely simulate a physical object or process.  In most cases, a number of simplifications are made to the model to make simulation computationally tractable.
  
Every model has a '''target system''' that it is attempting to simulate.  The first step in creating a simulation model is to identify this target system and the level of detail and accuracy that the simulation is desired to reproduce.  In the case of a logical process, the target system may be identified as TCP as defined by RFC 793.  In this case, it will probably be desirable to create a model that completely and faithfully reproduces RFC 793.  In the case of a physical process this will not be possible.  If, for example, you would like to simulate a wireless networking card, you may come up with a statement such as, "an accurate MAC-level implementation of the 802.11 specification and [...] a not-so-slow PHY-level model of the 802.11a specification."   
+
Every model has a '''target system''' that it is attempting to simulate.  The first step in creating a simulation model is to identify this target system and the level of detail and accuracy that the simulation is desired to reproduce.  In the case of a logical process, the target system may be identified as TCP as defined by RFC 793.  In this case, it will probably be desirable to create a model that completely and faithfully reproduces RFC 793.  In the case of a physical process this will not be possible.  If, for example, you would like to simulate a wireless networking card, you may determine that you need, "an accurate MAC-level implementation of the 802.11 specification and [...] a not-so-slow PHY-level model of the 802.11a specification."   
  
Once this is done, one can develop an '''abstract model''' of the target system.  This is typically an exercise in managing the tradeoffs between complexity, resource requiremens and accuracy.  The process of developing an abstract model has been called '''model qualification''' in the literature.  In the case of a TCP protocol, this process results in a design for a collection of objects that will fully implement RFC 793 in ns-3.  In the case of the wireless card, this process results in a number of tradeoffs to allow the physical layer to be simulated and the design of a network device and channel for ns-3.
+
Once this is done, one can develop an '''abstract model''' of the target system.  This is typically an exercise in managing the tradeoffs between complexity, resource requiremens and accuracy.  The process of developing an abstract model has been called '''model qualification''' in the literature.  In the case of a TCP protocol, this process results in a design for a collection of objects, interactions and behaviors that will fully implement RFC 793 in ns-3.  In the case of the wireless card, this process results in a number of tradeoffs to allow the physical layer to be simulated and the design of a network device and channel for ns-3, along with the desired objects, interactions and behaviors.
  
 
This abstract model is then developed into an '''ns-3 model''' that implements the abstract model as a computer program.  The process of getting the implementation to agree with the abstract model is called '''model verification''' in the literature.
 
This abstract model is then developed into an '''ns-3 model''' that implements the abstract model as a computer program.  The process of getting the implementation to agree with the abstract model is called '''model verification''' in the literature.
  
The process so far is open loop.  What remains is to determine taht a given ns-3 model has some connection to some reality -- that a model is an accurate representation of a real system, whether a logical process or a physical entity.  If you are going to use a simulation model to try and predict how some real system is going to behave, you must have some reason to believe your results -- i.e., can you trust that an inference made from the model translates into a correct prediction for the real system.  The process of getting the ns-3 model behavior to agree with the desired target system behavior as defined by the model qualification process is called '''model validation''' in the literature.
+
The process so far is open loop.  What remains is to maek a determination that a given ns-3 model has some connection to some reality -- that a model is an accurate representation of a real system, whether a logical process or a physical entity.  If you are going to use a simulation model to try and predict how some real system is going to behave, you must have some reason to believe your results -- i.e., can you trust that an inference made from the model translates into a correct prediction for the real system.  The process of getting the ns-3 model behavior to agree with the desired target system behavior as defined by the model qualification process is called '''model validation''' in the literature. In the case of a TCP implementation, you may want to compare the behavior of your ns-3 TCP model to some reference implementation in order to validate your model.  In the case of a wireless physical layer simulation, you may want to compare the behavior of your model to that of real hardware in a controlled setting,
  
Generally, the process is described as a closed loop
+
Generally, the process is usually described as a closed loop with variations on the following theme:
  
 
   target-system <---------------> abstract-model <--------------> ns-3 model
 
   target-system <---------------> abstract-model <--------------> ns-3 model
Line 21: Line 23:
 
                                 validation
 
                                 validation
  
Note that we have not used the term '''software testing''' at all in this discussion.  The process of qualification, verification and validation is really a research and development activity.
+
The following are the definitions we will use:
 +
 
 +
* ''Domain of applicability'':  Prescribed conditions for which the model has been tested, compared against reality to the extent possible, and judged suitable for use;
 +
* ''Qualification'': The process of defining the accuracy of a model in order to make a simulation tractable;
 +
* ''Range of accuracy'': Demonstrated agreement between the computerized model and reality within a domain of applicability.
 +
* ''Simulation'':  Modeling of systems and their operations using various means of representation;
 +
* ''Reality'': An entity, situation, or system selected for analysis -- a target-system;
 +
* ''Validation'': Substantiation that a model, within its domain of applicability, possesses a satisfactory range of accuracy consistent with the intended application of the model;
 +
* ''Verification'': Substantiation that the implementation of an abstract model is correct and performs as intended.
 +
 
 +
Note that we have not used the term '''software testing''' at all in this discussion.  The process of qualification, verification and validation is really a research and development activity.  Many of the checks implemented in the verification phase are ultimately reused in a software test quite however, leading to a blurring of the tasks.  Conceptually, however, neither qualification, verification nor validation has anything to do with software testing in its commonly understood sense.  The goal of model verification and validation is, as is suggested by the defintions above, substantiation that a model does what is advertised.
 +
 
 +
You will find some of the same terms and concepts used in discussions of software testing, however.  Software Testing is an investigation conducted to provide information about the quality of the product.  This is more of a manufacturing process activity -- given a model that has been verified and validated, software testing ensures that the model can be reproduced accurately and used without unexpected errors.  This is why software testing is sometimes called '''software quality control'''.
 +
 
 +
Without going too deeply into software test engineering, let's define some terms here as well:
 +
 
 +
* ''Acceptance testing'':  Tests performed prior to introducing a model into the main build or testing process;
 +
* ''Integration testing'':  Tests for defects in the interfaces and interaction between units.  Progressively larger groups of units may be integrated and tested;
 +
* ''Performance testing'': Tests to verify that models can handle large quantities of data (sometimes referred to as Load Testing);
 +
* ''Regression testing'':  Tests performed to uncover functionality that has been previously working correctly but stops working as intended;
 +
* ''System testing'': Checks that a completely integrated system meets its requirements;
 +
* ''Unit testing'': Tests minimal software components, or modules.  Each unit is checked tested to verify that the detailed design for the unit has been correctly implemented;
 +
* ''Usability testing'':  Verifies that user interfaces are easy to use and understand;
 +
* ''Verification'':  A determination that the product has been built according to its specifications;
 +
* ''Validation'': A determination that the system meets its intended needs and that the specifications were correct.
 +
 
 +
Note the reappearance of the terms Verification and Validation here with subtly changed meanings.  These activities close the product development loop in the same way as Validation and Verifiation close the model development loop.  These tasks are similar but not identical and are most often performed by people in entirely different roles.  In many cases, it seems, regression testing is confused with verification or validation.  These are actually wildly different activities with divergent goals.
 +
 
 +
That said, there is absolutely nothing wrong with code reuse.  It is possible, and desirable, to reuse tests done for model validation and verification in the software test domain.  For example, it would be very useful to automate the test suite used to verify and validate a given model and use those tests as verification, validation and regression tests in the software test sense.
 +
 
 +
The deliverables for ns-3 model verification and validation will be something like web or wiki pages detailing what was behaviors have been validated.  If a particlar behavior is verified or validated, the final output of the validation or verification test should be something like CONFORMS or DOES NOT CONFORM.  On the other hand, the deliverables for a software test suite will be something like a PASS or FAIL indication.  The same code can be used in both cases.  If a model validation or verification test is incorporated into a nightly regression test, the output CONFORMS is interpreted as CONTINUES TO CONFORM, and the output DOES NOT CONFORM is interpreted as REGRESSION ERROR.
 +
 
 +
The ns-3 verification, validation and testing project will produce tools and environments that make it as easy as possible to create the various kinds of tests used in the sundry domains we have described.  The frameworks will try to make it easy to reuse code in model-related and software test-related cases.  We will also provide a number of examples to show how we think the different testing tasks should be done.
 +
 
 +
== Kinds of Validation ==
 +
 
 +
The process used to validate a model is conceptually quite simple.  One compares the behavior of the ns-3 model to the behavior of the target system, and makes adjustments to the abstract model and/or the ns-3 model to improve the correlation.  This part of the process is sometimes called '''calibrating''' the model.
 +
 
 +
As mentioend above, at the end of the validation process, it is desirable to have a number of repeatable results that can demonstrate to other users that a given ns-3 model is faithful to a given abstract model and that this abstract model is, in turn, a faithful representation of the target system based on the initial qualification process.  We call this collection of results the '''Validation Suite'''.  You can think of this as the collection of experiments that have been used to validate a model, which can be run collectively as a suite of tests on the model for use by other users.  The suite can also be run as part of the software test stragegy as mentioned above.
 +
 
 +
These validation suites can be composed of deterministic tests which are used to validate process-oriented models such as the TCP implementation of RFC 793 mentioned above, or stochastic tests which are used to validate physical processes.  In both cases one wants to provide inputs to a model and observe that the outputs behave as expected.  In the literature, Naylor and Finger call this piece of the puzzle, '''Validation of Input-Output Transformations'''. 
 +
 
 +
=== Validating Models Using Stochastic Methods ===
 +
 
 +
In this case, the part of the target system to be validated is ultimately based on physical processes which are governed statistically.  To validate a model of this nature, we will need to perform statistical comparisons between experiments done on the target system and simulations of experiments done on the ns-3 model.  These techniques might be used to validate the "not-so-slow PHY-level model of the 802.11a specification" example given above.
 +
 
 +
The goal is to compare the behavior of the target system to the ns-3 model in some set of ways.  We must then identify some behavior, or '''observable''', to be validated and then design an experiment to determine whether or not the ns-3 model behaves in a way that is '''consistent''' with the target system in that respect.  We want to propose tests that the ns-3 model would fail if it were not functioning consistenly with the target system.  What does that really mean?
 +
 
 +
In the stochastic case, we are talking about a random variable -- a quantity that has no definite value, but rather has an '''ensemble''' of values that it can assume.  This ensemble of values varies according to some probability distribution.  For some set of conditions then, the  measurements of the random variable taken on the target system will have some distribution with some number of moments such as the '''expectation value''' (mean) and a '''variance'''. 
 +
 
 +
If we run an experiment on the target system, measurements of the random variable will have a distribution as described above.  This can be called a '''reference''' response.  If we run the same experiment on the ns-3 model under identical simulated conditions, measurements of the random variable in that environment will also have some distribution or response.  In order to validate the ns-3 model, we need to demonstrate that measurements of the ns-3 model observable are drawn from the same distribution as measurements of the target system observable to some level of statistical significance.  In other words, we are looking to support the '''null hypothesis''' (H0) and reject the '''alternate hypothesis''' (Ha).  The '''chi-squared''' test for goodness-of-fit is commonly used in such situations.
 +
 
 +
It seems that the description of the problem above leads us to conclude that the stochastic part of the ns-3 validation toolkit in ns-3 should really be a collection of tools used for statistical analysis of experimental data.  There are a number of pieces to the puzzle:
 +
 
 +
* For an experiment/test, how does one drive the tests and store the inputs or stimuli;
 +
* For an experiment/test, how does one collect and organize data collected from the target system and determine and specify the real distributions of the random variables;
 +
* For an experiment/test, how does one collect and organize data collected from the ns-3 model and determine and specify the simulated distributions of the random variables;
 +
* How does one actually perform the tests that determine whether the ns-3 model passes or fails (is consistent or not consistent with the target system);
 +
* How does one use the data to advertize compliance of some sort in a model;
 +
* How does one translate the validation tests into an ongoing software test suite.
 +
 
 +
We clearly don't want to get into the business of organizing experiments done on the target system, but we do need to figure out how to get information about the results of real experiments into the ns-3 validation framework as some form of reduced data description.  We do need to be able to run simulations in ns-3 in order to collect data generated by the ns-3 models under validation.  This implies some kind of statistics gathering framework, perhaps like Joe Kopena's framework.  We need to be able to perform statistical analysis on the gathered data in order to reduce the data and we need to be able to perform various tests of statistical inference such as chi-square and least-squares fitting to do the null hypothesis testing.
 +
 
 +
We need to be able to construct such tests and be able to point to them easily to advertise the fact that our models have been validated.  We also need to be able to use these tests to implement an ongoing software test strategy that has the feature of ensuring that our models continue to pass the validation tests over time.
 +
 
 +
It seems that, at a basic level, we are talking about:
 +
 
 +
# A data-gathering toolkit remniscent of the stats framework that allows us to run sets of experiments that generate data from the ns-3 models under test;
 +
# A data-reduction toolkit that allows us to take the generated data and reduce it to some distribution with associated moments;
 +
# A statistical analysis toolkit that allows us to make comparisons between an expected distribution and a measured distribution;
 +
# A toolkit that allows for proper display of statistical data for inclusion in the web site that is the deliverable of the validation process;
 +
# A testing framework that allows us to drive all of this automatically so we can use the validation test suite in the softwarae test environment.
 +
 
 +
=== Validating Models Using Deterministic Methods ===
 +
 
 +
In this case, the part of the target system to be validated is ultimately based on a logical specification.  For this kind of validation, deterministic tests will be in order.  These techniques might be used, for example, to validate the "TCP as defined by RFC 793" model described above.
 +
 
 +
As in the stocastic case, the goal is to compare the behavior of the target system to the ns-3 model in some set of ways.  We must also identify some behavior, or '''observable''', to be validated and then design an experiment to determine whether or not the ns-3 model behaves in a way that is '''consistent''' with the target system in that respect.  We also want to propose tests that the ns-3 model would fail if it were not operating consistenly with the target system.  The only difference is that in each test case, there is a single deterministically repeatable correct response to a given stimulus.  What does that really mean?
 +
 
 +
In the deterministic case, we are talking about a repeatable, definite response.  The current approach in ns-3 is to exercise the system in some way using an example program and capture the expected response of the system in pcap trace files.  The tests to determine whether the model conforms or not is by doing a diff on the pcap traces with an expected response.  Unfortunately, there is a lot wrong with this approach on several levels.
 +
 
 +
First, example programs should be example programs designed to explain easily how the system works and not be overloaded by having to serve as regression tests.  It is nice to have the ability to verify that the example programs continue to work as expected, but we should not rely on example programs to test our system.
 +
 
 +
Second, using pcap traces as the mechanism underlying the checks results in a hugely bloated reference trace repository.  As of this writing, the cumulativew sizes of the reference traces for ns-3-dev approach eighty-three megabytes.  Much of this is redundant.  For example, ARP exchanges are most likely checked hundreds of times in the examples as nodes ARP for addresses.
 +
 
 +
Third, using example programs does not result in directed tests of modules or models.  Test coverage is a side-effect of example execution, and there is no explicit determination that what results is correct behavior.  We simply ensure that existing behavior does not change.  We rely on third-party tools such as tcpdump or WireShark to be able to parse our trace files as a smoke test for correctness.  We rely on someone volunteering to look at these trace files and declare them reasonable.
 +
 
 +
Fourth, there is no documentation of what is tested and how.  We really have no way of advertising, for example, that the congestion window and slow start threshold of our TCP implementation work as expected in a real TCP implementation.  We have examples that transmit large amounts of data, and someone may have looked at trace files to see that something reasonable happened, but there are no cwnd or ssthresh tests.
 +
 
 +
There are a number of basic pieces to the puzzle:
 +
 
 +
* For an experiment/test, how does one drive the tests and store the inputs or stimuli;
 +
* For an experiment/test, how does one collect and organize data collected from a target system and determine and specify the expected responses;
 +
* For an experiment/test, how does one collect and organize data collected from the ns-3 model and determine and specify the actual responses;
 +
* How does one actually perform the tests that determine whether the ns-3 model passes or fails (is consistent or not consistent with the target system);
 +
* How does one use the data to advertize compliance of some sort in a model;
 +
* How does one translate the validation tests into an ongoing software test suite.
 +
 
 +
== Kinds of Testing ==
 +
 
 +
We listed a number of different kinds of software testing above.  Some kinds of tests, such as performance tests and acceptance tests are outside the scope of this project.  Verification and validation (in the software test engineering sense) can be accomplished by combinations of more "primitive" kinds of tests, so out of all of the kinds of software testing, we need to be able to address:
 +
 
 +
* ''Unit testing'': Tests minimal software components, or modules.  Each unit is checked tested to verify that the detailed design for the unit has been correctly implemented;
 +
* ''Integration testing'':  Tests for defects in the interfaces and interaction between units.  Progressively larger groups of units may be integrated and tested;
 +
* ''Regression testing'':  Tests performed to uncover functionality that has been previously working correctly but stops working as intended;
 +
* ''System testing'': Checks that a completely integrated system meets its requirements.
 +
 
 +
We already have a reasonable unit testing strategy in place -- it just needs to be expanded and coverage analysis performed.  System testing is really the large-scale limit of integration testing and so these can be treated as one facility.  We clearly need to replace our our regression testing process.  So, of the different types of testing listed above, we need to provide new environments for:
 +
 
 +
# Integration testing;
 +
# Regression testing.
 +
 
 +
== Requirements Round Up ==
 +
 
 +
The document so far is a hodge-podge of use cases, existing problems and goals.  We need to start translating that into at least an informal set of requirements; and then a prototype to see if this all makes sense.
 +
 
 +
=== A New Integration Testing Facility ===
 +
 
 +
There is currently no mechanism in ns-3 for integration testing.  We need a way to verify that units work together correctly.  For example, an integration test may ensure that a wifi-net-device and a yans-wifi-phy work together as expected; or that the different modules of the TCP protocol work together.  This may include system-level tests to determine that TCP integrates with other modules in the system such as IP, path MTU discovery, etc.
 +
 
 +
=== A New Regression Testing Facility ===
 +
 
 +
The current regression mechanism in ns-3 is poor.  We should build on the results of the new integration testing facility to implement a new regression mechanism that does not rely on large numbers of raw trace files.
 +
 
 +
=== A New Verification and Validation Facility ===
 +
 
 +
There is clearly a need for environments to do stochastic and deterministic model validation and a clear migration path for using tests developed as part of the integration and regression testing suites.
 +
 
 +
== A Prototype Stochastic Process Validation Framework ==
 +
 
 +
I have put together the beginnings of a framework to do validation of stochastic processes.  You can find the code in my private repostitory at http://code.nsnam.org/craigdo/ns-3-valver if you want to take a look.  I have a new top-level directory '''validation''' in which you will find a single subdirectory '''rng''' which does some stochastic tests for a subset of the ns-3 random number generators.
 +
 
 +
This uses chi-square tests for goodness of fit to validate that our random number generators do produce random numbers according to the distribution they advertise.  The programs in this directory can be used as a regression suite to verify that the random number generators do not change over time and also can be used to produce graphics suitable for presentation in a web site describing the validation.  I also have a prototype of web pages (wiki actually) to show what we have in mind for the presentation part.  Take a look at [[StochasticModelValidation]].
 +
 
 +
== A Prototype Deterministic Process Verification Framework ==
 +
 
 +
I have put together the beginnings of a framework to do validation of stochastic processes.  You can find the code in my private repostitory at http://code.nsnam.org/craigdo/ns-3-valver if you want to take a look.  I have a new top-level directory '''verification''' in which you will find a single subdirectory '''tcp''' which does some deterministic tests for a subset of the ns-3 TCP model functionality.
 +
 
 +
=== The Hard Way, The Easy Way, and My Way ===
 +
 
 +
As it stands in ns-3, it is quite simple to come up with a test.  A user can write a script, turn on tracing and write a tiny Python program and she has a test.  This really isn't necessarily a very good test, but we shouldn't be fascist about such things.  We should admit the possibility that not everyone will be terribly interested in coming up with excruciatingly detailed validation, verification and ingegration tests.  We should be pragmatic about it and accept simple, but not very high quality tests done the easy way.  This means we should actually retain something that works like the current regression tests.
 +
 
 +
We should also admit the possibility that someone might want to provide a very high fidelity simulation model and will want to spend the time to completely isolate their model from the rest of the system and carefully check many finely granular input-output transformations in great detail.  Out environment should allow people to do this -- the hard way.  We should have a test harness that allows people to drill down to extremely fine-grained testing.
 +
 
 +
We should also recognize that there is a probably a continuum of test strategies between these.  A user might want to verify part of their model in great detail, but leave parts which are less interesting to testing the easy way.  We should admit the possibility that users will want to do it their own way and make the environment flexible enough to work along this continuum.
 +
 
 +
=== What Does it Look Like ===
 +
 
 +
Well, no code is written, but the back-of-the-envelope version for a TCP test done "the hard way" would look something like,
 +
 
 +
  +----+
 +
  |    |    +--------------------+  +-------------------+
 +
  |    |    | Test Vector Source |  | Test Vector Sink  |
 +
  |    |    +--------------------+  +-------------------+
 +
  |    |              |                    ^
 +
  |    |              v                    |
 +
  |    |    +-------------------------------------------+    +-------------+
 +
  |    |    |                TCP Under Test            | -> | Trace Sinks |
 +
  |    |    +-------------------------------------------+    +-------------+
 +
  |    |              |                    ^
 +
  |    |              v                    |
 +
  |    |    +-------------------+  +--------------------+
 +
  |    |    | Test Vector Sink  |  | Test Vector Source |
 +
  |    |    +-------------------+  +--------------------+
 +
  |    |
 +
  |    +------------------------------------------------+
 +
  |                    Test Environment                |
 +
  |                                                    |
 +
  |    test "orchestrator," ns-3 core, simulator, etc.  |
 +
  |-----------------------------------------------------+
 +
 
 +
You can probably imagine that the upper "Test Vector Source" might be a module making calls into the "TCP Under Test" and the upper "Test Vector Sink" might be methods hooked into the TCP callbacks.  You could see that the lower "Test Vector Sink" might be a module operating in place of an IP (outbound) protocol and the lower "Test Vector Source" would operate as an IP (inbound) protocol.  You basically isolate the TCP protocol and look at all of its inputs and outputs and determine that it did exactly what was expected.
 +
 
 +
If I were to imagine a TCP test done "the easy way," I would replace some of the sources and sinks with more pieces of the system.
  
 +
  +----+
 +
  |    |    +-------------------------------------------+
 +
  |    |    |      ns-3 scripted traffic source/sink    |
 +
  |    |    +-------------------------------------------+
 +
  |    |              |                    ^
 +
  |    |              v                    |
 +
  |    |    +-------------------------------------------+    +-------------+
 +
  |    |    |                TCP Under Test            | -> | Trace Files |
 +
  |    |    +-------------------------------------------+    +-------------+
 +
  |    |          |      ^                   
 +
  |    |          v      |             
 +
  |    |    +-------------------+                            +-------------+
 +
  |    |    |    Ip stack      | -------------------------> | Trace Files |
 +
  |    |    +-------------------+                            +-------------+
 +
  |    |          |      ^ 
 +
  |    |          v      |
 +
  |    |    +-------------------+                            +-------------+
 +
  |    |    | Simple Net Device | -------------------------> | Trace Files |
 +
  |    |    +-------------------+                            +-------------+
 +
  |    |          |      ^ 
 +
  |    |          v      |
 +
  |    |    +-------------------------------------------+
 +
  |    |    |    ns-3 scripted traffic source/sink    |
 +
  |    |    +-------------------------------------------+
 +
  |    |
 +
  |    +------------------------------------------------+
 +
  |                    Test Environment                |
 +
  |                                                    |
 +
  |    test "orchestrator," ns-3 core, simulator, etc.  |
 +
  |-----------------------------------------------------+
  
 +
This is basically what we have now, a script that exercises some part of the system and which captures trace files.
 
----
 
----
[[User:Craigdo|Craigdo]] 19:03, 2 April 2009 (UTC)
+
[[User:Craigdo|Craigdo]] 20:40, 17 April 2009 (UTC)

Latest revision as of 00:24, 21 May 2009

TestLink

Verification, Validation and Testing

There is often much confusion regarding the meaning of the words Verification, Validation and Testing; and other associated terminology. It will be worthwhile to spend a little time establishing exactly what we mean when we use them.

A computer model is a mathematical or logical representation of something. It can represent a vehicle, a frog or a networking card. Models can also represent processes such as global warming, freeway traffic flow or a specification of a networking protocol. Models can be completely faithful representations of a logical process specification, but they necessarily can never completely simulate a physical object or process. In most cases, a number of simplifications are made to the model to make simulation computationally tractable.

Every model has a target system that it is attempting to simulate. The first step in creating a simulation model is to identify this target system and the level of detail and accuracy that the simulation is desired to reproduce. In the case of a logical process, the target system may be identified as TCP as defined by RFC 793. In this case, it will probably be desirable to create a model that completely and faithfully reproduces RFC 793. In the case of a physical process this will not be possible. If, for example, you would like to simulate a wireless networking card, you may determine that you need, "an accurate MAC-level implementation of the 802.11 specification and [...] a not-so-slow PHY-level model of the 802.11a specification."

Once this is done, one can develop an abstract model of the target system. This is typically an exercise in managing the tradeoffs between complexity, resource requiremens and accuracy. The process of developing an abstract model has been called model qualification in the literature. In the case of a TCP protocol, this process results in a design for a collection of objects, interactions and behaviors that will fully implement RFC 793 in ns-3. In the case of the wireless card, this process results in a number of tradeoffs to allow the physical layer to be simulated and the design of a network device and channel for ns-3, along with the desired objects, interactions and behaviors.

This abstract model is then developed into an ns-3 model that implements the abstract model as a computer program. The process of getting the implementation to agree with the abstract model is called model verification in the literature.

The process so far is open loop. What remains is to maek a determination that a given ns-3 model has some connection to some reality -- that a model is an accurate representation of a real system, whether a logical process or a physical entity. If you are going to use a simulation model to try and predict how some real system is going to behave, you must have some reason to believe your results -- i.e., can you trust that an inference made from the model translates into a correct prediction for the real system. The process of getting the ns-3 model behavior to agree with the desired target system behavior as defined by the model qualification process is called model validation in the literature. In the case of a TCP implementation, you may want to compare the behavior of your ns-3 TCP model to some reference implementation in order to validate your model. In the case of a wireless physical layer simulation, you may want to compare the behavior of your model to that of real hardware in a controlled setting,

Generally, the process is usually described as a closed loop with variations on the following theme:

 target-system <---------------> abstract-model <--------------> ns-3 model
       ^         qualification                    verification      ^
       |                                                            |
       +------------------------------------------------------------+
                               validation

The following are the definitions we will use:

  • Domain of applicability: Prescribed conditions for which the model has been tested, compared against reality to the extent possible, and judged suitable for use;
  • Qualification: The process of defining the accuracy of a model in order to make a simulation tractable;
  • Range of accuracy: Demonstrated agreement between the computerized model and reality within a domain of applicability.
  • Simulation: Modeling of systems and their operations using various means of representation;
  • Reality: An entity, situation, or system selected for analysis -- a target-system;
  • Validation: Substantiation that a model, within its domain of applicability, possesses a satisfactory range of accuracy consistent with the intended application of the model;
  • Verification: Substantiation that the implementation of an abstract model is correct and performs as intended.

Note that we have not used the term software testing at all in this discussion. The process of qualification, verification and validation is really a research and development activity. Many of the checks implemented in the verification phase are ultimately reused in a software test quite however, leading to a blurring of the tasks. Conceptually, however, neither qualification, verification nor validation has anything to do with software testing in its commonly understood sense. The goal of model verification and validation is, as is suggested by the defintions above, substantiation that a model does what is advertised.

You will find some of the same terms and concepts used in discussions of software testing, however. Software Testing is an investigation conducted to provide information about the quality of the product. This is more of a manufacturing process activity -- given a model that has been verified and validated, software testing ensures that the model can be reproduced accurately and used without unexpected errors. This is why software testing is sometimes called software quality control.

Without going too deeply into software test engineering, let's define some terms here as well:

  • Acceptance testing: Tests performed prior to introducing a model into the main build or testing process;
  • Integration testing: Tests for defects in the interfaces and interaction between units. Progressively larger groups of units may be integrated and tested;
  • Performance testing: Tests to verify that models can handle large quantities of data (sometimes referred to as Load Testing);
  • Regression testing: Tests performed to uncover functionality that has been previously working correctly but stops working as intended;
  • System testing: Checks that a completely integrated system meets its requirements;
  • Unit testing: Tests minimal software components, or modules. Each unit is checked tested to verify that the detailed design for the unit has been correctly implemented;
  • Usability testing: Verifies that user interfaces are easy to use and understand;
  • Verification: A determination that the product has been built according to its specifications;
  • Validation: A determination that the system meets its intended needs and that the specifications were correct.

Note the reappearance of the terms Verification and Validation here with subtly changed meanings. These activities close the product development loop in the same way as Validation and Verifiation close the model development loop. These tasks are similar but not identical and are most often performed by people in entirely different roles. In many cases, it seems, regression testing is confused with verification or validation. These are actually wildly different activities with divergent goals.

That said, there is absolutely nothing wrong with code reuse. It is possible, and desirable, to reuse tests done for model validation and verification in the software test domain. For example, it would be very useful to automate the test suite used to verify and validate a given model and use those tests as verification, validation and regression tests in the software test sense.

The deliverables for ns-3 model verification and validation will be something like web or wiki pages detailing what was behaviors have been validated. If a particlar behavior is verified or validated, the final output of the validation or verification test should be something like CONFORMS or DOES NOT CONFORM. On the other hand, the deliverables for a software test suite will be something like a PASS or FAIL indication. The same code can be used in both cases. If a model validation or verification test is incorporated into a nightly regression test, the output CONFORMS is interpreted as CONTINUES TO CONFORM, and the output DOES NOT CONFORM is interpreted as REGRESSION ERROR.

The ns-3 verification, validation and testing project will produce tools and environments that make it as easy as possible to create the various kinds of tests used in the sundry domains we have described. The frameworks will try to make it easy to reuse code in model-related and software test-related cases. We will also provide a number of examples to show how we think the different testing tasks should be done.

Kinds of Validation

The process used to validate a model is conceptually quite simple. One compares the behavior of the ns-3 model to the behavior of the target system, and makes adjustments to the abstract model and/or the ns-3 model to improve the correlation. This part of the process is sometimes called calibrating the model.

As mentioend above, at the end of the validation process, it is desirable to have a number of repeatable results that can demonstrate to other users that a given ns-3 model is faithful to a given abstract model and that this abstract model is, in turn, a faithful representation of the target system based on the initial qualification process. We call this collection of results the Validation Suite. You can think of this as the collection of experiments that have been used to validate a model, which can be run collectively as a suite of tests on the model for use by other users. The suite can also be run as part of the software test stragegy as mentioned above.

These validation suites can be composed of deterministic tests which are used to validate process-oriented models such as the TCP implementation of RFC 793 mentioned above, or stochastic tests which are used to validate physical processes. In both cases one wants to provide inputs to a model and observe that the outputs behave as expected. In the literature, Naylor and Finger call this piece of the puzzle, Validation of Input-Output Transformations.

Validating Models Using Stochastic Methods

In this case, the part of the target system to be validated is ultimately based on physical processes which are governed statistically. To validate a model of this nature, we will need to perform statistical comparisons between experiments done on the target system and simulations of experiments done on the ns-3 model. These techniques might be used to validate the "not-so-slow PHY-level model of the 802.11a specification" example given above.

The goal is to compare the behavior of the target system to the ns-3 model in some set of ways. We must then identify some behavior, or observable, to be validated and then design an experiment to determine whether or not the ns-3 model behaves in a way that is consistent with the target system in that respect. We want to propose tests that the ns-3 model would fail if it were not functioning consistenly with the target system. What does that really mean?

In the stochastic case, we are talking about a random variable -- a quantity that has no definite value, but rather has an ensemble of values that it can assume. This ensemble of values varies according to some probability distribution. For some set of conditions then, the measurements of the random variable taken on the target system will have some distribution with some number of moments such as the expectation value (mean) and a variance.

If we run an experiment on the target system, measurements of the random variable will have a distribution as described above. This can be called a reference response. If we run the same experiment on the ns-3 model under identical simulated conditions, measurements of the random variable in that environment will also have some distribution or response. In order to validate the ns-3 model, we need to demonstrate that measurements of the ns-3 model observable are drawn from the same distribution as measurements of the target system observable to some level of statistical significance. In other words, we are looking to support the null hypothesis (H0) and reject the alternate hypothesis (Ha). The chi-squared test for goodness-of-fit is commonly used in such situations.

It seems that the description of the problem above leads us to conclude that the stochastic part of the ns-3 validation toolkit in ns-3 should really be a collection of tools used for statistical analysis of experimental data. There are a number of pieces to the puzzle:

  • For an experiment/test, how does one drive the tests and store the inputs or stimuli;
  • For an experiment/test, how does one collect and organize data collected from the target system and determine and specify the real distributions of the random variables;
  • For an experiment/test, how does one collect and organize data collected from the ns-3 model and determine and specify the simulated distributions of the random variables;
  • How does one actually perform the tests that determine whether the ns-3 model passes or fails (is consistent or not consistent with the target system);
  • How does one use the data to advertize compliance of some sort in a model;
  • How does one translate the validation tests into an ongoing software test suite.

We clearly don't want to get into the business of organizing experiments done on the target system, but we do need to figure out how to get information about the results of real experiments into the ns-3 validation framework as some form of reduced data description. We do need to be able to run simulations in ns-3 in order to collect data generated by the ns-3 models under validation. This implies some kind of statistics gathering framework, perhaps like Joe Kopena's framework. We need to be able to perform statistical analysis on the gathered data in order to reduce the data and we need to be able to perform various tests of statistical inference such as chi-square and least-squares fitting to do the null hypothesis testing.

We need to be able to construct such tests and be able to point to them easily to advertise the fact that our models have been validated. We also need to be able to use these tests to implement an ongoing software test strategy that has the feature of ensuring that our models continue to pass the validation tests over time.

It seems that, at a basic level, we are talking about:

  1. A data-gathering toolkit remniscent of the stats framework that allows us to run sets of experiments that generate data from the ns-3 models under test;
  2. A data-reduction toolkit that allows us to take the generated data and reduce it to some distribution with associated moments;
  3. A statistical analysis toolkit that allows us to make comparisons between an expected distribution and a measured distribution;
  4. A toolkit that allows for proper display of statistical data for inclusion in the web site that is the deliverable of the validation process;
  5. A testing framework that allows us to drive all of this automatically so we can use the validation test suite in the softwarae test environment.

Validating Models Using Deterministic Methods

In this case, the part of the target system to be validated is ultimately based on a logical specification. For this kind of validation, deterministic tests will be in order. These techniques might be used, for example, to validate the "TCP as defined by RFC 793" model described above.

As in the stocastic case, the goal is to compare the behavior of the target system to the ns-3 model in some set of ways. We must also identify some behavior, or observable, to be validated and then design an experiment to determine whether or not the ns-3 model behaves in a way that is consistent with the target system in that respect. We also want to propose tests that the ns-3 model would fail if it were not operating consistenly with the target system. The only difference is that in each test case, there is a single deterministically repeatable correct response to a given stimulus. What does that really mean?

In the deterministic case, we are talking about a repeatable, definite response. The current approach in ns-3 is to exercise the system in some way using an example program and capture the expected response of the system in pcap trace files. The tests to determine whether the model conforms or not is by doing a diff on the pcap traces with an expected response. Unfortunately, there is a lot wrong with this approach on several levels.

First, example programs should be example programs designed to explain easily how the system works and not be overloaded by having to serve as regression tests. It is nice to have the ability to verify that the example programs continue to work as expected, but we should not rely on example programs to test our system.

Second, using pcap traces as the mechanism underlying the checks results in a hugely bloated reference trace repository. As of this writing, the cumulativew sizes of the reference traces for ns-3-dev approach eighty-three megabytes. Much of this is redundant. For example, ARP exchanges are most likely checked hundreds of times in the examples as nodes ARP for addresses.

Third, using example programs does not result in directed tests of modules or models. Test coverage is a side-effect of example execution, and there is no explicit determination that what results is correct behavior. We simply ensure that existing behavior does not change. We rely on third-party tools such as tcpdump or WireShark to be able to parse our trace files as a smoke test for correctness. We rely on someone volunteering to look at these trace files and declare them reasonable.

Fourth, there is no documentation of what is tested and how. We really have no way of advertising, for example, that the congestion window and slow start threshold of our TCP implementation work as expected in a real TCP implementation. We have examples that transmit large amounts of data, and someone may have looked at trace files to see that something reasonable happened, but there are no cwnd or ssthresh tests.

There are a number of basic pieces to the puzzle:

  • For an experiment/test, how does one drive the tests and store the inputs or stimuli;
  • For an experiment/test, how does one collect and organize data collected from a target system and determine and specify the expected responses;
  • For an experiment/test, how does one collect and organize data collected from the ns-3 model and determine and specify the actual responses;
  • How does one actually perform the tests that determine whether the ns-3 model passes or fails (is consistent or not consistent with the target system);
  • How does one use the data to advertize compliance of some sort in a model;
  • How does one translate the validation tests into an ongoing software test suite.

Kinds of Testing

We listed a number of different kinds of software testing above. Some kinds of tests, such as performance tests and acceptance tests are outside the scope of this project. Verification and validation (in the software test engineering sense) can be accomplished by combinations of more "primitive" kinds of tests, so out of all of the kinds of software testing, we need to be able to address:

  • Unit testing: Tests minimal software components, or modules. Each unit is checked tested to verify that the detailed design for the unit has been correctly implemented;
  • Integration testing: Tests for defects in the interfaces and interaction between units. Progressively larger groups of units may be integrated and tested;
  • Regression testing: Tests performed to uncover functionality that has been previously working correctly but stops working as intended;
  • System testing: Checks that a completely integrated system meets its requirements.

We already have a reasonable unit testing strategy in place -- it just needs to be expanded and coverage analysis performed. System testing is really the large-scale limit of integration testing and so these can be treated as one facility. We clearly need to replace our our regression testing process. So, of the different types of testing listed above, we need to provide new environments for:

  1. Integration testing;
  2. Regression testing.

Requirements Round Up

The document so far is a hodge-podge of use cases, existing problems and goals. We need to start translating that into at least an informal set of requirements; and then a prototype to see if this all makes sense.

A New Integration Testing Facility

There is currently no mechanism in ns-3 for integration testing. We need a way to verify that units work together correctly. For example, an integration test may ensure that a wifi-net-device and a yans-wifi-phy work together as expected; or that the different modules of the TCP protocol work together. This may include system-level tests to determine that TCP integrates with other modules in the system such as IP, path MTU discovery, etc.

A New Regression Testing Facility

The current regression mechanism in ns-3 is poor. We should build on the results of the new integration testing facility to implement a new regression mechanism that does not rely on large numbers of raw trace files.

A New Verification and Validation Facility

There is clearly a need for environments to do stochastic and deterministic model validation and a clear migration path for using tests developed as part of the integration and regression testing suites.

A Prototype Stochastic Process Validation Framework

I have put together the beginnings of a framework to do validation of stochastic processes. You can find the code in my private repostitory at http://code.nsnam.org/craigdo/ns-3-valver if you want to take a look. I have a new top-level directory validation in which you will find a single subdirectory rng which does some stochastic tests for a subset of the ns-3 random number generators.

This uses chi-square tests for goodness of fit to validate that our random number generators do produce random numbers according to the distribution they advertise. The programs in this directory can be used as a regression suite to verify that the random number generators do not change over time and also can be used to produce graphics suitable for presentation in a web site describing the validation. I also have a prototype of web pages (wiki actually) to show what we have in mind for the presentation part. Take a look at StochasticModelValidation.

A Prototype Deterministic Process Verification Framework

I have put together the beginnings of a framework to do validation of stochastic processes. You can find the code in my private repostitory at http://code.nsnam.org/craigdo/ns-3-valver if you want to take a look. I have a new top-level directory verification in which you will find a single subdirectory tcp which does some deterministic tests for a subset of the ns-3 TCP model functionality.

The Hard Way, The Easy Way, and My Way

As it stands in ns-3, it is quite simple to come up with a test. A user can write a script, turn on tracing and write a tiny Python program and she has a test. This really isn't necessarily a very good test, but we shouldn't be fascist about such things. We should admit the possibility that not everyone will be terribly interested in coming up with excruciatingly detailed validation, verification and ingegration tests. We should be pragmatic about it and accept simple, but not very high quality tests done the easy way. This means we should actually retain something that works like the current regression tests.

We should also admit the possibility that someone might want to provide a very high fidelity simulation model and will want to spend the time to completely isolate their model from the rest of the system and carefully check many finely granular input-output transformations in great detail. Out environment should allow people to do this -- the hard way. We should have a test harness that allows people to drill down to extremely fine-grained testing.

We should also recognize that there is a probably a continuum of test strategies between these. A user might want to verify part of their model in great detail, but leave parts which are less interesting to testing the easy way. We should admit the possibility that users will want to do it their own way and make the environment flexible enough to work along this continuum.

What Does it Look Like

Well, no code is written, but the back-of-the-envelope version for a TCP test done "the hard way" would look something like,

 +----+
 |    |    +--------------------+  +-------------------+
 |    |    | Test Vector Source |  | Test Vector Sink  |
 |    |    +--------------------+  +-------------------+
 |    |               |                     ^
 |    |               v                     |
 |    |    +-------------------------------------------+    +-------------+
 |    |    |                TCP Under Test             | -> | Trace Sinks |
 |    |    +-------------------------------------------+    +-------------+
 |    |               |                     ^
 |    |               v                     |
 |    |    +-------------------+  +--------------------+
 |    |    | Test Vector Sink  |  | Test Vector Source |
 |    |    +-------------------+  +--------------------+
 |    |
 |    +------------------------------------------------+
 |                     Test Environment                |
 |                                                     |
 |    test "orchestrator," ns-3 core, simulator, etc.  |
 |-----------------------------------------------------+

You can probably imagine that the upper "Test Vector Source" might be a module making calls into the "TCP Under Test" and the upper "Test Vector Sink" might be methods hooked into the TCP callbacks. You could see that the lower "Test Vector Sink" might be a module operating in place of an IP (outbound) protocol and the lower "Test Vector Source" would operate as an IP (inbound) protocol. You basically isolate the TCP protocol and look at all of its inputs and outputs and determine that it did exactly what was expected.

If I were to imagine a TCP test done "the easy way," I would replace some of the sources and sinks with more pieces of the system.

 +----+
 |    |    +-------------------------------------------+
 |    |    |      ns-3 scripted traffic source/sink    |
 |    |    +-------------------------------------------+
 |    |               |                     ^
 |    |               v                     |
 |    |    +-------------------------------------------+    +-------------+
 |    |    |                TCP Under Test             | -> | Trace Files |
 |    |    +-------------------------------------------+    +-------------+
 |    |           |      ^                     
 |    |           v      |              
 |    |    +-------------------+                            +-------------+
 |    |    |     Ip stack      | -------------------------> | Trace Files |
 |    |    +-------------------+                            +-------------+
 |    |           |      ^  
 |    |           v      |
 |    |    +-------------------+                            +-------------+
 |    |    | Simple Net Device | -------------------------> | Trace Files |
 |    |    +-------------------+                            +-------------+
 |    |           |      ^  
 |    |           v      |
 |    |    +-------------------------------------------+
 |    |    |     ns-3 scripted traffic source/sink     |
 |    |    +-------------------------------------------+
 |    |
 |    +------------------------------------------------+
 |                     Test Environment                |
 |                                                     |
 |    test "orchestrator," ns-3 core, simulator, etc.  |
 |-----------------------------------------------------+

This is basically what we have now, a script that exercises some part of the system and which captures trace files.


Craigdo 20:40, 17 April 2009 (UTC)