Testing code with random output

Many real-world applications involve code with output that contains a certain degree of randomness. Consider, for example, code for the analysis of measurements. Measurements typically result in data randomly distributed according to some well-defined distribution. The task for the analysis code may then involve statistical analyses, the result of which is, again, randomly distributed according to some distribution.

The following simplified example illustrates the idea:A quantity is being measured which linearly grows with time according to the law y=f(t)=a+bty = f(t) = a + bty=f(t)=a+bt. The task of the code is to determine the rate of the growth. The measurement process adds small random fluctuations to the measured quantity, which are normally distributed. An example of such a dataset is shown below.

Usually this kind of problem is solved by the minimization of the χ2\chi^2χ​2​​ function:

χ2=∑iN(yi−f(ti))2σ2\chi^2 = \sum_i^N \frac{(y_i - f(t_i))^2}{\sigma^2} χ​2​​=​i​∑​N​​​σ​2​​​​(y​i​​−f(t​i​​))​2​​​​

The intercept aaa and slope bbb which minimize the χ2\chi^2χ​2​​ expression are the best estimate of the parameters of the f(t)f(t)f(t) function underlying the data.

After having tested the basic functionality of the code — e.g. that it outputs a result, you will want to test that the minimization routine really outputs a value close to the theoretical optimum. This cannot be proven in a single run of the code.

Normally in such situations, one evaluates the quality of the analysis using statistical tests. For example, if the fit procedure is repeated many times, the resulting χ2\chi^2χ​2​​ will be distributed according to the χ2\chi^2χ​2​​ distribution with number of degrees of freedom equal to the number of the measurement points minus the number of the optimization parameters. This can be tested, for example, by running the fitting code on multiple samples of simulated data, mimicking the distributions characteristic of the measured data.

In a first approach one may want to just check by eye whether the theoretical χ2\chi^2χ​2​​ curve and the distribution from the simulation agree. Figure below shows the χ2\chi^2χ​2​​ distribution from nearly 5 thousand rounds of the χ2\chi^2χ​2​​ optimization with simulated data for our simple example of a linearly growing quantity. The theoretical χ2\chi^2χ​2​​ curve is shown in red. While this quick check is useful to sort out major bugs in the code, this is still not a quantitative test, and is not useful for automated testing.

A statistical test of the agreement between the data and the theoretical χ2\chi^2χ​2​​ curve in figure above involves quantitative comparison of well-defined features of both distributions. For example, one may compare the mean value of the distribution. The mean value of the theoretical χ2\chi^2χ​2​​ distribution is equal to the number of degrees of freedom of the distribution.

With large number of trials, the mean of the measured distribution is expected to approach the theoretical value. But the nature of random values is such that there is always a non-zero probability for the measured mean to depart from the theoretical value more than a predefined tolerance.

Automated testing and the Statistics

An automated test determines whether the code output is correct using a criterion that can be evaluated as true or false. A major issue when using statistical tests for automated testing of code with random output is the small but non-zero probability of failure for the correct code.

In our example with the χ2\chi^2χ​2​​ distribution, the test might assert whether the mean χ2\chi^2χ​2​​ from the runs with simulated data agrees within specified tolerance with the expected value from the statistical theory. Other properties of the distribution can also be used for such quantitative tests of the agreement.

However, there is always a non-zero probability that a correct analysis code fails the test. Such randomness can raise a false alarm and set you on a bug hunt unnecessarily.

In most real-life scenarios, the measurement process is simple enough that the data can be simulated reasonably fast and the distributions of the measured data can be reproduced well in the simulation. In such a case, the problem of the non-zero probability of a false fail can be solved by brute force, using a data sample large enough to make the probability of the false fail negligible in a lifetime.

Occasionally the measurement process is complex enough that the simulation of measurements takes considerable CPU time and the brute-force approach is too expensive. Such examples are often found in experiments with particle colliders and large detectors, such as performed at CERN. More complex approaches to code verification are needed in such situations.

Conclusions

Analysis procedures for data with random fluctuations can be tested using time-proven statistical tests. So can the code implementing the analysis procedures. However, statistical tests are of very different nature than deterministic unit tests commonly used for automated testing of the correctness of the code itself.

The automation of statistical tests faces the difficulty that there is always a non-zero probability for a correct code to fail a deterministic test. Often this problem can be solved by using a sufficiently large sample of simulated data that the probability of false fail is negligible.

The simulated data will have to be generated from distributions that are as close as possible to the distributions of the real measured data. A prerequisite for this is detailed knowledge of the behavior of the measured quantities and of the response of the measurement instrumentation to it.