Comparing Actual Results to Expected Results with the Log Likelihood Ratio

One of the core principles of software testing is that in order to determine a pass/fail result for a test case, after executing the test case you must compare the actual result or state of the system under test with an expected result/state. For example, if you are testing the + functionality of a calculator, and the test case inputs are 3.0 and 4.0 (and the expected result is 7.0) then you exercise the SUT to get an actual result and compare that actual result with the expected result. In statistics, the most common way to compare a set of observed values with a set of expected values is to use the well-known chi-square test. However, what is not so well known is that the chi-square test is actually a discrete approximation to the log likelihood test. The chi-square test was developed in the days before calculators when computing logarithms was difficult. Anyway, the point is, in software testing, if you want to compare how close a set of actual values is to a set of expected values, you should probably use the log likelihood g-test rather than the chi-square test. The g statistic is given by 2 * (sum-over-i(Oi * ln(Oi / Ei)) where Oi is an observed value and Ei is the corresponding expected value. For example, suppose you have some system which should emit the three values (4.0, 4.0, 4.0). These are the expected values. Now if the actual results are (3.0, 4.0, 6.0) then the g-statistic is 2 * [(3.0 * ln(3.0/4.0)) + (4.0 * ln(4.0/4.0)) + (6.0 * ln(6.0/4.0))] = 3.139. The closer the g-static is to 0, the closer the actual results are to the expected results; you can look up specific probabilities if necessary.

2 Responses to Comparing Actual Results to Expected Results with the Log Likelihood Ratio

Unknown says:

July 14, 2009 at 6:10 am

Hi,Can you elaborate in what way you have applied this statistical method in software testing?ThanksBertrand

Loading...
James says:

July 17, 2009 at 8:40 am

Recently I was asked to test some data mining software that automatically places SQL data into clusters of similar data. For numeric data this is not a problem. But for categorical data such as (Red, Large), testing how well the data has been clustered into groups is not so easy. Part of the approach I used was to look at the frequencies of the clustered data. If the clustering was random (and therefore not very effective), you\’d expect an even number of SQL data column values in every cluster. By computing the g statistic for the actual clustering results compared to an even distribution (these are the implied expected results), I was able to compute a measure of quality for the system under test.So the g statistic can be used in any testing situation where the SUT generates a set of values, and you can determine a meaningful set of expected values.

Loading...