Planning reliable and efficient statistical tests

Paul Boughton
For any job, you need a tool that offers the right amount of power for the task at hand. You would not use a telescope to examine a stamp collection, or a handheld magnifying glass to search for new galaxies, because neither would provide you with meaningful observations. To complicate matters, if detecting a galaxy really was your goal, the cost of gaining the necessary power might be more than you can afford.

Anyone using statistical tests faces the same issues. You must consider the precision you need to meet your goals - should your test detect subtle effects or massive shifts? You need to balance it against the cost of sampling your population - are you testing light bulbs or jet engines? Furthermore, you also want the confidence in your results that is appropriate for your situation - testing seat belts demands a greater degree of certainty than testing shampoo. We measure this certainty with statistical power, or the probability your test will detect an effect that truly exists.

Power and Sample Size tools, with Power Curves, help you balance these issues that may compete for your limited resources.

Here are examples of how a quick Power and Sample Size test can help you save time and money, while ensuring you get results you can trust. The following calculations were conducted using Minitab Statistical Software, although Power and Sample Size analysis can be performed using various statistical software programs.

Detecting changes

A ball bearing manufacturer wants to detect significant changes in the diameter of its bearings. To ensure the diameters are on target, 100 ball bearings are routinely sampled and measured. The measurements are then used to conduct a 1-sample t-test to determine if the bearing diameters are on target. But this large sample of 100 is unnecessary and makes the test too sensitive.

To evaluate the power of the test, three pieces of information are needed:

- Sample size.

- Standard deviation.

- Difference.

From historical process data, engineers know that the standard deviation of the bearing diameters is typically 0.004cm.

To choose a meaningful difference value, the engineers must determine the shift from the mean that will lead to unacceptable diameters.

For these bearings, the engineers consider a change of 0.002cm important enough to warrant adjusting the equipment.

Using this information, the power of a 1-sample t-test computed using 100 bearings can easily be compared to a test that uses far fewer bearings. This Power Curve shows they are wasting resources on excessive precision (Fig1).

A sample size of just 50, half the original sample size, will detect meaningful differences of 0.002cm without detecting a shift from the target at every negligible bump in the process.

With Power Curves, you can easily calculate the sample size necessary, saving time that could have been wasted sampling and measuring parts that aren't needed to test your data.

hile using larger sample sizes may be unnecessary, if you do not collect enough data, you risk not having enough data to detect a significant difference even if one exists. For example, a lumber company samples 10 beams to test whether their strength meets the target.

According to the Power Curve, this small sample size made their test incapable of detecting important effects. Even if the strength of the lumber was not on target, the t-test could produce a p-value suggesting otherwise. To achieve a statistical power of at least 80 per cent, the company must sample 24 beams to detect meaningful differences, not 10.

See the big picture

A power analysis helps you weigh your resources against your demands, and quantifies a test's ability to answer your question. It can expose design problems, like the lumber company's insufficient sample size. It can also reveal design solutions you hadn't considered.

Take for instance the packaging plant of a food products company. The food is enclosed in plastic bags. While the bags ensure the food is fresh, customers complain that the plastic bags are sealed too tightly and too much force is required to open them. The company believes the glue is too strong, so researchers use a One-way Analysis of Variance (ANOVA) to compare their current glue with three potential replacements.

Differences in seal strength less than 10 Newtons are undetectable to most people, so their test only needs to detect a difference of 10. A power value of 80 per cent is acceptable, but 90 per cent is ideal. What sample size meets their needs?

Thirty samples of each glue ensure the test detects a difference of 10 Newtons with 90 per cent power (Fig2). Or, they could detect the same difference with 23 samples and 80 per cent power. If the researchers determine that 80 per cent power is sufficient, they may consider using the smaller sample size, sacrificing unneeded power to save on the cost of the experiment.

The Power Curve illustrates this information, but it also charts every other combination of power and difference for a given sample size (see Fig.3).

The black line indicates researchers can attain 90 per cent power with just 23 samples if they are willing to seek a difference of 12 instead of 10. This might just be the ideal choice.

Putting power curves to use

Without knowing the power of your test, it is hard to know if you can trust your results: your test could be too weak to answer your question, or unreasonably strong for your needs.

Power Curves, available for many common statistical procedures, help you balance your resources against your goals and design a test you can trust that costs no more than necessary.

Power Curves graph the dynamic relationships that define power, revealing the big picture and ensuring no option escapes your consideration. And perhaps most importantly, they make power analysis an easier and more accessible part of every project.

Empower your test. Trust your results.

Tom Bowler is Statistical Technical Communication Specialist, Minitab Inc, State College PA, USA.