Balancing Confidence and Power for Decision Making
Introduction
Tests of consumer preference are performed to reduce risk to the manufacturer. A new product formulation may be compared to an existing product using this type of testing with the intent of "improving" the product line. Improvement may come from increased profits from existing share or from increased share. Inherent in any product change is the chance that the product modification is to the detriment of the manufacturer. The manufacturer risks loss in sales if the new product is worse than the current one. Conversely, there is the risk of losing the chance to increase profits by not producing a cheaper, yet equally preferable, product. Unfortunately, statistical analyses performed on product test data rarely take these risks into account. Research on Research Paper Number 27 touches on this aspect. This paper presents an example of a product test, relating monetary risks to levels of significance and power of the test of product preference.
Example
Consider a manufacturer that is planning to test consumer preference for two products: "A," currently on the market, and "B," a cheaper variation of "A." If management can produce product "B" without disrupting the $100 million franchise built on sales of product "A," $1 million per year in production costs will be saved. The objective of the preference test is to determine whether product "B" is at least as preferable as product "A," management will produce product "B" if it is at least equal in preference to "A."
A test among consumers will be performed to assess this difference in preference. Consumers will taste both products and then be asked to state their preference for either product "A" or "B." (For simplicity of exposition, a response of "no preference" would not be allowed.) The proportion of respondents preferring each are to be compared.
Outcomes
A comparison of the preference proportions yields three possible outcomes: "B" is worse than "A", "B" is better than "A," or the two products are equally preferred. However, the manufacturer has decided to introduce product "B" if the test either indicates that product "B" is more preferred than "A" or there is no difference in preference. Therefore, future production for the manufacturer can be determined by combining these outcomes such that only two competing results need to be considered when performing statistical significance testing: 1) "B" is worse than "A" versus 2) "B" is equal to or better than "A." In statistical theory, these are referred to as the alternative hypothesis and null hypothesis, respectively. Further, the outcomes to be tested correspond to a one-tail test of significance. These hypotheses can be taken as conjectures about how the products truly compare among all members of the population to whom inference about the product test is to be made. These population situations are cross-referenced by possible sample outcomes in Table One. Note that the effect of sampling variability may lead the manufacturer to a conclusion different than that which should be made.
Decision Errors
Since the manufacturer's decision of whether to replace product "A" with "B" will be based on data obtained from a sample, there are elements of risk, or error, associated with interpreting the test results. Specifically, there are two errors that can be made, Type I and Type II, as indicated in the table.
A Type I error is made if the test results indicate that product "A" is more preferred than "B," but in reality either product "B" is more preferred or both products are preferred equally. The cost to the manufacturer of making this type of error is the lost opportunity of saving $1 million per year in production costs.
Consider next that product "A" is truly better than "B" in the population. If the test results either indicate that product "B" is more preferred or there is no difference in preference (i.e., product "B" is considered to be at least equally preferable to product "A" ), then a Type II error has been committed. The cost of this type of error may be more serious. Here management will save $1 million per year in production costs, but sales may drop with the introduction of the less preferred product. For example, if the manufacturer experienced a 4% drop in sales, the loss would be approximately $4 million based on the $100 million franchise. In this situation, a Type II error is more costly than a Type I error.
Unfortunately, many researchers do not consider the losses associated with Type I and Type II errors. Typically, a manufacturer will perform a product test and apply a "95% confidence level decision rule" for interpreting the test results. This high level of confidence limits the risk of Type I error to 5%. However, no consideration is given to making a Type II error, which may, as in this case, be more costly.
Example, Revisited
Suppose the manufacturer has a budget which allows for a sample of 200 consumers for the product test. Further, management is willing to consider "A" and "B" as being roughly "equal" in preference even if preference for "B" is 10 percentage points lower than "A." If 55% of the consumers prefer product "A," then 45% must prefer product "B" (assuming a response of "no preference" is not allowed).
With a given sample size, there is a trade-off between the risks of committing a Type I and Type II error. Specifically, in order for Type II error risk to fall, the Type I error risk must rise, and vice versa. For a given difference of interest, the only way to reduce the risk of Type II error without affecting Type I error is to increase the sample size. However, in this situation (as well as in many research studies) limited research dollars prevent the manufacturer from affording this luxury.
Given these constraints, the power of the test can be statistically determined for various confidence levels through the use of power tables found in many statistical texts. (Power and other aspects of significance testing are addressed in detail in Research on Research Paper Number 36.) Returning to the example, if management chooses the typical confidence level of 95%, the power of the test is only 42%. That is, if the true population proportions for "A" and "B" are .55 and .45, respectively (for a difference of 10 percentage points), and 95% confidence is used, the difference would be detected and considered significant only 42% of the time in repeated sampling and testing with 200 consumers from the same population. In other words, the odds are stacked against management detecting a real difference of .10. Considering the large dollar risk associated with a Type II error, it is in management's interests to increase power. (Note that a difference between product preference proportions of .10, as conjectured here, is far larger than typically desired by most researchers. Yet power, the ability to detect a difference of this magnitude, is still poor.)
Power
As previously mentioned, for a given difference and confidence level, power can be increased (risk of Type II error decreased) by increasing the sample size. However, in this instance the sample size is constrained by the budget. Consequently, the power of the test can only be increased by decreasing the confidence level. To illustrate this, Table Two lists the power associated with various confidence levels (given the constraints discussed above).
In reviewing this table, the trade-off between the two probabilities becomes evident: confidence is sacrificed as power increases. For management to increase the power of the test to 88%, a 60% confidence level must be used in testing the difference between preference proportions. This level of confidence is substantially lower than the typical "95% rule." However, the higher power associated with this confidence level will protect the manufacturer from risking sales by introducing an "equally or more preferable" product that is in fact less preferable. Specifically, with this choice of confidence and power, management would consider product "A" as preferable if the proportion preferring product "A" was greater than that for product "B" and the difference was significant with at least 60% confidence.
