Correctly Selecting the Best Product

Introduction:

Product tests represent one of the cornerstones of marketing research. The tests involve many stages of preparation and planning, culminating in design, data collection and analysis. Experience suggests that many of the tests have as a goal the identification of one product, or a small subset of products, which is "best," in the sense of being most preferred or most likely to be purchased. However, a potential inconsistency exists between the goal of selecting the best product and the use of standard statistical tests to assess the significance of differences among products.


An Example:

A group of 146 consumers of rice were asked to taste and evaluate three rice products, where each consisted of white rice in a flavored sauce. After tasting all three products, respondents were asked to state a preference: which of the three products did they like best? Table One displays the results. Product "A" was preferred by 61 respondents, or approximately 42% of the sample. (All respondents stated a preference, precluding the existence of a fourth, "no preference," group.)


The key research question to be answered is whether product "A" is indeed best, in the sense of being most preferred. The statistical approach typically taken is a test of the equality of the proportion of respondents preferring product "A," the "best" product, and the proportion preferring product "C," the "second-best" product in the example. Given the forced choice nature of the data, the chi-squared test presented in Research on Research Paper Number 26 would be considered appropriate in this situation. Since one, and only one, comparison among the preference proportions is of interest, the chi-squared value could be referred to the chi-squared distribution with one degree of freedom, or the square root of the chi-squared statistic could be referred to the normal (Z) distribution, to assess the statistical significance. Calculation of this Z statistic for the difference between the proportions for product "A" and "C" yields a value of 1.25, significant with approximately 80% confidence. The level of confidence here may be considered sufficiently weak to suggest, in lieu of other marketing considerations, that no difference exists.

The statistical approach taken here understates the degrees of freedom that should be allocated for testing product differences. The chi-squared statistic, in fact, follows a chi-squared distribution with two degrees of freedom, and would be considered significant with only about 55% confidence. Alternatively, the Z distribution can be used with modification of the p-value of the z-test result to reflect the two degrees of freedom available for testing. (This modification can be achieved by use of the Bonferroni inequality.) The Z-test value of 1.25 would then be considered significant with approximately 62% confidence. The extra degree of freedom is necessary since most researchers cannot specify exactly which product comparison to test prior to data collection. However, most researchers would not in practice apply two degrees of freedom. Thus, a one degree of freedom test, although statistically incorrect, serves as a more realistic comparison to the methodology to be presented.


Logic of the usual statistical test:

The research question addressed above can be restated in statistical terms by inquiring as to the probability of correct selection: assuming that product "A" is indeed best (assuming it has the largest preference proportion in the populatlon of rice consumers), what is the probability that it will be correctly selected based upon the sample data as the best of the three rice products? The analysis above tested the equality of two proportions, the logic being that if the largest proportion differed significantly from the next largest with at least 95% confidence, then the product with the largest proportion was best. The level of confidence associated with this significance test refers to the likelihood that the observed difference between the proportions reflects a true population difference and is not due to random sample fluctuations, given that no difference between the population proportions is postulated (the null hypothesis). A large degree of confidence suggests the two proportions differ, no more.

To call one product best as a result of the test may be a logical next step and inferentially plausible, but the probability of correctly identifying that which is best is not directly assessed. Further, if it is assumed that one product can be identified as best, then differences among products are assumed to exist. This is inconsistent with the use of statistical procedures which test the null hypothesis. Thus, this testing logic only indirectly answers the research question and, as such, may not be the most sensitive method for assessing the probability of correctly selecting one product as best.

A one-tailed significance test is not the answer to a more sensitive test. The forced choice statistical test above was two-tailed because it was unlikely that the researcher, prior to data collection, could correctly state which product would indeed be best, or which product comparison to test to find that which is best. This knowledge is necessary for probabilistic advantages to accrue to the researcher for correct use of a one-tailed test. (Comments made in Research on Research Paper Number 36 are also relevant.)


The statistical estimation of the probability of correct selection:

A statistical methodology exists for estimating the probability of correct selection. The logic of the procedure is quite simple. A set of objects (rice products in the example) are ranked according to some criterion. Preference proportions were used above. The product with the largest criterion value is selected as best and the product with the second largest value as second best. The focus of the analysis is on the difference between these two products. All other values are irrelevant beyond supplying information on the number of objects in the analysis. The null hypothesis is implicitly rejected; the assumption is made prior to data collection that there are differences among the objects and that one is best. As a consequence, Type I errors, considering differences to be significant when they are not, are irrelevant. In the most extreme case, if the criterion values for all objects are equal, then selection of anyone of them as best, even if done randomly, is acceptable.

The sample difference between the best and second best criterion values serves in the assessment of the power or sensitivity of the test: the greater the difference, the greater the likelihood that the object which is truly best in the population will be selected as best in the sample. The statistical measure of this power is the probability of correct selection. (Research on Research Paper Number 36 can be referred to for more information on the concept of power.)

There are limitations to the use of this selection procedure, however. Formulae and programs for implementation may be in accessible to many researchers. Although tables exist for estimating sample sizes for the design of studies, using these tables "in reverse" to estimate probabilities of correct selection can require extensive interpolation.


Bootstrapping:

A simpler approach, both intuitively and practically, and one that gives a reasonable approximation to these probabilities, is bootstrapping. Originally discussed in Research on Research Paper Number 31, bootstrapping is an empirical or brute-force approach to obtaining statistics and their standard errors that are otherwise difficult, if not impossible, to calculate. The key idea underlying bootstrapping is the generation of a large number of samples, all drawn from the original data. This is done by using resampling, or sampling with replacement. The statistic of interest is obtained for each of these "bootstrap" samples and its variability is calculated across the bootstrapped estimates.

Obtaining the probability of correct selection requires the identification of that product with the largest preference proportion within each bootstrap sample. This involves estimation of the preference proportions within each bootstrap sample. Comparisons are made among the proportions to determine for which product the proportion is largest. The probability of correct selection for a given product is the proportion of times across the bootstrap samples the proportion preferring that product is largest. (The probability of correct selection, as estimated by bootstrapping, thus quantifies the "stability" with which a particular product would be found to be best across bootstrap samples. In a very strict sense, the probability should not be used with reference to the population from which the original sample was drawn, but rather to the universe of possible bootstrap samples which could be drawn. This suggests that the procedure should be used only with sizable samples that are representative of the population to which inferences are intended.)


The Example Revisited:

Table Two reports results from 1,000 bootstrap samples, each with 146 observations. The preference proportion for product "A" was largest in 87.6% of the bootstrap samples. Product "C" was best in 11.8% of the samples and product "B" in less than 1%. Note that the preference proportions vary over the bootstrap samples drawn. It was possible for product "B" to have the largest preference proportion in a small fraction of bootstrap samples due to the random nature with which the samples were drawn.


Product "A" is again seen as best with a probability of correct selection of .876. This provides a specific answer to the research question posed at the outset: what is the probability of correctly specifying product "A" as best?


Comparability of the probability of correct selection and confidence levels:

Although this probability of correct selection is not strictly comparable to the confidence level associated with the z-test given earlier, there is an inclination to do so since both can be viewed as "acceptance criteria" associated with their respective tests.

Taken at face value, the "correct selection" procedure yields greater sensitivity: product "A" can be considered best with a greater degree of confidence, .876 vs. .80. However, the probability of correct selection is a measure of power, not a measure of confidence and should be compared to the power of the Z-test. For the example, the power of the Z-test is approximately 24% and 35% for Type I error rates of .05 and .10, respectively. Both power levels are quite poor when considered relative to the estimated probability of correct selection. The power would have been much worse if the correct statistical test, as discussed earlier, had been used.

This comparison may not be fair either. As mentioned earlier, the null hypothesis is implicitly rejected and Type I errors are irrelevant when considering the probability of correct selection. This is tantamount to setting the Type I error rate to 1 for null hypothesis tests, so that any difference is considered significant. (Setting the Type I error rate to 1 is equivalent to not performing any significance test at all) The power of the Z-test in this situation, whether modified for the two degrees of freedom available for testing or not, would be comparable to the power or probability of correct selection.


General Applicability:

The "correct selection" procedure is applicable to all studies where the research question concerns selection of that object which is best. The application cited above in the rice example was an analysis of data that had already been collected. This procedure potentially is of greatest utility, however, when used to estimate sample sizes needed to achieve desired precision and power. Tables exist to serve this purpose. Once data have been collected, the analysis can proceed using the testing logic described above.

Any type of objects may be analyzed, e.g., concepts, product names or packages, television commercials, etc. Of particular relevance is advertising claim substantiation. If one brand is advertised as preferred over a second in nationwide taste tests, the probability of correct selection should be sufficiently large to warrant such a claim.

The testing logic is extendible to most, if not all, designs. Some modification of the test statistic may be required to better accommodate the design. For example, paired-comparison taste tests can easily be evaluated. Respondents would be exposed to two of a number of products and asked to state a preference for one product or the other. Using a three product test as an example, a sample of respondents would be randomly divided into three groups. Each group would be exposed to one of three possible pairs of products, each product appearing in two of the three pairs. A criterion for selecting the best product could be an average of the preference proportions for a product across the pairs in which it was tested. Comparison of these averages to find the "best" product for a large number of bootstrap samples would be used to estimate the probability of correct selection.

Although allocating "no preference" proportions to product proportions or repercentaging the preference proportions after exclusion of "no preference" responses may not always be reasonable procedures, the processes can be accommodated. In addition to supplying the probability of correct selection, bootstrapping can also supply an estimate of the standard error of these modified proportions. An advisable approach is to leave the "no preference" cell as is. The preference proportions for each of the products are then much cleaner estimates of the population proportions and the conclusions are generalizable to the total population as intended, not just to those stating a preference. Note that there is a chance of selecting the "no preference" group as "best" (the population proportion that is largest). The chance increases with the size of the proportion in this group.

Although only proportions have been addressed here, the "correct selection" procedure can be used for scaled data as well. Criteria for selecting the best product can be based on means, medians, or even ratios of common statistics. Bootstrapping can supply estimates of correct selection and standard errors for the statistics used. (Of course, caution should be exercised if the data used are distributed in a grossly non-normal fashion.)

Lastly, the "correct selection" method can be applied to identify the worst object, or that with the smallest criterion value. Also, the logic of the methodology can be extended to identify subsets of objects which are best (or worst) according to some criterion. In this case, the probability of correct selection pertains to the probability that the best (or worst) object is contained in the subset selected.


Designing Product Tests — Sample Size Estimation:

To this point, the correct selection methodology has been presented as a direct statistical approach to answering the research question: Which product is best? Through its use, greater statistical sensitivity is obtained, leading to an increased chance of correctly identifying that product which is indeed best.

However, as mentioned, the greatest utility of this approach is in designing product tests, i.e., sample size estimation. Specifically, the sample size requirements for achieving reasonable probability levels of correct selection are far smaller than those associated with conventional significance tests. (Research on Research Paper Number 37 is devoted to the estimation of sample sizes for such conventional tests.)

As an example, a researcher wished the ability to detect a difference of .10, or 10 percentage points, In forced-choice preference between the best and second best products in a four product test. Preference proportions for the two products in question could be .35 and .25, respectively. Using the conventional approach with a reasonable, yet not overwhelming, level of power of 80%, two sample sizes could be estimated. The first, with 95% confidence, requires a sample of 326 respondents. The second estimate acknowledges that any of six pairs of products could be considered for testing if no prior relevant information is available to direct the researcher. Splitting the Type I error risk, .05, six ways to account for this uncertainty (to reduce the increased chance of finding the difference significant when it is not) suggests the use of 99.2% confidence with, again, 80% power. As such, 503 respondents are needed. Conversely, only 278 respondents would be needed to achieve a desired probability of correct selection of .95. Although only nominally associated with 95% confidence and 80% power, the first approach still results in a larger sample size estimate than required by the correct selection approach. Hence, use of the correct selection methodology results in a more economical and efficient product test and yields data which, when analyzed, directly answers the research question.

As noted previously, formulae and tables for sample size estimation exist, but are limited. References to existing tables and literature on this approach can be obtained by contacting members of Decision Systems in Chicago.