Use of a Bayesian Orientation in Product Testing
Introduction
This paper summarizes a Bayesian approach to statistical analysis in a simple product testing situation. Basically, prior knowledge or information about the products of interest is quantified and coupled with the results of data collected to yield what is referred to as a posterior estimate of whatever statistic is of interest, the consideration of which leads to some sort of decision or action.
The notion behind research (e.g., collecting data) is to learn or to gain knowledge. For instance, a researcher may wish to learn about the level of preference for a cost-reduced product. Hopefully, this research leads to a better understanding of the true "state of nature," the level of preference in the population of consumers. A Bayesian approach allows the researcher to incorporate into the research process prior knowledge or information which can be merged with new data (such as the results of a most recent preference test) to form a more complete understanding of this "state of nature."
Conversely, the approach typically encountered in statistical analyses in marketing research assumes a state of ignorance. Each new piece of information stands alone and is not considered relative to or accumulated with knowledge already available. Clearly, this is wasteful and counterintuitive considering the way people, researchers or otherwise, think. Imagine walking down the street and seeing an old acquaintance, someone not seen for several years. Upon reflection, it is recalled that this person was most unpleasant. Unable to hide from sight, a meeting occurs. After a short conversation, it appears that the old acquaintance has chanced into a very pleasant person. Although skeptical, perhaps some reconsideration of the person's past demeanor is in order. This sequence of information acquisition and behavioral or attitudinal change is fairly typical.
There is a close parallel between a Bayesian orientation and this social encounter. Past behaviors constitute prior information, the belief the person was unpleasant. The meeting is a form of data collection from which more is learned about the level of unpleasantness. The prior belief is then merged, qualitatively at least, with the data to yield a new estimate, the posterior estimate, of pleasantness. How much the posterior estimate differs from the prior perception is greatly influenced by the strength or goodness of the data collection. Further, the divergence between the posterior estimate and the estimate obtained from the data speaks to the confidence with which the prior belief is held.
Moving back to the realm of marketing research, a Bayesian approach begins by incorporating available or prior information into the research process. This information may be formal, e.g., data already collected, and informal, e.g., suspicions or opinions about how the products to be tested should perform. This information is then turned into statistical estimates resembling those obtained from a sample of consumers and ultimately combined with estimates obtained from sample data. Further, an assessment of risks of a wrong decision stemming from use of statistical tests (type I and II errors. as discussed in Research on Research Papers 35 and 43) can be supplied. In addition to this risk quantification, as an overlay to the whole decision making process, there is the notion of personal or subjective probability: each user of the data comes to terms with risk in a different way. So, varying levels or degrees of significance or confidence, statistical ways to quantify risk, could be interpreted differently depending on the risk tolerance of the decision maker.
The Bayesian approach is, then, an introspective one. Perhaps the greatest contribution from a Bayesian orientation appears to be the thought process brought to bear on the research problem before any data are collected: the researcher is forced, at least in some small sense, to quantify beliefs about how the ensuing research will turn out. Further, the researcher is required to consider his/her own definition of risk, to make it explicit in whatever action standard might be developed. Even if prior information is never coupled with the data, much can be gained just by thinking about the problem.
Example
Prior information is worth data, which when coupled with data actually collected has the effect of increasing the sample size and improving the stability of the estimates obtained. The more precise this prior information is, the greater weight it receives when merged with the data actually collected.
Consider a preference test contrasting a current product to a cost-reduced alternative. It is quite reasonable to assume that prior beliefs about how well a cost-reduced product will fare are available. A good deal of information about the performance of the cost-reduced product typically exists from research done previously, both internally through sensory groups and from past consumer tests. The subjective evaluations famed by the product developers (or marketing people) also qualify as useful input. In essence, rarely will an untested formulation be sent to marketing research for extensive consumer testing. Given that the criterion measure of interest here is the percent of preference, prior information can be quantified on this scale.
For example, the R & D person responsible for a cost reduction project conjectures that the cost-reduced product will perform at parity (50% preference, with no "no preference" responses allowed) with the current product. Alternatively, the R & D person feels the true degree of preference is probably somewhere between 45% and 55%, so 50% serves as a midpoint of this preference interval. Unfortunately, when pressed for assurance, the R & D person isn't all that confident in this assertion: maybe the true population level of preference is in the 45% to 55% interval, and maybe it is not. Statistically, it's useful to equate this lack of certainty with a level of confidence, say 50% confidence or 1:1 odds that the true population preference percentage is encompassed by this interval. Further, the sample size that would be required to produce this interval with 50% confidence can easily be determined. Rearranging terms in the formula for the standard error of a percentage of 50 and taking account of the confidence multiplier of .6745 for 50% confidence yields an estimated sample size of 45. The prior belief is "worth" 45 respondents, not from data actually collected but from a quantification of what the prior information is worth in the "currency" of sample size.
Next, data have been collected from 250 consumers, yielding 48% preference for the cost-reduced product. The Bayesian approach suggests forming a weighted combination of the prior estimate, 50%, and the data estimate, 48%, with sample size, believed or real, as the basis for weighting. The resulting preference percentage is then (45 X (50%) + 250 X (48%)) / (45 + 250) = 48.3% based on an effective sample size of 295. The 48.3% preference is considered the posterior estimate, from which inferences about the true level of population preference can be made. Again, the prior information contributes to the estimation of the preference percentage, increasing the effective sample size and hence the stability upon which statistical decisions will be based.
Perhaps the most useful summary here is an interval, referred as a credibility interval or highest density region and comparable in construction to a confidence interval, drawn around the posterior estimate. A 95% credibility interval around 48.3% extends from 42.6% to 54%. The interpretation is that there is a 95% chance that the true population preference percentage is anywhere from well below parity (42.6%) to moderately above parity (54%). (Note that the probabilistic interpretation used here differs from that associated with a confidence interval when the confidence interval is interpreted correctly. Research on Research Paper 34 can be consulted for details. For a purely numeric comparison, a confidence interval drawn with 95% confidence around the sample estimate of 48% based on a sample of 250 respondents extends from 41.8% to 54.2%. The confidence interval is a little more than one percentage point wider than the credibility interval. This is due to the effect of increased sample on the standard error from which the Bayesian credibility interval is constructed.)
A more specific statistical test assesses the conjecture that preference in the population of consumers is at parity. The test contrasts the posterior percentage to 50%. Bayesian analysis considers population values as probabilistic and sample data as established and hence fixed. The Bayesian statistical orientation is one of asking questions about a population value, which can be any quantity, given data available. One then makes conjectures about the likely population value and estimates a probability corresponding to the chance that the conjecture is right. (Conversely, classical statistics takes the data as random but the population value as fixed, albeit unknown. Different samplings from the same population may yield different estimates but the population value stays the same, but still unknown. Bayesians would argue that their approach is more intuitively pleasing.) The research question is then concerned with estimating the chance or probability that the true population value could have been parity, 50%, given the established information on hand, a posterior percentage of 48.3. Using a slight variation of the usual z-test, dividing the difference (50 - 48.3) by the standard error for the percentage 48.3, yields a p-value of .28: there is a 28% chance that the posterior percentage is greater than or equal to 50% or, there is a 72% chance that the posterior preference is less than parity. The odds are 72:28 or about 2.5 to 1 that the posterior percentage is less than parity. Alternatively, a useful test contrasts the posterior percentage to a population value conjectured to be some minimally acceptable level of preference, say 45%. Running through the same statistical procedure yields odds of almost 7 to 1 - roughly an 87% chance - that 48.3% is greater than this minimally acceptable preference level. (Numerically there is little difference between the results here and those found using a traditional or classical statistical approach.)
Interpretation
How a decision is made now depends on the relationship between the result of the statistical test and the researcher's assessment of risk, and this is applicable whether prior information has been used or not. Taking a step back, an action standard can be formulated: if the cost-reduced product is far worse than parity (where "far worsen should be very precisely defined, say, five percentage points worse than parity or 45% preference) then stay with the current product, if the cost-reduced product achieves a 45% preference level or greater then proceed with the cost-reduced product. A rough assessment of risks suggests that an incorrect decision to produce the cost-reduced product may be far more damaging, in terms of reduced revenue from share loss, than missing the chance to reduce production costs if the current product is incorrectly retained. Applying a typical non-Bayesian (marketing research) approach to the above example, adopting say, 95% confidence, (but without regard to the power of the test) suggests that there is no significant difference between parity and the obtained preference percentage for the cost-reduced product. The decision is made to proceed with the cost-reduced product. However, the Bayesian oriented researcher may wonder whether 95% confidence is appropriate. The use of 95% confidence requires preference for the cost-reduced product to be considerably lower than parity (roughly 44% based on a sample size of 250) before being rejected. Essentially, this action standard criterion may have a bias toward going with the cost-reduced product even when that is an in correct decision. (In classical statistical hypothesis testing, this could be construed as a lack of power leading the researcher to commit a type II error.) Given the greater losses with this type of mistake (share loss will almost always exceed potential cost savings), some criterion lower than 95% seems quite reasonable.
Statistical decision theory (and classical hypothesis testing) allows for the explicit incorporation of information about specific dollar amounts associated with the types of risks mentioned here to help set the action standard criterion. (Research on Research Paper number 43 addresses this issue.) However, if this information isn't available, and quite often it is not or what is available is not specific enough to be usable, the researcher could rely on a personal point of view about risk. For example, the risk-averse researcher wishing to lessen what is felt to be the greater risk due to share loss from producing the truly less preferred cost-reduced product might feel that there should be no more than a 65% chance that the cost-reduced percentage is below parity before suggesting that product as a replacement for what is currently on the market. Recall that there is 72% chance of such an occurrence. Given the risk of franchise loss, the difference of 48.3% from parity would be considered too great.
Implicit here is the process of setting a personal level of risk which translates to some level of confidence required to make a decision. The process is introspective, with researchers setting risk levels consistent with their own feelings and personalities. This is as opposed to routinely adopting traditional levels of confidence which may be quite insensitive to the current circumstances. From another perspective, interpretation of statistical test results demands the setting of a confidence or significance level. This level may be set by default to some "usual" level or obtained through some more insightful thought process.
A Note on Introspection
Consider a conversation held with two marketing researchers on the way to a race track. The topic of risk aversion arose and each was asked to state the degree of risk each was willing to take when betting. The first researcher bet "sure things," horses running at 3:1 or at most 4:1 odds. This person was riskaverse, unwilling to place bets on horses considered unlikely to win. The second researcher was by far the gambler, willing to bet on long-shots at odds of 15:1 to 20:1. These odds were then related to statistical confidence where risk (100 - confidence level) relates to a type I error, incorrectly citing a finding (e.g., a difference) as significant when, in fact, it is not. Risk aversion leads to minimizing the chance of this type of error. (Money will not be risked if the horse isn't very likely to win.) If losing a bet corresponds to a type I error then 4:1 odds translates to 20% confidence (the horse upon which the bet is placed will win 20% of the time) while 15:1 to 20:1 corresponds to about 5% confidence. The first researcher requires four times as much confidence to place a bet. Yet, when back in a research setting, both people were quite willing to routinely use the 5% risk of a type I error associated with 95% confidence.
For the first researcher, 95% confidence may prove to be too liberal, finding too many differences significant which really weren't (betting on too many horses which lose) and so committing too many type I errors. At best, a 95% level may be a lower bound on the level of confidence with which this researcher may feel most comfort. In fact, to remain consistent with the second researcher by using the 4 to 1 ratio mentioned in the last paragraph, a 99.98% confidence level would be required. (A z- or t-value of 3.72 would be needed before a difference is declared significant, using a two-tailed test. This value is nearly twice that used for 95% confidence.) Conversely, 95% confidence may represent an upper bound for the second researcher, where higher levels would fail to signal as significant findings which this researcher may deem interesting. In a sense, this researcher is willing to risk an increase in type I errors as long as more differences can be granted significance. In keeping with the 4 to 1 ratio of confidence percentages cited above, a lower level of confidence, say 80%, may be more consistent. Some introspection on the part of the researchers, coupled with an evaluation of the risk orientation of the company for which they work, would help eliminate inconsistency between an inherent risk tolerance and the level of confidence used.
Heavily Weighted Prior Information
A useful tangent here is to consider the consequences of greater confidence in prior beliefs and information. In the example, this would be greater assurance on the part of the R & D person concerning prior beliefs about the performance of the cost-reduced product. Rather than being only 50% sure that the true population preference falls in the interval from 45% to 55%, the assertion is held with, say 95%,confidence. In Bayesian terms, there is a 95% chance the true population percentage is in this interval. The implicit sample size now is 384, which will be substantially larger than samples typically obtained through data collection. One might ask whether it is reasonable to value prior information so highly. Given the amount of development work done before a product is consumer tested and the amount of experience an R & D person might have with that and other products in the category, the answer could well be "yes." The role of consumer testing would then be one of confirmation of the prior belief; collect just enough data to substantiate what is believed. Data will upset the prior belief only if the estimated percentage differs greatly from the prior. As such, the orientation of consumer research may change depending on the amount of prior information available.
In Closing:
A Bayesian approach does not make research life easier, indeed it complicates matters considerably. In contrast to the way statistical analysis is typically used in marketing research, collecting data and comparing the result of a statistical test to some arbitrarily set level of confidence (e.g., 95%), some hard thinking is necessary to take account of beliefs and other information available about product performance. Further, a researcher is required to come to terms with his/her own view of risk, quantifying it and using it to interpret the results of significance tests.
