Some Methodological Issues in Product Testing

Introduction


The development of food products follows a lengthy path culminating in consumer acceptance testing. The product tests are very complex, with numerous issues to consider to obtain clear, unbiased measurements of consumer preference. These issues cover areas of product technology, respondents' sensory abilities and measurement of reactions to products. This paper is strictly concerned with the third issue. The objective is to discuss four interrelated statistical / methodological facets of food product testing. Consideration of these facets can help strengthen the interpretability of the test results.


Treatment Structure

"Treatment Structure" refers to the nature and number of products to be tested. In terms of an experiment, the products are composed of one or more ingredients which may be modified or systematically varied. Each variation is called a treatment. Information is obtained for all treatments chosen for study, typically in the form of scale measurements or ratings of reactions elicited from respondents. The influence of such changes in ingredients on product preference needs to be assessed: a relationship between change in product composition and change in preference is sought. Product tests that have been shown to be quite effective in identifying these relationships have a treatment structure such that key ingredients are systematically varied and, therefore, their influence on preference can be clearly measured. For example, a candy bar may be considered as being composed of four ingredients: nuts, chocolate, caramel and creme filling. These ingredients, which in experimental terminology are called factors, may be systematically changed, with each change being a different amount of that ingredient included in the candy bar. Each particular amount is termed a level of that factor. The levels of one factor are combined with all possible levels of the other factors and each combination essentially creates a different candy bar treatment.

The factors and the amounts (levels) at which each is tested constitute the underlying structure. The objective in varying the factors in a systematic fashion is to identify that ingredient mix, that specific candy bar, which elicited the most positive response from respondents. In other words, the purpose is to find that candy bar for which preference is optimized. It is quite possible that the "optimal" product is not one of those actually produced, but one found by mathematical interpolation. Further, it is especially important to identify those ingredients which most influence preference, those for which changes in preference are most closely associated with changes in the amounts of the ingredients themselves. Techniques are available through which this optimum can be identified (or estimated). Statistical techniques dealing with factorial experiments, with specific reference to response surface methods, are most frequently used.

Product tests may be categorized into one of four types, based on the number of products tested and the presence or absence of a systematic treatment structure. The four types are: (1) one product, (2) two products, (3) three or more products with no underlying structure (the products differ but not in any systematic way) and (4) three or more products "created" by the systematic variation of one or more product characteristics (factors).

"One-product tests" can be considered as part of a monadic design, to be discussed below, although the test is done in isolation. No other products are tested at that point in time. Information supplied from such a test can be compared to "norms" contained in data banks.

When testing two or more products, it is important to be able to measure the influence of changes in ingredients on preference. From this, the researcher begins to understand the optimization process. For scaled data, the results of which are summarized by means or proportions, an analysis of differences among these summary statistics follows to assess the ingredient influence, or in general, product differences. When no underlying structure is present, a lone factor represents the products, as levels, and the intent of statistical analysis is to identify differences among them.

Product tests with an underlying systematically varied ingredient structure may require more complex design and analysis. With specific reference to treatment structure, products with two or more ingredients (factors) to be varied can be arrayed in what is termed a factorial design. The maximum number of treatments (e.g., candy bars) in such an array is the result of multiplying together the number of levels of all factors. As an example, consider again candy bars. Nuts could be peanuts, cashews or almonds. Chocolate surrounding the bar could be composed of 20, 30, or 40% chocolate liqueur. The bars' interior could be made up of 10, 15, or 20% caramel in conjunction with 60, 65, or 70% creme filling. The rest of the bar is composed of nuts, the percentage of which necessarily changes as the levels of caramel and creme filling vary. (The percentage of nuts also could be considered as a factor, but for simplicity of exposition, is not.) Further, the relationship among these ingredients inside the bar is constrained by the fact that as one ingredient increases in volume the others must be reduced. (This is an experiment with mixtures of ingredients and requires particular care in design and analysis, the exposition of which is beyond the scope of this paper.) With four factors at three levels each, the number of treatments or cells in the factorial array is 3 (nut types) x 3 (chocolate liqueur levels) x 3 (caramel levels) x 3 (creme filling levels) or 34 or 81. In experimental design terminology, this would be referred to as a 34 factorial design. Most likely, respondents would be exposed to and asked to evaluate a subset of these 81 treatments. (A discussion of the subsetting is left for the "Design Structure" section which follows.) The evaluation may take the form of a response to a five-point likelihood to purchase question and/or a nine-point hedonic rating scale, along with reactions to a number of product characteristics recorded on similar scales. A statistical analysis performed on data obtained from such a design would identify which treatment or ingredient combination was best, in the sense of having the largest average rating on some measure of overall liking. In addition, the effect on preference of changing the levels of each ingredient can be assessed both singly (the "main effect" of that ingredient when all other ingredients are held at some constant level) and in combination with other ingredients (the "interaction" of ingredients, an assessment of the synergistic effects of combinations of ingredients).

Practical limitations in product testing are usually reached quickly. For example, few, if any, candy companies could produce al1 81 products necessary to accommodate the factorial design above. A reduced design is necessary and some subset of the 81 treatments is tested. But first, a trade-off must be considered: the practical constraint of being unable to produce all experimental products must be weighed against the potential loss of information from not measuring the preference for products excluded from the study. This toss increases as the complexity of the product increases — the extent to which key ingredients interact in a complex fashion in the formation of an overall product perception. For very complex products, deletion of some treatments would not allow the complexity, as indicated by statistical interactton, to be measured. That treatment which is best may not be included for study, nor would interpolation of existing treatments allow that best treatment to be identified. However, if there is good reason to believe that product ingredients do not interact in a complex fashion, then the number of treatments can be reduced by "fractionating" the factorial design. As the term implies, only a fraction of the treatments would then be produced and tested.

Typically, a fractional factorial design allows for the estimation of the main effects for each ingredient and perhaps a few interactions among ingredients, and no more. The researcher must assume that this is sufficient for finding the optimal ingredient mix, even if the optimal product is not among those tested. Although the "best" treatment may not specifically be tested, interpolation can suggest its approximate position via the "optimal" levels of each of the separate ingredients. Some, albeit limited, interaction information may be available to assist in the interpolation. (The "optimal" level for each ingredient is that amount of ingredient for which preference is greatest, holding all other ingredients at some constant level. A well-constructed treatment structure would attempt to bracket the assumed optimal level of each ingredient. The "optimal" level would not be the highest or lowest amount tested for each ingredient, but rather near the middle.) As with the full factorial design, the effect of each ingredient can be assessed to find those which most influence preference. Returning to the example, the 34 design can be fractionated to yield 9, 16 or 18 treatments which, when tested, supply information for estimating main effects, and little else. No clean estimation of interactions is possible. (A nine treatment fractionated design is minimal for estimating main effects. Designs with 16 or 18 treatments also allow for some statistical assessment of how well the main effects describe or summarize product differences.) As such, 9, 16, or 18 types of candy bars need to be produced. If 9 variations are still too many, some reduction in the number of factors and levels within factors must take place.


Design Structure

"Design Structure" concerns the arrangement or subsets of products presented to respondents. Specifically, it is the structure imposed upon the products which instructs the researcher as to which products, in what sequence, will be evaluated by each respondent. Three general types of structures are considered, which include all possible arrangements of products.

The first is a "monadic" design, where each respondent is presented with one and only one product for evaluation. The second design, at the opposite extreme, exposes each respondent to all products. Lastly, there are designs in which respondents evaluate a selected subset of the products. The researcher may, in fact, want all products to be seen by all respondents, but practical constraints like respondent fatigue or product cost pose limitations.

The most useful design structures within this third classification are those which ensure a "balance" among product presentations across respondents. In general, a design is "balanced" if the same number of respondents evaluate each product and all pairs of products are evaluated an equal number of times. As an example, consider a situation in which each respondent must evaluate two of three products, labeled A, B and C. Three subsets, called "blocks," of two products are formed: AB, AC and BC, with, say, 100 respondents assigned to each block. (Blocking also corresponds to versioning of questionnaires to guide interviewers as to which products are to be evaluated by each respondent.) The blocks are balanced so that across the entire sample of 300 respondents, each product is rated an equal number of times (200). Further, all pairs of products are seen equally often. This type of design is called a balanced incomplete block design, or BIBD.

Quite often balance cannot be achieved, given the total number of products and the subset required to be rated by each respondent. Partial balance may be possible, though. Consider a situation in which four products need to be tested, yet respondents are capable of evaluating only two. A balanced design yields six blocks (AB, AC, AD, BC, BD and CD) which may be too many for some practical circumstances. A "partial" alternative uses only four blocks: AB, CD, AD and BC. Each product is evaluated equally often. The pairs of products that occur do so equally often, but not all pairs are presented. The AC and BD blocks never appear.

Partially balanced in complete block designs, PBIBD's, are somewhat weaker inferentially and statistically due to the lack of balance. However, both of these types of designs are preferable to random or totally unbalanced approaches frequently encountered. It is not unusual to find a product test design for, say, four products with only the AB and CD subsets tested. This designs potentially fraught with considerable error. If exposure to one product is assumed to affect perceptions of the other product, then the reactions to product A depend on whether or not product B (or C or D) is seen. The AB, CD design does not allow for an assessment of the effects of the presence or absence of other products. The effect of the presence or absence of the second product cannot be statistically separated from the rating itself. Balanced and partially balanced designs are easily constructed (a number of texts exist which layout many of these designs) so there is rarely a reason to use anything but them when respondents are to evaluate a subset of products. When combined with simple rules of randomization, both within blocks and across blocks, these designs are an effective tool for ordering the display of products.

Coupling balanced or partially balanced design structures with a treatment structure must proceed with caution, though. BIBDs and PBIBDs work well for unstructured product tests, where there is no treatment structure. However, they do not take a factorial design into account when allocating products to respondents. Although BIBDs and PBIBDs can be used in conjunction with factorial designs (and may, in fact, be best for assessing context effects, as discussed later) other approaches are possible. For example, a basis for fractionating a factorial design can be effective in forming blocks or subsets of products. This approach is called confounding. The balance characteristic of BIBDs and PBIBDs cannot be guaranteed, though, using this approach, nor will all treatment interactions be assessed with equal precision.

The effects on product evaluations of one or more factors may be deemed less important than others. These factors can be statistically subordinated to those of greater importance through the use of a split-plot design. The effects of these less important factors are still estimable, yet with less statistical precision.

Returning to the design structure classifications, the three categories can be subsumed under a broader dichotomy: one product evaluated or more than one product evaluated. For those product tests where the respondent is to be exposed to more than one product, a primary question to consider is the number of products to be rated. What is the number of evaluations a respondent can give without the responses being biased, say due to fatigue or the presence of other products? The products themselves may supply the answer. Satiation levels may be reached quickly or the product may be oily, suggesting that the respondent's palate may not be cleansed sufficiently for multiple product evaluations (or it may take too much time to adequately clear the palate before further tests are made). The evaluation of the first product would then distort reactions to or ratings of the next.

However, the choice of the number of product exposures may be more personal. Some researchers believe the monadic approach to be best since it is more "natural," yielding a cleaner estimate of preference unaffected by the presence of other products. Others prefer multiple evaluations for cost efficiency or statistical precision reasons. (Statistical analyses of data obtained from multiple evaluation type studies may result in smaller standard errors of differences.) It could be argued that multiple-product tests best reflect the situation of a consumer facing an array of products on a store shelf.

Perhaps the key to selecting a design structure is an appreciation of context effects — the effects other stimuli (be they other products being tested or past product usage experience) have on the product evaluations. Underlying the concept of context effects is the logic of comparative judgments made by respondents. Specifically, all product evaluatlons provided are comparative, never absolute. When confronted with a product, respondents mentally compare it to other previously experienced products perceived to be similar, both within the same product test and from regular consumption outside the test. A product is rated well or poorly because it is better or worse, relative to other experiences. Again, respondents are not making absolute judgments, regardless of the type of scale supplied to record the reaction. Responses only appear as if they were absolute. Respondents supply their own frame of reference within which products are evaluated. Without understanding this context or frame of reference it is difficult to fully appreciate the ratings forthcoming. This is especially true with monadic designs and the ratings of the first product encountered in a multiple product design. Perhaps context here can be best estimated by knowing prior usage within the specific product category. Information like brand used most can be gathered and incorporated into subsequent analyses of the product test data.

For multiple product evaluations, understanding context becomes especially tricky since it may change as new products are introduced for evaluation. Context for the first product in a test may be past experience. Context for the second product may be the first product and prior experience, and so on. Further, the rating task associated with evaluation of the first product may serve as a learning task, especially for respondents who have never participated in a product test before. The first set of ratings may be affected by lack of familiarity with the task. Further, the respondent may become practiced enough for the second and subsequent products so as to affect these ratings. Statistically and inferentially, context effects can never be eliminated. They are inherent in the ratings given. But the magnitude of the effects can be measured and analyzed. BIBDs and PBIBDs provide a useful structure from which to begin estimating context effects when only a subset of products are to be evaluated. They ensure the use of many contexts or sets of products, given the constraint that a fixed number of products is to be rated. However, a balanced or partially balanced design isn't enough to estimate the full extent of a context effect.

The order or sequence in which products are presented may influence product perceptions, so the presentation of products should be rotated as well. Rotations can be as simple as having each product in each possible position, or more complex, having each product follow each other product an equal number of times. The latter designs are called cross-over designs, and are used to detect carry-over effects of preceding products. The design in which all respondents see all products can provide for all possible contexts, especially if the order of presentation is controlled. For example, a complete latin square arrangement can be imposed so that each product follows each other product equally often in the testing sequence. When respondents are exposed to only a subset of all products, crossover type arrangements can be obtained by the use of Youden square designs (when one less than the full set of products are to be evaluated) or simply displaying products in all possible permutations within blocks across the sample of respondents.

Subsequent analyses must take the blocking and ordering within blocks into account to measure context effects. (Again, balance and rotation do not reduce a context effect, they simply allow for its statistical assessment.) A context effect is exhibited as a statistically significant block-by-product, order-by-product or block-by-order-by-product interaction: the rating of one product relative to others changes from block to block or from order to order. The preference for a product is then contingent on other products seen in some specific block or order. If such interactions are large, generalizations about product performance become difficult to make. That product which is best in one context (block or order) may not be best elsewhere.


Measurement Procedures

Closely associated with the design structure is the measurement procedure. The Issues here concern how the respondent is to interact with or respond to the product(s) presented, and the measurement tools (scales) used to record the response.

The measurement process can be split into two possible approaches: "absolute" measurement and "relative" or "comparative" measurement. "Absolute" scales are best characterized as those which ask for an "absolute" judgment without explicit reference to any other stimuli for comparison purposes. Examples are a five-point likelihood of purchase scale or a nine-point hedonic scale, both frequently encountered. Other examples are: "Rate this product on a five-point scale where '1' means 'not at all sweet' and '5' means 'very sweet;'" a directional scale where, again, a five-point scale is used but "1" might be "not sweet enough," "3," the middle point, is "just right" and "5" is "much too sweet."

Although called "absolute" scales because of lack of reference to external context, information obtained from such a scale is anything but absolute. The frame of reference or context is supplied by the respondent, and almost certainly differs across respondents.

"Comparative" scales are those which force, or at least direct, the underlying comparative judgment taking place. A very simple way to allow the respondent to express or scale this judgment for a two product test is with a "head-on" scale, which is an elaboration of a paired comparison evaluation. Two products may be compared on level of sugar using a five-point scale: "1" is "product A much sweeter than product B," through "3," "A and B are equally sweet," to "5," "product B much sweeter than A." Direction and intensity of direction are measured. This type of scale is also useful for obtaining preference, again yielding direction and intensity of preference. With more than two products, ranking provides a simple way of obtaining comparative information since each product must be compared to all others to achieve the ranking. (If several products are to be ranked, however, it should be noted that respondents can more reliably identify the best and worst products than rank those in the middle.) Single product tests present a situation where outside or external (to the product test) contexts must be referenced explicitly: "compared to the brand you use most often" or "compared to your ideal product:" The accompanying scale must be graduated accordingly, where scale labels could read from "not at all like . . ." to "exactly like . . ."*

Another distinction between "absolute" and " comparative" scales is indirect versus direct measurement. "Comparative" measurements are direct, using scaling procedures which explicitly consider and record the comparative judgment taking place. One response is all that is needed to make the necessary comparison. "Absolute" scalings are indirect, yielding as many ratings as there are products, which are then compared. Information for the comparison of two products is obtained from two sources or ratings where it may not be obvious to the respondent that a comparison is desired by the researcher. Again, no explicit contextual reference is supplied as a guide to rating. The researcher must hope that the existing context is common to all ratings obtained from a respondent, and may be left with a weaker sense of inference concerning product differences.


*with strict interpretation, it is assumed that the scale points are consistently and unambiguously interpreted the same way by all respondents. Further, measurement is supposed to proceed along at least an interval scale. This is never the case, either for "absolute" or "comparative" scales. However, all that is required of the measurement is an indication of product performance, rather than an attempt at precise estimation of the product position on the scale. Beyond this, no attempt will be made here to address the controversy of whether scales, as used in marketing research, can be treated as interval scales.

The comments on comparative judgment and context effects found in the "Design Structure" section are fully applicable here. However, additional arguments favor the use of "absolute" scales. Information concerning the "absolute" level of product performance may be needed. First, it may be extremely important to determine if a product is performing at a satisfactory level, say at least a "6" on a nine-point hedonic liking scale. Although "comparative" evaluations may suggest the superiority of one product over another, both products may, in fact, perform quite poorly.

One approach to obtaining useful comparative data on an absolute scale might be to introduce two control products to be evaluated first. The two products would define, based on past performance, opposite extremes of expected product properties. For example, in the evaluation of beer, one control product might be a very light lager, the other a very dark, heavy stout. These initial product ratings provide the context or boundaries for subsequent beer evaluations. (This use of control products is discussed further in the section on standardization of tests.)

Tangentially, absolute ratings are often suggested for tests where there is a concern for comparability of ratings over time. There may be some interest in comparing ratings of a product obtained at two different points in time. Given the influence of context, "absolute" ratings still yield comparative or relative ratings. The conditions under which the product was tested, specifically the other products that accompanied it in testing, should be duplicated in future tests to ensure comparability.

A second argument favoring "absolute" scales concerns the ability to track perceived changes in the physical composition of the products, as per those modifications made for the product test. The reliability and validity of any scale used in product testing must be judged by its ability to measure changes in respondents' reactions to products that mirror actual physical changes in products themselves. Ideally, it should be possible to calibrate scales so that changes of given magnitudes along the scale reflect specific changes in product composition. Classical psycho-physical measurement procedures, like paired comparisons, have been used in efforts at calibration, yet this approach may require an inordinate amount of scaling work by respondents. "Absolute" scallngs offer a simpler approach, yielding information which can approximate the necessary calibration. Unfortunately, very little work has been published in this area. The ability to calibrate attitude scales is still largely unexplored in marketing research.

Clearly, If "absolute " scales are adopted, the researcher must exert substantial control over the context. Control requires understanding each respondent's product experience or frame of reference. Inventorying brand usage information is critical. Brands used most within the category of interest can be rated on the same product characteristics or attributes used for the test products. The intention is not to compare these ratings since brand image would completely distort any physical product differences, but to use the external reference information as a basis for statistical control or standardization of ratings of the test products. Ratings of brand used most may provide a reasonable idea of how "high" a test product can realistically be expected to be rated (averaged across respondents). Control is also exercised by the use of appropriate design and treatment structures, followed by correct statistical analysis taking these structures into account. Only then can context effects be estimated and correct inferences about product differences be made.


Test Objectives

A key objective pursued in product testing concerns the study of relationships between changes in product composition or ingredients and changes in preference. Information on preference is usually accompanied by measurement of reactions to other product characteristics. The methodology and statistical comments which follow are applicable regardless of the scaling used, although "absolute" scaling is most often encountered, especially when measuring specific product characteristics.

Data collected here are typically analyzed using some form of linear model. The linear model encompasses analysis of variance and regression analysis and all modifications of the analysis to reflect the design structure used in data collection. The distinctions between analysis of variance and regression are beyond the scope of this paper. In either case, characteristics of the treatment structure (the factors and levels within those factors) and the design structure must be incorporated.

The purpose of such analyses is two-fold. The first is the estimation of that ingredient mix, operationalized by a combination of levels (one from each factor) for which product preference is greatest. The second, more general, purpose is the estimation of the effects of each ingredient, or factor, on preference. The attempt here is to identify those ingredients which most influence or "drive" preference, finding those product characteristics which, if changed, improve product preference. In a well-constructed product test, key ingredients are varied allowing relatively simple and straight forward examination of the effects on preference. Note that inferences to be drawn here are at the product level; the researcher considers changes in products. As such, the analysis and interpretation of these data must also be at the product level. Analysis of variance, even as specific as t-tests, and response surface analyses allow such inference since their basis is the assessment of change among mean product ratings as product characteristics change.

However, regression analysis performed within products is of no practical utility. This is an attempt at predicting preference ratings given by respondents for one product, using ingredient ratings for that product as predictor ("independent") variables. The inference here is at the respondent level where, for the specific analysis, no product changes have taken place or are measured. The regression coefficients, used with the intent of identifying influential ingredients, are determined purely by respondent-to-respondent differences in how that one product was perceived. The coefficients do not measure or indicate change in products necessary to change preference, but rather only change in respondent-to-respondent perceptions of that one product. Without understanding why perceptions differ (which is like asking why respondents differ) — and within a product it is clearly not due to product to product differences — the coefficients do not give the intended interpretation. Add to this the statistical instability inherent in regression when the predictors are correlated (the price paid for statistically controlling or" holding constant" variables which, by design, have not been controlled), and, the coefficients may not be interpretable at all. It is interesting to note that the variability which underlies this analysis is considered as error in the analysis of variance approach.

Another objective is discrimination. Consider a situation where a new product formulation is developed, made up of less expensive ingredients. If potential consumers of the product cannot distinguish the cheaper product from the standard, currently available product, then little perceivable risk is incurred in switching to the new formulation. The research objective at this point is to assess respondents' ability to correctly, and potentially consistently, distinguish between the old and new formulation. Lack of discrimination suggests that the new product may be a cost-effective entry which has little chance of eroding the current product's franchise. Issues in discrimination testing are discussed at length in Research on Research Paper Number 33.


Standardization of Tests

The quality of information obtained from a product test can be greatly enhanced if some standardization takes place to control or direct the measurement process. This standardization can begin with the introduction of a control product, to which all respondents are exposed first. As such, all respondents start the product test in a comparable fashion. This is of special utility for taste tests conducted in malls where respondents may approach the tasting task with different residual tastes in their mouths. In addition to serving the "cracker or water to cleanse the palate" purpose, the respondent is allowed to practice the rating task before moving to the products of interest. Further, prior exposure to the characteristics or product attributes to be rated will help familiarize respondents with the rating task, sensitizing them to physical characteristics to which they may otherwise have paid little attention. In general, there will be improved reliability and validity of diagnostic information.

The logic of comparative judgment suggests that this control product will serve as a context for subsequent ratings. Perhaps, then, the best candidates for control products are current products, those already on the market. To extend or increase the inferential validity of the taste test, a few replications of the taste test could be constructed, each with a different control product.