Sample size tables for significance tests

Introduction

At some point in any research project, a decision must be made concerning the sample size needed to achieve the study objectives. Among the factors that might be considered in this decision are the purpose of the research, the method(s) and costs of data collection, availability and costs of products or other materials, sizes of subgroups of interest for analysis, as well as statistical issues, such as levels of confidence with which to address means, percentages, and differences among them, and sensitivity (power) to detect differences between groups.

Although statistical confidence and power are not the only considerations, in many cases they are important ones because they allow the researcher to control the risks of drawing erroneous conclusions from sample results: either concluding that "real" (population) differences exist when they do not ("Type I errors"), or that population differences do not exist when, in fact, they do ("Type II errors"). These risks are discussed in detail in Research on Research Paper Numbers 27, 36 and 43. Consideration of such risks is important, for example, in product or concept tests, where the future of a product or concept may depend, at least in part, on the outcome of one or more statistical significance tests.

Tables are provided in this report to simplify the task of estimating sample sizes required for significance tests concerning means and proportions (or percentages) obtained from a single sample or from two independent samples. A detailed description of the logic of statistical significance tests is presented in Research on Research Paper Number 36, and procedures for estimating sample sizes are described and illustrated in Research on Research Paper Number 37. The tables presented here were developed by application of formulas contained in the latter paper.


Using the Sample Size Tables

In order to make use of these tables, a researcher must specify the following:

  1. The degree of risk one is willing to tolerate of declaring a spurious sample difference to be "significant." This is the "Type I error" risk, and is usually set relatively low, e.g., 5% or 10%. The complement of this risk is the confidence level: confidence = 100% - Type I error risk. The tables include entries for confidence levels of 90% and 95%, corresponding to risk levels or 10% and 5%.

  2. The risk of overlooking a "real" (population) difference of a given size. This is the "Type II error" risk; this type of error occurs when there is a difference in the population but the sample does not yield a significant difference. The complement of this risk is "power" — the likelihood of obtaining a significant sample difference when there is a real difference of a given size in the population. The tables include entries for powers of 50, 70, 80 and 90%.


The variability of the variable of interest. In general, the greater the variation among responses, the larger the sample size required to detect a difference of a given size as "significant." A great deal of market research data is obtained using rating scales, and the variance among responses depends in large part on the number of scale points used. For example, a difference of .5 between two population means is relatively larger on a 5-point scale than on a 10-point scale because the variation among responses generally would be greater in the latter case. Accordingly, the sample size required to detect a .5 difference as "significant" would be greater with 10 scale points than with 5. Suggested variance estimates for various numbers of scale points are included in Table 3.

For tests involving proportions (or percentages), the variance depends on the proportion itself: variance = P(1 - P), where "P" is the population proportion (or average of two proportions in the two-sample case). This variance is largest when P = .5 and declines as P approaches 0 or 1. Unless there is good reason to believe that the proportion(s) of interest are not near .5, one should use P = .5 in estimating the variance (which will then be .25). This ensures that the resulting sample size is at least large enough to achieve the desired power, regardless of the true population proportion(s).

The size of the smallest difference that is "useful" or "meaningful" or important to detect. For example, a "true" difference between two population means that is less than 1/10 of a point on a 6-point scale might not be considered large enough to be of practical importance. In general, the smaller the difference of interest, the larger the sample size required to detect it with a given degree of power. The size of the difference of interest must, of course, take into account characteristics of the "measurements" employed (e.g., measurement error in general, and the number of scale points when rating scales are used).

Once these have been specified, use the tables as follows:

  1. Find the column corresponding to the desired levels of confidence (Table 1 for 90%, Table 2 for 95%) and power.*

  2. Locate the row corresponding to the difference of interest.

  3. Multiply the tabled value by the estimated variance. (For data obtained using rating scales, see Table 3.) Then,

    • For tests of differences between two independent samples, the result is the required sample size in each group.
    • For one-sample tests (e.g., does a particular mean or proportion differ significantly from some assumed value?), divide the result by 2 to obtain the appropriate sample size.


Example 1: Test of a Difference Between Two Independent Sample Means

A company is interested in comparing reactions to two advertisements for one of its products. Respondents will be shown one of the two ads and then asked to indicate their likelihood of purchasing the product, using a 5-point rating scale where 5 = "I definitely would buy it" and 1 = "I definitely would not buy it." A confidence level of 90% is adopted, which limits the risk of finding a "significant" difference between the ads to at most 10%, If there is no difference in the population. Further, the company wants to have at least an 80% chance of obtaining a significant difference in the sample if there is a real difference of at least 1/4 point in the population. How many people should be exposed to each ad?

In this case, the confidence level is 90 %, the difference of interest is .25, and the desired power to detect a difference of this magnitude is 80%. With a 5-point rating scale the variance is expected to be at most 1.8 (from Table 3). Using Table 1, the entry corresponding to 80% power and a difference of .25 is 198, which when multiplied by the estimated variance yields a sample size requirement of 198 x 1.8 = 356 for each ad.



The confidence levels here assume that a two-tailed, or non-directional, test is appropriate. For one-tatted tests with the direction of interest specified in advance, Table 1 corresponds to 95% confidence (a 5% Type I error risk) and Table 2 to 97.5% confidence (a 2.5% Type I error risk). Refer to Research on Research Paper Numbers 36 and 37 for discussions of this issue.



Example 2: A One-Sample Test for a Proportion

A soft drink company is interested in whether people prefer the current formulation of a soft drink or an experimental reformulation. Respondents will taste both in a "blinded" test and then be asked which one they prefer, with responses of "no preference" not allowed. (Although two products are involved here, this is actually a one-sample situation, since a test of whether there is a difference in preference for the two products is the same as a test of whether the proportion preferring either one of the them differs from .5.) The significance level chosen for the test is 5% (corresponding to a 95% confidence level) and the company wants to have at least a 90% chance of detecting a significant difference if the true preference proportion for either product is at least .55 (which implies that the true proportion for the other product is at most .45).

If the two products are, in fact, equally preferred in the population, then the proportion for each is P = .5, and the variance is .5 (.5) = .25. For a confidence level of 95% and a power of 90% to detect a difference of .55 - .50 = .05, the tabled value (from Table 2) is 8,406. Multiplying this by the variance, .25, yields 2,102. Since this is a one-sample test, the appropriate sample size is half of 2,102, or 1,051.



Another Perspective

The best time to consider power is before the size of the sample has been determined, when the researcher still has the opportunity to control both types of risk for population differences large enough to be deemed important. However, if the sample size has already been specified, the question becomes, "What size population difference can be detected with given levels of confidence and power?"

Tables 4 and 5 are designed to answer this question. Table entries are population difference multipliers for sample sizes ranging from 100 to 5,000. To obtain the required population difference, multiply the table value by the estimated standard deviation (the square root of the variance estimate from Table 3). These tables assume the difference is between two independent samples; for one-sample tests, divide the result by .

Returning to Exampie 1, if the sample size had already been set at 200 per ad, then with 90% confidence and an estimated standard deviation of , one would have a 70% chance of detecting a population difference between ads of 1.34(.217) = .29,80% power to detect a difference of 1.34(.249) = .33 and 90% power to detect a difference of 1.34(.293) = .39.



If the preference test described in Example 2 was carried out with 1,000 respondents, then assuming a standard deviation of and a confidence level of 95%, one would have 90% power to detect a population difference of .5(.145) / .