Sample Sizes for Analyses of Means and Proportions
Introduction:
An issue that must be resolved early in any research project concerns the sample size required to satisfactorily address the research objectives. In many cases, a researcher's questions can be answered via statistical analyses of sample means and / or proportions. This report describes the statistical issues involved in sample size estimation and presents formulas for determining — from a statistical point of view — appropriate sample sizes for analyses of means and proportions from one sample or from two independent samples.
Many questions concerning means and proportions require either (a) estimating a population mean, proportion, or difference with a certain degree of accuracy; or (b) testing the statistical significance of a difference, either between a sample mean or proportion and some hypothesized value, or between means or proportions from two different samples. The sample size formulas presented in this paper yield sample sizes that are "appropriate" in the sense that they provide the desired degree of accuracy when estimating population means, proportions, or differences and they allow the researcher to control the risks of reaching erroneous conclusions when carrying out tests of statistical significance. Since the issues involved in estimating population means, proportions, or differences differ from those involved in statistical significance testing, these topics will be discussed separately. (A detailed description of the logic involved in significance tests is presented in Research on Research No. 36.)
Factors Affecting Sample Size Requirements for Estimating Population Means and Proportions:
In the long run, estimates of means and proportions obtained from larger samples are subject to less sampling error (sample-to-sample variation) than estimates obtained from smaller samples. (This sampling error is measured by the standard error of sample mean or proportion; the larger the sample, the smaller the standard error.) As a result, large sample estimates of means and proportions tend to be more accurate (nearer to the population value) than estimates obtained from smaller samples.
Therefore, one factor that affects sample size requirements is the degree of accuracy desired: what is the maximum tolerable difference between the population value to be estimated and the sample estimate? For example, a researcher interested in estimating the mean likelihood of purchase for a potential new product, based on ratings on a 5-point scale, might want to be reasonably confident that the sample mean lies within two-tenths of a point of the population mean.
It should be noted that some non-random sampling plans can yield biased estimates of population characteristics. A biased estimate is one that converges to a value other than the true population value as the sample size increases. For example, if a quota sample contains equal numbers of owners of various brands when, in fact, the true brand shares differ widely, then total-sample estimates of population means, proportions, etc., are biased estimates of the true values. Increasing the sample size in this situation does not remove the bias; it simply provides increasingly precise estimates of the incorrect (biased) values. (Weighting the samples may make the weighted total- sample estimates appear more "reasonable," but one can't be sure that similar results would have been obtained from a more representative sample, one unconstrained by quotas.)
For any given degree of accuracy, the sample size also depends on the variability of the variable of interest. Variability is typically measured by the variance or standard deviation. In general, the smaller the variance, the smaller the number of respondents required to achieve the desired degree of accuracy. When no information about the variance is available, it is best to be conservative (i.e., to err on the "high" side) in its estimation. This will yield an estimated sample size with at least the desired accuracy. Guidelines for estimating variances for various rating scales are provided in Appendix I.
In the case of proportions, if the population proportion is P, then the variance is P(1 - P). This variance is greatest when P = .5 and decreases as P approaches zero or one. For example, if a researcher is interested in estimating the proportion of people who prefer a "new improved" product over an existing product, a larger sample would be required if the products are very similar (so P is likely to be near .5) than if they are very different (where P may differ greatly from .5).
A third factor that affects the sample size needed to estimate a mean or proportion is the desired degree of confidence that the specified accuracy will actually be achieved. Except in trivial cases*, it is mathematically impossible to guarantee that a mean or proportion from a sample of any size will fall within a given range around the true value. In other words, there is no sample size short of a census that will allow the researcher to be certain (or "100% confident") of achieving the desired accuracy. Some uncertainty will always exist when an estimate of some unknown population value is sought. Therefore, the researcher must specify some maximum probability, or risk, of obtaining a sample value that is less accurate (further from the population value) than desired. This risk (denoted by α) is usually set at a low value, such as .10 or less. A risk of .05 (or 5 %), for instance, implies that the researcher will tolerate a "one in twenty" chance of obtaining a sample estimate that is less accurate than desired. The level of confidence — the converse of this risk — is the likelihood of obtaining a sample estimate with the desired accuracy, and is usually expressed as a percentage: confidence = 100(1 - risk). For example, a risk level of .05 corresponds to a confidence level of 95%.
The trivial cases are those where the values of the variable of interest are constrained to lie within a fixed range, and the degree of accuracy is so low that it encompasses the entire range. For example, one can always be certain that sample and population proportions differ by less than 1.0, although this situation is trivial.
To summarize this section, three factors affect the sample size needed to estimate a population mean or proportion. The required sample size increases as:
- (a) the desired accuracy increases — i.e., as the maximum tolerable error of estimate (or difference between the sample and population values) decreases,
- (b) the variance increases, and
- (c) the desired degree of confidence increases (or risk decreases).
Sample Size Requirements for Estimating Population Means:
To estimate a population mean accurate to within ± d with a risk level of α, i.e., with 100(1 - α)% confidence, the required sample size is
n = (Z² α/2) (σ²) / (d²), (1)
where σ² is the variance of the variable of interest and Z α/2 is the value from a standard normal ("z") distribution that is exceeded with probability α/2. (Values of z for various levels of confidence are given in Appendix II.)
Example:
A 5-point rating scale will be used to obtain consumers' likelihood of purchase for a new product. If the variance is expected to be at most 2.0, then the sample size needed to estimate the population mean accurate to within ±.2 with 95% confidence is
n = (1.96²) (2.0) / (.2²)
= 192.1, or 193.
(Fractional results are always rounded upwards.)
Note that this sample size estimate is conservative (i.e., on the "high" side) because a relatively large variance estimate was employed. (See Appendix I.) If the true variance is less than 2.0, more than 95% of all possible random samples of 193 respondents will have means within .2 points of the population mean. Thus, the degree of confidence with which the researcher could state that the sample mean is within .2 points of the population mean would be at least 95%.
Sample Size Requirements for Estimating Population Proportions:
To calculate the sample size needed to estimate a population proportion, P, one can simply substitute P(l - P) for σ² in formula (1), i.e.,
n = (Z² α/2) P(1 - P)/(d²). (2)
Note that the variance, P(1 - P), depends on P, which is the unknown population proportion to be estimated. When nothing is known about the probable value of P, it is best to use P = .5, which will yield the largest (most conservative) sample size. In some instances, the value of P can be reasonably expected to lie closer to zero or one. For example, if one is interested in estimating the proportion of defective transistors in a large lot, it may be reasonable to assume, based on prior data, that P is at most .15. In this case, P = .15 would be used in formula (2). (When P is very close to zero or one, e.g., less than .1 or greater than .9, other more appropriate methods are available for estimating sample size requirements.)
Example:
A researcher wants to estimate the proportion of people who prefer a test cigarette over Marlboro in a blind taste test, where respondents will be forced to state a preference. In this case, nothing is known about P, so P = .5 will be used. The (conservative) sample size required to estimate the true value of P, accurate to with in ± .02 with 90% confidence is
n = (1.645²) .5(.5) / (.02²)
= 1,692.
Factors Affecting Sample Size Requirements for Significance Tests Concerning Means and Proportions:
A statistical significance test allows the user to assess the risks of drawing erroneous conclusions about characteristics of the population based on data obtained from a random sample. For instance, a researcher may want to determine whether the proportion of people in the population who prefer "Brand A" over "Brand B" differs from .5, or whether there is a true (population) difference between overall ratings of two brands. Implicit in these two examples are two different testing situations. The first is when a sample value (e.g., mean or proportion) is to be compared to some hypothesized population value. The second pertains to a comparison of two sample values when the hypothesized difference between the corresponding population values is zero.
Note that all significance tests pertain to differences: based on results obtained from a sample, the researcher attempts to infer whether a population mean, proportion, difference between means, etc., differs from some hypothesized value. In the brand preference example, the hypothesized difference between the population proportion (of Brand A preferrers) and .5 is zero. In the brand rating example, the hypothesized difference between the population brand means would be zero.
The greater the difference between the observed sample value and the hypothesized population value, the more confident the researcher becomes that the hypothesized value is "wrong" and should be rejected. Whether the hypothesized value should, in fact, be rejected depends on the risks the researcher is willing to tolerate of reaching the wrong conclusion — either rejecting the hypothesized value when it is, in fact, correct, or failing to reject it if it is wrong. These risks are discussed further below (and in greater detail in Research on Research No. 36).
In any given situation, the appropriate sample size for a statistical significance test depends on three factors: (1) the size of the smallest population difference considered important to detect, (2) the population variance, and (3) the risks of reaching a wrong conclusion.
As one would expect, the smaller the population difference of interest, the larger the sample size required to detect the difference as "significant" with a specified level of confidence. In a political poll, for example, a larger sample size is needed to project a "winner" with reasonably high confidence in a close race than in a race in which a large majority of people favor one candidate. Similarly, when comparing mean ratings of different brands, larger size samples are required in order to detect small population differences.
Therefore, it is important to specify the smallest difference that is meaningful, or important to detect as "significant." Samples that are too small will cause relatively large differences to be deemed non significant, whereas samples that are too large may allow trivially small differences — i.e., differences) of no practical importance — to be considered significant with a high degree of confidence.
Of course, the smallest difference of practical importance depends on the variability of the variable of interest. In general, the greater the variance, the greater the sample size requirement. For example, a larger sample would be needed to declare a difference between means of .4 as statistically significant if ratings are obtained using a 10-point scale than if a 5-potnt scale is used. (Guidelines for estimating variances for various rating scales are given in Appendix I.)
The third factor affecting sample size requirements for significance tests is the risk the researcher is willing to tolerate of reaching the wrong conclusion. Actually, there are two different risks that must be considered. One is the risk of obtaining a large difference from sample data when no real difference exists in the population. This leads the researcher to the spurious conclusion that a true difference exists when, in fact, it does not, which is called a Type I error. The risk of committing this type of error is called the "significance level" of the test, and an upper limit for this risk is set by the researcher. Usually, this risk (denoted by α) is set relatively low, e.g., .10 or less. The confidence level associated with the test is the converse of this risk; a significance level criterion of .05 corresponds to a confidence level of 95%, a Type I error risk of .10 to a confidence level of 90%, etc. The confidence level describes the degree of "assurance" a researcher can have that a difference observed in a sample reflects a true population difference. The larger the sample difference, the more confident the researcher can be that a real difference exists in the population.
The second type of risk that must be considered is the risk of not finding a difference when a true difference exists. This is called a Type II error. This kind of error occurs when the statistical test is not sensitive enough to detect the observed sample difference as "significant". Type II errors may be caused either by an insufficient sample size or by chance factors that result in a very small observed sample difference. The true probability (or risk) of committing a Type II error depends on the size of the true (population) difference, and, as such, is generally unknown. However, the risk of a Type II error decreases as the sample size increases (for a fixed significance level).
The converse of the risk of a Type II error is called the power of the test: it is the probability of detecting a difference when a real (population) difference exists. A test with high power is one that has a reasonably good chance of detecting a sample difference as significant when there is a true population difference of at least a given size. When a researcher can specify the size of an "important" difference, it is worthwhile estimating the sample size needed to detect a difference of at least that size with reasonable power, i.e., with a small risk of a Type II error. Failure to consider power when planning sample sizes can lead to samples that are too small to detect differences large enough to be deemed "important" or samples that are larger (and more expensive) than needed to detect differences of interest.
Finally, it should be noted that a significance test can be either "directional" (or "one-tailed") or "nondirectional" ("two-tailed"). A non-directional test allows the researcher to assess the statistical significance of a difference regardless of its direction (or sign). A directional test, on the other hand, requires the researcher to specify the direction of the difference of interest before looking at the sample data; differences in the other direction, regardless of magnitude, are ignored. In most marketing research applications, non-directional tests are most appropriate. This issue is discussed in Research on Research No. 36.
Sample size requirements for testing whether a mean differs from some hypothesized value:
To test whether a population mean differs from some hypothesized value, let
- d = the smallest difference of interest,
- σ² = the population variance,
- α = the significance level of the test, i.e., the maximum risk of a Type I error (confidence = 100(1 - σ)),
- β = the Type II error risk — failing to detect a true difference of sized d or greater as significant (power = 1 - β),
- z α/2 = the value from a standard normal ("z") distribution that is exceeded with probability α/2 (from Appendix II),
- zβ = the value from a standard normal ("z") distribution that is exceeded with probability β from Appendix II).
(Note: for one-tailed tests, zα would be used instead of z α/2.)
Then, the sample size required (for a non-directional test) is
n = (z α/2 + zβ)² (σ²) / (d²).
(3)
It is worth noting that the only difference between this expression and formula (1), which yields the sample size needed to estimate a mean with a given degree of accuracy, is the presence of zβ, the z value corresponding to the Type II error risk. Use of formula (1) for estimating the sample size required for a significance test yields a sample size with a Type II error risk of .5 (for zβ = 0); i.e., if the true difference is "d," then one would run a 50% risk of obtaining a sample mean that does not differ significantly from the hypothesized value, leading to a Type II error.
Example:
A study is planned to ascertain whether the true average amount of time that cars remain parked in the short term parking area at an airport on weekdays (Monday through Friday) differs from 40 minutes, the time found two years ago when a similar study was conducted. Data are obtained by drawing a random sample of time slips for cars parked in the lot between 6 AM Monday and 6 PM Friday. A true difference of at least 5 minutes in either direction is of interest; i.e., the researcher is interested in determining if the true mean is now 35 minutes or less, or at least 45 minutes. The standard deviation of parking times is estimated from previous data to be 15 minutes, so the estimated variance is 15² = 225. The confidence level for the test is set at 99%, corresponding to a Type I error risk of .01, so zα/2 = 2.576. Also, the sample is to be large enough to have a probability (power) of .90 to detect a difference of at least 5 minutes as significant. This corresponds to a Type II error risk of 1 - .90 = .10, so zβ = 1.282. Given these specifications, the required sample size is
n = (2.576 + 1.282)² (225) / (5²)
= 134 time slips.
Sample size requirements for testing whether a proportion differs from some hypothesized value:
In order to estimate the sample size needed to determine if a population proportion differs from some hypothesized value, P, formula (3) can be modified by substituting P(l - P) for σ²:
n = (z α/2 + z β)² P(l - P) / (d²). (4)
(Note: for one-tailed tests, use zα instead of z α/2.)
Example:
A study is conducted to determine consumer preferences for two potential products. Respondents will try both products and be forced to state a preference. If the products are equally preferred in the population, then the proportion choosing each product would be expected to be .5. Thus, in this situation, a test of whether one product is preferred over the other is a test of whether the sample proportion, P, for either product differs from .5. The researcher wants the sample size to be large enough to insure that if the true (population) proportion preferring one product is at least .52, he will have at least an 80% chance of obtaining a sample proportion that differs significantly from .5 with at least 90% confidence. Thus, d = .52 - .5 = .02, the Type I error risk (α) = 1 - .90 = .10 (so z α/2 = 1.645), and the Type II error risk (β) = 1 - .80 = .20 (so z β = .842). Hence, the required number of respondents is
n = (1.645 + .842)² .5(1 - .5) / (.02²)
= 3,866.
Sample size requirements for testing differences between population means estimated from two independent samples:
Probably the most common applications of statistical significance tests in marketing research involve comparisons of means or proportions obtained from two or more independent samples. The two-sample case is discussed here. Multiple comparisons among means from three or more samples are beyond the scope of this report, but are addressed in Research on Research No. 14 (on analysis of variance) and No.30 (on multiple comparisons).
In the two-sample case, the issue is whether the researcher can infer that two population means are different, based on an observed difference between two sample means. The purpose of the significance test is to assess the risks of drawing an incorrect inference — either concluding that a true (population) difference exists when it does not (a Type I error) or concluding that no true difference exists when it does (a Type II error).
For the purpose of this discussion, it will be assumed that the variances of the populations of interest are equal. (It is possible to estimate sample size requirements when the variances are unequal and known, although this circumstance rarely arises in practice.) Following the notation developed earlier, let d = the smallest difference between means that is of interest to detect, σ² = the common population variance, α = the maximum risk of a Type I error (or significance level for the test) and β = the risk of a Type II error (the risk of obtaining a non significant sample difference if the population difference is at least d). Then, the confidence level associated with a statistically significant difference is at least 100 (1 - σ)% and the power of the test to detect a difference of size d or greater as significant is at least 1 - β. If the two samples are to be the same size, the size of each sample should be
n = 2 (z α/2 + z β)² (σ²) / (d²). (5)
(Note: for one-tailed tests, use z α instead of z α/2.)
Example:
Respondents are given one of two experimental varieties of dog food to be fed to their dog(s) over a two-week period, after which they are asked to rate the product tried. Product ratings are to be obtained using 7-point scales, and the variance is expected to be at most 4.0. If the researcher wants to have a power of at least .8 to detect a true difference of at least ½ of a point, using a significance level of .05 (or confidence level of 95%) for the test, then d = .5, α = .05, β = 1 - .80 = .20, z α/2 = 1.96, and z β = .842. Therefore, the sample size required for each variety is
n = 2 (1.96 + .842)² (4.0) / (.5²)
= 252.
In some situations, it may be desirable (for reasons of cost, etc.) to draw samples that are not the same size. This is likely to be the case when the means to be compared do not correspond to experimental "treatments" but rather to groups of respondents classified according to some demographic characteristic (gender, region, income, etc.). For example, if a brewer estimates that 75% of the drinkers of his brand are men, the total sample may be drawn so that there are approximately 3 times as many men as women. This would allow "total sample" statistics to reflect the appropriate gender mix (without the necessity of weighting the data), while still permitting comparisons between men and women.
Formula (5) can be modified to yield appropriate sample sizes when the researcher can specify the ratio of the sample sizes. Let R = the ratio of the second sample size to the first. i.e., R = n2 / n1. Then, the sample sizes required for the two samples are
n1 (z α/2 + zβ)² (σ²) [(R + 1)/R] / (d²) (6)
and n2 = R n1.
Example:
To continue with the beer example, say the researcher is interested in comparing product ratings obtained from men and women and that the total sample should contain about 75% men, so R = 3. Product evaluations are to be obtained using 6-point rating scales, with an expected variance of at most 3.0. True differences between men and women of at least ¼ of a point are of interest, and the researcher wants to have a probability of at least .9 of detecting a real difference at least this large, using a significance level of α = .10 for the tests. In this case, z α/2 = 1.645 and zβ = 1.282, and the required sample sizes are
n1 = (1.645 + 1.282)² (3.0) [(3 + 1) / 3] / (.25²)
= 548.3, or 549 women, and
n2 = 3 (548.3) = 1,645 men.
Sample size requirements for testing differences between proportions estimated from two independent Samples:
Formulas (5) and (6) can be adapted to estimate sample sizes needed for testing the difference between proportions from two independent samples. To do this, substitute P(1 - P) for σ² where P is the expected average proportion. Again, if nothing is known about the proportions involved, P = .5 should be used.
Example:
A poll is to be conducted to gather information concerning voter preferences in an upcoming mayoral election. The sponsor of the research is interested in whether preference for a particular candidate is the same among Caucasian and non-Caucasian voters. The race involves two candidates and is believed to be very "close," so the average proportion favoring the candidate in question is set at .5. The test is to employ a significance level criterion of α = .05 and must be sensitive enough to have at least a 90% chance (power) to detect a difference between Caucasians and non-Caucasians of 2% or greater, i.e., a difference of at least 51% vs. 49%. Therefore, d = .02, z α/2 = 1.96 and zβ = 1.282. If equal numbers of Caucasians and non-Caucasians are to be polled, the required number of each is
n = 2 (1.96 + 1.282)² .5 (1 - .5) / (.02²) = 13,139.
This sample size requirement is very large because the specifications call for a high power to detect a small difference. The required sample size could be reduced by relaxing the risk of a Type I and/or Type II error.
If 60% of the registered voters are Caucasians, it may be desirable to draw the total sample accordingly. In this case, the ratio of Caucasians to non-Caucasians would be 60 / 40 = 1.5. If the rest of the specifications remain the same, the required sample sizes would be
n1 = (1.96 + 1.282)² .5(1 - .5) [(1.5 + 1) / 1.5]/(.02²)
= 10,949 non-Caucasians and
n2 = 1.5 (10,949) = 16,423 Caucasians
Dealing with multiple research objectives:
It is rarely the case that a research project (especially in marketing research) has only a single objective. Typically, there are a number of research questions, each requiring a separate analysis. Further, the appropriate sample size (using the formulas presented here) would most likely be different for each objective.
In such cases, there is no simple answer to the question of the appropriate sample size. However, there are a number of ways to address the issue. One way is to determine the sample size needed to address each of the objectives (or the major ones) and then use the maximum of these sample size requirements. Another option is to identify the single most "important" objective and use the sample size required to adequately address that objective. For example, it respondents are asked to indicate their likelihood of purchase of products (using, say, a 5-point scale) and also to rate the products on a collection of product-related attributes (perhaps using different rating scales), many researchers would view purchase intent as being the single most "important" rating.
One should give some consideration to the multiplicity of significance tests to be carried out. Even when there are only two products (or, in general, "groups") to be compared, comparisons may be carried out on a large number of variables. For each test there are risks, and the overall risk of reaching at least one wrong conclusion (or inference) climbs as the number of comparisons increases. This problem can be handled by specifying some acceptable level of "overall risk" across the several tests, and then dividing this risk by the number of tests. For details on this approach, see Research on Research No. 30.
Finally, it is acknowledged that in many cases the required sample size is determined by more "basic" issues, such as costs or the smallest cell size likely to be encountered in planned tabulations, or simply by the "comfort" a researcher has in drawing inferences from a sample of a given size. Sample size estimates based solely on the statistical considerations discussed in this paper will sometimes be either too large to be affordable or too small to allow for detailed crosstabs. Nevertheless, such estimates do provide an alternate way to estimate appropriate sample sizes and can be used to get a better "feel" for the statistical properties (precision, power, etc.) associated with whatever sample size is ultimately used.
APPENDIX I
Guidelines for estimating variances for data obtained using rating scales
Rating scales are "doubly-bounded": on a 5-point scale, for instance, responses cannot be less than 1 or greater than 5. This constraint leads to a relationship between the mean and the variance. For example, if a sample mean is 4.6 on a 5-point scale, then there must be a large proportion of responses of "5" and it follows that the variance must be relatively small. On the other hand, if the mean is near 3.0, the variance can be potentially much greater. The nature of the relationship between the mean and the variance depends on the number of scale points and on the "shape" of the distribution of responses (e.g., approximately normal or symmetrically "clustered" around some central scale value, or skewed, or uniformly spread among the scale values). By considering the types of distribution shapes typically encountered in practice, it is possible to estimate variances for use in calculating sample size requirements for a given number of scale points.
Table 1 lists ranges of variances likely to be encountered for various numbers of scale points. The low end of the range is the approximate variance when data values tend to be concentrated around some middle point of the scale, as in a normal distribution. The high end of the range is the variance that would be obtained if responses were uniformly spread across the scale points. Although it is possible to encounter distributions with larger variances than those listed (such as distributions with modes at both ends of the scale), such data are rare.
In most cases, data obtained using rating scales tend to be more uniformly spread out than in a normal distribution. Hence, in order to arrive at conservative sample size estimates — i.e., sample sizes that are at least large enough to accomplish the stated objectives — it is advisable to use a variance estimate at or near the high end of the range listed.
APPENDIX II
Values of Z for various levels of risk
Table 2 lists z values associated with various risk levels. For significance tests, the Type I error risk (α) is the "significance level" used to decide whether a result is statistically significant. It is the maximum allowable risk of concluding that a population difference exists when it does not. The Type II error risk (β) is the probability of not detecting a true population difference. The power of the test is 1 - β.
To use the table for problems requiring estimation of means or proportions, find the desired confidence level in the first column and use the corresponding value of z. For significance tests, the second column gives values of α/2 for non-directional (two-tailed) tests, values of a for directional (one-tailed) tests, and values of β (the Type II error risk) in either case. The table must be used twice to obtain z values for both Type I and Type II error risks.
