Measures of Relationship for Binary Data
Binary ("yes/no") data arise frequently in marketing research, especially in the form of multiple-response questions. Resolution of important or interesting marketing issues often requires going beyond basic tabulations of marginal response frequencies — to exploration of relationships among respondents' answers to the various questions or response alternatives. This paper describes various statistics that can be used to quantify the degree of "similarity" or "relationship" among binary items. The statistics differ with respect to how the underlying concept of "similarity" is defined, and some of them can be used as a basis for cluster analyses of the items.
Introduction
Survey respondents often are presented with questions allowing multiple responses, usually in the form of a checklist. Examples include such questions as "Which of the following brands have you used in the past year?" and "Place an 'X' next to the attributes you think describe this product." Data obtained in this manner are binary in nature: each item (brand, attribute, etc.) is either selected (a "yes" answer) or not selected (a "no").
Binary data also arise when responses are obtained via rating scales, but dichotomized for analysis purposes. For example, purchase intent ratings for various products or concepts are typically obtained using a 5-point scale, where 5 indicates "definitely would buy" and 1 indicates "definitely would not buy." Comparisons among products, however, may be based on top box or top-two box percentages or proportions, in addition to or instead of comparisons based on mean ratings.
In such instances, the responses are treated as binary: either top (top-two) box or not top (top-two) box.
Analyses of such data are often limited to determining the number or percentage of respondents who selected (answered "yes" to) each item - i.e., the focus is on the marginal distributions of the items. However, this overlooks the potentially useful information that can be obtained by examining relationships among responses to the items in the list. For example, if respondents are asked to indicate the makes of cars they would consider buying, a researcher might be interested in answers to such marketing questions as:
- Do certain makes tend to be selected (considered) together? If so, which ones?
- Among those respondents who would consider an import, what proportion would also consider a domestic car, and vice versa?
- What percentage of respondents gave the same answer for both Ford and Chevrolet, either selecting both or selecting both?
- Among those who would consider a Ford or a Chevy, what proportion would consider both?
- How accurately can responses regarding Ford be predicted from responses regarding Chevrolet, and vice versa?
Some research questions regarding relationships among items can be addressed by analyzing the frequencies of occurrence of specific "patterns" of answers across the items. A "pattern analysis" yields groups of respondents such that all respondents within a group have the same pattern of answers. (Research on Research Number 35 illustrated a pattern analysis of credit card ownership data among 75,000 households as a way to summarize the incidence with which various cards were held, alone and in combinations.) A disadvantage of such an analysis, however, is that the number of possible response patterns (or groups) increases rapidly as the number of items increases. (For k binary items, there are 2k possible patterns.) Consequently, individual patterns may occur with small frequencies, thus limiting their utility.
The individual response patterns preserve all of the information concerning relationships among items present in the data. But the objective of a pattern analysis is to form groups of respondents, not to summarize relationships among items. The patterns tie together responses to all items simultaneously; relationships between individual items are presented only indirectly. When there are several items, considerable time and effort may be required to identify such relationships, so other methods of analysis may be better suited for this purpose. Even when pattern analysis yields useful results, analyses aimed at describing relationships among items can provide a useful, complementary summary of the data.
This report describes a number of measures of "relationship" or "similarity" among binary items. As the preceding comments suggest, the concept of "relationship" can take on a variety of meanings, depending on the nature of the question(s). Consequently, several statistics can be considered, each addressing a different facet. The statistics included here are not an exhaustive list; they were chosen on the basis of ease of interpretation, popularity, and relevance for marketing applications.
Note that the items do not have to be in a checklist format in order to make use of these measures. It is only necessary that the items be dichotomous (or be treated as such) and that selecting (answering "yes" to) one of them does not preclude the respondent from selecting others. In what follows, item "selection" should be generalized to include possession or any other behavior or attitude resulting in a positive response.
Preliminaries
Since there are only two possible responses for each item, the data for any pair of items can be displayed in a 2 x 2 table of frequencies, as shown in Table 1 (with marginal totals).
Here, "A" is the number of respondents who selected both items, "D" is the number selecting neither, "B" the number selecting Item 1 but not Item 2, and "C" the number selecting item 2 but not item 1. Thus, the total sample size for the pair of items is N = A + B + C + D. Cells A and D represent "matching" responses (i.e., responses that are the same for both items) and cells B and C represent "non-matching" responses. Every measure of similarity or relationship can be obtained from these four frequencies.
A few remarks concerning the handling of non-responses are relevant here. The sample size, N, in the table includes only respondents who answered both items. When each item requires an explicit "yes" or "no" answer (or a rating), non-responses are easy to detect. When a checklist is used, though, care should be taken to insure that a valid answer of "no" (indicated by the absence of a check mark) can be distinguished from a failure to respond. This can be accomplished by including "none of the above" and/or "don't know" options for each question.
In any particular application, some attention should be paid to the meaning of a response in cell D. The various statistics described in this paper differ in their treatment of the D cell. For some, all four cells are weighted equally; for others, the D cell is ignored entirely. Still other statistics (not described here) weight the D cell disproportionately — either more or less than the other cells.
The issue here is whether (or how) non-selection of both items should contribute to a measure of similarity. In some cases, a reasonable argument can be made that it should not. One case is when respondents are restricted in the number of items they may select; e.g., "check the three brands you use most often "or" indicate which five attributes are most important to you in deciding which brand to buy." Here, the D cell tends to be large simply as an artifact of the instructions.
A related situation is when the number of items in the list is large relative to the typical number selected by respondents. For example, if the list consists of 100 makes and models of cars from which the respondent is to select those he would consider buying, large frequencies in the D cell are likely to occur for all or nearly all pairs of cars. As a result, indices of similarity that include the D cell will tend to indicate strong degrees of "similarity," even for cars of very different types. In fact, some statistics will indicate "perfect" agreement or similarity when no respondent selects either item, i.e., when cells A, B, and C are empty.
There are situations, however, in which the D cell should be included in assessing similarity or relationship. An example is when respondents are asked (without restrictions) to indicate which attributes or characteristics describe particular products. Two products could be considered "similar" if respondents perceive both of them as possessing some characteristics (cell "A" responses) and not possessing other characteristics (cell "D" responses). For instance, the larger the percentage of respondents who consider neither of two cars to have good gas mileage, the more "similar" the two cars could be regarded.
Whether or not the D cell is included in the calculation of a statistic has important implications for inferences that can be drawn based on that statistic. A statistic based on all four cells can be generalized to the total population from which the sample was drawn, assuming random sample selection. However, a statistic based on a subset of the sample (which includes all statistics that ignore the D cell, among others) is generalizable only to the corresponding subset of the population. (Note that any statistic that includes N in the calculation implicitly includes D, even though D may not appear explicitly in the formula.)
Measures of Similarity or Relationship
Each of the statistics to be described will be illustrated using the credit card ownership data shown in Table 2, which were drawn from Market Facts' financial data file. Of the 4,120 respondents, 1,967 (47.7%) have a Visa card, l,577 (38.3%) have a MasterCard, 1,070 (26%) have both, and 1,646 (40%) have neither.
The Matching Coefficient
The matching coefficient is simply the proportion of matching responses, (A + D) / N. The range is from 0 (no matching responses) to 1 (all matching responses). For the credit card data, the matching coefficient is (1,070 + 1,646) / 4,120 = .66, indicating that about 2/3 of the respondents are consistent in their ownership — either having both cards or neither of them.
A statistic similar to the matching coefficient but excluding the D cell in the numerator is the proportion of respondents who selected both items: A / N. For the example, this is the proportion who have both cards, or 1,070/4,120 = .26.
The base (denominator) for both of these proportions is the total number of respondents in the sample, N. Hence, these proportions may be generalized to the total population from which the sample was drawn, assuming random sample selection.
The Jaccard Statistic
The Jaccard statistic is also similar to the matching coefficient, except the D cell is excluded from both the numerator and denominator. Jaccard = A /(A + B + C). This is the probability that both items are selected, given that either of them is selected. The Jaccard statistic ranges from 0 (when no one selects both items) to 1 (when everyone who selects one item selects both). For the example, Jaccard = 1,070 / (1,070 + 897 + 507) = .43, so 43% of those who have either card have both of them.
Since the D cell is ignored in the calculation, the Jaccard statistic is generalizable only to the subset of the population having at least one of the cards. As such, it is not directly comparable to the matching coefficients. Which coefficient is of greater utility depends on the nature of the research question (which may, in turn, depend on the nature of the items themselves, as discussed earlier).
Conditional Probabilities
In general terms, a conditional probability is a proportion calculated on a subset of respondents, where the persons included in the subset are those who satisfy some condition or set of conditions. Thus, the Jaccard statistic is a type of conditional probability, the condition for inclusion being that the respondent must select at least one of the items.
Various other conditional probabilities can be obtained from data for a pair of binary items. As proportions, all conditional probabilities range from 0 to 1. Also, since a conditional probability is based on a subset of the sample, it can be generalized only to the corresponding subset of the population.
Of particular interest are the conditional probabilities of selecting one item given that another item is selected. In the example, these correspond to the proportion of Visa owners who have a MasterCard, or A / (A + B) = 1,070 / 1,967 = .54, and the proportion of MasterCard owners who have a Visa, or A / (A + C) = 1,070 / 1,577 = .68. Thus, 68% of MasterCard owners have a Visa, but only 54% of Visa owners have a MasterCard.
The potential asymmetry between these two conditional probabilities differentiates these statistics from other measures of relationship. Most other statistics are "symmetric" in nature; i.e., they yield a single value for any given pair of items. If the conditional probabilities are very disparate, then reliance on "symmetric" statistics to quantify the degree of relationship between items may be misleading. This asymmetry makes conditional probabilities especially useful for analyses of brand-switching behavior.
The conditional probabilities described above are probabilities of selecting one item given that the other item is selected. It is sometimes worthwhile, for comparative purposes, to examine probabilities of selecting an item given that the other item is not selected. In the example, these probabilities pertain to MasterCard ownership among non-owners of Visa, or C / (C + D) = 507/2,153 = .24, and Visa ownership among non-owners of MasterCard, or B / (B + D) = 897/2,543 = .35. These conditional probabilities are substantially lower than those previously calculated among Visa and MasterCard owner groups, so the probability of a respondent having one of the cards clearly depends on whether or not the other card is held.
Conditional probabilities play a key role in determining whether responses to two items are statistically independent (unrelated). Responses to two binary items are independent if the probability of selecting one of them is the same whether or not the other item is selected. In the example, MasterCard and Visa ownership would be independent if the percentage of MasterCard owners was the same for both owners and non-owners of Visa. This is clearly not the case; 54% of the Visa owners but only 24% of the non-owners have a MasterCard. The same conclusion would be reached if one compared Visa ownership among owners and non-owners of MasterCard (68% vs. 35%).
The Cosine
The cosine is the geometric mean of the conditional probabilities of selecting each item given that the other item was selected, i.e.,
(A geometric mean is the k-th root of the product of k quantities; in this case, k = 2, the two conditional probabilities under the radical.) For the credit card data,
Since the D cell is ignored in the calculation, the cosine is generalizable to the same portion of the population as the Jaccard statistic — namely, to those having at least one of the cards.
As its name suggests, the cosine has a geometric interpretation. Each item can be regarded as a "vector" in a space with as many dimensions as there are items. If everyone who selects one item selects both (so responses to the two items are identical), then the vectors for the two items coincide: the angle between the vectors is 0 degrees and the cosine is 1, indicating "perfect" similarity. On the other hand, if no person selects both items (so cell A is empty), then the vectors are perpendicular (90 degree angle) and the cosine is 0, indicating complete dissimilarity. (The angle between vectors is, itself, a measure of similarity or relationship, and may be easier to visualize. The angle may range from 0 to 90 degrees, with smaller angles denoting greater degrees of similarity. For the credit card data, the cosine of .61 corresponds to an angle of 53 degrees.)
The Pearson Correlation Coefficient
The correlation coefficient is one of the most popular measures of relationship. The correlation between two binary items (also known as a "phicoefficient") can range from - 1 (when all responses are in cells B and C) to + 1 (when all responses are in cells A and D).
Several formulas can be used to calculate a correlation coefficient. The expression best suited for understanding the nature of the correlation makes use of the marginal proportions and the proportion of the sample selecting both items. Let P1 and P2 indicate the marginal proportions selecting the items and let Q1 and Q2 indicate the marginal proportions not selecting the items. Also, let P12 be the proportion of the sample selecting both items, (i.e., the proportion in the A cell, or A / N). In the example,
(Note that Q1 = 1 - P1 and Q2 = 1 - P2 since, for any item (card), selection (ownership) and non-selection (non-ownership) are mutually exclusive and exhaustive.) The correlation can be calculated as
The product of P1 and P2 in the numerator,.477 x .383 = .183 in the example, is the proportion of the sample that would be expected to select both items (own both cards) if the two items were independent, i.e., if ownership of one card is unrelated to ownership of the second. Hence, the numerator of the correlation coefficient reflects the extent to which responses to one item depend on responses to the other. A positive correlation indicates a greater degree of joint selection (i.e., a larger value of P12) than would be the case if the items were independent; a negative correlation indicates a lesser degree of joint ownership than if the items were independent (i.e., ownership of one card reduces the likelihood of owning the other). The denominator of the correlation, which involves the product of the four marginal proportions, "standardizes" the correlation so that the absolute value cannot exceed 1.
For computational purposes, the easiest formula (eliminating calculation of any proportions) is
as before. The numerator in this expression is the product of frequencies in the matching cells (A and D) minus the product of frequencies in the nonmatching cells (B and C). (The quantities AD and BC are called "cross-products.") The term under the radical is the product of the row and column marginal totals. Since all four cells are included in the calculation, the correlation is generalizable to the total population from which the sample was drawn.
The statistical significance of a correlation coefficient can be assessed by converting
the correlation to a Z-statistic:
A statistically significant result suggests that responses to the two items are not
independent; i.e., by considering respondents' answers to one item (e.g., Visa), responses
to the other item (MasterCard) can be predicted with better than chance accuracy. This does
not imply, however, that the relationship is "meaningful." Although the correlation in the
example is relatively small, the corresponding Z value is 20.3, highly significant due to
the large sample size.
Statistics based on Odds
The concept of "odds" should be familiar to anyone who participates in lotteries or other forms of betting. The odds of selecting an item are defined as the probability that the item is selected divided by the probability that it is not selected. For item 1, this is (A + B) / (C + D); for item 2, it is (A + C) / (B + D). In the example, the odds of owning a Visa are 1,967 / 2,153 = .91 (or ".91 to 1"), indicating that there are about 9/10 as many Visa owners as non-owners. Similarly, the odds of owning a MasterCard are l,577 / 2,543 = .62; there are about 6/10 as many owners as non-owners.
If two items are unrelated, one would expect the odds of selecting one item to be the same regardless of whether the other item is selected; i.e., the expectation is that A/B = C/D and A/C = B/D. The odds ratio, as its name implies, is a ratio of two odds: either
Odds of Selecting Item 2 Given that Item 1 is Selected
A / B
----------------------------------------------------------- = -------
Odds of Selecting Item 2 Given that Item 1 is not Selected
= C / D
or
Odds of Selecting Item 2 Given that Item 2 is Selected
A / C
----------------------------------------------------------- = -------
Odds of Selecting Item 2 Given that Item 2 is not Selected
B / D
both of which are equal to AD/BC. (For this reason, the odds ratio is also known as the "cross-product ratio.") The odds ratio ranges from 0 to infinity; it is 0 when either of the matching cells (A and D) is empty, it is infinite when either of the non-matching cells (B and C) is empty, and it is 1 when the items are independent. Since all four cells are used in the calculation, the odds ratio (like all other statistics based on odds) is generalizable to the total population from which the sample was drawn.
For the credit card data,
Odds Ratio =(1,070)(1,646) / (897)(507) = 3.87,
so the odds of owning one card are almost four times greater if the other card is owned than if it is not. Among Visa owners, the odds of having a MasterCard = 1,070 / 897 = 1.2 (the number owning a MasterCard is 20% greater than the number not owning one). Among non-owners of Visa, the odds of having a MasterCard = 507 / 1,646 = .31 (the number of MasterCard owners is about 1/3 the number of non-owners). The odds ratio is the ratio of these odds, or 1.21/.31 = 3.87. The same value would be obtained if one used the odds of Visa ownership among owners and non-owners of MasterCard (2.11 and .545, respectively; 2.11 /.545 = 3.87).
Every statistic described so far has a lower and/or upper bound. For some analyses (particularly "modeling" of multi-way frequency tables), these bounds represent constraints, and are difficult to handle in linear models. For this reason, the logarithm of the odds ratio (the "log odds") is commonly used in such analyses; it has no upper or lower bound and always has the same sign as the correlation coefficient. (Recall that the correlation is also a function of the cross-products.) In the example, the log odds is 1.35. The log odds can also be calculated as log(A) + log(D) - log(B) - log(C), i.e., the sum of the logs of the frequencies in the matching cells minus the sum of the logs of the frequencies in the non-matching cells. Thus, the log odds converts a multiplicative relationship among the cells (in the form of the odds ratio) to an additive one, which allows the use of linear models to explore differences among levels of classification variables in multi-way frequency tables.
Another statistic related to the odds ratio, but with both a lower and an upper bound, is gamma (also known as Yule's Q). Gamma ranges from - 1 (corresponding to an odds ratio of 0) to + 1 (corresponding to an infinite odds ratio). Gamma is 0 when the odds ratio is 1, indicating that responses to the two items are independent. Gamma also has an additional interpretation: it indicates how much more probable (if positive) or less probable (if negative) it is to obtain a response in each of the matching cells (A and D) than a r esponse in each of the non-matching cells (B and C) when two individuals are selected at random.
Gamma can be computed as
(AD - BC) / (AD + BC) or as
(odds ratio - 1) / (odds ratio + 1).
For the example data,
This indicates that if two respondents were randomly selected from the population, the probability of obtaining one with both cards and one with neither is about 6/10 greater than the probability of obtaining one respondent with a Visa only and another with a MasterCard only.
Beyond Quantifying "Similarity" — Clustering Items
Anumber of statistics have been described which can be used to quantify the degree to which responses to two dichotomous items are "similar" or "related." Since the statistics described address different facets of the concept of "similarity," some or all of them may be useful in any particular application.
Some statistics can be transformed to yield distances between items — the matching coefficient, Jaccard statistic, cosine and correlation, among those described here. These statistics can be used as a basis for clustering the items. When the number of items is large, a cluster analysis may be not only desirable, but necessary, to reduce the volume of information to a manageable size.
Clusters are usually (but not necessarily) formed sequentially, or hierarchically, beginning with each item in a separate cluster and ending with all items in a single cluster. The hope, of course, is that one can identify one or more "reasonable" (useful and intuitively appealing) groups of items somewhere between these extremes. At each step in the process, the two items or existing clusters which are "closest" (between which the distance is smallest) are merged. Naturally, different clusters may emerge when different distance measures (corresponding to different definitions of "similarity") are used.
