Graphic displays of data: Box and Whisker plots

Box and whisker plots provide a valuable graphic means of summarizing and displaying data. They are particularly useful for comparing the central tendency, variability, and shape of distributions of responses from several groups of individuals or on several variables. The interpretation and uses of box and whisker plots are described and their strengths and weaknesses compared to other types of data summaries are outlined.


Introduction:

It is essential to have efficient methods of summarizing and displaying large quantities of data. The most common types of data summaries are crosstabulations and various descriptive statistics, such as means, medians, and standard deviations. However, examination of frequencies and descriptive statistics can be very cumbersome and time consuming if one is interested in comparing responses across several variables and/or groups of respondents. In such cases, graphic summaries of data can be extremely helpful.

A "box and whisker plot" is one of the most useful graphic displays. Such a plot conveys a great deal of information about the central tendency, variability, and shape of a distribution of responses. When placed side-by-side, box and whisker plots can be used effectively to compare responses across several variables and/or groups.


Components of a Box and Whisker Plot:

The elements of a box and whisker plot are illustrated with an example. A frequency distribution of 100 responses obtained using a 10-point scale is shown in Table One, along with some summary statistics. A box and whisker plot is shown in Figure 1.

The vertical axis identifies the scale of the plot. In the plot itself, the box extends from the 25th percentile to the 75th percentile and thus includes the middle 50% of the responses. The length of the box is equal to a measure of variability known as the interquartile range. For normally distributed data, the Interquartile range is 1 1/3 times the standard deviation. In this example, the box extends from 4 through 6 indicating that the middle 50% of the responses fall in a two-point range.



Within the box, the "+" indicates the mean (5.4 in this case) and the line which divides the box identifies the median, or 50th percentile (5 in this case). The vertical lines extending from the box are called "whiskers." Each whisker extends either the length of the box (2 scale points in this case) or to the most extreme observation in that direction, whichever distance is less.

Data points denoted by "0" or "*" are relatively extreme responses, given the amount of variability in the data. These responses are often referred to as "outliers." For data that are normally distributed, about 1 in 20 observations would be classified as outliers (either "0" or "*"). A response denoted by an asterisk would occur about once in 200 observations if the data were normally distributed. In this example, responses of 1 and 9 are represented by a "0" and a response of 10 by an asterisk. Note that the plot reveals the location of these outliers but not the number of responses at each of these values.

Thus, a box and whisker plot conveys information concerning central tendency (the mean and the median) and variability (the range and the interquartile range). One can also as sess the degree of skewness (or asymmetry) in the data by examining the relative position of the mean and median, by comparing the lengths of the whiskers, and by noting the location and number of outliers at each end of the scale.


Interpreting Unusual Box and Whisker Plots:

Departures from a normal distribution alter the appearance of the plot. To illustrate this, frequency distributions and descriptive statistics are shown in Table Two, histograms in Figure Two, and box and whisker plots in Figure Three for examples of four types of nonnormal distributions: (A) skewed, (B) peaked, (C) flat, and (D) bimodal. Each of the data sets include 100 responses on a 5-point scale.

When the data are skewed, (example A), the median may fall at or near one end of the box and the whiskers may be unequal in length. In the example shown, the skewness is so marked that the median is the highest value (5, at the top of the box), causing the upper whisker to vanish. Note that when the median is equal to the value at the 25th or 75th percentile (i.e., when the median is at the end of the box), the symbol for the median is printed.

When the data are peaked, (B), the box and whiskers are relatively short and in extreme cases may disappear entirely. In the example shown, the middle 50% of the responses all have the same value (3), so the length of the box and whiskers is zero.

On the other hand, a flat (or uniform) distribution is characterized by a relatively long box, as in (C). In many such cases, including this example, the whiskers will be shorter than the box.

The bimodal distribution in (D) also results in a long box. In this example, the middle 50% of the responses range from 1 to 5 (i.e., the entire range of the scale), so the whiskers vanish. Distributions of this type can result when responses of two sub-groups of individuals are combined.

Most types of nonnormality have predictable effects on box and whisker plots. However, it is possible for distributions with different shapes to have similar plots. For example, if the modes in example D had been at 2 and 4, the plot would have appeared like that for example C.


Uses of Box and Whisker Plots:

Since they contain so much information, box and whisker plots have several applications. They can be used to graphically present descriptive information about the data — means, medians, measures of variability (the range and interquartile range), and skewness. They can also aid in assessing the normality of the data, although other graphic devices (such as probability plots and FUNOP plots) are better suited for this purpose.

Side-by-side box and whisker plots are excellent tools for comparing responses to various questions or responses of several groups of respondents. One can efficiently compare not only the average response, but also variability from one variable or group of respondents to another.

As an example, Figure Four depicts side-by-side plots of ratings of prices of five products. Brand E is clearly rated highest, followed in order by C, A, D, and B. Also, there is less variability in the Brand E data than in the data for the other brands.

Side-by-side plots can also be used to examine relationships between two quantitative variables. For example, respondents could be asked to rate the overall quality of a product and their likelihood of purchase. Separate plots of likelihood of purchase could be constructed for respondents at each level of quality and placed side by side. This would reveal how likelihood of purchase changes as the rating of quality increases as well as the variability in likelihood of purchase when perceived quality is held constant.


Box and Whisker Plots Vs. Other Types of Summaries:

Box and whisker plots are effective supplements to cross-tabulations and descriptive statistics. As illustrated in the preceding section, they are particularly useful for comparing distributions of responses. They provide more information than frequency tables and histograms (bar charts), since the mean, median (50th percentile), and 25th and 75th percentiles are clearly identified. Further, unlike descriptive statistics, box and whisker plots reveal the values of relatively extreme data points (outliers).

Box and whisker plots are not without limitations. First, distributions with different shapes can have similar plots. Second, the number of outliers are not shown. Both of these limitations can be overcome by examining frequency distributions associated with unusual plots or by utilizing other graphic methods more sensitive to distributional assumptions (such as probability plots).


Table Two
Examples of nonnormal distributions

  Frequencies
Response Alternatives A Skewed B Peaked C Flat D Bimodal
1 2 4 20 32
2 18 15 24 17
3 17 60 21 10
4 20 12 19 14
5 53 9 16 27
 
Mean 4.14 3.07 2.87 2.87
25th Percentile 3 3 2 1
Median 5 3 3 3
75th Percentile 5 3 4 5
Standard Deviation 1.09 0.89 1.37 1.64


Click to enlarge


In summary, box and whisker plots are quite effective tools for displaying data in an efficient format. Such plots can greatly ease the burden of examining large quantities of data and of comparing responses across several variables or groups at respondents.