Chapter 13 Advice / Best Practices

This section contains some thoughts on what makes a good data visualization. Most come from books and posts that others have written, but I’ll take responsibility for putting them here.

13.1 Labeling

Everything on your graph should be labeled including the

  • title - a clear short title letting the reader know what they’re looking at
    • Relationship between experience and wages by gender
  • subtitle - an optional second (smaller font) title giving additional information
    • Years 2016-2018
  • caption - source attribution for the data
    • source: US Department of Labor - www.bls.gov/bls/blswage.htm
  • axis labels - clear labels for the x and y axes
    • short but descriptive
    • include units of measurement
      • Engine displacement (cu. in.)
      • Survival time (days)
      • Patient age (years)
  • legend - short informative title and labels
    • Male and Female - not 0 and 1 !!
  • lines and bars - label any trend lines, annotation lines, and error bars

Basically, the reader should be able to understand your graph without having to wade through paragraphs of text. When in doubt, show your data visualization to someone who has not read your article or poster and ask them if anything is unclear.

13.2 Signal to noise ratio

In data science, the goal of data visualization is to communicate information. Anything that doesn’t support this goals should be reduced or eliminated.

Chart Junk - visual elements of charts that aren’t necessary to comprehend the information represented by the chart or that distract from this information. (Wikipedia)

Consider the following graph. The goal is to compare the calories in bacon to the other four foods.

(Disclaimer: I got this graph from somewhere, but I can’t remember where. If you know, please tell me, so that I can make a proper attribution. Also bacon always wins.)

Graph with chart junk

Graph with chart junk

If the goal is to compare the calories in bacon to other foods, much of this visualization is unnecessary and distracts from the task.

Think of all the things that are superfluous:

  • the tan background border
  • the grey background color
  • the 3-D effect on the bars
  • the legend (it doesn’t add anything, the bars are already labeled)
  • the colors of bars (they don’t signify anything)

Here is an alternative.

Graph with chart junk removed

Graph with chart junk removed

The chart junk has been removed. In addition

  • the x-axis label isn’t needed - these are obviously foods
  • the y-axis is given a better label
  • the title has been simplified (the word different in redundant)
  • the bacon bar is the only colored bar - it makes comparisons easier
  • the grid lines have been made lighter (gray rather than black) so they don’t distract

I may have gone a bit far leaving out the x-axis label. It’s a fine line, knowing when to stop simplifying.

In general, you want to reduce chart junk to a minimum. In other words, more signal, less noise.

13.3 Color choice

Color choice is about more than aesthetics. Choose colors that help convey the information contained in the plot.

The article How to Pick the Perfect Color Combination for Your Data Visualization is a great place to start.

Basically, think about selecting among sequential, diverging, and qualitative color schemes:

  • sequential - for plotting a quantitative variable that goes from low to high
  • diverging - for contrasting the extremes (low, medium, and high) of a quantitative variable
  • qualitative - for distinguishing among the levels of a categorical variable

The article above can help you to choose among these schemes. Additionally, the RColorBrewer package provides palettes categorized in this way.

Other things to keep in mind:

  • Make sure that text is legible - avoid dark text on dark backgrounds, light text on light backgrounds, and colors that clash in a discordant fashion (i.e. they hurt to look at!)
  • Avoid combinations of red and green - it can be difficult for a colorblind audience to distinguish these colors

Other helpful resources are Practical Rules for Using Color in Charts and Expert Color Choices for Presenting Data.

13.4 y-Axis scaling

OK, this is a big one. You can make an effect seem massive or insignificant depending on how you scale a numeric y-axis.

Consider the following the example comparing the 9-month salaries of male and female assistant professors. The data come from the Academic Salaries dataset.

# load data
data(Salaries, package="carData")

# get means, standard deviations, and
# 95% confidence intervals for
# assistant professor salary by sex 
library(dplyr)
df <- Salaries %>%
  filter(rank == "AsstProf") %>%
  group_by(sex) %>%
  summarize(n = n(),
            mean = mean(salary), 
            sd = sd(salary),
            se = sd / sqrt(n),
            ci = qt(0.975, df = n - 1) * se)

df
## # A tibble: 2 x 6
##   sex        n   mean    sd    se    ci
##   <fct>  <int>  <dbl> <dbl> <dbl> <dbl>
## 1 Female    11 78050. 9372. 2826. 6296.
## 2 Male      56 81311. 7901. 1056. 2116.
# create and save the plot
library(ggplot2)
p <- ggplot(df, 
            aes(x = sex, y = mean, group=1)) +
  geom_point(size = 4) +
  geom_line() +
  scale_y_continuous(limits = c(77000, 82000),
                     label = scales::dollar) +
  labs(title = "Mean salary differences by gender",
       subtitle = "9-mo academic salary in 2007-2008",
       caption = paste("source: Fox J. and Weisberg, S. (2011)",
                       "An R Companion to Applied Regression,", 
                       "Second Edition Sage"),
       x = "Gender",
       y = "Salary") +
  scale_y_continuous(labels = scales::dollar)

First, let’s plot this with a y-axis going from 77,000 to 82,000.

# plot in a narrow range of y
p + scale_y_continuous(limits=c(77000, 82000))
Plot with limited range of Y

Figure 13.1: Plot with limited range of Y

There appears to be a very large gender difference.

Next, let’s plot the same data with the y-axis going from 0 to 125,000.

# plot in a wide range of y
p + scale_y_continuous(limits = c(0, 125000))
Plot with limited range of Y

Figure 13.2: Plot with limited range of Y

There doesn’t appear to be any gender difference!

The goal of ethical data visualization is to represent findings with as little distortion as possible. This means choosing an appropriate range for the y-axis. Bar charts should almost always start at y = 0. For other charts, the limits really depends on a subject matter knowledge of the expected range of values.

We can also improve the graph by adding in an indicator of the uncertainty (see the section on Mean/SE plots).

# plot with confidence limits
p +  geom_errorbar(aes(ymin = mean - ci, 
                       ymax = mean + ci), 
                       width = .1) +
  ggplot2::annotate("text", 
           label = "I-bars are 95% \nconfidence intervals", 
           x=2, 
           y=73500,
           fontface = "italic",
           size = 3)
Plot with error bars

Figure 13.3: Plot with error bars

The difference doesn’t appear to exceeds chance variation.

13.5 Attribution

Unless it’s your data, each graphic should come with an attribution - a note directing the reader to the source of the data. This will usually appear in the caption for the graph.

13.6 Going further

If you would like to learn more about ggplot2 there are several good sources, including

If you would like to learn more about data visualization in general, here are some useful resources.

13.7 Final Note

With the growth (or should I say deluge?) of readily available data, the field of data visualization is exploding. This explosion is supported by the availability of exciting new graphical tools. It’s a great time to learn and explore. Enjoy!