Chapter 13 Advice / Best Practices
This section contains some thoughts on what makes a good data visualization. Most come from books and posts that others have written, but I’ll take responsibility for putting them here.
Everything on your graph should be labeled including the
- title - a clear short title letting the reader know what they’re looking at
- Relationship between experience and wages by gender
- Relationship between experience and wages by gender
- subtitle - an optional second (smaller font) title giving additional information
- Years 2016-2018
- caption - source attribution for the data
- source: US Department of Labor - www.bls.gov/bls/blswage.htm
- axis labels - clear labels for the x and y axes
- short but descriptive
- include units of measurement
- Engine displacement (cu. in.)
- Survival time (days)
- Patient age (years)
- legend - short informative title and labels
- Male and Female - not 0 and 1 !!
- lines and bars - label any trend lines, annotation lines, and error bars
Basically, the reader should be able to understand your graph without having to wade through paragraphs of text. When in doubt, show your data visualization to someone who has not read your article or poster and ask them if anything is unclear.
13.2 Signal to noise ratio
In data science, the goal of data visualization is to communicate information. Anything that doesn’t support this goals should be reduced or eliminated.
Chart Junk - visual elements of charts that aren’t necessary to comprehend the information represented by the chart or that distract from this information. (Wikipedia)
Consider the following graph. The goal is to compare the calories in bacon to the other four foods.
(Disclaimer: I got this graph from somewhere, but I can’t remember where. If you know, please tell me, so that I can make a proper attribution. Also bacon always wins.)
If the goal is to compare the calories in bacon to other foods, much of this visualization is unnecessary and distracts from the task.
Think of all the things that are superfluous:
- the tan background border
- the grey background color
- the 3-D effect on the bars
- the legend (it doesn’t add anything, the bars are already labeled)
- the colors of bars (they don’t signify anything)
Here is an alternative.
The chart junk has been removed. In addition
- the x-axis label isn’t needed - these are obviously foods
- the y-axis is given a better label
- the title has been simplified (the word different in redundant)
- the bacon bar is the only colored bar - it makes comparisons easier
- the grid lines have been made lighter (gray rather than black) so they don’t distract
I may have gone a bit far leaving out the x-axis label. It’s a fine line, knowing when to stop simplifying.
In general, you want to reduce chart junk to a minimum. In other words, more signal, less noise.
13.3 Color choice
Color choice is about more than aesthetics. Choose colors that help convey the information contained in the plot.
The article How to Pick the Perfect Color Combination for Your Data Visualization is a great place to start.
Basically, think about selecting among sequential, diverging, and qualitative color schemes:
- sequential - for plotting a quantitative variable that goes from low to high
- diverging - for contrasting the extremes (low, medium, and high) of a quantitative variable
- qualitative - for distinguishing among the levels of a categorical variable
The article above can help you to choose among these schemes. Additionally, the
RColorBrewer package provides palettes categorized in this way.
Other things to keep in mind:
- Make sure that text is legible - avoid dark text on dark backgrounds, light text on light backgrounds, and colors that clash in a discordant fashion (i.e. they hurt to look at!)
- Avoid combinations of red and green - it can be difficult for a colorblind audience to distinguish these colors
13.4 y-Axis scaling
OK, this is a big one. You can make an effect seem massive or insignificant depending on how you scale a numeric y-axis.
Consider the following the example comparing the 9-month salaries of male and female assistant professors. The data come from the Academic Salaries dataset.
# load data data(Salaries, package="carData") # get means, standard deviations, and # 95% confidence intervals for # assistant professor salary by sex library(dplyr) df <- Salaries %>% filter(rank == "AsstProf") %>% group_by(sex) %>% summarize(n = n(), mean = mean(salary), sd = sd(salary), se = sd / sqrt(n), ci = qt(0.975, df = n - 1) * se) df
## # A tibble: 2 x 6 ## sex n mean sd se ci ## <fct> <int> <dbl> <dbl> <dbl> <dbl> ## 1 Female 11 78050. 9372. 2826. 6296. ## 2 Male 56 81311. 7901. 1056. 2116.
# create and save the plot library(ggplot2) p <- ggplot(df, aes(x = sex, y = mean, group=1)) + geom_point(size = 4) + geom_line() + scale_y_continuous(limits = c(77000, 82000), label = scales::dollar) + labs(title = "Mean salary differences by gender", subtitle = "9-mo academic salary in 2007-2008", caption = paste("source: Fox J. and Weisberg, S. (2011)", "An R Companion to Applied Regression,", "Second Edition Sage"), x = "Gender", y = "Salary") + scale_y_continuous(labels = scales::dollar)
First, let’s plot this with a y-axis going from 77,000 to 82,000.
# plot in a narrow range of y p + scale_y_continuous(limits=c(77000, 82000))
There appears to be a very large gender difference.
Next, let’s plot the same data with the y-axis going from 0 to 125,000.
# plot in a wide range of y p + scale_y_continuous(limits = c(0, 125000))
There doesn’t appear to be any gender difference!
The goal of ethical data visualization is to represent findings with as little distortion as possible. This means choosing an appropriate range for the y-axis. Bar charts should almost always start at y = 0. For other charts, the limits really depends on a subject matter knowledge of the expected range of values.
We can also improve the graph by adding in an indicator of the uncertainty (see the section on Mean/SE plots).
# plot with confidence limits p + geom_errorbar(aes(ymin = mean - ci, ymax = mean + ci), width = .1) + ggplot2::annotate("text", label = "I-bars are 95% \nconfidence intervals", x=2, y=73500, fontface = "italic", size = 3)
The difference doesn’t appear to exceeds chance variation.
Unless it’s your data, each graphic should come with an attribution - a note directing the reader to the source of the data. This will usually appear in the caption for the graph.
13.6 Going further
If you would like to learn more about
ggplot2 there are several good sources, including
- the book ggplot2: Elegenat Graphics for Data Anaysis (be sure to get the second edition)
- the eBook R for Data Science - the data visualization chapter
If you would like to learn more about data visualization in general, here are some useful resources.
- Harvard Business Reviews - Visualizations that really work
- the website Information is Beautiful
- the book Beautiful Data: The Stories Behind Elegant Data Solutions
- the Wall Street Journal’s - Guide to Information Graphics
- the book The Truthful Art
13.7 Final Note
With the growth (or should I say deluge?) of readily available data, the field of data visualization is exploding. This explosion is supported by the availability of exciting new graphical tools. It’s a great time to learn and explore. Enjoy!