A Datasets

The appendix describes the datasets used in this book.

A.1 Academic salaries

The Salaries for Professors dataset comes from the carData package. It describes the 9 month academic salaries of 397 college professors at a single institution in 2008-2009. The data were collected as part of the administration’s monitoring of gender differences in salary.

The dataset can be accessed using

data(Salaries, package="carData")

It is also provided in other formats, so that you can practice importing data.

Format File
Comma delimited text Salaries.csv
Tab delimited text Salaries.txt
Excel spreadsheet Salaries.xlsx
SAS file Salaries.sas7bdat
Stata file Salaries.dta
SPSS file Salaries.sav

A.2 Starwars

The starwars dataset comes from the dplyr package. It describes 13 characteristics of 87 characters from the Starwars universe. The data are extracted from the Star Wars API.

A.3 Mammal sleep

The msleep dataset comes from the ggplot2 package. It is an updated and expanded version of a dataset by Save and West, describing the sleeping characteristics of 83 mammals.

The dataset can be accessed using

data(msleep, package="ggplot2")

A.4 Medical insurance costs

The insurance dataset is described in the book Machine Learning with R by Brett Lantz. A cleaned version of the dataset is also available on Kaggle. The dataset describes medical information and costs billed by health insurance companies in 2013, as compiled by the United States Census Bureau. Variables include age, sex, body mass index, number of children covered by health insurance, smoker status, US region, and individual medical costs billed by health insurance for 1338 individuals.

A.5 Marriage records

The Marriage dataset comes from the mosiacData package. It is contains the marriage records of 98 individuals collected from a probate court in Mobile County, Alabama.

The dataset can be accessed using

data(Marriage, package="mosaicData")

A.6 Fuel economy data

The mpg dataset from the ggplot2 package, contains fuel economy data for 38 popular models of car, for the years 1999 and 2008.

The dataset can be accessed using

data(mpg, package="ggplot2")

A.7 Literacy Rates

This dataset provides the literacy rates (percent of the population that can both read and write) for each US State in 2023. The data were obtained from the World Population Review (http://https://worldpopulationreview.com/state-rankings/us-literacy-rates-by-state).

The dataset can be accessed using

library(readr)
litRates <- read_csv("USLitRates.csv")

A.8 Gapminder data

The gapminder dataset from the gapminder package, contains longitudinal data (1952-2007) on life expectancy, GDP per capita, and population for 142 countries.

The dataset can be accessed using

data(gapminder, package="gapminder")

A.9 Current Population Survey (1985)

The CPS85 dataset from the mosaicData package, contains 1985 data on wages and other characteristics of workers.

The dataset can be accessed using

data(CPS85, package="mosaicData")

A.10 Houston crime data

The crime dataset from the ggmap package, contains the time, date, and location of six types of crimes in Houston, Texas between January 2010 and August 2010.

The dataset can be accessed using

data(crime, package="ggmap")

A.11 Hispanic and Latino Population

The Hispanic and Latino Population data is a raw tab delimited text file containing the percentage of Hispanic and Latinos by US state from the 2010 Census. The actual dataset was obtained from Wikipedia (https://en.wikipedia.org/wiki/List_of_U.S._states_by_Hispanic_and_Latino_population).

The data can be accessed using

library(readr)
text <- read_csv("hisplat.csv")

A.12 US economic timeseries

The economics dataset from the ggplot2 package, contains the monthly economic data gathered from Jan 1967 to Jan 2015.

The dataset can be accessed using

data(economics, package="ggplot2")

A.13 US population by age and year

The uspopage dataset describes the age distribution of the US population from 1900 to 2002.

The dataset can be accessed using

data(uspopage, package="gcookbook")

A.14 Saratoga housing data

The Saratoga housing dataset contains information on 1,728 houses in Saratoga Country, NY, USA in 2006. Variables include price (in thousands of US dollars) and 15 property characteristics (lotsize, living area, age, number of bathrooms, etc.)

The dataset can be accessed using

data(SaratogaHouses, package="mosaicData")

A.15 NCCTG lung cancer data

The lung dataset describes the survival time of 228 patients with advanced lung cancer from the North Central Cancer Treatment Group.

The dataset can be accessed using

data(lung, package="survival")

A.16 Titanic data

The Titanic dataset provides information on the fate of Titanic passengers, based on class, sex, and age. The dataset comes in table form with base R. It is provided here as data frame.

The dataset can be accessed using

library(readr)
titanic <- read_csv("titanic.csv")

A.17 JFK Cuban Missle speech

The John F. Kennedy Address is a raw text file containing the president’s October 22, 1962 speech on the Cuban Missle Crisis. The text was obtained from the JFK Presidential Library and Museum.

The text can be accessed using

library(readr)
text <- read_csv("JFKspeech.txt")

A.18 UK Energy forecast data

The UK energy forecast dataset contains data forecasts for energy production and consumption in 2050. The data are in an RData file that contains two data frames.

  • The node data frame contains the names of the nodes (production and consumption types).
  • The links data fame contains the source (originating node), target (target node), and value (flow amount between the nodes).

The data come from Mike Bostock’s Sankey Diagrams page and the network3D homepage and can be accessed with the statement

load("Energy.RData")