A Datasets
The appendix describes the datasets used in this book.
A.1 Academic salaries
The Salaries for Professors dataset comes from the carData
package. It describes the 9 month academic salaries of 397 college professors at a single institution in 2008-2009. The data were collected as part of the administration’s monitoring of gender differences in salary.
The dataset can be accessed using
It is also provided in other formats, so that you can practice importing data.
Format | File |
---|---|
Comma delimited text | Salaries.csv |
Tab delimited text | Salaries.txt |
Excel spreadsheet | Salaries.xlsx |
SAS file | Salaries.sas7bdat |
Stata file | Salaries.dta |
SPSS file | Salaries.sav |
A.2 Starwars
The starwars dataset comes from the dplyr package. It describes 13 characteristics of 87 characters from the Starwars universe. The data are extracted from the Star Wars API.
A.3 Mammal sleep
The msleep dataset comes from the ggplot2 package. It is an updated and expanded version of a dataset by Save and West, describing the sleeping characteristics of 83 mammals.
The dataset can be accessed using
A.4 Medical insurance costs
The insurance dataset is described in the book Machine Learning with R by Brett Lantz. A cleaned version of the dataset is also available on Kaggle. The dataset describes medical information and costs billed by health insurance companies in 2013, as compiled by the United States Census Bureau. Variables include age, sex, body mass index, number of children covered by health insurance, smoker status, US region, and individual medical costs billed by health insurance for 1338 individuals.
A.5 Marriage records
The Marriage dataset comes from the mosiacData package. It is contains the marriage records of 98 individuals collected from a probate court in Mobile County, Alabama.
The dataset can be accessed using
A.6 Fuel economy data
The mpg dataset from the ggplot2 package, contains fuel economy data for 38 popular models of car, for the years 1999 and 2008.
The dataset can be accessed using
A.7 Literacy Rates
This dataset provides the literacy rates (percent of the population that can both read and write) for each US State in 2023. The data were obtained from the World Population Review (http://https://worldpopulationreview.com/state-rankings/us-literacy-rates-by-state).
The dataset can be accessed using
A.8 Gapminder data
The gapminder dataset from the gapminder package, contains longitudinal data (1952-2007) on life expectancy, GDP per capita, and population for 142 countries.
The dataset can be accessed using
A.9 Current Population Survey (1985)
The CPS85 dataset from the mosaicData package, contains 1985 data on wages and other characteristics of workers.
The dataset can be accessed using
A.10 Houston crime data
The crime dataset from the ggmap package, contains the time, date, and location of six types of crimes in Houston, Texas between January 2010 and August 2010.
The dataset can be accessed using
A.11 Hispanic and Latino Population
The Hispanic and Latino Population data is a raw tab delimited text file containing the percentage of Hispanic and Latinos by US state from the 2010 Census. The actual dataset was obtained from Wikipedia (https://en.wikipedia.org/wiki/List_of_U.S._states_by_Hispanic_and_Latino_population).
The data can be accessed using
A.12 US economic timeseries
The economics dataset from the ggplot2 package, contains the monthly economic data gathered from Jan 1967 to Jan 2015.
The dataset can be accessed using
A.13 US population by age and year
The uspopage dataset describes the age distribution of the US population from 1900 to 2002.
The dataset can be accessed using
A.14 Saratoga housing data
The Saratoga housing dataset contains information on 1,728 houses in Saratoga Country, NY, USA in 2006. Variables include price (in thousands of US dollars) and 15 property characteristics (lotsize, living area, age, number of bathrooms, etc.)
The dataset can be accessed using
A.15 NCCTG lung cancer data
The lung dataset describes the survival time of 228 patients with advanced lung cancer from the North Central Cancer Treatment Group.
The dataset can be accessed using
A.16 Titanic data
The Titanic dataset provides information on the fate of Titanic passengers, based on class, sex, and age. The dataset comes in table form with base R. It is provided here as data frame.
The dataset can be accessed using
A.17 JFK Cuban Missle speech
The John F. Kennedy Address is a raw text file containing the president’s October 22, 1962 speech on the Cuban Missle Crisis. The text was obtained from the JFK Presidential Library and Museum.
The text can be accessed using
A.18 UK Energy forecast data
The UK energy forecast dataset contains data forecasts for energy production and consumption in 2050. The data are in an RData file that contains two data frames.
- The
node
data frame contains the names of the nodes (production and consumption types).
- The
links
data fame contains the source (originating node), target (target node), and value (flow amount between the nodes).
The data come from Mike Bostock’s Sankey Diagrams page and the network3D
homepage and can be accessed with the statement