The qacr package contains functions and data sets designed to simplify data analyses and aid in the instruction of data science courses. The primary functions and data sets are described below.

Preparing Data

Function Description
import import() can read data from an excel spreadsheet, SAS, SPSS, or Stata data file, or a delimited text file (e.g., csv) and save it as a data frame. In the case of a delimited text file, the structure and delimiters are determined from the data.
recodes recodes() provides a simple way to recode the values of numeric, character, or factor variables. See the vignette for examples.
standardize standardize() transforms all the numeric variables in a data frame to same mean and standard deviation (mean=0 sd=1, by default), without modifying character, factor, or dummy coded variables.
normalize normalize() transforms all the numeric variables in a data frame to same range of values ([0, 1] by default). Again, character and factor variables are left unchanged.

Describing a data set

Function Description
contents contents() provides a comprehensive description of a data frame. The output is much more detailed than that provided by the base summary.data.frame() function, and is easier to read and understand. This function should be your first stop when looking at a new dataset.
df_plot dfPlot() helps you visual a data frame. Variable are grouped by type (numeric, integer, character, factor, date) and color coded. The percent of missing data for each variable is also displayed, along with the total number of variables and cases.
barcharts barcharts() provides bar charts of all the character or factor variables in a data frame, within a single graph.
histograms histograms() provides histograms of the all quantitative variables in a data frame, within a single graph.
densities densities() provides density charts of all the quantitative variables in a data frame, within a single graph.

Exploratory data analysis

Numeric variables

Function Description
qstats qstats() allows you to easily calculate any number of descriptive statistics (e.g., n, mean, sd) for a quantiatative variable. The results can be broken down by the levels of one of more categorical variables (groups). Any function that produces a single number can be used. See the vignette for examples.
univariate_plot univariatePlot() provides a detailed visualization of the distribution of values in a quantiative variable. The graph contains a histrogram, jittered dot plot, density curve, and boxplot, Annotations provide statistics such as n, mean, sd, median, min, max, skew, and outliers.
scatter scatter() generates a scatter plot and line of best fit with 95% confidence interval displaying the relationship between two quantiative variables. Annotations include the slope, correlation coefficient (r), r-squared, and p_value. Oultiers (determinded by studentized residuals) are flagged. Optionally, marginal distributions (histograms, boxplots, density curves, violin plots) can be added to the margins of the plot.
cor_plot corplot() plots the correlations among numeric variables in a data frame. Variables can be sorted to place variables with similar correlation patterns together.
groupdiff groupdiff() compares groups on a quantitative outcome using either a parametric (ANOVA) or nonparametric (Kruskal-Wallis) test. Summary statistics, pair-wise group differences (post-hoc comparisons), and plots are provided.

Categorical Variables

Function Description
tab tab() generates a frequency table and bar chart for a categorical variable. There are many options including sorting categories by frequency, adding cumulative frequencies and percents, and combining infrequent categories into an ‘Other’ category. See the vignette for examples.
crosstab crosstab() generates a two-way frequency table from two categorical variables. There are many options including cell, row, and column percents, plotting options, and a chi-square test of independence. See the vignette for examples.

Machine learning

Cluster Analysis

Function Description
profile_plot Plot and compare mean cluster profiles.
wss_plot Create a within-groups sums-of-squares plot for determining the number of clusters in a numeric dataset.

Dimension Reduction

Function Description
biPlot Creates a principal components biplot.
FA Performs common factor analysis (principle axis, maximum likelihood) with options.
PCA Performs a principal components analysis, with options.
scree_plot Performs a parallel analysis for determining the number of factors or compoents in a numeric dataset.

Classification

Function Description
lift_plot Generate gain and lift charts.
roc_plot Produce a Receive Operating Curve (ROC) for a binary prediction problem. Cut-point values and AUC are also displayed.

Example Data sets

The qacr package contains over 25 datasets chosen to illustrate exploratory data analysis, data visualization, and machine learning. They are large enough to be interesting, while small enough to run easily on a laptop.

While each dataset can be used for data visualization, here are some pointers to get your started with specialized applications.

Ideas Datasets
Mapping Amazon forest fires, US border crossings, CIA World FactBook, Maryland crash data, US farmer’s markets, Japanese hostels, US hate crimes, California housing data, Major sports venues
Regression MLB batting statistics, Boston housing data, Automobile dataset (large), Automobile dataset (small), Coffee ratings, Google play apps, Medical costs, Student grades, Time spent watching TV
Classification Missed medial appointments, Breast cancer, Contraceptive use, Heart disease
Text Mining British movie plots, Wine reviews
Dimension Reduction Big 5 personality factors, Holland Occupational Themes