Jenny Wagner, PhD, MPH
Department of Public
Health
California State University, Sacramento
Exploratory Data Analysis (EDA) is the process of examining and summarizing a dataset to identify patterns, relationships, and potential issues before performing more complex analyses. It involves techniques like calculating summary statistics, visualizing data distributions, and identifying outliers and missing values. EDA is important because it:
After completing Lab 2, you will be able to:
Suppose you are a public health researcher interested in studying the relationship between lifestyle factors and cardiovascular health. Using data from the Framingham Heart Study, you aim to explore key risk factors - such as sex, age, body mass index (BMI), and smoking behavior - in the baseline sample to identify patterns and relationships that could inform prevention strategies. Before beginning any in-depth analyses, you plan to conduct an exploratory data analysis (EDA) to summarize the data, visualize distributions and relationships, and identify potential issues. The EDA process will help you become familiar with the data and guide future analyses.
When you are ready to start Lab 2:
.Rmd
file in a folder
dedicated to Lab 2 materials.As you begin Lab 2, notice that the first few tasks closely mirror those of Lab 1. Remember that, in general, you will always start a new assignment in this way - by loading packages, setting your working directory, and importing the dataset(s) you need.
In Lab 2, we will mostly be using functions contained within packages
you have already installed (e.g., tidyverse
). We will use
one new package called naniar
which contains many functions
for investigating missing data. Start by entering the following into
your Console (bottom-left window) to install this package:
install.packages("naniar")
Remember that packages only need to be installed once;
however, they must be loaded every time you start a new R
session using the library()
function as follows:
Set your working directory using the setwd()
function -
but remember to paste your own file path between the parentheses (return
to Lab 1 and/or the RStudio Video Tutorial on Canvas if you need
detailed instructions).
# set working directory
setwd("/Volumes/GoogleDrive/My Drive/1_CSUS/2024-2025/Spring 2025/PUBH 207B/Website/PUBH-207B-S25")
Use the read.csv()
function to import the
modified dataset you created in Lab 1, called
“frmgham2_p1.csv”.
Recall that you exported the “frmgham2_p1.csv” dataset from RStudio in the final task of Lab 1. If you carried out this task correctly, the dataset will be located in your Lab 1 working directory (i.e., the folder containing your Lab 1 materials). To utilize this dataset for Lab 2, you will need to move or copy the file over to the folder you set as your working directory in the above step. Consider this an exercise in locating and organizing files on your computer.
For Lab 2, we will focus on a handful of variables of interest. Be sure to refer back to the data documentation (linked on the Lab 2 assignment page on Canvas) for more information about these variables:
RANDID
: participant identification numberSEX
: sexAGE
: ageBMI
: body mass indexTOTCHOL
: serum total cholesterolEDUC
: attained education (renamed in Lab 1)OBESE
: obesity status, where obese means BMI>30
(created in Lab 1)Let’s start by creating a new object called data_sub
containing only these variables. To do this, we will use the
select()
function, which is used to select specific columns
(variables) from a dataset. It allows you to keep only the columns you
need for analysis (or drop those you don’t need), which can help make
your dataset more manageable and focused:
# select variables of interest and save in a new data frame
data_sub <- data %>% select(RANDID, SEX, AGE, BMI, TOTCHOL, EDUC, OBESE)
This code creates a smaller, more focused dataset (called
data_sub
) containing only the variables of interest, listed
above. Let’s break down this line of code:
select(RANDID, SEX, AGE, BMI, TOTCHOL, EDUC, OBESE)
:
Selects the specified columns from the original dataset. Only these
columns will be included in the new dataset
(data_sub
).data %>%
: Passes the object called data
(our original dataset) into the select()
function using the
pipe operator (%>%
). Remember, this is like telling R,
‘use this dataset and then select only these variables.’data_sub <-
: Creates a new object called
data_sub
to store the result of the operation. By saving
the subsetted data with a new name, the original dataset (called
data
) remains unchanged and available to us if we need
it.After running the code chunk above, you should see a new object in
your Environment (top-right window) called data_sub
(when
running your own analyses, you can use whatever names make sense to you)
with 4434 observations and 7 variables. Click on the
data_sub
object to view it in a new tab. You should see a
spreadsheet open containing only the 7 variables we selected.
Like we did in Lab 1, we should identify variable types using the
class()
function and modify them as needed. Let’s check the
variable type for each variable in data_sub
. Below, you
will see the code used to check the variable type and output for each
variable:
## [1] "integer"
## [1] "integer"
## [1] "numeric"
## [1] "integer"
## [1] "integer"
## [1] "integer"
We see that all variables are coded as “integer” variables aside from
BMI
, which is coded as “numeric”. As discussed in Lab 1, we
should modify variable types when needed to ensure variables are treated
appropriately - as numerical or categorical - in our analyses. For
example, we know the values for sex (SEX
), attained
education (EDUC
), and obesity status (OBESE
)
represent categories rather than numerical quantities. As such, we
should modify these variables accordingly using the
factor()
command (see Lab 1 for a detailed explanation of
the code):
# convert sex to unordered factor variable
data_sub$SEX <- factor(data_sub$SEX,
levels=c("1", "2"),
ordered=FALSE)
# convert attained education to ordered factor variable
data_sub$EDUC <- factor(data_sub$EDUC,
levels=c("1", "2", "3", "4"),
ordered = TRUE)
# convert obesity status to unordered factor variable
data_sub$OBESE <- factor(data_sub$OBESE,
levels=c("0", "1"),
ordered=FALSE)
You might be thinking, “wait, didn’t we already convert sex and attained education to factor variables in Lab 1? Why are they integer variables once again?” Note that when you import a CSV file, R will automatically interpret variables with numerical values as either “integer” or “numeric” - R does not know that the values represent categories until we specify this. So, any time you bring in a new CSV file, you will need to go through the process of identifying and modifying variable types as needed.
Exploratory data analysis (EDA) typically involves a combination of: (1) numerical summaries, and (2) descriptive plots or visuals. We will explore both in this Lab, beginning here with numerical summaries. This means generating summary statistics for numerical variables and frequency distributions for categorical variables.
Our data_sub
object contains three numerical variables
(excluding participant ID number) - age (AGE
), body mass
index (BMI
), and total cholesterol level
(TOTCHOL
) - and three categorical variables - sex
(SEX
), attained education (EDUC
), and obesity
status (OBESE
).
To understand the distributions of these variables, the
summary()
function is a great place to start:
## RANDID SEX AGE BMI TOTCHOL
## Min. : 2448 1:1944 Min. :32.00 Min. :15.54 Min. :107
## 1st Qu.:2440336 2:2490 1st Qu.:42.00 1st Qu.:23.09 1st Qu.:206
## Median :4972848 Median :49.00 Median :25.45 Median :234
## Mean :4987278 Mean :49.93 Mean :25.85 Mean :237
## 3rd Qu.:7463577 3rd Qu.:57.00 3rd Qu.:28.09 3rd Qu.:264
## Max. :9999312 Max. :70.00 Max. :56.80 Max. :696
## NA's :19 NA's :52
## EDUC OBESE
## 1 :1822 0 :3841
## 2 :1281 1 : 574
## 3 : 716 NA's: 19
## 4 : 502
## NA's: 113
##
##
The summary()
function in R provides a quick overview of
the dataset by summarizing each variable in the specified dataset (in
this case, data_sub
). For numerical variables, it
calculates key descriptive statistics, and for categorical variables, it
provides frequency counts for each category. The output also provides
the number of missing values for each variable (more on missing data
later). After running the above code chunk, you will see:
For example, we can see that the baseline sample contains 1944 males,
where SEX
=1, and 2490 females, where SEX
=2
(How do we know which category corresponds to each value? See the data
documentation). We can also see that the minimum and maximum ages in the
sample are 32 and 70, respectively, with a mean of 49.93 years and
median of 49 years. There are no missing values for sex or age.
Examine the output and answer the following:
As stated previously, the summary()
function is a great
starting point, since it provides a summary for each variable in the
dataset. However, the summary()
function does not
necessarily provide all of the information you might need. There are
several functions you might use to generate summary statistics and
frequency distributions.
For example, suppose you just want the mean and standard deviation
for BMI. You can obtain these values using the mean()
and
sd()
functions as follows:
# obtain mean and standard deviation for BMI in the baseline sample
mean(data_sub$BMI, na.rm = TRUE)
## [1] 25.84616
## [1] 4.101821
When you run the above code chunk, you will see the mean and standard
deviation of BMI in the Console (bottom-left window). Notice that, when
using the mean()
and sd()
functions, you need
to specify both the dataset you want R to use (in this case,
data_sub
), and the specific variable of interest (in this
case, BMI
), separated by the $
operator.
What does
na.rm = TRUE
mean? This argument is used to remove missing values (NAs) before performing a calculation. Some functions in R will automatically ignore missing values, while others require you to specify this explicitly. I know, kind of annoying. Themean()
andsd()
functions are two examples in which R will return an “NA” result in the presence of missing values - unless you explicitly tell R to ignore them. More on missing values later in this Lab.
Suppose you want to generate the frequency distribution for a single
categorical variable - say, obesity status. You can obtain the frequency
distribution using the table()
command as follows:
##
## 0 1
## 3841 574
When you run the above code chunk, you will see the frequencies for
each category (0=not obese; 1=obese) in the Console (bottom-left
window). As above, when using the table()
command, you need
to specify both the dataset you want R to use (in this case,
data_sub
), and the specific variable(s) of interest (in
this case, OBESE
), separated by the $
operator.
Exploring the distribution of variables using descriptive plots complements summary statistics and frequencies by providing a visual representation of the data. While descriptive statistics and frequencies offer numerical summaries, plots can help us understand the overall shape of a distribution and reveal patterns, trends, and anomalies that numbers alone may not capture. For example, histograms or boxplots of numerical variables can highlight skewness, outliers, or multimodal distributions, while bar charts of categorical variables can make the relative prevalence of categories easier to interpret. Visualizing data can also help identify issues like data entry errors or unusual variability in addition to making it easier to communicate findings to others.
Histograms are useful visuals for numerical variables. A histogram shows the frequency of observations across “bins” of values and is best for assessing the shape and modality of a variable. For example, if a dataset includes ages ranging from 18 to 90, using bins of 5 years (e.g., 18–22, 23–27, etc.) might make sense to align with common age groupings in research. This would require about 15 bins, where the frequencies generated would be the number of people with ages in each bin.
Let’s generate a histogram for each numerical variable of interest,
beginning with age, using the hist()
function:
# histogram for age
hist(data_sub$AGE,
breaks = 15,
main = "Histogram of Baseline Age",
xlab = "Age",
ylab = "Frequency")
Let’s break down each part of the code above:
hist(data_sub$AGE...)
: Creates a histogram of the
AGE
variable from the data_sub
object, showing
the frequency of different age ranges.breaks = 15
: Specifies the number of bins (or
intervals) for grouping the ages. Setting breaks = 15
creates 15 bins.
main = "Histogram of Baseline Age"
: Sets the title of
the histogram to “Histogram of Baseline Age.”xlab = "Age"
: Labels the x-axis as “Age,” so we know
what the bins represent.ylab = "Frequency"
: Labels the y-axis as “Frequency,”
which shows the number of observations in each bin.After running the code chunk above, you should see a histogram of age appear in the plots window (bottom-right). The distribution of age appears to be generally mound-shaped with little skewness.
Two additional examples, for BMI and total cholesterol:
# histogram for BMI
hist(data_sub$BMI,
breaks = 30,
main = "Histogram of Baseline BMI",
xlab = "BMI",
ylab = "Frequency")
# histogram for total cholesterol level
hist(data_sub$TOTCHOL,
breaks = 30,
main = "Histogram of Baseline Total Cholesterol Level",
xlab = "Total Cholesterol Level",
ylab = "Frequency")
Notice that the only changes we needed to make to the code to create these additional historgrams were to the specific variable of interest (specified in the first line of code in each chunk) and the labels (so that our x-axis and chart titles accurately reflect the variable at-hand).
From the histograms, we see that both BMI and total cholesterol level are slightly right-skewed, meaning the peak of the data is on the left, and the values taper off toward the right. In both cases, most of the data falls within a bell-shaped curve, but there are some extreme values to the right. Note that in a right-skewed distribution, the mean is greater than the median. This happens because the long tail on the right pulls the mean higher while the median is relatively unaffected.
Like histograms, boxplots are useful visuals for numerical variables. A boxplot provides a visual summary of key statistics (e.g., median, quartiles, range) and is ideal for identifying outliers and comparing the distribution of a numerical variable across groups (more on this later).
Let’s generate a boxplot for each numerical variable of interest,
beginning with age, using the boxplot()
function:
Let’s break down each part of the code above:
boxplot(data_sub$AGE...)
: Creates a boxplot of the
AGE
variable from the data_sub
dataset. The
boxplot summarizes the distribution of ages, showing the median,
interquartile range (IQR), and min/max values. Boxplots will also
typically reveal any potential outliers, which will be represented by
open circles, but we don’t see any in this case.main = "Boxplot of Baseline Age"
: Sets the title of the
plot to “Boxplot of Baseline Age” so that we have context for what the
plot represents.ylab = "Age"
: Labels the y-axis as “Age,” indicating
that the values displayed on the axis correspond to participant
ages.After running the code chunk above, you should see a boxplot of age appear in the plots window (bottom-right). The distribution of age appears to be generally symmetrical (with mean equal to median) with no outliers.
Two additional examples, for BMI and total cholesterol level:
# boxplot for total cholesterol level
boxplot(data_sub$TOTCHOL,
main = "Boxplot of Baseline Total Cholesterol Level",
ylab = "Total Cholesterol Level")
Again, notice the changes in the code. To create a new boxplot for a different variable, we only need to change the specific variable indicated on the first line of each code chunk, as well as the correponding title and axis labels.
As expected given our examination of their histograms, both BMI and total cholesterol level are right-skewed. We can see this reflected in their respective boxplots because there are many extreme values above the median. Every observation shown by an open circle on the boxplot figures represents an outlier. If you’re doing this by hand, there are a couple of steps involved in identifying which points are outliers (while we won’t do anything “by hand” in this course, it is important to understand the process):
Identifying outliers is important because they can indicate possible data errors or unusual observations, which can significantly influence statistical analyses if left unaddressed. Boxplots provide a quick visual summary to detect these outliers.
Barcharts are an effective way to visualize the frequency distribution of categorical variables. They display the number (or proportion) of observations in each category as rectangular bars, allowing us to easily compare the prevalence of different groups. On a bar chart, each unique category is represented on the x-axis, and the height of each bar corresponds to the frequency (or relative frequency) of observations in that category. Barcharts can help us quickly identify the most and least frequent categories, easily visualize differences in the size of categories, and communicate results to a wide audience.
Let’s generate a bar chart for each categorical variable of interest,
beginning with sex, by nesting the table()
function within
the barplot()
function:
# generate a simple bar chart for sex
barplot(table(data_sub$SEX),
main = "Barchart of Sex Distribution",
xlab = "Sex",
ylab = "Frequency",
col = "lightblue",
names.arg = c("Male", "Female"))
Let’s break down each part of this code chunk:
barplot(table(data_sub$SEX...))
:
table(data_sub$SEX)
: Creates a frequency table for the
SEX variable, counting how many observations fall into each category
(e.g., male, female).barplot()
: Generates a bar chart based on the frequency
table.main = "Barchart of Sex Distribution"
: Sets the title
of the chart to “Barchart of Sex Distribution.”xlab = "Sex"
: Labels the x-axis as “Sex,” indicating
what the categories on the x-axis represent.ylab = "Frequency"
: Labels the y-axis as “Frequency,”
indicating that the height of each bar represents the count of
observations in each category.col = "lightblue"
: Sets the color of the bars to light
blue, making the chart visually appealing.names.arg = c("Male", "Female")
: Replaces the default
x-axis labels (e.g., coded values 1 and 2) with descriptive labels
(“Male” and “Female”), making the chart easier to interpret.After running the code chunk above, you should see a bar chart of sex appear in the plots window (bottom-right). Based on this chart, we can see that females make up a larger portion of the sample than males.
Two additional examples, for attained education and obesity status:
# generate a simple bar chart for attained education
barplot(table(data_sub$EDUC),
main = "Barchart of Attained Education",
xlab = "Attained Education Level",
ylab = "Frequency",
col = "pink",
names.arg = c("0-11 years", "High School", "Some College", "College degree or more"))
Note: If some of the labels on the x-axis don’t show up on your bar chart, you may just need to expand the plots window. Click and drag the border of the plots window to make it larger. This can be done by hovering your mouse over the edge of the window until the cursor changes to a resizing icon, then dragging to resize. Better yet, if resizing the window isn’t enough, click the Zoom button in the plots pane (represented by a magnifying glass icon). This opens the plot in a larger, separate window.
# generate a simple bar chart for obesity status
barplot(table(data_sub$OBESE),
main = "Barchart of Obesity Status",
xlab = "Obesity Status",
ylab = "Frequency",
col = "purple",
names.arg = c("Not obese", "Obese (BMI>30)"))
From the barcharts above, we can get a good sense of the distributions of attained education and obesity within the baseline sample. For example, the bar chart shows a decreasing trend in the number of individuals as the level of attained education increases. This suggests that higher education levels are less common in the sample, with fewer participants achieving college degrees compared to lower levels of education. We also see that a relatively small proportion of individuals in the baseline sample are considered obese (BMI>30) compared to those who are not obese.
Now that we have examined each variable of interest individually and gained an understanding of their distributions, we can move on to exploring relationships between variables. This next step allows us to identify patterns, trends, or associations that may guide future analyses.
Just like when we examined individual variables, we can use descriptive plots/visuals to explore relationships between variables. Exploring relationships between variables depends on the types of variables involved, as different approaches are appropriate for different combinations of variable types. Here’s a brief overview with examples from our data…
In general, we will use scatterplots to visualize the relationship between two numerical variables. For example, you might be interested in the relationship between BMI and total cholesterol level. On a scatterplot, one variable is plotted on the x-axis and the other on the y-axis. Scatterplots allow us to identify trends (e.g., positive, negative, or no correlation) and assess the linearity of the relationship.
# Basic scatterplot of BMI and total cholesterol level
plot(data_sub$BMI, data_sub$TOTCHOL,
main = "Scatterplot of BMI vs. total cholesterol Level",
xlab = "BMI",
ylab = "Total Cholesterol Level")
Let’s break down this code:
plot(data_sub$BMI, data_sub$TOTCHOL...)
: Creates a
scatterplot with BMI on the x-axis and total cholesterol level on the
y-axis. Notice how for both BMI and total cholesterol level we
are identifying the dataset for R to use (data_sub
) and the
specific variables of interest (BMI
and
TOTCHOL
), separated by the $
operator.main = "Scatterplot of BMI vs. Total Cholesterol Level"
:
Adds a title to the scatterplot.xlab = "BMI"
and
ylab = "Total Cholesterol Level"
: Labels the x-axis and
y-axis, making the plot easier to interpret.We can use boxplots to compare the distribution of a numerical variable across categories. For example, you might be interested in the distribution of BMI across attained education levels. Stratifying boxplots in this way allows us to assess whether the numerical variable differs across categories or shows patterns related to the categorical variable.
# Boxplot of BMI across attained education levels
boxplot(data_sub$BMI ~ data_sub$EDUC,
main = "Boxplot of BMI by Education Level",
xlab = "Education Level",
ylab = "BMI",
col = "lightblue",
names = c("0-11 years", "High School", "Some College", "College degree or more"))
This code generates a boxplot where each box represents the distribution of BMI for a specific education level. The plot visually compares medians, variability, and potential outliers of BMI across levels of attained education. Let’s break down the code:
boxplot(data_sub$BMI ~ data_sub$EDUC
…): Specifies the
formula for the boxplot, where BMI
is the numerical
variable to be plotted, and EDUC
is the categorical
variable used to group the data.main = "Boxplot of BMI by Education Level"
: Sets the
title of the boxplot.xlab = "Education Level"
and ylab = "BMI"
:
Labels the x-axis and y-axis.col = "lightblue"
: Colors the boxplots light blue.names = c(...)
: Replaces the default category labels
(1, 2, 3, and 4) with descriptive labels for each education level.As above, expand the plots window or use the Zoom button in the plots pane to open the plot in a larger, separate window.
To explore the relationship between two categorical variables, we can create a contingency table to show the frequencies of combinations of the categories. We can then use a barchart to visualize the frequencies or proportions within each category. For example, you might be interested in the relationship between sex and obesity status. Contingency tables and bar charts can help us determine whether certain categories are more or less likely to occur together. Here’s an example:
##
## 0 1
## 1 1707 232
## 2 2134 342
Recall that we used the table()
command up above to
tabulate the frequencies in each category for a single
categorical variable. Here, we see that we can also use the
table()
function to generate a contingency table, which
displays the frequency (or count) of observations for each combination
of categories from the two variables. In this case, the table shows the
number of males and females who are obese or not obese. We can then plot
these values on a bar chart by nesting the table()
function
within the barplot()
function as follows:
# Barchart for sex by obesity status
barplot(table(data_sub$SEX, data_sub$OBESE),
beside = TRUE,
main = "Barchart of Sex by Obesity Status",
xlab = "Obesity Status",
ylab = "Frequency",
col = c("lightblue", "lightgreen"),
legend = c("Male", "Female"),
names.arg = c("Not Obese", "Obese"))
Let’s break down this code:
table(data_sub$SEX, data_sub$OBESE...)
: Creates a
contingency table with sex (SEX
) as rows and obesity status
(OBESE
) as columns.barplot()
: Generates a barchart from the contingency
table.beside = TRUE
: Creates separate bars for each category
(instead of stacking them), which might make it a bit easier to compare
groups.main
, xlab
, ylab
: Add a title
and axis labels to describe the chart.col = c("lightblue", "lightgreen")
: Assigns colors to
the bars for the two groups (e.g., “Male” and “Female”).legend = c("Male", "Female")
: Adds a legend to identify
which bars correspond to males and females.names.arg = c("Not Obese", "Obese")
: Replaces the
default x-axis labels (e.g., 0 and 1) with more descriptive category
names.When comparing frequencies or counts across groups, be aware that the sizes of the groups may differ significantly, and this can distort your interpretation. For example, if there are more females than males in the sample, it might appear that females are more frequently obese simply because there are more females overall, not because they are at greater risk. To account for these differences, it’s important to examine relative frequencies or proportions, which standardize the data by dividing the count in each category by the total for that group. This approach allows for fair comparisons across groups of different sizes and provides a more accurate picture of the relationships in the data.
We can plot relative frequencies in a few steps as follows:
# Create the contingency table
contingency_table <- table(data_sub$SEX, data_sub$OBESE)
# Calculate relative frequencies using the `prop.table()` command
relative_frequencies <- prop.table(contingency_table, margin = 1)
# Print relative frequencies
print(relative_frequencies)
##
## 0 1
## 1 0.8803507 0.1196493
## 2 0.8618740 0.1381260
# Barchart for relative frequencies
barplot(relative_frequencies,
beside = TRUE,
main = "Relative Frequencies of Sex by Obesity Status",
xlab = "Obesity Status",
ylab = "Proportion",
col = c("lightblue", "lightgreen"),
legend = c("Male", "Female"),
names.arg = c("Not Obese", "Obese"))
A brief breakdown of this code…
contingency_table <- table(data_sub$SEX, data_sub$OBESE)
table(data_sub$SEX, data_sub$OBESE)
: Creates the
contingency table for sex and obesity status.contingency_table <-
: Saves the resulting
contingency table as an object called “contingency_table”, allowing us
to input this object directly into the subsequent function.relative_frequencies <- prop.table(contingency_table, margin = 1)
:
prop.table(contingency_table, margin = 1)
:Calculates
relative frequencies for each group. The margin = 1 argument ensures
that the proportions are calculated row-wise (i.e., within each sex
group).relative_frequencies <-
: Saves the resulting
relative frequencies as an object called “relative_frequencies”,
allowing us to input this object directly into the subsequent
function.print(relative_frequencies)
: Displays the relative
frequencies in the Console (bottom-left window). This step might be
helpful because it provides the exact proportion of participants in each
obesity category for males and females, but it is not actually necessary
to create the bar chart.barplot(relative_frequencies, ...
: Creates a barchart
to visualize the relative frequencies, similar to the one for counts but
scaled to show proportions (values between 0 and 1). See previous bar
chart above (of counts) for an explanation of each argument nested in
the barplot()
function here.From the barchart of relative frequencies, we can see that females in the sample are slightly more likely to be obese than males. In the coming weeks in this course, we will learn how to test whether group differences like this are statistically significant, meaning there is enough evidence in the data to conclude that a real difference exists (in the population) - or in other words, that the difference is unlikely to have happened by chance.
A key point… Understanding how variables relate to each other is a big part of public health work. It helps us figure out how health outcomes, behaviors, and risk factors connect, which can point us toward effective interventions and policies. For example, looking at how smoking affects lung health, how BMI relates to blood pressure, or how education level ties to access to care can reveal important patterns. Right now, we’re just getting started with exploring these relationships through EDA. Later in the course, we’ll learn more advanced methods to answer questions, like identifying risk factors for disease or uncovering health disparities.
Identifying and addressing missing data is important to ensure the accuracy and reliability of any analysis. Missing data can introduce bias, especially if certain groups or patterns are more affected, and lead to results that don’t reflect the true population. Missing data may also reduce the power of statistical tests and limit the amount of usable data, which might make it more difficult to draw meaningful conclusions. By addressing missing data properly — whether by removing incomplete cases or using techniques like imputation — we can maintain the integrity of the analysis.
Recall from above that we used the summary()
command to
generate a numerical summary for each variable in our dataset. In
addition to summary statistics, this function provided the number of
missing data points for each variable. Let’s take another look:
## RANDID SEX AGE BMI TOTCHOL
## Min. : 2448 1:1944 Min. :32.00 Min. :15.54 Min. :107
## 1st Qu.:2440336 2:2490 1st Qu.:42.00 1st Qu.:23.09 1st Qu.:206
## Median :4972848 Median :49.00 Median :25.45 Median :234
## Mean :4987278 Mean :49.93 Mean :25.85 Mean :237
## 3rd Qu.:7463577 3rd Qu.:57.00 3rd Qu.:28.09 3rd Qu.:264
## Max. :9999312 Max. :70.00 Max. :56.80 Max. :696
## NA's :19 NA's :52
## EDUC OBESE
## 1 :1822 0 :3841
## 2 :1281 1 : 574
## 3 : 716 NA's: 19
## 4 : 502
## NA's: 113
##
##
For some variables, we see that the list of summary statistics
includes the number of NA
values. NA means “not available”
and indicates a missing data point. For example, of our 4434 total
observations, there are 19 missing values for BMI (and for obesity
status, given this variable was derived from BMI), 52 missing values for
total cholesterol level, and 113 missing values for attained
education.
There are a few other ways you can determine the number of missing
values. For example, suppose you want to obtain the number of missing
values for a single variable, say, attained education
(EDUC
). We can add up the number of ‘NA’ values by using
the is.na()
function nested within the sum()
function as follows:
## [1] 113
Or, suppose you want to generate a count of missing values for all
variables but without printing the full numerical summary (as we did
with the summary()
function). We can add up the number of
‘NA’ values in each column by using the is.na()
function
nested within the colSums()
function as follows:
## RANDID SEX AGE BMI TOTCHOL EDUC OBESE
## 0 0 0 19 52 113 19
Suppose you want to know what percentage of data points are missing for each variable. We can use the following to calculate the percentage of missing data, which can help us assess whether the missingness is substantial:
## RANDID SEX AGE BMI TOTCHOL EDUC OBESE
## 0.000000 0.000000 0.000000 0.428507 1.172756 2.548489 0.428507
From the output we can see that approximately 0.43% of values are missing for BMI, about 1.17% are missing for TOTCHOL level, 2.55% for attained education, and 0.43% for obesity status.
Finally, visualizing missingness can be a helpful way to identify patterns in missing data. It allows us to see whether missing values are concentrated in specific variables, occur together across multiple variables, or appear randomly. Understanding these patterns is important because they can help us decide how to handle missing data appropriately.
An “upset chart” is one tool for visualizing missingness. It shows patterns of missing data across multiple variables. Each column in the chart represents a specific pattern of missingness (e.g., which variables are missing together), while the height of the bar above each column indicates how many observations fall into that pattern. Rows of dots below the bars show which variables are missing in each pattern. For example:
If you notice a lot of overlap (e.g., the same variables missing together frequently) or if entire groups of variables have high missingness, it could indicate a problem that requires further investigation. Random missingness, on the other hand, often shows as small bars spread across different patterns.
We can use the gg_miss_upset()
function from the
naniar
package to create an upset chart as follows:
From the chart, we see that most of the missingness is isolated to
individual variables. For example, there are 111 observations for which
EDUC
is the only missing variable, and 50
observations for which TOTCHOL
is the only missing
variable. Observations rarely have co-occurring missing values.
In real life, our investigation of missing data would go a bit
further - we would want to know, for example, whether there are patterns
of missingness related to the actual values of other variables (e.g., do
misisng values of TOTCHOL
tend to cluster within
participants with high BMI?). While we won’t dive into this level of
detail in this course, it’s important to understand that examining
missing data can be a complex and time-consuming process, especially
when working with large datasets.
Now that we know how to quantify missing data, let’s walk through a couple of approaches for addressing it. There are several options for dealing with missing data, but we will focus on the two simplest methods:
Under this approach, rows in which data are missing for any variables
involved in the analysis are excluded. In other words, if a participant
is missing a value for any variable used in our analysis, that
participant’s data is removed from the analytic sample. This approach is
straightforward but may reduce the sample size significantly, especially
if many variables have missing values. This approach is most appropriate
when the missingness is minimal and occurs randomly. We can omit rows
with missing data using na.omit()
function as follows:
The na.omit()
function tells R to remove all rows which
contain any missing values. It is a good idea to save the modified
dataset as a new object in our Environment, as we have done here
(data_complete
), so that we retain the original dataset for
further use. Notice that, after running the above code chunk, we have a
new object in our Environment called data_complete
which
contains 4253 observations and 7 variables (the same 7 variables we had
in data_sub
). By eliminating rows with missing data, we
have eliminated 181 observations and retained approximately 96% of our
original sample. Not too bad. However, if you, as the researcher, decide
you do not want to eliminate these observations completely, simple
imputation (or a more advanced method) may be a more appropriate
strategy for handling missing data. Let’s take a look at this approach
next…
Simple imputation entails replacing missing values with a single
value, such as the mean, median, or mode of the variable. This method is
simple and retains the sample size, but it may underestimate variability
in the data. It’s best for variables with few missing values and no
strong patterns of missingness. Let’s consider when it might be
appropriate to apply simple imputation using the mean, median, or mode
as the replacement value. But first, let’s duplicate our dataset with a
new name (data_impute
) so that we can retain the original
if needed…
# duplicate `data_sub` and call it `data_impute` to use for simple imputation
data_impute <- data_sub
# replace missing values of BMI with its mean
data_impute$BMI[is.na(data_impute$BMI)] <- mean(data_impute$BMI, na.rm = TRUE)
Let’s break this down…
data_impute$BMI
: Refers to the BMI
column
in the data_impute
object. The $
operator is
used to access this specific column.is.na(data_impute$BMI)
: Identifies which values in the
BMI
column are missing (NA). Returns a logical vector where
TRUE indicates missing values and FALSE indicates non-missing
values.data_impute$BMI[is.na(data_impute$BMI)]
: Selects only
the elements in the BMI
column that are missing (NA), based
on the logical vector created by is.na()
.mean(data_impute$BMI, na.rm = TRUE)
: Calculates the
mean (average) of the BMI
column, excluding missing values.
The na.rm = TRUE
argument tells R to ignore NA’s when
calculating the mean.<-
: Assigns the computed mean value to the missing
(NA) positions in the BMI
column.## RANDID SEX AGE BMI TOTCHOL
## Min. : 2448 1:1944 Min. :32.00 Min. :15.54 Min. :107
## 1st Qu.:2440336 2:2490 1st Qu.:42.00 1st Qu.:23.09 1st Qu.:206
## Median :4972848 Median :49.00 Median :25.46 Median :234
## Mean :4987278 Mean :49.93 Mean :25.85 Mean :237
## 3rd Qu.:7463577 3rd Qu.:57.00 3rd Qu.:28.06 3rd Qu.:264
## Max. :9999312 Max. :70.00 Max. :56.80 Max. :696
## NA's :52
## EDUC OBESE
## 1 :1822 0 :3841
## 2 :1281 1 : 574
## 3 : 716 NA's: 19
## 4 : 502
## NA's: 113
##
##
When we run the summary()
function above after using
mean imputation for BMI, we see that the BMI
variable now
shows no missing values. Yay!
# replace missing values of TOTCHOL with its median
data_impute$TOTCHOL[is.na(data_impute$TOTCHOL)] <- median(data_impute$TOTCHOL, na.rm = TRUE)
The line of code above can be broken down in the same way as the line
we used for mean imputation of BMI previously. The only differences are
that: (1) We have specified the variable of interest as
TOTCHOL
; and (2) We are using the median()
function to impute missing values using the median, rather than the
mean.
## RANDID SEX AGE BMI TOTCHOL
## Min. : 2448 1:1944 Min. :32.00 Min. :15.54 Min. :107.0
## 1st Qu.:2440336 2:2490 1st Qu.:42.00 1st Qu.:23.09 1st Qu.:206.0
## Median :4972848 Median :49.00 Median :25.46 Median :234.0
## Mean :4987278 Mean :49.93 Mean :25.85 Mean :236.9
## 3rd Qu.:7463577 3rd Qu.:57.00 3rd Qu.:28.06 3rd Qu.:263.0
## Max. :9999312 Max. :70.00 Max. :56.80 Max. :696.0
## EDUC OBESE
## 1 :1822 0 :3841
## 2 :1281 1 : 574
## 3 : 716 NA's: 19
## 4 : 502
## NA's: 113
##
When we run the summary()
function above after using
median imputation for total cholesterol level, we see that the
TOTCHOL
variable now shows no missing values. Yay!
# replace missing values of EDUC with its mode
mode_value <- names(sort(table(data_impute$EDUC), decreasing = TRUE))[1]
data_impute$EDUC[is.na(data_impute$EDUC)] <- mode_value
Mode imputation is carried out in two steps, each with its own line of code as shown above. A brief breakdown…
EDUC
variable and assigns it to the object
mode_value
. More specifically…
table(data_impute$EDUC)
: Creates a frequency table for
the EDUC
column, showing the counts of each unique value
(category).sort(..., decreasing = TRUE)
: Sorts the frequency table
in descending order, so the most frequently occurring category is at the
top.names(...)
: Extracts the names (categories) of the
sorted table, ignoring the counts.[1]
: Selects the first name in the sorted table, which
corresponds to the most frequent category (the mode).mode_value <-
: Assigns the mode (e.g., “0-11 years”)
to the object mode_value
(see Environment).NA
) in the
EDUC
column with the mode calculated in the first line.
More specifically…
is.na(data_impute$EDUC)
: Identifies which values in the
EDUC
column are missing (NA), returning a logical vector
(TRUE for missing, FALSE for non-missing).data_impute$EDUC[is.na(data_impute$EDUC)]
: Selects only
the rows in the EDUC
column where the value is
missing.mode_value
: The previously calculated mode of the
EDUC
column, which will replace the missing values.<-
: Assigns the mode value to the missing entries in
the EDUC
column.## RANDID SEX AGE BMI TOTCHOL
## Min. : 2448 1:1944 Min. :32.00 Min. :15.54 Min. :107.0
## 1st Qu.:2440336 2:2490 1st Qu.:42.00 1st Qu.:23.09 1st Qu.:206.0
## Median :4972848 Median :49.00 Median :25.46 Median :234.0
## Mean :4987278 Mean :49.93 Mean :25.85 Mean :236.9
## 3rd Qu.:7463577 3rd Qu.:57.00 3rd Qu.:28.06 3rd Qu.:263.0
## Max. :9999312 Max. :70.00 Max. :56.80 Max. :696.0
## EDUC OBESE
## 1:1935 0 :3841
## 2:1281 1 : 574
## 3: 716 NA's: 19
## 4: 502
##
##
When we run the summary()
function above after using
mode imputation for attained education, we see that the
EDUC
variable now shows no missing values. Yay!
Notice that the only variable with remaining missing values is
OBESE
. Recall that this is a variable we created in Lab 1
based on values of BMI, where we defined ‘obese’ as BMI greater than 30.
Now that we have addressed missing values for the variable
BMI
, we can address missing values of the variable
OBESE
simply by recreating the variable within this updated
dataset. We can do this as follows:
# create a new variable for obese status
data_impute <- data_impute %>% mutate(OBESE = ifelse(BMI>30, 1, 0))
## RANDID SEX AGE BMI TOTCHOL
## Min. : 2448 1:1944 Min. :32.00 Min. :15.54 Min. :107.0
## 1st Qu.:2440336 2:2490 1st Qu.:42.00 1st Qu.:23.09 1st Qu.:206.0
## Median :4972848 Median :49.00 Median :25.46 Median :234.0
## Mean :4987278 Mean :49.93 Mean :25.85 Mean :236.9
## 3rd Qu.:7463577 3rd Qu.:57.00 3rd Qu.:28.06 3rd Qu.:263.0
## Max. :9999312 Max. :70.00 Max. :56.80 Max. :696.0
## EDUC OBESE
## 1:1935 Min. :0.0000
## 2:1281 1st Qu.:0.0000
## 3: 716 Median :0.0000
## 4: 502 Mean :0.1295
## 3rd Qu.:0.0000
## Max. :1.0000
Now when we view a summary of the data, we see there are no longer
any missing values for OBESE
. This is because, when we
recreate the OBESE
variable and give it the same
name, the previous version of the variable is overwritten.
A key point… There is no single “correct” way to handle missing data; the best approach depends on the specific context, the data, and the analysis being conducted. What’s most important is to carefully document and report how missing data were addressed to ensure transparency and reproducibility. This includes clearly explaining the methods used and why they were chosen. While different approaches may be valid, decisions should be justified based on their appropriateness for the data and the research questions being addressed. This helps maintain the integrity of the analysis and allows others to evaluate and replicate your work.
Note: Advanced methods for addressing missing data, such as mutliple imputation or modeling missingness, are available for more complex datasets and analyses but are beyond the scope of this course.
Now, since we have done the hard work of addressing missing data (at least, for our selected variables of interest), let’s export the modified dataset for later use. This time, we will call the resulting CSV file “frmgham2_p1_impute.csv” to indicate that this dataset contains only the baseline sample (Period 1), and missing data have been addressed using simple imputation.
In Lab 2, we’ve built a strong foundation in exploring and understanding data. You’ve learned how to summarize variables, create visualizations, identify outliers, explore relationships between variables, and address missing data. These are essential first steps in any analysis. Moving forward, we’ll dive into topics like estimation, hypothesis testing, and regression analysis, which will allow us to make inferences about populations and examine relationships in greater depth.
When you are ready, please submit the following to the Lab 2 assignment page on Canvas:
.Rmd
extension.html
filePlease reach out to me at jenny.wagner@csus.edu if you have any questions. See you in class!