Visualizing uncertainty
Overview
In this module, we will deal with some ways to visualize variability and uncertainty in data or models. In most cases, we will visualize descriptive statistics like the mean and standard error as well as confidence intervals. If you like using formulas for plotting, you should check out {ggformula}.
Uncertainty can be communicated in data visualization using lines, shading and colors, bars, distributions, rugs, etc. The goal is may be communicate variability on its own or to complement measures like means or medians, which represent best guesses (point estimates) about centers of distributions. Central tendency measures by their very nature fail to account for the differences among events in a distribution. Models and data, however, typically have associated uncertainty, which should be communicated in some way.
To Do
Watch corresponding Canvas content.
Optional Readings
External Functions
Provided in class:
view_html()
: for viewing data frames in html format, from /r/my_functions.R
You can use this in your own work space but I am having a challenge rendering this of the website, so I’ll default to print()
on occasion.
source(here::here("src", "my_functions.R"))
Custom Functions
We will use some custom functions to handle some tasks this module.
Libraries
- {dplyr} 1.1.4: for selecting, filtering, and mutating
- {ggplot2} 3.5.1: for plotting
- {ggdist} 3.3.2: for plotting distributions
- {distributional} 0.4.0: for plotting distributional information
- {tidyr} 1.3.1: for tidying up models
- {broom} 1.0.6: for cleaning up models
Load libraries
::p_load(dplyr, ggplot2, ggdist, distributional, tidyr, broom) pacman
Loading Data
We will use some swimming event times which can be accessed from:
https://raw.githubusercontent.com/slicesofdata/dataviz24/main/data/processed/cleaned-2023-cms-invite.csv
To access the data, either read the file directly from the url using read.csv()
and assign the data frame a name like SWIM
:
read.csv("https://raw.githubusercontent.com/slicesofdata/dataviz24/main/data/processed/cleaned-2023-cms-invite.csv")
Or download it and save to the /data/processed
directory and read from there.
<- read.csv(here::here("data", "processed", "cleaned-2023-cms-invite.csv")) SWIM
Measurement
Measurement is the systematic assignment of numbers to an event, or object, based on the properties inherent in that event, or object. For example, the properties of an apple may be its color, size, shape, sweetness, acidity, texture, etc. The properties of an individual may be ones height, weight, BMI, genetics, hair color, eye color, graphical literacy, decision-making ability, working-memory capacity, etc.
Statistical research is the pursuit of understanding the systematic relationships between events that vary. A key goal of statistics is to understand the systematic relationship between predictor and outcome variables. If fact, we don’t really need statistics at all if there is no variability in measurement from person to person, event to event, or sample to sample. Variability in measurement, and capturing that variability is the essence of statsics.
Variability and Uncertainty in Data
When we deal with variables, rather than constants, we have variability. By definition, variables represent events that can be counted or measured and may take on different values. Variables are not constants. Whereas constants introduce no ambiguity, variability, or uncertainty, variables capture differences in measurement of event properties. These differences, or changes, in measurement value may occur from context to context, person to person, time to time, etc. When all individuals perform the same on some measurable event (e.g., math accuracy, long jump, money in pocket, etc.) there is no variability. By contrast, differences in measurement across individuals can reflect uncertainty about performance.
To the extent that an individual differs in number of millimeters jumped in 10 attempts of the standing broad jump, there is variability and uncertainty in this ability. Similarly, differences in the number of millimeters jumped by 10 individuals reveals differences or variability among individuals (plus or minus other influences affecting a single jumping instance). To the extent that differences among individuals are not the result of random influence, they must be due to other measurable properties (e.g., age, height, weight, strength, balance, motor coordination, angle of knee flexion, angle of hip flexion, take-off velocity, preparation, push-off, and float, etc.).
Of course, all jumping events can be aggregated in an effort to summarize and estimate the central tendency of jumping ability (e.g., mean, median, mode) but this aggregation process does not remove the change in measurement from jump to jump called variability.
Point Estimates
We can group the data and summarize by groups using means in order to plot those mean point estimates are bars. This is the traditional, though not personally recommended, approach to plotting data in various disciplines. This will serve as a starting point for plots. Let’s leverage stat_summary()
rather than compute manually with summarize()
.
|>
SWIM ggplot(mapping = aes(x = Team, y = Time)) +
stat_summary(fun = mean, geom = "bar") +
labs(title = "Mean Freestyle Swim Time", y = "Seconds")
Compared with the count data, the means are simply point estimates of the central tendency for sample data which are often used in service of estimating unknown population parameters. This is why a mean is called a point estimate as it estimate a single value, or point, of a distribution. Plotting means does not provide information about variability in the sample data or in the uncertainty around the mean’s ability to estimate the sample’s center generally and more specifically its corresponding population parameter.
Visualizing Variability
Whereas point estimates provide statistical information about central tendency and are represented visually using points, bars, or boxes, interval estimates and dispersion measures like standard deviations, standard errors of the mean, confidence intervals, etc. are represented using vertical lines, crossbars, or error bars. {ggplot2} has four geoms for visualizing statistical measures of variability: geom_crossbar()
, geom_errorbar()
, geom_linerange()
, geom_pointrange()
. Note: Legends may be hidden for illustrative purposes.
A descriptives()
function
In order to create some visualizations, we will first summarize the data and then create a base plot to overlay different metrics for examples. We could use group_by()
, summarize()
, and ungroup()
but you can always create your own functions to perform these redundant operations.
In order to summarize the data, we will use a custom function I wrote and named descriptives()
, which serves as a “Offiziersmesser” (“officer’s knife”, aka Swiss Army Knife), type function for descriptive statistics. This function takes a vector or a data frame and returns the n, mean, trimmed mean, median, standard deviation, standard error, skewness, kurtosis, sum, minimum, maximum, range, interquartile range, median absolute deviation, and confidence interval values for a numeric vector. Both trim
and conf
can be adjusted as needed.
We will also calculate the range of years in order to facilitate the axis scale. And we will also make a simple function to cycle though some colors for making the points in the adjacent years easier to distinguish. We only need to colors for contrasting adjacent years,
How does descriptives()
work? There are there parts.
data
: the data object, which can be a vector or a data frame/tibblegroupby
: the grouping parameter leveraginggroup_by()
var
: the outcome variable vector for describing statistically
You can source the function this way:
source("https://raw.githubusercontent.com/slicesofdata/dataviz24/main/src/functions/describe.R")
Done Defining describe.R
Let’s get the descriptives()
for the SWIM
data, grouped by Team
and Event
for Time
.
descriptives(data = SWIM,
groupby = c(Team, Event),
var = Time
)
# A tibble: 12 × 25
Team Event n mean mean.trim mdn smallest_mode modefreq modeprop
<chr> <chr> <int> <dbl> <dbl> <dbl> <chr> <int> <dbl>
1 Men Backstroke 7 90.6 90.6 110. 53.95 1 0.143
2 Men Breaststro… 11 84.2 82.2 59.4 55.17 1 0.0909
3 Men Butterfly 15 65.6 62.3 52.4 49.66 1 0.0667
4 Men Freestyle 40 81.2 68.3 49.8 21.34 1 0.025
5 Men IM 10 131. 119. 120. 112.33 1 0.1
6 Mixed Freestyle 15 94.0 93.9 93.5 88.14 1 0.0667
7 Mixed Medley 15 104. 104. 104. 97.74 1 0.0667
8 Women Backstroke 8 106. 106. 129. 62.81 1 0.125
9 Women Breaststro… 8 108. 108. 109. 62.44 1 0.125
10 Women Butterfly 12 89.3 87.6 62.3 57.29 1 0.0833
11 Women Freestyle 47 142. 92.9 113. 24.16 1 0.0213
12 Women IM 13 155. 146. 133. 127.94 1 0.0769
# ℹ 16 more variables: sd <dbl>, se <dbl>, skew <dbl>, kurt <dbl>, min <dbl>,
# max <dbl>, range <dbl>, iqr <dbl>, q25 <dbl>, q75 <dbl>, mad <dbl>,
# sum <dbl>, ci.95l <dbl>, ci.95u <dbl>, ci.99l <dbl>, ci.99u <dbl>
All metrics are lowercase, so if passing the returned object to ggplot()
, we can plot the data by mapping mapping = aes(x = Team, y = mean.trim)
. If we want only freestyle data, we can filter for it. Otherwise, you will need to
descriptives(SWIM, groupby = c(Team, Event), var = Time) |>
ggplot(mapping = aes(x = Team, y = mean.trim, color = Event)) +
geom_point(position = position_dodge2(width = 1)) +
labs(title = "Mean Freestyle Swim Time", y = "Seconds")
If you do not need all of the variables extracted with descriptives()
, you can use stat_summary()
with some metrics already built into {ggplot2}, or you can create your own functions to pass to fun
or fun.data
.
geom_pointrange()
:
A geom_pointrange()
is quite simply a combination of a geom_point()
and a geom_line()
. It also provides more detail than a geom_linerange()
, which is just a line connecting two points. For both geoms, you will need to specify where the line along the y axis starts and where it ends by mapping variables to ymin
and ymax
. For these examples, the values will be obtained using descriptives()
but you could use {{dplyr}} to subset and summarize the data too. If you want to map other aes()
thetics, for example, shape or color, you can do that as well.
geom_pointrange(
mapping = NULL,
data = NULL,
stat = "identity",
position = "identity",
...,fatten = 4,
na.rm = FALSE,
orientation = NA,
show.legend = NA,
inherit.aes = TRUE
)
stat_summary()
to create a geom_pointrange()
with +/- 1 standard deviation
|>
SWIM ggplot(mapping = aes(x = Event, y = Time, color = Team)) +
stat_summary(fun.data = mean_sdl,
geom = "pointrange")
Well, that’s not quite right. Remember that geom_pointrange()
will require a y
, ymin
, and ymax
for the mean and standard deviations, respectively. We need to use fun.data = mean_sd1
rather than fun = mean
.
|>
SWIM ggplot(mapping = aes(x = Event, y = Time,
color = Team)) +
stat_summary(fun.data = mean_sdl,
geom = "pointrange")
OK, closer but not quite right. We need to change the spatial positioning. One positioning issue that you will experience with geom_point()
or geom_pointrange()
relates to mapping a third variable to color
, fill
, or shape
which could all be mapped to Team
. You have seen points plotted with these aesthetics before and addressed overplotting before by jittering them. When using geom_pointrange()
, you can immediately notice a similar challenge; the points corresponding to the same x-axis position may have overlapping error variability lines. Because the lines are so thin, adjusting opacity will not really fix the problem.
Adjusting positioning using position = position_jitter()
will move the change point position but will do so that will be inconsistent across the x variable, whether categorical or numeric creating asymmetrical positioning. We can also map shape
as well for some clarity to see if that helps.
|>
SWIM ggplot(mapping = aes(x = Event, y = Time,
color = Team, shape = Team)) +
stat_summary(fun.data = mean_sdl,
geom = "pointrange",
position = position_jitter())
The problem is that jitter functions apply a random jittering for each level of x here. Setting the seed
for the process will ensure consistency every function call but will not ensure consistency across each level of the x variable as seen in the plot.
When you really just want the points be get out of each others way, you can use position = position_dodge2()
to make the points dodge side-to-side from the central positioning in a symmetrical manner. position_dodge2()
relative to position_dodge()
also does not require a group
to be defined in the geom_*()
or the global ggplot()
object. However, you will likely need to set width
to a value of 1 or less when your x variable is categorical in order to avoid something unappealing. Here are some examples.
|>
SWIM ggplot(mapping = aes(x = Event, y = Time,
color = Team, shape = Team)) +
stat_summary(fun.data = mean_sdl,
geom = "pointrange",
position = position_dodge2(width = 1)
)
<-
dodge_1 |>
SWIM ggplot(mapping = aes(x = Event, y = Time,
color = Team, shape = Team)) +
stat_summary(fun.data = mean_sdl,
geom = "pointrange",
position = position_dodge2(width = 1)
+
) labs(title = "position_dodge2(width = 1)",
tag = "A") + coord_flip()
<-
dodge_2 |>
SWIM ggplot(mapping = aes(x = Event, y = Time,
color = Team, shape = Team)) +
stat_summary(fun.data = mean_sdl,
geom = "pointrange",
position = position_dodge2(width = 2)
+
) labs(title = "position_dodge2(width = 2)",
tag = "B") + coord_flip()
<-
dodge_5 |>
SWIM ggplot(mapping = aes(x = Event, y = Time,
color = Team, shape = Team)) +
stat_summary(fun.data = mean_sdl,
geom = "pointrange",
position = position_dodge2(width = .5)
+
) labs(title = "position_dodge2(width = .5)",
tag = "C") + coord_flip()
<-
dodge_25 |>
SWIM ggplot(mapping = aes(x = Event, y = Time,
color = Team, shape = Team)) +
stat_summary(fun.data = mean_sdl,
geom = "pointrange",
position = position_dodge2(width = .25)
+
) labs(title = "position_dodge2(width = .25)",
tag = "D") + coord_flip()
suppressMessages(
plot(
::arrangeGrob(dodge_1, dodge_2, dodge_5, dodge_25,
gridExtrancol = 2
)))
Plot A will not achieve the dodge you desire and something something too small may not lead to enough change in position. Plot Β solves the issue but is challenged by the Gestalt perceptional grouping principle of proximity. If you are curious about these principles in UX design Nielsen Norman Group also has a post on this issue. The dodging associated with width = 1
does not facilitate the grouping of Team
at each level of Event
because the spacing between Team
s for each event is the same as the spacing of Team
s across Event
s. Reduce the dodge so that proximity grouping facilitates plot perception and interpretation. Keep in mind that plot aspect ratios (see here) can also affect positioning and proximity in some cases.
stat_summary()
to create a geom_pointrange()
with +/- 1 standard error, use fun.data = mean_se
stat_summary()
to create a geom_pointrange()
with confidence intervals, use fun.data = mean_cl_normal
or fun.data = mean_cl_boot
for a bootstrapped version.
stat_summary()
to create a geom_linerange()
A geom_linerange()
simply visualizes a line plot that starts at one value and ends at another value.
geom_linerange(
mapping = NULL,
data = NULL,
stat = "identity",
position = "identity",
...,na.rm = FALSE,
orientation = NA,
show.legend = NA,
inherit.aes = TRUE
)
We can approach line range plots with the same fun.data
arguments or pass two values.
|>
SWIM ggplot(mapping = aes(x = Event, y = Time,
color = Team, shape = Team)) +
stat_summary(fun.min = min,
fun.max = max,
geom = "linerange",
linewidth = 3,
position = position_dodge2(width = 1)
+ coord_flip() +
) labs(title = "Mean Swim times for Stags and Athenas by Event",
caption = "lines represent ranges from min to max times")
|>
SWIM ggplot(mapping = aes(x = Event, y = Time,
color = Team, shape = Team)) +
stat_summary(fun.data = mean_se,
geom = "linerange",
position = position_dodge2(width = 1)
+ coord_flip() +
) labs(title = "Mean Swim times for Stags and Athenas by Events",
caption = "lines represent /- 1 bootstrapped standard error of the mean")
Such a visualization shows clearly where the data start and stop and allow for comparisons. Compared with geom_pointrange()
, geom_linerange()
only creates a perceptual grouping based on color. Because there are no points to plot, you cannot also change the shape
of the points in order to make the grouping of Team
redundant with color and point shape. Redundant encoding is something we will address in another module on designing perceptually-efficient visualizations). If you wish to achieve this redundancy, you will need to vary the linetype
. You can map the aesthetic to Team
, set it specifically with scale_linetype_manual()
, or code the line type into the data frame and use scale_linetype_identity()
. You can specify linetype
by name or by number: 0 (“blank”), 1 (“solid”), 2 (“dashed”), 3, 4, 5, 6, etc. When passing values
in scale_linetype_manual()
, keep in mind this is a vector and vectors can be numeric or character but not both so you cannot mix numbers and strings for line types.
For more on line type, read here
|>
SWIM ggplot(mapping = aes(x = Event, y = Time,
color = Team, shape = Team,
linetype = Team
+
)) stat_summary(fun.data = mean_se,
geom = "linerange",
linewidth = 1,
position = position_dodge2(width = 1)
+
) coord_flip() +
scale_linetype_manual(values = c(Men = "dotted", Women = "longdash", Mixed = "solid")) +
labs(title = "Mean Swim times for Stags and Athenas by Events",
caption = "lines represent /- 1 bootstrapped standard error of the mean")
geom_errorbar()
:
Error bars are likely the most familiar visual form of of uncertainty you see in data visualization. Error bars represent the measurable error associated with data cases deviating from the distribution’s mean and is most typically the standard error of the mean. Without delving too deeply into concepts of statistics, the standard error of the mean is calculated as standard deviation / square root of the sample size
. Although there are libraries like {plotrix} containing functions for it, its calculation is so simple you don’t need to bother with external libraries.
The describe()
function calculates the standard error of the mean as se
.
geom_errorbar()
will require setting the ymin
and ymax
values for the error bars. Because the se
reflects error around the mean
, we will need to add and subtract the se
to and from the mean
in order to determine its upper and lower limits.
geom_errorbar()
with standard error
+
swim_base_plot geom_errorbar(mapping = aes(ymin = mean - se,
ymax = mean + se
) )
Out of the box, the bars will overlap and they will can been quite large, thus requiring some adjustment. We will position_dodge2()
the bars to prevent overlapping, change the linewidth
to be more prominent.
+
swim_base_plot geom_errorbar(mapping = aes(ymin = mean - se,
ymax = mean + se
),position = position_dodge2(),
linewidth = .7,
width = .3 # make the horizontal bars shorter
)
geom_errorbar()
with confidence intervals
+
swim_base_plot geom_errorbar(mapping = aes(ymin = ci.99l,
ymax = ci.99u
),position = position_dodge2(),
linewidth = .7,
width = .3 # make the horizontal bars shorter
)
Before even adding error bars, this geom_col()
represents an excellent example of a pesky problem with out-of-the-box plots containing missing data. In any given data set, you might not have perfectly tidy data frames with data for all variable x variable combinations. For example, you might have data on the number of steps you walk during the morning and the afternoon (two levels of a time factor) for every day of the week (7 measures of another time variable) and you would have 2 x 7 = 14 bars to present in a plot. But if on Saturdays you sleep in past noon, you never have any data for the morning on Saturdays and you will have only 13 bars for your plot.
The above plot illustrates what geom_col()
does when you have this data imbalance. When both bars are missing, you will see an empty space on the plot. You see that for 2021. But when only half the data are present, a single bar usurps the space of two bars.
*Making bars the same width**
When you read the docs for position_dodge()
or position_dodge2()
, you see that you can set a preserve
argument (e.g., “should dodging preserve the”total” width of all elements at a position, or the width of a “single” element?“). Clearly, we want to fix the width using preserve = "single"
. The way that position_dodge()
and position_dodge2()
handle this aesthetically differs so we can use both for comparison. You can decide what looks better for your own plots.
suppressMessages(
plot(
::arrangeGrob(
gridExtra+
swim_base_plot geom_col(position = position_dodge(preserve = "single")) +
labs(title = 'position_dodge(preserve = "single"))',
tag = "A"
),+
swim_base_plot geom_col(position = position_dodge2(preserve = "single")) +
labs(title = 'position_dodge2(preserve = "single"))',
tag = "B"
),ncol = 1
)))
The bars are now all the same width. For position_dodge2()
, the single bar is center-aligned whereas position_dodge()
aligns it to the left. position_dodge2()
seems like a better option.
Add the error bars using geom_errorbar()
:
We now will add the error bars to the plot. Just as we did for geom_linerange()
, we will map the ymin
and ymax
to the for the line to terminate.
Bars with Standard Errors
+
swim_base_plot geom_col(position = position_dodge2(preserve = "single")) +
scale_fill_manual(values = c(Women = "firebrick",
Men = "cornflowerblue",
Medley = "grey60"
)+
) geom_errorbar(mapping = aes(ymin = mean - se,
ymax = mean + se
) )
OK, This is hideous! The error bars are not positioned with the bars and they are rather wide. To address the positioning, remember that we dodged the bars/columns, specifically using position_dodge2(preserve = "single")
, so we need to similarly adjust the positioning for the error bars.
+
swim_base_plot geom_col(position = position_dodge2(preserve = "single"),
alpha = .9,
color = "grey50" # make the bar outline color the same and
+
) scale_fill_manual(values = c(Women = "firebrick",
Men = "cornflowerblue",
Medley = "grey60"
)+
) geom_errorbar(mapping = aes(ymin = mean - se,
ymax = mean + se
),position = position_dodge2(preserve = "single"),
color = "black", # set them to all be the same color
linewidth = .6,
)
Just remember that with multi-layered plots, the layers are added on top of existing ones. Starting with geom_errorbar()
and then adding geom_col()
will result in the lower portion of the error bars behind masked by the columns, especially if alpha = 1
.
+
swim_base_plot geom_col(position = position_dodge2(preserve = "single"),
alpha = 1,
color = "grey50" # make the bar outline color the same and
+
) geom_errorbar(mapping = aes(ymin = mean - se,
ymax = mean + se
),position = position_dodge2(preserve = "single", width = 1),
color = "black", # set them to all be the same color
linewidth = .6
+
) scale_fill_manual(values = c(Women = "firebrick",
Men = "cornflowerblue",
Medley = "grey60"
) )
Confidence Intervals
We will create the same plot using confidence intervals.
+
swim_base_plot geom_col(position = position_dodge2(preserve = "single"),
alpha = .8,
color = "grey50" # make the bar outline color the same and
+
) geom_errorbar(mapping = aes(ymin = ci.99l,
ymax = ci.99u
),position = position_dodge2(preserve = "single"),
color = "black", # set them to all be the same color
linewidth = .6
+
) scale_fill_manual(values = c(Women = "firebrick",
Men = "cornflowerblue",
Mixed = "grey60"
)+
) labs(title = "Mean Time for Events in 2023",
tag = "",
x = NULL, y = "Seconds",
caption = "M = blue, F = red, Medley = grey\nbars = 99% CI"
)
The geom_pointrange()
may likely be a better visualization of the data than the geom_errobar()
paired with geom_col()
.
Models
Statistical models also have associated uncertainty because they are built on the data that have natural variability. We have seems some of this variability and uncertainty of models in the module on visualizing associations.
Let’s build a model, however, and then see it’s associated uncertainty.
<-
FREE_100 |>
SWIM filter(Event == "Freestyle") |>
filter(Team != "Mixed") |>
filter(Distance == 100) |>
filter(Time < 500)
Let’s not worry about whether a linear or nonlinear model is a better fit of the data and let’s not worry about the differences in variable relationship differs for Team
s. Let’s just build a linear model to predict Time
from Split50
to illustrate a point.
<- lm(Time ~ Split50, data = FREE_100)
fit
|>
fit ::tidy() |>
broom::kable(format = "markdown") knitr
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -0.0290832 | 1.6497412 | -0.017629 | 0.9860556 |
Split50 | 2.0817834 | 0.0659113 | 31.584599 | 0.0000000 |
Standardizing model coefficients. Interpreting statistics is not the focus of this set of modules. To understand more about interpreting model coefficients and standardized coefficients, see this tutorial or search the web for others.
Model Uncertainty Using a Table
::lm.beta(fit) |>
lm.beta::tidy() |>
broom::kable(format = "markdown") knitr
term | estimate | std_estimate | std.error | statistic | p.value |
---|---|---|---|---|---|
(Intercept) | -0.0290832 | NA | 1.6497412 | -0.017629 | 0.9860556 |
Split50 | 2.0817834 | 0.9857743 | 0.0659113 | 31.584599 | 0.0000000 |
The tables contain the model coefficients to quantify elements like the linear relationship between the variables, the error or uncertainty in the model fit, etc. We can see that Split50
predicts the Time
for the 100 Freestyle. The association is not perfect, however. There is some error, or uncertainty, in the model as indicated by the model std.error
.
Confidence Intervals for Model Fit
confint(fit) |>
::kable(format = "markdown") knitr
2.5 % | 97.5 % | |
---|---|---|
(Intercept) | -3.403183 | 3.345016 |
Split50 | 1.946980 | 2.216587 |
::lm.beta(fit) |>
lm.betaconfint() |>
::kable(format = "markdown") knitr
2.5 % | 97.5 % | |
---|---|---|
(Intercept) | NA | NA |
Split50 | 0.8509705 | 1.120578 |
Model Uncertainty Using a Plot
Let’s apply geom_smoth()
to add a fit line.
<-
plot_lm |>
FREE_100 ggplot(mapping = aes(
x = Split50,
y = Time
+
)) geom_point(alpha = .5) +
geom_smooth(method = "lm",
fullrange = TRUE,
color = "black",
fill = "firebrick"
+
) theme_light()
plot_lm
`geom_smooth()` using formula = 'y ~ x'
We can see the linear model fit as a line and the shaded area around that fit line indicates uncertainty of the model parameters. Because the model is not perfect, there is uncertainty about the true fit. Looking at ?geom_smooth
, you will see the argument for level = 0.95
which helps define the uncertainty to visualize. Specifically, it defines the width of the shaded bands around the linear regression line in the plot. The bands represent the range within which the true regression line should lie given some degree of confidence. Thus, with level = 0.95
, the confidence interval of 95%. We can see a different version by changing the level = .99
for a 99% confidence interval.
<-
plot_lm99 |>
FREE_100 ggplot(mapping = aes(
x = Split50,
y = Time
+
)) geom_point(alpha = .5) +
geom_smooth(method = "lm",
fullrange = TRUE,
color = "black",
fill = "firebrick",
level = .99
+
) theme_light()
plot_lm99
`geom_smooth()` using formula = 'y ~ x'
The bands are wider now because they are more likely to capture the true population parameter predicted by the model. The bands can vary in width based on the number of data points contributing to the prediction of y at any given x value but the bands will be most narrow at the model centroid (the point corresponding to the mean of x and the mean of y). Mapping aesthetics to a new point will illustrate this. The model needs to pass through this point.
= aes(x = mean(FREE_100$Split50),
mapping y = mean(FREE_100$Time))
<- FREE_100 |>
plot_lm ggplot(mapping = aes(
x = Split50,
y = Time
+
)) geom_point(alpha = .5) +
geom_smooth(method = "lm",
fullrange = TRUE,
color = "black",
fill = "firebrick"
+
) theme_light() +
geom_point(mapping = aes(x = mean(FREE_100$Split50),
y = mean(FREE_100$Time)
),size = 10,
shape = "*",
color = "blue"
)
plot_lm
We can remove the model error from the plot using se = FALSE
but but doing so is not very honest communication.
|>
FREE_100 ggplot(mapping = aes(
x = Split50,
y = Time
+
)) geom_point() +
geom_smooth(method = "lm",
fullrange = TRUE,
se = FALSE,
color = "firebrick"
+
) theme_light()
But even if a linear model did fit the data perfectly, the set of coefficients obtained were from a single model and that single model is based on the athletes who participated in events. What would the model look like if it did not include just those athletes but instead includes athletes who were sick and sat the sidelines or those who could have been disqualified for some reason?
Bootstrap Models
rsample::bootstraps()
will allow us to take samples from the full data set and run multiple models using various subsets of that full data set. Doing so will provide models that do not include best athletes, do not include worst athletes, include various mixtures, etc. The goal is not to teach bootstrapping methods but to help you understand how models are fit and how they differ, thus illuminating uncertainty in a different way than with geom_smooth()
. The code is not provided as this just illustrates a plot of bootstrapped models.
# Bootstrap sampling with apparent sample
# A tibble: 1,001 × 3
splits id model
<list> <chr> <list>
1 <split [31/10]> Bootstrap0001 <lm>
2 <split [31/12]> Bootstrap0002 <lm>
3 <split [31/12]> Bootstrap0003 <lm>
4 <split [31/10]> Bootstrap0004 <lm>
5 <split [31/11]> Bootstrap0005 <lm>
6 <split [31/10]> Bootstrap0006 <lm>
7 <split [31/13]> Bootstrap0007 <lm>
8 <split [31/8]> Bootstrap0008 <lm>
9 <split [31/13]> Bootstrap0009 <lm>
10 <split [31/10]> Bootstrap0010 <lm>
# ℹ 991 more rows
You can see some bootstrapped mode coefficients here.
# A tibble: 6 × 8
splits id model term estimate std.error statistic p.value
<list> <chr> <lis> <chr> <dbl> <dbl> <dbl> <dbl>
1 <split [31/10]> Bootstrap00… <lm> (Int… 0.282 0.929 0.304 7.63e- 1
2 <split [31/10]> Bootstrap00… <lm> Spli… 2.06 0.0374 55.0 6.97e-31
3 <split [31/12]> Bootstrap00… <lm> (Int… -0.644 1.32 -0.487 6.30e- 1
4 <split [31/12]> Bootstrap00… <lm> Spli… 2.10 0.0521 40.3 5.10e-27
5 <split [31/12]> Bootstrap00… <lm> (Int… -0.142 1.33 -0.107 9.16e- 1
6 <split [31/12]> Bootstrap00… <lm> Spli… 2.08 0.0543 38.3 2.14e-26
Because there are more than one model, we can visualize distributions of the coefficients as a histogram.
The mean and the standard deviation of the bootstrapped models:
term | estimate_mean | estimate_sd |
---|---|---|
(Intercept) | -0.11 | 1.39 |
Split50 | 2.08 | 0.06 |
Compare the mean with the coefficient from the single model:
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -0.03 | 1.65 | -0.02 | 0.99 |
Split50 | 2.08 | 0.07 | 31.58 | 0.00 |
Plotting Bootstrapped Model Fits (Variants)
Each model fit is plotted as a very this light red line in the plot. In fact, there are 1000 different models fit through the points. Because each model includes a difference subset of athletes, the mean of the variables will differ based on the data used for each model. Thus, each model has its own centroid so there is no single point through which all models must pass. Nevertheless, you can see the most narrow part and darkest coloring (indicating more lines overlapping) of the band is located near the location of the original centroid. Also, upper right part of the plot is lighter than the lower left because there are fewer points in the upper right and thus there is corresponding uncertainty to visualize.
Plotting Model Error Bars
Using {tidyr} we can create nested subsets of data using tidyr::nest()
and then we can run models an each subset. We can group using .by
and pass a vector of variable names for grouping. Make sure that you don’t have NA
s in your data frames.
<- SWIM |>
nested filter(!is.na(Time)) |>
filter(!is.na(Split50)) |>
filter(Distance < 500) |>
filter(Event != "IM") |>
::nest(.by = c(Event, Distance)) tidyr
The first instance is in data
.
$data[[1]] nested
# A tibble: 14 × 8
Year School Team Relay Name Age Time Split50
<int> <chr> <chr> <chr> <chr> <int> <dbl> <dbl>
1 2023 Pomona-Pitzer-CA Mixed Relay <NA> NA 97.7 26.4
2 2023 Claremont-Mudd-Scripps-CA Mixed Relay <NA> NA 101. 24.4
3 2023 Claremont-Mudd-Scripps-CA Mixed Relay <NA> NA 102. 24.1
4 2023 Pomona-Pitzer-CA Mixed Relay <NA> NA 102. 25.0
5 2023 Claremont-Mudd-Scripps-CA Mixed Relay <NA> NA 103. 24.4
6 2023 Pomona-Pitzer-CA Mixed Relay <NA> NA 103. 27.5
7 2023 Claremont-Mudd-Scripps-CA Mixed Relay <NA> NA 104. 28.5
8 2023 Pomona-Pitzer-CA Mixed Relay <NA> NA 104. 26.8
9 2023 Pomona-Pitzer-CA Mixed Relay <NA> NA 104. 25.8
10 2023 Claremont-Mudd-Scripps-CA Mixed Relay <NA> NA 104. 24.8
11 2023 Pomona-Pitzer-CA Mixed Relay <NA> NA 105. 25.4
12 2023 Claremont-Mudd-Scripps-CA Mixed Relay <NA> NA 106. 29.9
13 2023 Claremont-Mudd-Scripps-CA Mixed Relay <NA> NA 107. 30.4
14 2023 Pomona-Pitzer-CA Mixed Relay <NA> NA 113. 30.9
You see we have a tibble that contains nested subsets of data. There are not much data for some events but the goal is only to show how to visualize model error. We will now us Base R
lapply()
to apply a function to a list. For each nested data frame, the data will be .x
. The model fit is returned and
|>
SWIM filter(!is.na(Time)) |>
filter(!is.na(Split50)) |>
filter(Distance < 500) |>
filter(Event != "IM") |>
::nest(.by = c(Event, Distance)) |>
tidyr::mutate(models = lapply(X = data,
dplyrFUN = function(x) lm(Time ~ Split50, data = x)
) )
# A tibble: 9 × 4
Event Distance data models
<chr> <int> <list> <list>
1 Medley 200 <tibble [14 × 8]> <lm>
2 Butterfly 100 <tibble [19 × 8]> <lm>
3 Freestyle 200 <tibble [49 × 8]> <lm>
4 Breaststroke 100 <tibble [11 × 8]> <lm>
5 Backstroke 100 <tibble [6 × 8]> <lm>
6 Backstroke 200 <tibble [9 × 8]> <lm>
7 Freestyle 100 <tibble [31 × 8]> <lm>
8 Breaststroke 200 <tibble [8 × 8]> <lm>
9 Butterfly 200 <tibble [8 × 8]> <lm>
Great! We have a tibble of linear models for each Event
and Distance
pair. Using the {broom} library, perform some model cleaning using broom::tidy()
to return a cleaned model and assign it as a column in the tibble. Using lapply
, apply the broom::tidy()
function on each model in the list. Finally, because the models are nested, tidyr::unest()
them.
<-
nested |>
SWIM filter(!is.na(Time)) |>
filter(!is.na(Split50)) |>
filter(Distance < 500) |>
filter(Event != "IM") |>
::nest(.by = c(Event, Distance)) |>
tidyr::mutate(models = lapply(X = data,
dplyrFUN = function(x) lm(Time ~ Split50, data = x)
),tidy_mods = lapply(X = models, FUN = broom::tidy)
|>
) ::unnest(cols = tidy_mods) tidyr
The tibble is messy, so let’s clean it up a bit by removing the intercept term. Also, we don’t need columns like models
or data
.
<-
nested |>
SWIM filter(!is.na(Time)) |>
filter(!is.na(Split50)) |>
filter(Distance < 500) |>
filter(Event != "IM") |>
::nest(.by = c(Event, Distance)) |>
tidyr::mutate(models = lapply(X = data,
dplyrFUN = function(x) lm(Time ~ Split50, data = x)
),tidy_mods = map(models, broom::tidy)
|>
) ::unnest(cols = tidy_mods) |>
tidyrfilter(term != "(Intercept)") |>
select(-c(models, data))
#nested
So we now have a tibble with nested model coefficients. We can visualize some of the models and their errors. In the tibble, estimate
is the estimate and std.error
is the error. We can create a 95% confidence interval with lower and upper bounds by subtracting and adding 1.96*std.error
(use 1.645 for 90% CI or 2.576 for a 99% CI). Map the color to the Distance
column.
|>
nested mutate(Distance = as.character(Distance)) |>
ggplot(mapping = aes(
x = Event, y = estimate,
ymin = estimate - 1.96*std.error,
ymax = estimate + 1.96*std.error,
color = Distance
+
)) geom_pointrange(position = position_dodge2(width = .5)
+
) scale_y_continuous(n.breaks = 20) +
theme(legend.position = "top")
Using the {ggdist} and {distributional} libraries, we can plot the distributions of errors as well.
|>
nested mutate(Distance = as.character(Distance)) |>
ggplot(mapping = aes(x = estimate, y = Event, color = Distance)) +
::stat_dist_halfeye(
ggdistmapping = aes(dist = distributional::dist_normal(
mu = estimate,
sigma = std.error)
),point_size = 3
)
Box Plots
Distributions can also be visualized with box plots as we have seen before. As with geom_col()
used with geom_errorbar()
, we will need to change the boxplot spatial positioning using position = position_dodge2(preserve = "single")
so that there are boxes that are twice as wide as others, unless you like that aesthetic.
Let’s assign some color to col_values
to use in the plot.
Using stat_summary()
The above plots can also be created with stat_summary()
as long as you specify the fun
for calculating the statistics and the geom
for the plot type. Because stat_summary()
calculates the statistics, you can use the original data frame. However, in some instances, the function arguments are likely not as intuitive.
pointrange
:
A “pointrange” is just a point plot without other values like fun.min
and fun.max
.
|>
SWIM ggplot(mapping = aes(x = Event, y = Time, color = Team)) +
stat_summary(fun = mean,
geom = "pointrange",
position = position_dodge2(width = 1)
)
Warning: Removed 12 rows containing missing values or values outside the scale range
(`geom_segment()`).
But ranges are very crude and do not represent average dispersion in the data.
|>
SWIM ggplot(mapping = aes(x = Event, y = Time, color = Team)) +
stat_summary(fun = mean,
geom = "pointrange",
fun.min = min,
fun.max = max,
position = position_dodge2(width = 1)
)
Rather, for fun.max
and fun.min
, we can pass a function(x) mean(x) + sd(x) / sqrt(length(x))
for the standard error.
|>
SWIM ggplot(mapping = aes(x = Event, y = Time, color = Team)) +
stat_summary(fun = mean,
geom = "pointrange",
fun.max = function(x) mean(x) + sd(x) / sqrt(length(x)), # se
fun.min = function(x) mean(x) - sd(x) / sqrt(length(x)), # se
position = position_dodge2(width = 1)
)
errorbar
:
For an errorbar
, just change the geom
.
|>
SWIM ggplot(mapping = aes(x = Event, y = Time, color = Team)) +
stat_summary(fun = mean,
geom = "errorbar",
fun.max = function(x) mean(x) + sd(x) / sqrt(length(x)), # se
fun.min = function(x) mean(x) - sd(x) / sqrt(length(x)), # se
position = position_dodge2(width = 1),
width = .5,
linewidth = 1
)
But if you wanted to add a layer, for example, the points, they will likely not align with the error bars. You also cannot set a seed.
|>
SWIM ggplot(mapping = aes(x = Event, y = Time, color = Team)) +
stat_summary(fun = mean,
geom = "errorbar",
fun.max = function(x) mean(x) + sd(x) / sqrt(length(x)), # se
fun.min = function(x) mean(x) - sd(x) / sqrt(length(x)), # se
position = position_dodge2(width = 1),
width = .5,
linewidth = 1
+
) stat_summary(fun = mean,
geom = "point",
position = position_dodge2(width = 1),
)
Session Info
sessionInfo()
R version 4.4.1 (2024-06-14 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 22631)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8
[2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
time zone: America/Los_Angeles
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] yardstick_1.3.1 workflowsets_1.1.0 workflows_1.1.4
[4] tune_1.2.1 rsample_1.2.1 recipes_1.1.0
[7] parsnip_1.2.1 modeldata_1.4.0 infer_1.0.7
[10] dials_1.3.0 scales_1.3.0 tidymodels_1.2.0
[13] kableExtra_1.4.0 broom_1.0.6 distributional_0.4.0
[16] ggdist_3.3.2 htmltools_0.5.8.1 DT_0.33
[19] vroom_1.6.5 lubridate_1.9.3 forcats_1.0.0
[22] stringr_1.5.1 dplyr_1.1.4 purrr_1.0.2
[25] readr_2.1.5 tidyr_1.3.1 tibble_3.2.1
[28] ggplot2_3.5.1 tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] splines_4.4.1 bitops_1.0-7 R.oo_1.26.0
[4] cellranger_1.1.0 datawizard_0.11.0 hardhat_1.4.0
[7] rpart_4.1.23 lifecycle_1.0.4 lm.beta_1.7-2
[10] gtsummary_1.7.2 pbmcapply_1.5.1 doParallel_1.0.17
[13] rprojroot_2.0.4 globals_0.16.3 lattice_0.22-6
[16] MASS_7.3-60.2 insight_0.20.1 backports_1.5.0
[19] magrittr_2.0.3 Hmisc_5.1-3 sass_0.4.9
[22] rmarkdown_2.27 yaml_2.3.10 qqplotr_0.0.6
[25] qqconf_1.3.2 gld_2.6.6 cowplot_1.1.3
[28] expm_0.999-9 quadprog_1.5-8 R.utils_2.12.3
[31] nnet_7.3-19 pracma_2.4.4 ipred_0.9-14
[34] labelled_2.13.0 lava_1.8.0 ggrepel_0.9.5
[37] listenv_0.9.1 moments_0.14.1 performance_0.12.0
[40] parallelly_1.37.1 svglite_2.1.3 commonmark_1.9.1
[43] codetools_0.2-20 xml2_1.3.6 tidyselect_1.2.1
[46] farver_2.1.2 base64enc_0.1-3 broom.helpers_1.15.0
[49] jsonlite_1.8.8 opdisDownsampling_1.0.1 e1071_1.7-14
[52] Formula_1.2-5 survival_3.6-4 iterators_1.0.14
[55] systemfonts_1.1.0 foreach_1.5.2 tools_4.4.1
[58] twosamples_2.0.1 DescTools_0.99.54 Rcpp_1.0.12
[61] glue_1.7.0 prodlim_2024.06.25 gridExtra_2.3
[64] xfun_0.45 here_1.0.1 mgcv_1.9-1
[67] withr_3.0.1 fastmap_1.2.0 boot_1.3-30
[70] fansi_1.0.6 caTools_1.18.2 digest_0.6.36
[73] timechange_0.3.0 R6_2.5.1 colorspace_2.1-0
[76] markdown_1.13 R.methodsS3_1.8.2 see_0.8.4
[79] utf8_1.2.4 generics_0.1.3 data.table_1.15.4
[82] robustbase_0.99-3 class_7.3-22 httr_1.4.7
[85] htmlwidgets_1.6.4 pkgconfig_2.0.3 gtable_0.3.5
[88] Exact_3.2 timeDate_4032.109 GPfit_1.0-8
[91] furrr_0.3.1 lmom_3.0 gower_1.0.1
[94] knitr_1.47 rstudioapi_0.16.0 tzdb_0.4.0
[97] checkmate_2.3.1 nlme_3.1-164 proxy_0.4-27
[100] rootSolve_1.8.2.4 parallel_4.4.1 foreign_0.8-87
[103] pillar_1.9.0 grid_4.4.1 vctrs_0.6.5
[106] xtable_1.8-4 lhs_1.2.0 cluster_2.1.6
[109] htmlTable_2.4.2 evaluate_0.24.0 magick_2.8.3
[112] mvtnorm_1.2-5 cli_3.6.3 compiler_4.4.1
[115] rlang_1.1.4 crayon_1.5.3 future.apply_1.11.2
[118] labeling_0.4.3 stringi_1.8.4 viridisLite_0.4.2
[121] munsell_0.5.1 bayestestR_0.13.2 pacman_0.5.1
[124] Matrix_1.7-0 patchwork_1.2.0 hms_1.1.3
[127] bit64_4.0.5 future_1.33.2 haven_2.5.4
[130] gt_0.10.1 DEoptimR_1.1-3 bit_4.0.5
[133] readxl_1.4.3 DiceDesign_1.10