::sourceDirectory(here::here("src", "functions")) R.utils
Histograms and density plots
Overview
This module focuses on the Cartesian Coordinate system and visualizing data distributions using {ggplot2} geom_histogram()
and geom_density()
. This module demonstrates some ways to visualize the distribution of a single variable along either the x or y axis. When adding another aesthetic (e.g., fill
), this distribution could be visualized along the x axis but for different groupings for a second, fill
variable. Visualizations containing multiple levels of a second variable create problems for geom_histogram()
out-of-the-box which can be addressed with geom_density()
but this approach too can create perceptual problems for more complicated plots. We will also address ways to solve the limitations of visualizing more complex data arrangement in a later section and introduce the {ggridges} library to produce geom_ridgeline()
plots.
To Do
External Functions
Provided in class:
view_html()
: for viewing data frames in html format, from /src/my_functions.R
You can use this in your own work space but I am having a challenge rendering this of the website, so I’ll default to print()
on occasion.
Libraries
- {dplyr} 1.1.4: for selecting, filtering, and mutating
- {ggplot2} 3.5.1: for plotting
Load libraries
library(dplyr)
library(ggplot2)
Loading Data
To examine some associations, we will use some swimming event times which can be accessed from:
https://raw.githubusercontent.com/slicesofdata/dataviz24/main/data/processed/cleaned-2023-cms-invite.csv
To access the data, either read the file directly from the url using read.csv()
and assign the data frame a name like SWIM
:
read.csv("https://raw.githubusercontent.com/slicesofdata/dataviz24/main/data/processed/cleaned-2023-cms-invite.csv")
Or download it and save to the /data/processed
directory and read from there.
<- read.csv(here::here("data", "processed", "cleaned-2023-cms-invite.csv")) SWIM
Histograms
geom_histogram()
The histogram will plot the counts of common instances within a vector. Plotting a geom_histogram()
will require mapping a variable to either x
or y
in aes()
. The only difference will be the distribution orientation, which could have course be changed also with coord_flip()
.
If x = Time
:
|>
SWIM ggplot(mapping = aes(x = Time)) +
geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
If y = Time
:
|>
SWIM ggplot(mapping = aes(y = Time)) +
geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
If you have both x and y mapped, you will get an error.
SWIM |>
ggplot(mapping = aes(x = Distance, y = Time)) +
geom_histogram()
Error in `geom_histogram()`:
! Problem while computing stat.
ℹ Error occurred in the 1st layer.
Caused by error in `setup_params()`:
! `stat_bin()` must only have an x or y aesthetic.
Histograms depend on the chosen bin width
Histograms are data representations typically use to display counts, frequencies, or instances of events falling within some upper and lower bound range for numeric or continuous variables. For example, counts of salaries for every $5k, weights for every 5 pounds, etc. Histogram functions will typically determine these ranges on their own, which may or may not be suitable for your visualization. All of these elements can be customized.
Setting bins
Number
The bins
parameter in geom_histogram()
controls number of bins, defaults to 30, and is overridden by the binwidth
argument. Depending on your data, this may or may not provide the best representation of data. You can also specify the number of bins. This will force geom_histogram()
to adjust the bin width based on the number of bind you specify. Let’s create two examples.
<- SWIM |>
plot1 filter(Time < 500) |>
ggplot(mapping = aes(Time)) +
geom_histogram(bins = 20) +
labs(title = "bins = 20", tag = "A")
<- SWIM |>
plot2 filter(Time < 500) |>
ggplot(mapping = aes(Time)) +
geom_histogram(bins = 60) +
labs(title = "bins = 60", tag = "B")
Because the total number of bins
will capture all frequency counts in the data, setting fewer bins will result in plots that have more counts per bin than will plots with many bins. Examine the counts along the y-axis to see this.
plot(gridExtra::arrangeGrob(plot1, plot2, ncol = 2))
Not only is the narrowness of the bars associated with more bins but the this is the result of fewer counts, which ultimately affects your range along the y axis. This means that when you plot two bars side-by-side (to facilitate comparison of bars by their heights), the height will be confounded by the scale; you are not comparing bars on aligned scales (See Cleveland & McGill (1984)). This results in perceptual problems.
Setting binwidth
When you plot a geom_histogram()
, R reminds you to consider the binwidth by providing the error: stat_bin() using bins = 30. Pick better value with binwidth
. The binwidth
argument controls the bin width along the X-axis and this argument overrides the bin
argument.
You could make your bin width narrow relative to the data being plotted. For example, 1/10 of a second (e.g., binwidth = .1
) for swim events. Similarly, if your vector contains discrete values, binwidth = 1
would plot the frequency count for all instances.
<- SWIM |>
plot1 filter(Time < 500) |>
ggplot(mapping = aes(Time)) +
geom_histogram(binwidth = .1) +
labs(title = "binwidth = .1", tag = "A")
But this produces a very granular plot that is not very useful. The counts are so few and the distribution is not very apparent visually. By contrast, if you set it too wide (e.g., binwidth = 30
), then you might loose a lot of detail. Notice also that the x axis does not provide tick marks that help with the binning, whether narrow or wide.
<- SWIM |>
plot2 filter(Time < 500) |>
ggplot(mapping = aes(Time)) +
geom_histogram(binwidth = 30) +
labs(title = "binwidth = 30", tag = "B")
plot(gridExtra::arrangeGrob(plot1, plot2, ncol = 2))
Is this plot better?
<- SWIM |>
plot3 filter(Time < 500) |>
ggplot(mapping = aes(Time)) +
geom_histogram(binwidth = 5) +
labs(title = "binwidth = 5", tag = "C")
plot(gridExtra::arrangeGrob(plot1, plot2, ncol = 1))
Centering the bin
Once you are happy with the bin width, you can think about centering them. center
will be used to specify the center of the bin containing the bin value. For now, there is some adjustment, though not perfect, to the scale to see the differences.
<- SWIM |>
plot1 filter(Time < 200) |>
ggplot(mapping = aes(Time)) +
geom_histogram(binwidth = 10) +
ggtitle("binwidth = 10") +
# add some detailing to see better
scale_x_continuous(breaks = seq(0, 350, 10)) +
labs(title = "default center", tag = "A")
<- SWIM |>
plot2 filter(Time < 200) |>
ggplot(mapping = aes(Time)) +
geom_histogram(binwidth = 10,
center = 1) +
# add some detailing to see better
scale_x_continuous(breaks = seq(0, 350, 10)) +
labs(title = "center = 1", tag = "B")
<- SWIM |>
plot3 filter(Time < 200) |>
ggplot(mapping = aes(Time)) +
geom_histogram(binwidth = 10,
center = 0) +
# add some detailing to see better
scale_x_continuous(breaks = seq(0, 350, 10)) +
labs(title = "center = 0", tag = "C")
<- SWIM |>
plot4 filter(Time < 200) |>
ggplot(mapping = aes(Time)) +
geom_histogram(binwidth = 10,
center = 5) +
# add some detailing to see better
scale_x_continuous(breaks = seq(0, 350, 10)) +
labs(title = "center = 5", tag = "D")
plot(gridExtra::arrangeGrob(plot1, plot2, plot3, plot4, ncol = 2))
For more details on center
, see ?geom_histogram
.
Color and other Aesthetics
And can also change the color of the bars in the histogram or make them transparent (e.g., fill = "transparent"
).
<- SWIM |>
plot1 filter(Time < 500) |>
ggplot(mapping = aes(Time)) +
geom_histogram(
fill = "tomato2",
color = "black",
linetype = "dashed"
+
) labs(title = 'fill = "tomato2"\nlinetype = "dashed"',
tag = "A"
)
<- SWIM |>
plot2 filter(Time < 500) |>
ggplot(mapping = aes(Time)) +
geom_histogram(
fill = "transparent",
color = "tomato2",
linetype = "solid"
+
) labs(title = 'color = "tomato2"\nlinetype = "solid"',
tag = "B"
)
Two arrangements of the plots illustrate their differences.
::plot_layout(plot1 + plot2) patchwork
$ncol
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
$nrow
NULL
$byrow
NULL
$widths
NULL
$heights
NULL
$guides
NULL
$tag_level
NULL
$axes
NULL
$axis_titles
NULL
$design
NULL
attr(,"class")
[1] "plot_layout"
::plot_layout(plot1 / plot2) patchwork
$ncol
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
$nrow
NULL
$byrow
NULL
$widths
NULL
$heights
NULL
$guides
NULL
$tag_level
NULL
$axes
NULL
$axis_titles
NULL
$design
NULL
attr(,"class")
[1] "plot_layout"
Stacked histograms
You can stack bars in histograms just as you did with geom_bar()
. Mapping a variable to fill
will introduce the distribution for levels of the fill
variable. However, if you think stacking a histogram will be useful, it really isn’t.
|>
SWIM ggplot(mapping = aes(x = Time, fill = Team)) +
geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Even if the bars are not opaque, they are not much better. In general, histograms with overlapped bars a rarely useful.
|>
SWIM ggplot(mapping = aes(x = Time, fill = Team)) +
geom_histogram(alpha = .5)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
There are certainly things that you can do to the plot but there are better options.
Histograms with y as Density
The default operation of geom_histogram()
is to plot the frequency counts along the y axis corresponding to the x variable mapped in aes()
. The statistical transformation adjusted using after_stat()
. You will see that the default mapping to the y variable is after_stat(count)
. We can add this explicitly either in geom_histogram()
to across ggplot()
and geom_histogram()
For example:
ggplot() +
geom_histogram(mapping = aes(x = Time,
y = after_stat(count)
)
)
ggplot(mapping = aes(x = Time)) +
geom_histogram(mapping = aes(y = after_stat(count)))
after_stat(count)
<- SWIM |>
(plot1 filter(Time < 500) |>
ggplot() +
geom_histogram(mapping = aes(x = Time,
y = after_stat(count)
)+
) labs(title = 'aes(y = after_stat(count))',
tag = "A")
)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The plot looks the same as we have seen before because this is the default operation. We can, however, change the statistical transformation using after_stat()
to create a density plot.
after_stat(density)
<- SWIM |>
(plot2 filter(Time < 500) |>
ggplot() +
geom_histogram(mapping = aes(x = Time,
y = after_stat(density)
)+
) labs(title = 'aes(y = after_stat(density))',
tag = "B"
) )
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Using {patchwork} to create a grob, we can divide the grid space using /
so that the plots are on top of each other.
$ncol
$nrow
NULL
$byrow
NULL
$widths
NULL
$heights
NULL
$guides
NULL
$tag_level
NULL
$axes
NULL
$axis_titles
NULL
$design
NULL
attr(,"class")
[1] "plot_layout"
Notice, however, that the visualization of the binned bars along with the density function does not provide provide what you might expect to see. That’s because a density function is continuous whereas histograms are not.
If you add a density layer, you can see them both together:
+ geom_density(mapping = aes(x = Time)) plot2
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
More on geom_density()
later.
Histogram as Proportion of Counts
Not everyone is adept at reading histogram distributions representing frequency counts or probability densities. Sometimes you just want to plot the data such that your y-axis represent percentages. With new versions of {ggplot2}, you will accomplish this using the after_stat()
function.
Just to illustrate what’s going on, we can use the after_stat()
default with geom_histogram()
to see what it does and then we can edit it.
<- SWIM |>
(plot1 filter(Time < 500) |>
ggplot(mapping = aes(Time)) +
geom_histogram() +
labs(title = "default",
tag = "A")
)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Because geom_histogram()
is applying the statistical transformation, stat
as the frequency count and plotting along the y-axis, we can add after_stat(count)
to y inside the aes()
and see what happens to the plot.
<- SWIM |>
(plot2 filter(Time < 500) |>
ggplot(mapping = aes(Time)) +
geom_histogram(mapping = aes(y = after_stat(count))) +
labs(title = "aes(y = after_stat(count))",
tag = "B"
) )
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
But because a proportion of counts in a bin is calculated as counts in bin/ sum of all counts
, we can specify the math aes(y = after_stat(count)/sum(after_stat(count)))
.
<- SWIM |>
plot3 filter(Time < 500) |>
ggplot(mapping = aes(Time)) +
geom_histogram(mapping = aes(y = after_stat(count)/sum(after_stat(count)))) +
labs(title = "count/sum(counts)",
tag = "C")
Looking at the plots together:
plot(gridExtra::arrangeGrob(plot1, plot2, plot3, ncol = 1))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
And although adjusting scales is addressed in a different module, we can clean up the scale here just as an example of how scale functions can clean up axes. We would need to do something like adjust the tick labels on the y axis as percents using scale_y_continuous(labels = scales::percent)
.
Creating geom_*()
s with stat_bin
Histograms using stat_bin
Although you are familiar with the usage of geom_()
s that include statistical transformations (remember all plots have a stat_()
), statistical transformation functions with naming convention stat_*()
allow for geom
s. For stat_bin
, which create bins, the default geom = "bar"
and position = "stack"
, which you saw with the stacking of the colored bars using geom_histogram()
.
stat_bin(
mapping = NULL,
data = NULL,
geom = "bar",
position = "stack",
...,binwidth = NULL,
bins = NULL,
center = NULL,
boundary = NULL,
breaks = NULL,
closed = c("right", "left"),
pad = FALSE,
na.rm = FALSE,
orientation = NA,
show.legend = NA,
inherit.aes = TRUE
)
geom = "bar"
|>
SWIM filter(Time < 500) |>
ggplot(mapping = aes(Time)) +
stat_bin(
geom = "bar"
)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Densities or Area Plots using stat_bin
You can modify the geom
to "density"
or "area"
by passing arguments.
stat_bin(geom = "density")
stat_bin(geom = "area"
Or even modify stat
to a “density” plot:
stat_bin(stat = "density")
Starting with geom = "density"
:
<- SWIM |>
plot1 filter(Time < 500) |>
ggplot(mapping = aes(Time)) +
stat_bin(
geom = "density",
fill = "gold",
col = "black"
+
) labs(title = 'geom = "density"',
tag = "A"
)
And then geom = "area"
:
<- SWIM |>
plot2 filter(Time < 500) |>
ggplot(mapping = aes(Time)) +
stat_bin(
geom = "area",
fill = "gold",
col = "black"
+
) labs(title = 'geom = "area"',
tag = "B"
)
plot(gridExtra::arrangeGrob(plot1, plot2, ncol = 1))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
As you can see, the geom settings produce the same plots.
Kernel Density Estimate (KDE) Plots
Whereas histograms display raw counts, KDEs display proportions. (Total area = 1)
geom_density()
geom_density(
mapping = NULL,
data = NULL,
stat = "density",
position = "identity",
...,na.rm = FALSE,
orientation = NA,
show.legend = NA,
inherit.aes = TRUE,
outline.type = "upper"
)
We can also change aesthetics using:
- color: line color
- size: line thickness
- linetype: line type
- fill: fill color of the area
- alpha: opacity/transparency
Let’s create some plots.
<- SWIM |>
plot1 filter(Time < 500) |>
ggplot(mapping = aes(Time)) +
geom_density() +
labs(title = "no fill", tag = "A")
<- SWIM |>
plot2 filter(Time < 500) |>
ggplot(mapping = aes(Time)) +
geom_density(fill = "gold") +
labs(title = 'fill = "gold"', tag = "B")
<- SWIM |>
plot3 filter(Time < 500) |>
ggplot(mapping = aes(Time)) +
geom_density(
fill = "gold",
bw = 0.5, # a very small bandwidth
kernel = "gaussian" # Gaussian kernel (the default)
+
) labs(title = 'fill = "gold"\nkernel = "gaussian"', tag = "C")
<- SWIM |>
plot4 filter(Time < 500) |>
ggplot(mapping = aes(Time)) +
geom_density(
fill = "gold",
bw = 50, # a larger bandwidth
kernel = "rectangular" # Gaussian kernel (the default)
+
) labs(title = 'fill = "gold"\nkernel = "rectangular"', tag = "D")
plot(gridExtra::arrangeGrob(plot1, plot2, plot3, plot4, ncol = 2))
The plots present the data in different ways. Choosing the best type of plot will involve some consideration. You should not feel compelled to use plots out-of-the-box.
Mapping a fill
variable
In the earlier example, we set fill = "gold"
but we could also map fill
to an existing variable in the data frame. By default, the fill will be opaque, so you might be better off with an alpha adjustment.
<- SWIM |>
plot1 filter(Time < 500) |>
ggplot(mapping = aes(Time)) +
geom_density(aes(fill = School)) +
labs(title = 'aes(fill = School)) + alpha = 1', tag = "A")
<- SWIM |>
plot2 filter(Time < 500) |>
ggplot(mapping = aes(Time)) +
geom_density(aes(fill = School), alpha = .5) +
labs(title = 'aes(fill = School)) + alpha = .5', tag = "B")
plot(gridExtra::arrangeGrob(plot1, plot2, ncol = 1))
Although the alpha may not be the best with this setting, you can now see all of the distributions. Mapping a variable to fill
will always result in plot using the same x-axis positioning. In other words, you will always have plots that will overlap unless of course the distributions are naturally far apart. We will introduce other approaches to distributions on different horizontal levels if the goal is to present the grouping in a way that prevents visual overlap.
Session Info
sessionInfo()
R version 4.4.1 (2024-06-14 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 22631)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8
[2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
time zone: America/Los_Angeles
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] htmltools_0.5.8.1 DT_0.33 vroom_1.6.5 lubridate_1.9.3
[5] forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4 purrr_1.0.2
[9] readr_2.1.5 tidyr_1.3.1 tibble_3.2.1 ggplot2_3.5.1
[13] tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] utf8_1.2.4 generics_0.1.3 stringi_1.8.4 hms_1.1.3
[5] digest_0.6.36 magrittr_2.0.3 evaluate_0.24.0 grid_4.4.1
[9] timechange_0.3.0 fastmap_1.2.0 R.oo_1.26.0 rprojroot_2.0.4
[13] jsonlite_1.8.8 R.utils_2.12.3 gridExtra_2.3 fansi_1.0.6
[17] scales_1.3.0 cli_3.6.3 rlang_1.1.4 crayon_1.5.3
[21] R.methodsS3_1.8.2 bit64_4.0.5 munsell_0.5.1 withr_3.0.1
[25] yaml_2.3.10 tools_4.4.1 tzdb_0.4.0 colorspace_2.1-0
[29] here_1.0.1 vctrs_0.6.5 R6_2.5.1 lifecycle_1.0.4
[33] htmlwidgets_1.6.4 bit_4.0.5 pkgconfig_2.0.3 pillar_1.9.0
[37] gtable_0.3.5 glue_1.7.0 xfun_0.45 tidyselect_1.2.1
[41] rstudioapi_0.16.0 knitr_1.47 farver_2.1.2 patchwork_1.2.0
[45] rmarkdown_2.27 labeling_0.4.3 compiler_4.4.1