Histograms and density plots

Author

Gabriel I. Cook

Published

October 30, 2024

Overview

This module focuses on the Cartesian Coordinate system and visualizing data distributions using {ggplot2} geom_histogram() and geom_density(). This module demonstrates some ways to visualize the distribution of a single variable along either the x or y axis. When adding another aesthetic (e.g., fill), this distribution could be visualized along the x axis but for different groupings for a second, fill variable. Visualizations containing multiple levels of a second variable create problems for geom_histogram() out-of-the-box which can be addressed with geom_density() but this approach too can create perceptual problems for more complicated plots. We will also address ways to solve the limitations of visualizing more complex data arrangement in a later section and introduce the {ggridges} library to produce geom_ridgeline() plots.

To Do

External Functions

Provided in class:

view_html(): for viewing data frames in html format, from /src/my_functions.R

You can use this in your own work space but I am having a challenge rendering this of the website, so I’ll default to print() on occasion.

R.utils::sourceDirectory(here::here("src", "functions"))

Libraries

{dplyr} 1.1.4: for selecting, filtering, and mutating
{ggplot2} 3.5.1: for plotting

Load libraries

library(dplyr)
library(ggplot2)

Loading Data

To examine some associations, we will use some swimming event times which can be accessed from:

https://raw.githubusercontent.com/slicesofdata/dataviz24/main/data/processed/cleaned-2023-cms-invite.csv

To access the data, either read the file directly from the url using read.csv() and assign the data frame a name like SWIM:

read.csv("https://raw.githubusercontent.com/slicesofdata/dataviz24/main/data/processed/cleaned-2023-cms-invite.csv")

Or download it and save to the /data/processed directory and read from there.

SWIM <- read.csv(here::here("data", "processed", "cleaned-2023-cms-invite.csv"))

Histograms

`geom_histogram()`

The histogram will plot the counts of common instances within a vector. Plotting a geom_histogram() will require mapping a variable to either x or y in aes(). The only difference will be the distribution orientation, which could have course be changed also with coord_flip().

If x = Time:

SWIM |>
  ggplot(mapping = aes(x = Time)) +
  geom_histogram()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

If y = Time:

SWIM |>
  ggplot(mapping = aes(y = Time)) +
  geom_histogram()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

If you have both x and y mapped, you will get an error.

SWIM |>
  ggplot(mapping = aes(x = Distance, y = Time)) +
  geom_histogram()

Error in `geom_histogram()`:
! Problem while computing stat.
ℹ Error occurred in the 1st layer.
Caused by error in `setup_params()`:
! `stat_bin()` must only have an x or y aesthetic.

Histograms depend on the chosen bin width

Histograms are data representations typically use to display counts, frequencies, or instances of events falling within some upper and lower bound range for numeric or continuous variables. For example, counts of salaries for every $5k, weights for every 5 pounds, etc. Histogram functions will typically determine these ranges on their own, which may or may not be suitable for your visualization. All of these elements can be customized.

Setting `bins` Number

The bins parameter in geom_histogram() controls number of bins, defaults to 30, and is overridden by the binwidth argument. Depending on your data, this may or may not provide the best representation of data. You can also specify the number of bins. This will force geom_histogram() to adjust the bin width based on the number of bind you specify. Let’s create two examples.

plot1 <- SWIM |>
  filter(Time < 500) |>
  ggplot(mapping = aes(Time)) +
  geom_histogram(bins = 20) +
  labs(title = "bins = 20", tag = "A")

plot2 <- SWIM |>
  filter(Time < 500) |>
  ggplot(mapping = aes(Time)) +
  geom_histogram(bins = 60) +
  labs(title = "bins = 60", tag = "B")

Because the total number of bins will capture all frequency counts in the data, setting fewer bins will result in plots that have more counts per bin than will plots with many bins. Examine the counts along the y-axis to see this.

plot(gridExtra::arrangeGrob(plot1, plot2, ncol = 2))

Not only is the narrowness of the bars associated with more bins but the this is the result of fewer counts, which ultimately affects your range along the y axis. This means that when you plot two bars side-by-side (to facilitate comparison of bars by their heights), the height will be confounded by the scale; you are not comparing bars on aligned scales (See Cleveland & McGill (1984)). This results in perceptual problems.

Setting `binwidth`

When you plot a geom_histogram(), R reminds you to consider the binwidth by providing the error: stat_bin() using bins = 30. Pick better value with binwidth. The binwidth argument controls the bin width along the X-axis and this argument overrides the bin argument.

You could make your bin width narrow relative to the data being plotted. For example, 1/10 of a second (e.g., binwidth = .1) for swim events. Similarly, if your vector contains discrete values, binwidth = 1 would plot the frequency count for all instances.

plot1 <- SWIM |>
  filter(Time < 500) |>
  ggplot(mapping = aes(Time)) +
  geom_histogram(binwidth = .1) +
  labs(title = "binwidth = .1", tag = "A")

But this produces a very granular plot that is not very useful. The counts are so few and the distribution is not very apparent visually. By contrast, if you set it too wide (e.g., binwidth = 30), then you might loose a lot of detail. Notice also that the x axis does not provide tick marks that help with the binning, whether narrow or wide.

plot2 <- SWIM |>
  filter(Time < 500) |>
  ggplot(mapping = aes(Time)) +
  geom_histogram(binwidth = 30) +
  labs(title = "binwidth = 30", tag = "B")

plot(gridExtra::arrangeGrob(plot1, plot2, ncol = 2))

Is this plot better?

plot3 <- SWIM |>
  filter(Time < 500) |>
  ggplot(mapping = aes(Time)) +
  geom_histogram(binwidth = 5) +
  labs(title = "binwidth = 5", tag = "C")

plot(gridExtra::arrangeGrob(plot1, plot2, ncol = 1))

Centering the bin

Once you are happy with the bin width, you can think about centering them. center will be used to specify the center of the bin containing the bin value. For now, there is some adjustment, though not perfect, to the scale to see the differences.

plot1 <- SWIM |>
  filter(Time < 200) |>
  ggplot(mapping = aes(Time)) +
  geom_histogram(binwidth = 10) +
  ggtitle("binwidth = 10") +
  # add some detailing to see better
  scale_x_continuous(breaks = seq(0, 350, 10)) +
  labs(title = "default center", tag = "A")

plot2 <- SWIM |>
  filter(Time < 200) |>
  ggplot(mapping = aes(Time)) +
  geom_histogram(binwidth = 10,
                 center = 1) +
  # add some detailing to see better 
  scale_x_continuous(breaks = seq(0, 350, 10)) +
  labs(title = "center = 1", tag = "B")

plot3 <- SWIM |>
  filter(Time < 200) |>
  ggplot(mapping = aes(Time)) +
  geom_histogram(binwidth = 10,
                 center = 0) +
  # add some detailing to see better
  scale_x_continuous(breaks = seq(0, 350, 10)) +
  labs(title = "center = 0", tag = "C")

plot4 <- SWIM |>
  filter(Time < 200) |>
  ggplot(mapping = aes(Time)) +
  geom_histogram(binwidth = 10,
                 center = 5) +
  # add some detailing to see better 
  scale_x_continuous(breaks = seq(0, 350, 10)) +
  labs(title = "center = 5", tag = "D")

plot(gridExtra::arrangeGrob(plot1, plot2, plot3, plot4, ncol = 2))

For more details on center, see ?geom_histogram.

Color and other Aesthetics

And can also change the color of the bars in the histogram or make them transparent (e.g., fill = "transparent").

plot1 <- SWIM |>
  filter(Time < 500) |>
  ggplot(mapping = aes(Time)) +
  geom_histogram(
    fill = "tomato2",
    color = "black",
    linetype = "dashed"
  ) +
  labs(title = 'fill = "tomato2"\nlinetype = "dashed"',
      tag = "A"
      )

plot2 <- SWIM |>
  filter(Time < 500) |>
  ggplot(mapping = aes(Time)) +
  geom_histogram(
    fill = "transparent",
    color = "tomato2",
    linetype = "solid"
  ) +
  labs(title = 'color = "tomato2"\nlinetype = "solid"',
       tag = "B"
  )

Two arrangements of the plots illustrate their differences.

patchwork::plot_layout(plot1 + plot2)

$ncol

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


$nrow
NULL

$byrow
NULL

$widths
NULL

$heights
NULL

$guides
NULL

$tag_level
NULL

$axes
NULL

$axis_titles
NULL

$design
NULL

attr(,"class")
[1] "plot_layout"

patchwork::plot_layout(plot1 / plot2)

$ncol

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


$nrow
NULL

$byrow
NULL

$widths
NULL

$heights
NULL

$guides
NULL

$tag_level
NULL

$axes
NULL

$axis_titles
NULL

$design
NULL

attr(,"class")
[1] "plot_layout"

Stacked histograms

You can stack bars in histograms just as you did with geom_bar(). Mapping a variable to fill will introduce the distribution for levels of the fill variable. However, if you think stacking a histogram will be useful, it really isn’t.

SWIM |>
  ggplot(mapping = aes(x = Time, fill = Team)) +
  geom_histogram()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Even if the bars are not opaque, they are not much better. In general, histograms with overlapped bars a rarely useful.

SWIM |>
  ggplot(mapping = aes(x = Time, fill = Team)) +
  geom_histogram(alpha = .5)

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

There are certainly things that you can do to the plot but there are better options.

Histograms with y as Density

The default operation of geom_histogram() is to plot the frequency counts along the y axis corresponding to the x variable mapped in aes(). The statistical transformation adjusted using after_stat(). You will see that the default mapping to the y variable is after_stat(count). We can add this explicitly either in geom_histogram() to across ggplot() and geom_histogram()

For example:

  ggplot() + 
  geom_histogram(mapping = aes(x = Time,
                               y = after_stat(count)
                               )
  )
  
  ggplot(mapping = aes(x = Time)) + 
  geom_histogram(mapping = aes(y = after_stat(count)))

`after_stat(count)`

(plot1 <- SWIM |>
  filter(Time < 500) |>
  ggplot() +
  geom_histogram(mapping = aes(x = Time,
                               y = after_stat(count)
                     )
                 ) +
  labs(title = 'aes(y = after_stat(count))',
       tag = "A")
)

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The plot looks the same as we have seen before because this is the default operation. We can, however, change the statistical transformation using after_stat() to create a density plot.

`after_stat(density)`

(plot2 <- SWIM |>
  filter(Time < 500) |>
  ggplot() +
  geom_histogram(mapping = aes(x = Time,
                               y = after_stat(density)
                     )
                 ) +
  labs(title = 'aes(y = after_stat(density))', 
       tag = "B"
       )
)

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Using {patchwork} to create a grob, we can divide the grid space using / so that the plots are on top of each other.

$ncol


$nrow
NULL

$byrow
NULL

$widths
NULL

$heights
NULL

$guides
NULL

$tag_level
NULL

$axes
NULL

$axis_titles
NULL

$design
NULL

attr(,"class")
[1] "plot_layout"

Notice, however, that the visualization of the binned bars along with the density function does not provide provide what you might expect to see. That’s because a density function is continuous whereas histograms are not.

If you add a density layer, you can see them both together:

plot2 + geom_density(mapping = aes(x = Time))

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Histogram as Proportion of Counts

Not everyone is adept at reading histogram distributions representing frequency counts or probability densities. Sometimes you just want to plot the data such that your y-axis represent percentages. With new versions of {ggplot2}, you will accomplish this using the after_stat() function.

Just to illustrate what’s going on, we can use the after_stat() default with geom_histogram() to see what it does and then we can edit it.

(plot1 <- SWIM |>
  filter(Time < 500) |>
  ggplot(mapping = aes(Time)) +
  geom_histogram() +
  labs(title = "default",
       tag = "A")
)

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Because geom_histogram() is applying the statistical transformation, stat as the frequency count and plotting along the y-axis, we can add after_stat(count) to y inside the aes() and see what happens to the plot.

(plot2 <- SWIM |>
  filter(Time < 500) |>
  ggplot(mapping = aes(Time)) +
  geom_histogram(mapping = aes(y = after_stat(count))) +
  labs(title = "aes(y = after_stat(count))", 
       tag = "B"
       )
)

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

But because a proportion of counts in a bin is calculated as counts in bin/ sum of all counts, we can specify the math aes(y = after_stat(count)/sum(after_stat(count))).

plot3 <- SWIM |>
  filter(Time < 500) |>
  ggplot(mapping = aes(Time)) +
  geom_histogram(mapping = aes(y = after_stat(count)/sum(after_stat(count)))) +
  labs(title = "count/sum(counts)",
       tag = "C")

Looking at the plots together:

plot(gridExtra::arrangeGrob(plot1, plot2, plot3, ncol = 1))

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

And although adjusting scales is addressed in a different module, we can clean up the scale here just as an example of how scale functions can clean up axes. We would need to do something like adjust the tick labels on the y axis as percents using scale_y_continuous(labels = scales::percent).

**Creating `geom_*()`s with `stat_bin`**

Histograms using `stat_bin`

Although you are familiar with the usage of geom_()s that include statistical transformations (remember all plots have a stat_()), statistical transformation functions with naming convention stat_*() allow for geoms. For stat_bin, which create bins, the default geom = "bar" and position = "stack", which you saw with the stacking of the colored bars using geom_histogram().

stat_bin(
  mapping = NULL,
  data = NULL,
  geom = "bar",
  position = "stack",
  ...,
  binwidth = NULL,
  bins = NULL,
  center = NULL,
  boundary = NULL,
  breaks = NULL,
  closed = c("right", "left"),
  pad = FALSE,
  na.rm = FALSE,
  orientation = NA,
  show.legend = NA,
  inherit.aes = TRUE
)

geom = "bar"

SWIM |>
  filter(Time < 500) |>
  ggplot(mapping = aes(Time)) +
  stat_bin(
    geom = "bar"
  )

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Densities or Area Plots using `stat_bin`

You can modify the geom to "density" or "area" by passing arguments.

stat_bin(geom = "density")
stat_bin(geom = "area"

Or even modify stat to a “density” plot:

stat_bin(stat = "density")

Starting with geom = "density":

plot1 <- SWIM |>
  filter(Time < 500) |>
  ggplot(mapping = aes(Time)) +
  stat_bin(
    geom = "density",
    fill = "gold",
    col  = "black"
  ) +
  labs(title = 'geom = "density"',
       tag = "A"
       )

And then geom = "area":

plot2 <- SWIM |>
  filter(Time < 500) |>
  ggplot(mapping = aes(Time)) +
  stat_bin(
    geom = "area",
    fill = "gold",
    col = "black"
  ) +
  labs(title = 'geom = "area"',
       tag = "B"
  )

plot(gridExtra::arrangeGrob(plot1, plot2, ncol = 1))

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

As you can see, the geom settings produce the same plots.

Kernel Density Estimate (KDE) Plots

Whereas histograms display raw counts, KDEs display proportions. (Total area = 1)

`geom_density()`

geom_density(
  mapping = NULL,
  data = NULL,
  stat = "density",
  position = "identity",
  ...,
  na.rm = FALSE,
  orientation = NA,
  show.legend = NA,
  inherit.aes = TRUE,
  outline.type = "upper"
)

We can also change aesthetics using:

- color: line color
- size: line thickness
- linetype: line type 
- fill: fill color of the area 
- alpha: opacity/transparency

Let’s create some plots.

plot1 <- SWIM |>
  filter(Time < 500) |>
  ggplot(mapping = aes(Time)) +
  geom_density() +   
  labs(title = "no fill", tag = "A")

plot2 <- SWIM |>
  filter(Time < 500) |>
  ggplot(mapping = aes(Time)) +
  geom_density(fill = "gold") +
  labs(title = 'fill = "gold"', tag = "B")

plot3 <- SWIM |>
  filter(Time < 500) |>
  ggplot(mapping = aes(Time)) +
  geom_density(
    fill = "gold",
    bw = 0.5,               # a very small bandwidth
    kernel = "gaussian"     # Gaussian kernel (the default)
  ) +
  labs(title = 'fill = "gold"\nkernel = "gaussian"', tag = "C")

plot4 <- SWIM |>
  filter(Time < 500) |>
  ggplot(mapping = aes(Time)) +
  geom_density(
    fill = "gold",
    bw = 50,                   # a larger bandwidth
    kernel = "rectangular"     # Gaussian kernel (the default)
  ) +
  labs(title = 'fill = "gold"\nkernel = "rectangular"', tag = "D")


plot(gridExtra::arrangeGrob(plot1, plot2, plot3, plot4, ncol = 2))

The plots present the data in different ways. Choosing the best type of plot will involve some consideration. You should not feel compelled to use plots out-of-the-box.

Mapping a `fill` variable

In the earlier example, we set fill = "gold" but we could also map fill to an existing variable in the data frame. By default, the fill will be opaque, so you might be better off with an alpha adjustment.

plot1 <- SWIM |>
  filter(Time < 500) |>
  ggplot(mapping = aes(Time)) +
  geom_density(aes(fill = School)) +
  labs(title = 'aes(fill = School)) + alpha = 1', tag = "A")

plot2 <- SWIM |>
  filter(Time < 500) |>
  ggplot(mapping = aes(Time)) +
  geom_density(aes(fill = School), alpha = .5) +
  labs(title = 'aes(fill = School)) + alpha = .5', tag = "B")

plot(gridExtra::arrangeGrob(plot1, plot2, ncol = 1))

Although the alpha may not be the best with this setting, you can now see all of the distributions. Mapping a variable to fill will always result in plot using the same x-axis positioning. In other words, you will always have plots that will overlap unless of course the distributions are naturally far apart. We will introduce other approaches to distributions on different horizontal levels if the goal is to present the grouping in a way that prevents visual overlap.

Session Info

sessionInfo()

R version 4.4.1 (2024-06-14 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 22631)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] htmltools_0.5.8.1 DT_0.33           vroom_1.6.5       lubridate_1.9.3  
 [5] forcats_1.0.0     stringr_1.5.1     dplyr_1.1.4       purrr_1.0.2      
 [9] readr_2.1.5       tidyr_1.3.1       tibble_3.2.1      ggplot2_3.5.1    
[13] tidyverse_2.0.0  

loaded via a namespace (and not attached):
 [1] utf8_1.2.4        generics_0.1.3    stringi_1.8.4     hms_1.1.3        
 [5] digest_0.6.36     magrittr_2.0.3    evaluate_0.24.0   grid_4.4.1       
 [9] timechange_0.3.0  fastmap_1.2.0     R.oo_1.26.0       rprojroot_2.0.4  
[13] jsonlite_1.8.8    R.utils_2.12.3    gridExtra_2.3     fansi_1.0.6      
[17] scales_1.3.0      cli_3.6.3         rlang_1.1.4       crayon_1.5.3     
[21] R.methodsS3_1.8.2 bit64_4.0.5       munsell_0.5.1     withr_3.0.1      
[25] yaml_2.3.10       tools_4.4.1       tzdb_0.4.0        colorspace_2.1-0 
[29] here_1.0.1        vctrs_0.6.5       R6_2.5.1          lifecycle_1.0.4  
[33] htmlwidgets_1.6.4 bit_4.0.5         pkgconfig_2.0.3   pillar_1.9.0     
[37] gtable_0.3.5      glue_1.7.0        xfun_0.45         tidyselect_1.2.1 
[41] rstudioapi_0.16.0 knitr_1.47        farver_2.1.2      patchwork_1.2.0  
[45] rmarkdown_2.27    labeling_0.4.3    compiler_4.4.1