library(tibble)
library(dplyr)
Summarizing Data: Using across()
and Lists
Overview
This module introduces some somewhat advanced approaches to summarizing data. They are advanced insofar as they involve creating lists of summary functions and using dplyr::across()
to run those functions across multiple variables. They are more complicated than adding lines of expressions to summarize()
but the cost associated with greater complexity buy you simplicity of code and reproducibility.
Libraries
Create a Data Frame
<- data.frame(
DATA id = 1:6,
group = c(rep("a", 3), rep("b", 3)),
x = c(11, 21, 13, 51, 44, NA),
y = c(30, 20, 33, NA, 20, 19)
)
<- tribble(
DATA ~id, ~group, ~x, ~y,
1, "a", 11, 30,
2, "a", 21, 20,
3, "a", 13, 33,
4, "b", 51, NA,
5, "b", 44, 20,
6, "b", NA, 19
)
Standard Data-Summary Techniques
Summarizing Vectors
mean(DATA$x, na.rm = T)
[1] 28
mean(DATA$y, na.rm = T)
[1] 24.4
median(DATA$x, na.rm = T)
[1] 21
median(DATA$y, na.rm = T)
[1] 20
NOTE: The mean and median of x
differ.
Summarizing Vectors in Data Frames
1 Variable, 2 Metrics
|>
DATA summarize(x = mean(x, na.rm = TRUE),
x_median = median(x, na.rm = TRUE)
)
# A tibble: 1 × 2
x x_median
<dbl> <dbl>
1 28 28
Warning: The mean and median of x
are the same. The median()
is computed based on the new value of x
that is assigned by the first line in summarize()
(e.g x = mean(x, na.rm = TRUE)
). You would want to use a new variable name.
Solution: Assigning to names other than x
:
|>
DATA summarize(mean = mean(x, na.rm = TRUE),
median = median(x, na.rm = TRUE)
)
# A tibble: 1 × 2
mean median
<dbl> <dbl>
1 28 21
2 Variables, 1 Metric
|>
DATA summarize(x_mean = mean(x, na.rm = TRUE),
y_mean = mean(y, na.rm = TRUE),
)
# A tibble: 1 × 2
x_mean y_mean
<dbl> <dbl>
1 28 24.4
NOTE: Are the means really the same?
2 Variables, 2 Metrics
|>
DATA summarize(x_mean = mean(x, na.rm = TRUE),
y_mean = mean(y, na.rm = TRUE),
x_median = median(x, na.rm = TRUE),
y_median = median(y, na.rm = TRUE)
)
# A tibble: 1 × 4
x_mean y_mean x_median y_median
<dbl> <dbl> <dbl> <dbl>
1 28 24.4 21 20
2 Variables, 2 Metrics with Grouping
|>
DATA group_by(group) |>
summarize(x_mean = mean(x, na.rm = TRUE),
y_mean = mean(y, na.rm = TRUE),
x_median = median(x, na.rm = TRUE),
y_median = median(y, na.rm = TRUE)
)
# A tibble: 2 × 5
group x_mean y_mean x_median y_median
<chr> <dbl> <dbl> <dbl> <dbl>
1 a 15 27.7 13 30
2 b 47.5 19.5 47.5 19.5
Advanced Data-Summary Techniques
Coding each variable to include in the summarized data frame can be tedious.
- use
across()
- pass a list of functions
Summarizing Across with dplyr::across()
across()
is used when you want to iterate a function or set of functions across a multiple variables. The function will require you to pass arguments for the columns you want to summarize, the function(s) specifying how to summarize, and the names of the new output variables. By default, the
across(.cols,
.fns, ..., .names = NULL,
.unpack = FALSE
)
dplyr::across()
: Parameters/Arguments
.cols
: the columns to perform a function upon.fns
: the function(s) to apply to the column in.cols
.names
: a glue specification that describes how to name the output columns; use{.col}
to stand for the selected column name, and{.fn}
for the function being applied; defaults to"{col}_{fn}"
dplyr::across()
: Passing Arguments
.cols = c(x, y)
.fns = ~mean(x, na.rm = TRUE)
Accepting the default argument for .names
:
.names = NULL
|>
DATA group_by(group) |>
summarize(across(.cols = c(x, y),
.fns = ~mean(.x, na.rm = TRUE)
) )
# A tibble: 2 × 3
group x y
<chr> <dbl> <dbl>
1 a 15 27.7
2 b 47.5 19.5
Passing an argument to .names
:
.names = "{col}_{fn}"
|>
DATA group_by(group) |>
summarize(across(.cols = c(x, y),
.fns = ~mean(.x, na.rm = TRUE),
.names = "{col}_{fn}"
) )
# A tibble: 2 × 3
group x_1 y_1
<chr> <dbl> <dbl>
1 a 15 27.7
2 b 47.5 19.5
Or using a quoted vector: .cols = c("x", "y")
|>
DATA group_by(group) |>
summarize(across(.cols = c("x", "y"),
.fns = ~mean(.x, na.rm = TRUE),
.names = "{col}_{fn}"
) )
# A tibble: 2 × 3
group x_1 y_1
<chr> <dbl> <dbl>
1 a 15 27.7
2 b 47.5 19.5
Or passing a quoted vector:
- must use
all_of()
orany_of()
for variable selection .cols = all_of(summarize_these)
.cols = summarize_these
will produce a warning
<- c("x", "y") # create the vector to pass
summarize_these
|>
DATA group_by(group) |>
summarize(across(.cols = any_of(summarize_these),
.fns = ~mean(.x, na.rm = TRUE),
.names = "{col}_{fn}"
) )
# A tibble: 2 × 3
group x_1 y_1
<chr> <dbl> <dbl>
1 a 15 27.7
2 b 47.5 19.5
Note: defining a vector of variables can be a helpful solution when you have multiple summary tables for which you use the same variables.
Annoyances with across()
- More complicated
- Although {col} is useful (e.g.,
x
andy
), {fn} results in a numeric value which is not diagnostic of the function
Using Lists
- Lists hold all different types of objects as their elements
- Vectors, data frames, functions
<-
my_list list(
num_vect = c(1, 2),
char_vect = c("2", "3", "5"),
dataframe = DATA,
afunction = function(x) {mean(x)}
)
Silly Example
- Call the function element to
- Perform on the numeric vector element
$afunction(my_list$num_vect) my_list
[1] 1.5
Passing a List of Functions to .fns
.cols = c(x, y)
.fns = list(~mean(x, na.rm = TRUE))
|>
DATA group_by(group) |>
summarize(across(.cols = c(x, y),
.fns = list(~mean(.x, na.rm = TRUE)),
.names = "{col}_{fn}"
) )
# A tibble: 2 × 3
group x_1 y_1
<chr> <dbl> <dbl>
1 a 15 27.7
2 b 47.5 19.5
Note: The ~
is use as a lambda-like operator that results in iterating the function over all instances of x. In this case, list(~mean(x, na.rm = TRUE)
, the x is not referring to the x
column in the data frame but instead the values in all variables passed to .cols
. In this case, the x would be both x
and y
, in that order.
Same Annoyances
- More complicated (requires remembering the function, in the list, and the ~)
- Although {col} is useful (e.g.,
x
andy
), {fn} results in a numeric value which is not diagnostic of the function
Giving the List Elements Names
.cols = c(x, y)
.fns = list(some_name = ~mean(x, na.rm = TRUE))
|>
DATA group_by(group) |>
summarize(across(.cols = c(x, y),
.fns = list(some_name = ~mean(.x, na.rm = TRUE)),
.names = "{col}_{fn}"
) )
# A tibble: 2 × 3
group x_some_name y_some_name
<chr> <dbl> <dbl>
1 a 15 27.7
2 b 47.5 19.5
Annoyances
- Still complicated (requires remembering the function, in the list, and the ~)
Passing a List Object to .fns
- create
summary_funcs
list of function(s) .fns = summary_funcs
Create a list containing ~mean(x, na.rm = TRUE))
used in previous example:
<- list(
summary_funcs mean = ~mean(na.omit(.x))
)
Then pass to .fns
, .fns = summary_funcs
:
|>
DATA summarize(across(.cols = c(x, y),
.fns = summary_funcs,
.names = "{col}_{fn}"
) )
# A tibble: 1 × 2
x_mean y_mean
<dbl> <dbl>
1 28 24.4
Pair with .cols = summarize_these
to summarize the variables in summarize_these
using the function(s) in summary_funcs
:
|>
DATA summarize(across(.cols = summarize_these,
.fns = summary_funcs,
.names = "{col}_{fn}"
) )
Note: If you try to execute this, pay attention to the warning Message!
! Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.
# Was:
data %>% select(summarize_these)
# Now:
data %>% select(all_of(summarize_these))
See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
Solution: When passing an external vector, you need to use all_of()
or any_of()
.
|>
DATA summarize(across(.cols = all_of(summarize_these),
.fns = summary_funcs,
.names = "{col}_{fn}"
) )
# A tibble: 1 × 2
x_mean y_mean
<dbl> <dbl>
1 28 24.4
Annoyances/Benefits
- Requires creating other objects
- Simplifies the code
- Does not require remembering the functions, the list, and ~ after created once
Adding More Functions as List Elements
Add functions to the list to accomplish more
<- list(
summary_funcs mean = ~mean(na.omit(.x)),
median = ~median(na.omit(.x)),
sd = ~sd(na.omit(.x)),
n = ~length(na.omit(.x))
)
Passing the List Object to .fns
: Grouping
group_by(group)
.fns = summary_funcs
|>
DATA group_by(group) |>
summarize(across(.cols = all_of(summarize_these),
.fns = summary_funcs,
.names = "{col}_{fn}"
) )
# A tibble: 2 × 9
group x_mean x_median x_sd x_n y_mean y_median y_sd y_n
<chr> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <int>
1 a 15 13 5.29 3 27.7 30 6.81 3
2 b 47.5 47.5 4.95 2 19.5 19.5 0.707 2
Annoyances/Benefits
- Creating and remembering the list object name
- Solution: Add an .R script to your functions directory and make a code snippet for the object name.