Summarizing Data: Advanced Techniques

Libraries

{tibble}, {dplyr}

Create a Data Frame

DATA <- tribble(
  ~id, ~group, ~x, ~y,
  1, "a", 11, 30,
  2, "a", 21, 20,
  3, "a", 13, 33,
  4, "b", 51, NA,
  5, "b", 44, 20,
  6, "b", NA, 19
)

Standard Data-Summary Techniques

Summarizing Vectors

$ operator for vector in data frame

mean(DATA$x, na.rm = T)

[1] 28

mean(DATA$y, na.rm = T)

[1] 24.4

median(DATA$x, na.rm = T)

[1] 21

median(DATA$y, na.rm = T)

[1] 20

NOTE: The mean and median of x differ.

Summarizing Vectors in Data Frames

1 Variable, 2 Metrics

DATA |>
  summarize(x = mean(x, na.rm = TRUE),
            x_median = median(x, na.rm = TRUE)
            )

# A tibble: 1 × 2
      x x_median
  <dbl>    <dbl>
1    28       28

Warning: The mean and median of x are the same. The median() is computed based on the new value of x that is assigned by the first line in summarize() (e.g., x = mean(x, na.rm = TRUE)). You would want to use a new variable name.

1 Variable, 2 Metrics (Cont.)

Assigning to names other than x:

DATA |>
  summarize(mean = mean(x, na.rm = TRUE),
            median = median(x, na.rm = TRUE)
            )

# A tibble: 1 × 2
   mean median
  <dbl>  <dbl>
1    28     21

NOTE: The mean and median of x differ.

2 Variables, 1 Metric

DATA |>
  summarize(x_mean = mean(x, na.rm = TRUE),
            y_mean = mean(y, na.rm = TRUE),
            )

# A tibble: 1 × 2
  x_mean y_mean
   <dbl>  <dbl>
1     28   24.4

2 Variables, 2 Metrics

DATA |>
  summarize(x_mean = mean(x, na.rm = TRUE),
            y_mean = mean(y, na.rm = TRUE),
            x_median = median(x, na.rm = TRUE),
            y_median = median(y, na.rm = TRUE)
            )

# A tibble: 1 × 4
  x_mean y_mean x_median y_median
   <dbl>  <dbl>    <dbl>    <dbl>
1     28   24.4       21       20

2 Variables, 2 Metrics with Grouping

DATA |>
  group_by(group) |>
  summarize(x_mean = mean(x, na.rm = TRUE),
            y_mean = mean(x, na.rm = TRUE),
            x_median = median(x, na.rm = TRUE),
            y_median = median(x, na.rm = TRUE)
            )

# A tibble: 2 × 5
  group x_mean y_mean x_median y_median
  <chr>  <dbl>  <dbl>    <dbl>    <dbl>
1 a       15     15       13       13  
2 b       47.5   47.5     47.5     47.5

Advanced Data-Summary Techniques

Coding each variable to include in the summarized data frame can be tedious.

use across()
pass a list of functions

Summarizing Across with `dplyr::across()`

across() is used when you want to iterate a function or set of functions across a multiple variables. The function will require you to pass arguments for the columns you want to summarize, the function(s) specifying how to summarize, and the names of the new output variables.

across(.cols, 
       .fns, ..., 
       .names = NULL, 
       .unpack = FALSE
       )

`dplyr::across()`: Parameters/Arguments

.cols: the columns to perform a function upon
.fns: the function(s) to apply to the column in .cols
.names: a glue specification that describes how to name the output columns; use {.col} to stand for the selected column name, and {.fn} for the function being applied; defaults to "{col}_{fn}"

`dplyr::across()`: Passing Arguments (Cont.)

.cols = c(x, y)
.fns = ~mean(x, na.rm = TRUE)
.names = NULL (default argument)

`dplyr::across()`: Passing Arguments (Cont.)

DATA |>
  group_by(group) |>
  summarize(across(.cols = c(x, y), 
                   .fns = ~mean(.x, na.rm = TRUE)
                   )
            )

# A tibble: 2 × 3
  group     x     y
  <chr> <dbl> <dbl>
1 a      15    27.7
2 b      47.5  19.5

`dplyr::across()`: Passing Arguments (Cont.)

Passing an argument to .names, .names = "{col}_{fn}":

DATA |>
  group_by(group) |>
  summarize(across(.cols = c(x, y), 
                   .fns = ~mean(.x, na.rm = TRUE),
                   .names = "{col}_{fn}"
                   )
            )

# A tibble: 2 × 3
  group   x_1   y_1
  <chr> <dbl> <dbl>
1 a      15    27.7
2 b      47.5  19.5

`dplyr::across()`: Passing Arguments (Cont.)

Or using a quoted vector for .cols: .cols = c("x", "y")

DATA |>
  group_by(group) |>
  summarize(across(.cols = c("x", "y"), 
                   .fns = ~mean(.x, na.rm = TRUE),
                   .names = "{col}_{fn}"
                   )
            )

# A tibble: 2 × 3
  group   x_1   y_1
  <chr> <dbl> <dbl>
1 a      15    27.7
2 b      47.5  19.5

`dplyr::across()`: Passing Arguments (Cont.)

Or passing a quoted vector to .cols:

must use all_of() or any_of() for variable selection
.cols = all_of(summarize_these)
.cols = summarize_these will produce a warning

`dplyr::across()`: Passing Arguments (Cont.)

summarize_these <- c("x", "y")  # create the vector to pass

DATA |>
  group_by(group) |>
  summarize(across(.cols = any_of(summarize_these), 
                   .fns = ~mean(.x, na.rm = TRUE),
                   .names = "{col}_{fn}"
                   )
            )

# A tibble: 2 × 3
  group   x_1   y_1
  <chr> <dbl> <dbl>
1 a      15    27.7
2 b      47.5  19.5

Defining a vector of variables can be a helpful solution when you have multiple summary tables for which you use the same variables.

Annoyances with `across()`

More complicated
Although {col} is useful (e.g., x and y), {fn} results in a numeric value which is not diagnostic of the function

Using Lists

Lists hold all types of objects as their elements
Vectors, data frames, functions

Using Lists: Example of a List

my_list <- 
  list(
    num_vect = c(1, 2),
    char_vect = c("2", "3", "5"),
    dataframe = DATA,
    afunction = function(x) {mean(x)}
    )

Silly Example

Call the function element to
Perform on the numeric vector element

Silly Example (Cont.)

Obtain the mean (e.g., afunction) of the numeric vector (e.g., num_vect):

my_list$afunction(my_list$num_vect)

[1] 1.5

Passing a List of Functions to `.fns`

.cols = c(x, y)
.fns = list(~mean(x, na.rm = TRUE))

Passing a List of Functions to `.fns` (Cont.)

DATA |>
  group_by(group) |>
  summarize(across(.cols = c(x, y), 
                   .fns = list(~mean(.x, na.rm = TRUE)),
                   .names = "{col}_{fn}"
                   )
            )

# A tibble: 2 × 3
  group   x_1   y_1
  <chr> <dbl> <dbl>
1 a      15    27.7
2 b      47.5  19.5

Passing a List of Functions to `.fns` (Cont.)

The ~ is use as a lambda-like operator that results in iterating the function over all instances of x. In this case, list(~mean(x, na.rm = TRUE), the x is not referring to the x column in the data frame but instead the values in all variables passed to .cols. In this case, the x would be both x and y, in that order.

Same Annoyances

More complicated (requires remembering the function, in the list, and the ~)
Although {col} is useful (e.g., x and y), {fn} results in a numeric value which is not diagnostic of the function

Giving the List Elements Names

.cols = c(x, y)
.fns = list(some_name = ~mean(x, na.rm = TRUE))

Giving the List Elements Names (Cont.)

DATA |>
  group_by(group) |>
  summarize(across(.cols = c(x, y), 
                   .fns = list(some_name = ~mean(.x, na.rm = TRUE)),
                   .names = "{col}_{fn}"
                   )
            )

# A tibble: 2 × 3
  group x_some_name y_some_name
  <chr>       <dbl>       <dbl>
1 a            15          27.7
2 b            47.5        19.5

Annoyances

Still complicated (requires remembering the function, in the list, and the ~)

Passing a List Object to `.fns`

create summary_funcs list of function(s)
.fns = summary_funcs

Passing a List Object to `.fns` (Cont.)

Create a list containing ~mean(.x, na.rm = TRUE)) used in previous example:

summary_funcs <- list(
  mean = ~mean(.x, na.rm = TRUE)  
  )

Passing a List Object to `.fns` (Cont.)

Then pass to .fns, .fns = summary_funcs:

DATA |>
  summarize(across(.cols = c(x, y), 
                   .fns = summary_funcs,
                   .names = "{col}_{fn}"
                   )
            )

# A tibble: 1 × 2
  x_mean y_mean
   <dbl>  <dbl>
1     28   24.4

Passing a List Object to `.fns` (Cont.)

Pair with .cols = summarize_these to summarize the variables in summarize_these using the function(s) in summary_funcs:

DATA |>
  summarize(across(.cols = summarize_these,
                   .fns = summary_funcs,
                   .names = "{col}_{fn}"
                   )
            )

Annoyances/Benefits

Requires creating other objects
Simplifies the code
Does not require remembering the functions, the list, and ~ after created once

Adding More Functions as List Elements

Add functions to the list to accomplish more

summary_funcs <- list(
  mean = ~mean(.x, na.rm = TRUE),
  median = ~median(.x, na.rm = TRUE),
  sd = ~sd(.x, na.rm = TRUE),
  n = ~length(na.omit(.x))  # no na.rm parameter in length()
  )

Passing the List Object to `.fns`: Grouping

group_by(group)
.fns = summary_funcs

Passing the List Object to `.fns`: Grouping (Cont.)

DATA |>
  group_by(group) |>
  summarize(across(.cols = c(x, y), 
                   .fns = summary_funcs,
                   .names = "{col}_{fn}"
                   )
            )

# A tibble: 2 × 9
  group x_mean x_median  x_sd   x_n y_mean y_median  y_sd   y_n
  <chr>  <dbl>    <dbl> <dbl> <int>  <dbl>    <dbl> <dbl> <int>
1 a       15       13    5.29     3   27.7     30   6.81      3
2 b       47.5     47.5  4.95     2   19.5     19.5 0.707     2

Annoyances/Benefits

Creating and remembering the list object name
Solution: Add an .R script to your functions directory and make a code snippet for the object name.

Summarizing Data: Advanced Techniques

Libraries

Create a Data Frame

Standard Data-Summary Techniques

Summarizing Vectors

Summarizing Vectors in Data Frames

1 Variable, 2 Metrics

1 Variable, 2 Metrics (Cont.)

2 Variables, 1 Metric

2 Variables, 2 Metrics

2 Variables, 2 Metrics with Grouping

Advanced Data-Summary Techniques

Summarizing Across with dplyr::across()

dplyr::across(): Parameters/Arguments

dplyr::across(): Passing Arguments (Cont.)

dplyr::across(): Passing Arguments (Cont.)

dplyr::across(): Passing Arguments (Cont.)

dplyr::across(): Passing Arguments (Cont.)

dplyr::across(): Passing Arguments (Cont.)

dplyr::across(): Passing Arguments (Cont.)

Annoyances with across()

Using Lists

Using Lists: Example of a List

Silly Example

Silly Example (Cont.)

Passing a List of Functions to .fns

Passing a List of Functions to .fns (Cont.)

Passing a List of Functions to .fns (Cont.)

Same Annoyances

Giving the List Elements Names

Giving the List Elements Names (Cont.)

Annoyances

Passing a List Object to .fns

Passing a List Object to .fns (Cont.)

Passing a List Object to .fns (Cont.)

Passing a List Object to .fns (Cont.)

Annoyances/Benefits

Adding More Functions as List Elements

Passing the List Object to .fns: Grouping

Passing the List Object to .fns: Grouping (Cont.)

Annoyances/Benefits

Summarizing Across with `dplyr::across()`

`dplyr::across()`: Parameters/Arguments

`dplyr::across()`: Passing Arguments (Cont.)

`dplyr::across()`: Passing Arguments (Cont.)

`dplyr::across()`: Passing Arguments (Cont.)

`dplyr::across()`: Passing Arguments (Cont.)

`dplyr::across()`: Passing Arguments (Cont.)

`dplyr::across()`: Passing Arguments (Cont.)

Annoyances with `across()`

Passing a List of Functions to `.fns`

Passing a List of Functions to `.fns` (Cont.)

Passing a List of Functions to `.fns` (Cont.)

Passing a List Object to `.fns`

Passing a List Object to `.fns` (Cont.)

Passing a List Object to `.fns` (Cont.)

Passing a List Object to `.fns` (Cont.)

Passing the List Object to `.fns`: Grouping

Passing the List Object to `.fns`: Grouping (Cont.)