Data subsets and summaries

Author

Gabriel I. Cook

Published

October 31, 2024

Overview

This module demonstrates how to use {dplyr} to subset data, mutate variables, and summarize variables in data frames. Many data summaries involve creating group-level statistics so we will also cover grouping across variables. {lubridate} will be used to handle time vectors. Other functions covered with examples include openxlsx::read.xlsx(), readRDS() , saveRDS() , across(), relocate(), arrange(), and list() (for passing a list object to summarize()). Some elements are advanced topics related to preparing data in order to be summarized.

To Do

Watch video assignment before class for points and for being prepared for class exercises. Bring questions to class. Come prepared to tackle an exercise.

Supplemental Readings

Task

Watch associated video(s) and review module content as necessary for exercises, homework, or projects.

Libraries

  • {here}: 1.0.1: for path management
  • {dplyr} 1.1.4: for selecting, filtering, and mutating
  • {lubridate} 1.9.3: for handling date and time vectors
  • {openxlsx} 4.2.5.2: for reading .xlsx worksheets

External Functions

Provided:

view_html(): for viewing data frames in html format, from /src/functions/view_html.R

R.utils::sourceDirectory(here::here("src", "functions"))

Libraries

We will work with a few different libraries for data manipulation. Let’s load them into our work space using library().

library(dplyr)
library(stringr)
library(lubridate)

Reading and Looking at Data

In order to perform any data summary, you need data. We will use raw data files named "cms-top-all-time-2023-swim.csv" and "cms-top-all-time-2023-swim.xlsx" which can be located and downloaded from the Data tab above. You should download these to /data/raw. We could just access the file from the full URL path but we will later want to save a cleaned version of the file for later use. Modifying the file name will be easier than hard coding the path in both places. Thus, we will use here() to build the file path and pass to the relevant functions. There is also additional content involving mutate() in order to prepare the data file for summarizing.

Reading .csv files

This example is a walk through using the raw data file.

We will use here() to build and pass the file string as the argument to file and assign to an object named DAT. If column names are not in the file, modify the header argument. You could use dplyr::read_csv() also if you desire.

DAT <- read.csv(file = here::here("data", "raw", "cms-top-all-time-2023-swim.csv"))

Reading .xlsx files

This example is a walk through using the raw data file.

We will use here() to build and pass the file string to the xlxsFile argument and also pass sheet = "swim" so the swim worksheet is read from the file. If {openxlsx} is not loaded, call the function with reference to the library and assign to an object named DAT.

DAT <- openxlsx::read.xlsx(
  xlsxFile = here::here("data", "raw", "cms-top-all-time-2023-swim.xlsx"),
  sheet = "swim"
  )

Looking at Data

If data wrangling and cleaning skills are not needed, jump to summarizing. In order to prepare a data frame for summarizing, however, you need to ensure it is cleaned and saved.

View its contents:

head(DAT)    # or view_html(DAT)
   time             name year   event   team
1 23.29 Jocelyn Crawford 2019 50 FREE Athena
2 23.31    Ava Sealander 2022 50 FREE Athena
3 23.49        Kelly Ngo 2016 50 FREE Athena
4 23.71        Helen Liu 2014 50 FREE Athena
5 23.76      Michele Kee 2014 50 FREE Athena
6 23.77 Natalia Orbach-M 2020 50 FREE Athena

We have several variables: time, name, year, event, team. Some appear to be numeric and some are characters. Passing the data frame to dplyr::glimpse() will provide more detail (see also str()).

dplyr::glimpse(DAT) # or DAT |> dplyr::glimpse()
Rows: 440
Columns: 5
$ time  <chr> "23.29", "23.31", "23.49", "23.71", "23.76", "23.77", "23.77", "…
$ name  <chr> "Jocelyn Crawford", "Ava Sealander", "Kelly Ngo", "Helen Liu", "…
$ year  <chr> "2019", "2022", "2016", "2014", "2014", "2020", "2020", "2010", …
$ event <chr> "50 FREE", "50 FREE", "50 FREE", "50 FREE", "50 FREE", "50 FREE"…
$ team  <chr> "Athena", "Athena", "Athena", "Athena", "Athena", "Athena", "Ath…

Hmm, looks like there are <chr> and <dbl> variables and some numbers are in quotes. Objects in quotes are strings. So we need to clean up the file before we can even get a summary. Other than looking at the structure of data frame using glimpse() or str(), you can examine vectors individually.

When vectors are in data frames, you can reference them using the $ notation: my_dataframe$my_variable. Passing this to is.numeric() will also tell you whether the vector is numeric by returning either TRUE or FALSE.

is.numeric(DAT$time)
[1] FALSE

OK, so this variable is definitely not numeric. Looking at the file, we need to convert at least one variable that should be numbers from character to numeric.

Modifying a Data Frame or Tibble Using {dplyr}

Use dplyr::mutate() to create, modify, and delete column variables. At every least, you will need to pass a data frame as the first argument, .data and then a name-value pair as the second argument. There are other arguments that are experimental which will not be the focus of this module.

mutate(
  .data,
  ...,
  .by = NULL,
  .keep = c("all", "used", "unused", "none"),
  .before = NULL,
  .after = NULL
)

We will use mutate() in conjunction with Base R’s piping operator, |>.

Creating a new variable

We will pass the data frame into mutate() and then specify a name-value variable pair. In order to keep the print out manageable, we will also use the slice() function.

mutate(data_frame, 
       new_variable_name = variable
       )

Create new variables that are set to a constant number or string:

DAT |>
  slice(1:5) |>   # rows 1 through 5
  mutate(newvar1 = 9999) |>
  mutate(newvar2 = "Student") 
   time             name year   event   team newvar1 newvar2
1 23.29 Jocelyn Crawford 2019 50 FREE Athena    9999 Student
2 23.31    Ava Sealander 2022 50 FREE Athena    9999 Student
3 23.49        Kelly Ngo 2016 50 FREE Athena    9999 Student
4 23.71        Helen Liu 2014 50 FREE Athena    9999 Student
5 23.76      Michele Kee 2014 50 FREE Athena    9999 Student

You can see that each row in the data frame will take on the paired value.

Modifying a new variable

New variables are modified using the same name-value pairing approach. When you modify a variable, you are taking an existing variable to setting it to another value.

Set an existing variable equal to a constant

Just use an existing variable name (left of = in name-value pair).

DAT |>
  slice(1:5) |>         # rows 1 through 5
  mutate(time = 1) |>
  mutate(name = "0") 
  time name year   event   team
1    1    0 2019 50 FREE Athena
2    1    0 2022 50 FREE Athena
3    1    0 2016 50 FREE Athena
4    1    0 2014 50 FREE Athena
5    1    0 2014 50 FREE Athena

OK, that’s not very helpful. We just replaced our existing variables with nothing useful. You can see that name is still a <chr> type.

Set an existing variable equal to another value

As long as {dplyr} can result the character elements of the vector, as.numeric() will convert the character strings to numbers. For example:

as.numeric(c("1", "3.2", "6.99"))
[1] 1.00 3.20 6.99

We can illustrate in a data frame by creating character value that will serve as the constant, and use as.numeric() just to illustrate this example.

DAT |>
  slice(1:5) |>            # rows 1 through 5
  mutate(name = as.numeric("0")) 
   time name year   event   team
1 23.29    0 2019 50 FREE Athena
2 23.31    0 2022 50 FREE Athena
3 23.49    0 2016 50 FREE Athena
4 23.71    0 2014 50 FREE Athena
5 23.76    0 2014 50 FREE Athena

And now name is a <dbl>, which is a type of numeric. We can see this by selecting columns from the data frame where() the variable is.numeric().

DAT |>
  slice(1:5) |>   # rows 1 through 5
  mutate(name = as.numeric("0")) |>
  select(where(is.numeric))
  name
1    0
2    0
3    0
4    0
5    0

But if we try to convert time to numeric this way, you will see that the complex numbers will be converted to NAs, or missing.

as.numeric(DAT$time)
Warning: NAs introduced by coercion
  [1] 23.29 23.31 23.49 23.71 23.76 23.77 23.77 23.87 23.93 24.02 51.05 51.24
 [13] 51.41 51.56 51.56 51.88 52.05 52.05 52.14 52.17    NA    NA    NA    NA
 [25]    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 [37]    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 [49]    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 [61] 55.66 55.67 55.91 56.11 56.74 56.83 57.18 57.36 57.47 57.56    NA    NA
 [73]    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 [85]    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 [97]    NA    NA    NA    NA 54.76 54.92 54.93 55.23 55.74 56.04 56.27 56.42
[109] 56.47 56.56    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
[121]    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA 22.69 22.92
[133] 22.93 22.95 23.27 23.31 23.33 23.38 23.45 23.47 50.65 50.67 50.92 51.19
[145] 51.27 51.28 51.29 51.37 51.45 51.56    NA    NA    NA    NA    NA    NA
[157]    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
[169]    NA    NA 25.94 26.22 26.28 26.57 26.82 26.90 27.14 27.14 27.16 27.17
[181] 28.33 28.96 29.05 29.06 29.09 29.09 29.24 29.26 29.46 29.55 24.05 24.28
[193] 24.58 24.59 24.65 24.85 25.05 25.08 25.24 25.34 54.65 54.81 54.91 55.11
[205] 55.13 55.25 55.27 55.45 55.62 56.21    NA    NA    NA    NA    NA    NA
[217]    NA    NA    NA    NA 19.98 20.21 20.22 20.36 20.51 20.65 20.69 20.71
[229] 20.79 20.82 44.06 44.21 44.73 44.94 45.24 45.31 45.32 45.45 45.50 45.50
[241]    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
[253]    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
[265]    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
[277]    NA    NA    NA    NA 46.99 47.57 49.32 49.97 50.03 50.29 50.35 50.41
[289] 50.51 50.59    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
[301] 54.88 55.80 55.92 56.03 56.27 56.35 56.36 56.45 56.49 56.49    NA    NA
[313]    NA    NA    NA    NA    NA    NA    NA    NA 47.45 47.56 47.80 48.74
[325] 48.91 49.26 49.31 49.34 49.68 49.74    NA    NA    NA    NA    NA    NA
[337]    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
[349]    NA    NA 19.47 19.63 19.96 20.08 20.10 20.14 20.18 20.22 20.25 20.28
[361] 43.28 43.69 43.74 44.43 44.57 44.59 44.81 44.81 44.83 44.87    NA    NA
[373]    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
[385]    NA    NA    NA    NA    NA    NA 22.32 22.50 22.84 23.00 23.24 23.45
[397] 23.72 23.74 23.85 23.87 24.73 24.81 24.88 25.10 25.20 25.47 25.55 25.63
[409] 26.01 26.05 20.60 21.38 21.48 21.58 21.60 21.81 21.83 22.00 22.06 22.07
[421] 47.01 47.72 48.39 48.50 48.54 48.82 49.00 49.45 49.52 49.60 54.52 55.15
[433] 55.21 55.47 55.62 56.04 56.13 56.19 56.35 56.48

The problem we have is that the data are not in a clean form. In this data frame, some elements of time are composed of numbers, decimals, and colons (e.g., x.xx, xx:xx.xx, etc.). which all make up elements that would be numbers.

Reading Processed Data

Once you have modified a data frame and saved in a processed directory, you will want to read it for other project tasks. Although the DAT object is available in this case, it will not be available in other scripts. This example demonstrates reading the processed file from /data/processed and assigning to an object. Because this file is a .Rds file, we will use readRDS() to read it.

DAT <- readRDS(file = here::here("data", "processed", "cms-top-all-time-2023-swim.Rds"))

NOTE: Clearly, the DAT object is the same as it was before reading it but you typically will not have objects available in .R scripts unless the script defines it. The main exception is when a .R script reads another script file (aka source()) that defines the object.

Quick Summary Using base R

If you want a quick summary of data, summary() will provide some basic information for you. You can pass a data frame to the function.

summary(DAT)
      time            name               year              event          
 Min.   :  61.1   Length:440         Length:440         Length:440        
 1st Qu.: 123.9   Class :character   Class :character   Class :character  
 Median :1115.7   Mode  :character   Mode  :character   Mode  :character  
 Mean   :1309.4                                                           
 3rd Qu.:2721.0                                                           
 Max.   :3476.0                                                           
     team          
 Length:440        
 Class :character  
 Mode  :character  
                   
                   
                   

Summarizing Data Using {dplyr}

However, there are many ways to summarize data. To introduce data summary techniques using {dplyr}, we will open DAT and use dplyr::summarise() or dplyr::summarize() to summarize the data. The summarise() function works similar to mutate() insofar as it creates variables but it differs insofar as the data frame returned from summarise() contains only the variable(s) referenced in as part of that summary process. This includes some function calls immediately prior to summarize(). In the example below, we summarize by creating a new variable which is set to represent some data summary technique. In essence, summarizing is for descriptive statistics. Using mean(), we can summarize the data by taking the mean of the time variable.

summarize(.data, 
   ..., 
   .by = NULL, 
   .groups = NULL
   )
DAT |>
  summarise(mean = mean(time, na.rm = T))
      mean
1 1309.388

Notice what is returned is a single value reflecting the mean of all the data in the data frame. We could have obtained the same without using {dplyr}.

mean(DAT$time, na.rm = T)        # $ notation
[1] 1309.388

But we lose flexibility of easily adding new summary procedures. We can summarize by two name-value pair functions.

DAT |>
  summarise(mean = mean(time, na.rm = T),
            sd   = sd(time, na.rm = T)
            )
      mean       sd
1 1309.388 1225.964

Now there is a mean and standard deviation for price. You can also add the sample size using dplyr::n().

DAT |>
  summarise(mean = mean(time, na.rm = T),
            sd   = sd(time, na.rm = T),
            n    = n()
            )
      mean       sd   n
1 1309.388 1225.964 440

Summarizing across() Multiple Variables

Summarizing a single variable is useful but if you want to summarize by many, you likely do not want to code a new line for each variable. In such cases, you can use across() as a helper function as was used for creating new variables with mutate() (see previous lesson).

Parameters/Arguments:

  • .cols: specifies the columns to summarize across
  • .fns: specifies the function(s) for the summary
  • .names: specifies the output variable names

Remember across() will want you to pass the columns to summarize by, .cols, the function for how to summarize, .fns, and the names for how to name the new variables, .names (which will be NULL by default). The .x here stands for passing the vector to the mean function and not the data frame. More on ~ and .x later.

Summarize across by numeric variables:

DAT |>
  summarise(across(.cols = where(is.numeric),
                   .fns  = ~mean(.x, na.rm = TRUE)
                   )
            )
      time
1 1309.388

Well, that’s now actually impressive because there is only one numeric variable. What if we had more that were piped to summarize()?

DAT |>
  mutate(num1 = time,
         num2 = time,
         num3 = time
         ) |>
  summarise(across(.cols = where(is.numeric), 
                   .fns  = ~mean(.x, na.rm = TRUE)
                   )
            )
      time     num1     num2     num3
1 1309.388 1309.388 1309.388 1309.388

That was easy.

Because across() is so powerful, let’s just add another variable to the data frame for using in examples. You might also wish to reorder the position of variables in the data frame so that they are grouped in some way. We can use dplyr::relocate() to accomplish this. We will move the time column to the position .before one of the new variables using relocate(time, .before = min).

Doing so will also show you some ways to create variables.

DAT <- DAT |>
  mutate(sec = time,      # will be redundant with time but named accurately
         min  = time/60,
         hour = time/(60*60)
         ) |>
  relocate(time, .before = sec)

Take a look:

head(DAT)
              name year   event   team time  sec      min      hour
1 Jocelyn Crawford 2019 50 FREE Athena 1409 1409 23.48333 0.3913889
2    Ava Sealander 2022 50 FREE Athena 1411 1411 23.51667 0.3919444
3        Kelly Ngo 2016 50 FREE Athena 1429 1429 23.81667 0.3969444
4        Helen Liu 2014 50 FREE Athena 1451 1451 24.18333 0.4030556
5      Michele Kee 2014 50 FREE Athena 1456 1456 24.26667 0.4044444
6 Natalia Orbach-M 2020 50 FREE Athena 1457 1457 24.28333 0.4047222

Variable names created with across() is controlled using the .names argument. The default is equivalent to .names = {.col}, which means that the name(s) are inherited from the .cols argument; the names are a stand-in for the name specification. If you wish to have control over the names, you can pass a string that that either appends (e.g.,"{.col}suffix") or prepends (e.g.,"prefix{.col}") a string to each column name. This string looks odd because it’s a special glue specification that glues together with a string with an object. We will use this concept later when using the {glue} library.

When modifying .names, include a character like "_" (e.g.,"{.col}_suffix") so that the column names and the appended text are separated, making the name easily legible. If you summarize to create means, a good suggestion is something like (e.g.,"{.col}_mean" or (e.g.,"{.col}_mn")) so that you know the variable is a mean. If you prefer the function name first, you can use a prefix (e.g.,"mean_{.col}").

DAT |>
  summarise(across(.cols = c("sec", "min", "hour"), 
                   .fns  = ~mean(.x, na.rm = TRUE),
                   .names = "{.col}_mean"
                   )
            )
  sec_mean min_mean hour_mean
1 1309.388 21.82313 0.3637189

You can see how all variables in the summary end in "_mean".

You can also glue the function and the column names together by passing .names = "{.col}_{.fn}".

DAT |>
  summarise(across(.cols = c("sec", "min", "hour"), 
                   .fns  = ~mean(.x, na.rm = TRUE),
                   .names = "{.col}_{.fn}"
                   )
            )
     sec_1    min_1    hour_1
1 1309.388 21.82313 0.3637189

You you will see there is a number that is appended to the variable name. This is because there is only one function passed to .fns. You can pass more using a special object called a list (see ?list). Unlike vectors, elements of lists need not be the same kind. Elements of lists can combinations of characters, numbers, data frames, functions, etc.

Passing multiple functions from a list() in across()

Passing functions as a list requires a little fancy coding. We will pass two functions as a list so that we can calculate both the mean() and the sd() on the variables passed to across().

A list is a special object (e.g., container) for which its elements can be different types of objects. Whereas elements of vectors can be only character or only numeric, elements of lists can hold different object. One element can be a numeric vector, another element a data frame, another element a character vector, etc. Many functions used in R will actually return lists for which elements contain different types of objects.

OK back to two or more functions. If you pass a list() with arguments for the mean and the sd (e.g., list(mean, sd), you can summarize by both. If you want to prevent errors (yes you do) and want to keep the summaries separate (probably so), you can modify .names to pass both the column and the function (e.g., "{.col}_{.fn}"). The underscore is not needed here; it only helps with readability of the variables so that you don’t end up with variable names like var1mean but instead var1_mean.

Let’s pass the summary procedures as a list to include measures of mean and standard deviation for the variables.

DAT |>
  summarise(across(.cols  = c("sec", "min", "hour"), 
                   .fns   = list(mean, sd),
                   .names = "{.col}_{.fn}")
            )
     sec_1    sec_2    min_1    min_2    hour_1    hour_2
1 1309.388 1225.964 21.82313 20.43274 0.3637189 0.3405456

Well those are not exactly the names we want but it illustrates how names are created. Because we have two summary functions for each column variable passed to across(), they are enumerated according to the order in the list (e.g., mean then standard deviation).

Fixing .names when passing lists to .cols in across()

Enumeration is not helpful for remembering variable names. There are different ways to do fix this problem, some of which may be better under certain scenarios. You have to determine what approach is best but I’ll lay out some limitations. If you pass only the functions into the list, then when you pass {.fn} to .names, the variable names in the returned data frame will take on a numeric value representing the order/element position of the functions as you entered them in the list. In this coding instance, means would be named with"_1" and standard deviation names with "_2". This approach, however, leads to confusing variable names because you have to remember which is 1 and which is 2 and of course explain this to anyone with whom you share the data. Let’s take a look.

A better approach could be to assign the mean and sd functions their own names in the list() function call. By doing so, the name is appended and the new variable is named meaningfully.

Let’s modify what we pass to .fns by passing a list containing 3 functions (e.g., mean(), sd(), and length()) and give each there name. I know this part is confusing because the () are dropped inside the list. This is just how R works. Don’t blame the messenger.

DAT |>
  summarise(across(.cols = c("sec", "min", "hour"), 
                   .fns  = list(mean = mean, 
                                sd = sd,
                                n = length
                                ),
                   .names = "{.col}_{.fn}")
            )
  sec_mean   sec_sd sec_n min_mean   min_sd min_n hour_mean   hour_sd hour_n
1 1309.388 1225.964   440 21.82313 20.43274   440 0.3637189 0.3405456    440

Importantly, however, certain functions like mean() will operate in ways you might not expect. One one hand, it does what we expect when all elements can be used to calculate the mean.

mean(DAT$time)
[1] 1309.388

On the other hand, if there is a missing value, it does not computer the mean but instead something else. Let’s add an NA to the vector using c() to see what happens.

mean(c(DAT$time, NA))
[1] NA

Understanding NAs when passing lists to .cols in across()

The mean() function returns NA rather than a mean. If there is just one NA, mean() returns NA. By design this is actually good.

Let’s also try sd() for the standard deviation of a vector:

sd(c(DAT$time, NA))
[1] NA

The median of a vector, median():

median(c(DAT$time, NA))
[1] NA

The length of a vector, length():

length(c(DAT$time, NA))
[1] 441

Well, that’s interesting. By default length() will return the number of elements of the vector including NAs but by default mean() will not return the mean of a vector with NAs because na.rm = FALSE by default. If you wish to calculate the mean by removing the NAs, pass na.rm = TRUE.

mean(c(DAT$time, NA), na.rm = T)
[1] 1309.388

Make note, however, that the length of this vector without NAs is shorter than the length with NAs. We can test this hypothesis on a vector with and without the NA by using na.omit() to omit any of them. Using our vector we added an NA, let’s omit it.

na.omit(c(DAT$time, NA))
  [1] 1409.00 1411.00 1429.00 1451.00 1456.00 1457.00 1457.00 1467.00 1473.00
 [10] 1442.00 3065.00 3084.00 3101.00 3116.00 3116.00 3148.00 3125.00 3125.00
 [19] 3134.00 3137.00  110.30  111.89  112.40  112.55  112.57  112.67  112.80
 [28]  112.83  112.88  112.89  297.37  299.22  299.78  300.06  300.21  300.47
 [37]  301.17  301.99  302.16  302.23  614.33  615.40  622.59  622.84  624.14
 [46]  624.82  626.34  626.89  633.15  636.60 1018.48 1023.21 1026.78 1029.09
 [55] 1029.26 1032.18 1032.46 1033.58 1041.98 1044.49 3366.00 3367.00 3391.00
 [64] 3371.00 3434.00 3443.00 3438.00 3456.00 3467.00 3476.00  119.91  122.10
 [73]  123.73  123.85  123.92  125.10  125.56  125.71  125.76  126.13   61.84
 [82]   63.28   63.51   63.91   64.16   64.19   64.65   64.66   64.77   64.94
 [91]  134.83  135.62  138.99  139.06  139.78  140.01  141.17  141.77  142.34
[100]  142.83 3316.00 3332.00 3333.00 3323.00 3374.00 3364.00 3387.00 3402.00
[109] 3407.00 3416.00  121.84  122.58  123.80  124.05  124.42  124.81  126.40
[118]  127.12  127.33  127.42  120.69  123.59  123.79  124.74  125.41  126.82
[127]  127.12  127.14  127.81  127.94 1389.00 1412.00 1413.00 1415.00 1407.00
[136] 1411.00 1413.00 1418.00 1425.00 1427.00 3065.00 3067.00 3092.00 3079.00
[145] 3087.00 3088.00 3089.00 3097.00 3105.00 3116.00  110.91  110.98  111.39
[154]  111.41  111.56  111.64  111.95  112.53  112.78  113.04  255.73  267.18
[163]  270.33  270.77  271.96  272.45  272.57  272.68  274.16  274.27 1594.00
[172] 1582.00 1588.00 1617.00 1642.00 1650.00 1634.00 1634.00 1636.00 1637.00
[181] 1713.00 1776.00 1745.00 1746.00 1749.00 1749.00 1764.00 1766.00 1786.00
[190] 1795.00 1445.00 1468.00 1498.00 1499.00 1505.00 1525.00 1505.00 1508.00
[199] 1524.00 1534.00 3305.00 3321.00 3331.00 3311.00 3313.00 3325.00 3327.00
[208] 3345.00 3362.00 3381.00   61.10   62.88   62.89   63.67   64.10   64.16
[217]   64.24   64.26   64.35   64.43 1238.00 1221.00 1222.00 1236.00 1251.00
[226] 1265.00 1269.00 1271.00 1279.00 1282.00 2646.00 2661.00 2713.00 2734.00
[235] 2724.00 2731.00 2732.00 2745.00 2750.00 2750.00   98.35   98.88   99.07
[244]   99.35   99.63   99.80   99.82  100.30  100.31  100.50  265.67  268.11
[253]  268.89  269.32  271.64  272.52  272.65  272.94  272.98  274.70  554.11
[262]  564.43  575.78  576.64  579.27  580.02  581.48  585.72  586.63  587.09
[271]  917.24  932.19  945.57  947.40  952.94  953.75  956.57  957.89  962.45
[280]  963.38 2859.00 2877.00 2972.00 3037.00 3003.00 3029.00 3035.00 3041.00
[289] 3051.00 3059.00  105.05  105.67  106.51  108.84  109.01  109.38  110.32
[298]  110.43  111.07  111.57 3328.00 3380.00 3392.00 3363.00 3387.00 3395.00
[307] 3396.00 3405.00 3409.00 3409.00  119.90  121.18  121.45  121.60  121.66
[316]  121.77  121.78  122.89  123.19  123.23 2865.00 2876.00 2900.00 2954.00
[325] 2971.00 2966.00 2971.00 2974.00 3008.00 3014.00  103.96  108.70  109.24
[334]  109.95  109.96  110.34  110.47  110.49  110.51  110.76  106.97  108.74
[343]  109.74  110.51  110.78  111.11  111.24  111.48  111.82  111.83 1187.00
[352] 1203.00 1236.00 1208.00 1210.00 1214.00 1218.00 1222.00 1225.00 1228.00
[361] 2608.00 2649.00 2654.00 2683.00 2697.00 2699.00 2721.00 2721.00 2723.00
[370] 2727.00   97.98   98.49   99.09   99.19   99.66  100.21  100.22  100.44
[379]  100.70  100.70  235.61  236.68  236.88  239.02  239.13  239.17  240.63
[388]  241.14  242.38  242.99 1352.00 1370.00 1404.00 1380.00 1404.00 1425.00
[397] 1452.00 1454.00 1465.00 1467.00 1513.00 1521.00 1528.00 1510.00 1520.00
[406] 1547.00 1555.00 1563.00 1561.00 1565.00 1260.00 1298.00 1308.00 1318.00
[415] 1320.00 1341.00 1343.00 1320.00 1326.00 1327.00 2821.00 2892.00 2919.00
[424] 2930.00 2934.00 2962.00 2940.00 2985.00 2992.00 3000.00 3292.00 3315.00
[433] 3321.00 3347.00 3362.00 3364.00 3373.00 3379.00 3395.00 3408.00
attr(,"na.action")
[1] 441
attr(,"class")
[1] "omit"

And then get the length when NAs are omitted:

length(na.omit(c(DAT$time, NA)))
[1] 440

This behavior is important because if you want to obtain the mean of a variable with NAs and the sample size using length(), your sample size will be inaccurate.

In order to see these operations on a data frame and in the context of dplyr::summarize(), let’s modify the data frame to make the values for sec and min in row be NA. The data frame will then contain some missing values. We can use dplyr::case_when() for the conditional rather than if.else() or if_else().

Does this look right?

DAT |>
  mutate(across(.cols = c(sec, min),        # for these columns
                .fns = ~case_when(
                  row_number() == 1 ~ NA,   # make this one NA
                  row_number() != 1 ~ .x    # keep it original 
                  )
                )
         ) |>
  head()
              name year   event   team time  sec      min      hour
1 Jocelyn Crawford 2019 50 FREE Athena 1409   NA       NA 0.3913889
2    Ava Sealander 2022 50 FREE Athena 1411 1411 23.51667 0.3919444
3        Kelly Ngo 2016 50 FREE Athena 1429 1429 23.81667 0.3969444
4        Helen Liu 2014 50 FREE Athena 1451 1451 24.18333 0.4030556
5      Michele Kee 2014 50 FREE Athena 1456 1456 24.26667 0.4044444
6 Natalia Orbach-M 2020 50 FREE Athena 1457 1457 24.28333 0.4047222

Yes, row one has NA values, so let’s clobber DAT with this new version. Remove the pipe to head() of course.

DAT <-
  DAT |>
  mutate(across(.cols = c(sec, min),        # for these columns
                .fns = ~case_when(
                  row_number() == 1 ~ NA,   # make this one NA
                  row_number() != 1 ~ .x    # keep it original 
                  )
                )
         )

Comparing some functionality when passing lists to .cols in across()

When functions do not contain arguments for dealing with NAs, there is na.omit(), a function that takes an object and removes NAs. You can just pass the variable to na.omit() and then wrap it in the metric function of interest. Also, because na.rm = T cannot be used for length(), na.omit() offers consistency across all functions and as a result, I believe, less confusion.

Unfortunately, accomplishing this task can be rather tricky and requires some new syntax. This requires usage of what’s called a “lambda” technique. You will want to incorporate ~ and .x into your code. The ~ is used to indicate that you are supplying a lambda function and use of .x is to indicate where the variable in across() is used. Using this type of syntax, we can pass functions to the .fns argument that operate across a set of variables. The ?across() documentation calls this “a {purrr}-style lambda” in the arguments section. This approach can be a little bit confusing, so I’m going to show you an example, and then walk through it step by step. You can always create code snippets so you don’t have to rely on memory write complicated code like this.

Anyway, we will precede the function with ~ and reference the vector using .x. Let’s do this and change the .fns argument slightly.

Here is a general example:

name = ~function(na.omit(.x))

We will summarize only time and sec because those variables are identical except for the row we added. We will also add dplyr::n() to see what’s going on with that function.

DAT |>
  summarise(across(.cols = c("time", "sec"), 
                   .fns  = list(mean = ~mean(na.omit(.x)),
                                len  = ~length(na.omit(.x)),
                                n    = ~dplyr::n()
                                ),
                   .names = "{.col}_{.fn}"
                   )
            ) 
  time_mean time_len time_n sec_mean sec_len sec_n
1  1309.388      440    440 1309.161     439   440

So what happened? The means and lengths for time and sec are not the same. Means differ because they are calculated by different values depending on the presence of NAs. But notice that the n’s are the same based on dplyr::n(). How can the means and differ if the n’s are the same?

So what’s the point of all of this? Well, you need to be careful not to apply functions and assume they are doing what you believe you are doing. You always need to be smarter than the code you use. Also, there is no single answer for dealing with data. Sometimes one approach will be appropriate and in other instances another approach will be. You as the data scientist need to know that there are different methods so that you an decide where to apply those different methods.

Summarize by Groups Using group_by()

Identifying how to group

When you have subgroups in your data, you will often want to create summary statistics by those group levels. A typical grouping approach is by some categorical or factor variable present in a data frame. Using glympse(), we can view all variables to see what might be of interest.

glimpse(DAT)
Rows: 440
Columns: 8
$ name  <chr> "Jocelyn Crawford", "Ava Sealander", "Kelly Ngo", "Helen Liu", "…
$ year  <chr> "2019", "2022", "2016", "2014", "2014", "2020", "2020", "2010", …
$ event <chr> "50 FREE", "50 FREE", "50 FREE", "50 FREE", "50 FREE", "50 FREE"…
$ team  <chr> "Athena", "Athena", "Athena", "Athena", "Athena", "Athena", "Ath…
$ time  <dbl> 1409.00, 1411.00, 1429.00, 1451.00, 1456.00, 1457.00, 1457.00, 1…
$ sec   <dbl> NA, 1411.00, 1429.00, 1451.00, 1456.00, 1457.00, 1457.00, 1467.0…
$ min   <dbl> NA, 23.516667, 23.816667, 24.183333, 24.266667, 24.283333, 24.28…
$ hour  <dbl> 0.39138889, 0.39194444, 0.39694444, 0.40305556, 0.40444444, 0.40…

Looks like some factor variables we can group by include name, year, event, and team.

Summarize A (one) Specific Variable by Groups Using group_by()

A single summary metric

Perhaps you only want to obtain the mean() or the sum() or the sd() for a variable. If so, this is easiest.

Grouping by one variable:

The data contain the top records for swimming events. You might be curious what year was the best of all time or what swimmer (e.g., name) has attained the most records of all time. There are different ways to accomplish this goal.

One approach that might be the most straight forward to new programmers is to mutate a constant count variable on each row which can be used to sum the counts for different groups.

DAT |>
  mutate(count = 1) |>         # mutate a new variable where all rows get a 1
  group_by(name) |>            # group by the swimmer name
  summarise(count = sum(count)) # sum the count and assign it the name count
# A tibble: 141 × 2
   name            count
   <chr>           <dbl>
 1 A Breazeale         8
 2 A Roeseler          2
 3 Aaron Lutzker       6
 4 Abel Sapirstein     1
 5 Alec Vercruysse     4
 6 Alex Mendoza        2
 7 Alex Poltash        6
 8 Allyson Yao         3
 9 Amy Fuller          1
10 Andrew Cox          4
# ℹ 131 more rows

By default, the data frame is arranged by the grouping variable (e.g., name). We can change the order of the rows by count using arrange() but this function by default sorts in an ascending manner. If you want a descending sorting, pass count to desc() to arrange the data frame in this way. We can also assign it to an object.

NAME_count <- DAT |>
  mutate(count = 1) |>
  group_by(name) |>
  summarise(count = sum(count)) |>
  arrange(desc(count))
NAME_count
# A tibble: 141 × 2
   name          count
   <chr>         <dbl>
 1 Michele Kee      11
 2 Augusta Lewis    10
 3 Matt Williams    10
 4 Kelly Ngo         9
 5 A Breazeale       8
 6 Ava Sealander     8
 7 Gary Simon        8
 8 Marco Conati      8
 9 Nic Tekieli       7
10 Aaron Lutzker     6
# ℹ 131 more rows

We can see that the top counts of all time are by Michele Kee for a total of 11. Kudos to Michele!

YEAR_count <- DAT |>
  mutate(count = 1) |>
  group_by(year) |>
  summarise(count = sum(count)) |>
  arrange(desc(count))
YEAR_count
# A tibble: 32 × 2
   year  count
   <chr> <dbl>
 1 2022     81
 2 2023     59
 3 2020     41
 4 2017     38
 5 2014     27
 6 2013     22
 7 2019     22
 8 2015     20
 9 2016     17
10 2018     16
# ℹ 22 more rows

We can see that the year with the most best are by 2022 for a total of 81. Hooray for 2022!

Grouping by two or more variables:

We can also summarize both the teams as well in order to see the top swimmer by team. If you want to summarize more than one variable, pass them both in group_by():

TEAM_NAME_count <- DAT |>
  mutate(count = 1) |>
  group_by(team, name) |>
  summarise(count = sum(count)) |>
  arrange(desc(count))
`summarise()` has grouped output by 'team'. You can override using the
`.groups` argument.
TEAM_NAME_count
# A tibble: 141 × 3
# Groups:   team [2]
   team   name          count
   <chr>  <chr>         <dbl>
 1 Athena Michele Kee      11
 2 Athena Augusta Lewis    10
 3 Stag   Matt Williams    10
 4 Athena Kelly Ngo         9
 5 Athena Ava Sealander     8
 6 Stag   A Breazeale       8
 7 Stag   Gary Simon        8
 8 Stag   Marco Conati      8
 9 Stag   Nic Tekieli       7
10 Athena Ella Blake        6
# ℹ 131 more rows

Filtering rows and then grouping by one variable:

For Athenas:

ATHENA_NAME_count <- DAT |>
  mutate(count = 1) |>
  filter(team == "Athena") |>
  group_by(name) |>
  summarise(count = sum(count)) |>
  arrange(desc(count))
ATHENA_NAME_count
# A tibble: 66 × 2
   name               count
   <chr>              <dbl>
 1 Michele Kee           11
 2 Augusta Lewis         10
 3 Kelly Ngo              9
 4 Ava Sealander          8
 5 Ella Blake             6
 6 Jamee Mitchum          6
 7 Katie Bilotti          6
 8 Mackenzie Mayfield     6
 9 Annika Jessen          5
10 Jocelyn Crawford       5
# ℹ 56 more rows

We can see that the top counts of all time are by Michele Kee with a total of 11. Nice work Michele!

And for Stags:

STAG_NAME_count <- DAT |>
  mutate(count = 1) |>
  filter(team == "Stag") |>
  group_by(name) |>
  summarise(count = sum(count)) |>
  arrange(desc(count))
STAG_NAME_count
# A tibble: 75 × 2
   name          count
   <chr>         <dbl>
 1 Matt Williams    10
 2 A Breazeale       8
 3 Gary Simon        8
 4 Marco Conati      8
 5 Nic Tekieli       7
 6 Aaron Lutzker     6
 7 Alex Poltash      6
 8 Blake Weber       6
 9 Sam Willett       5
10 Tom Harrison      5
# ℹ 65 more rows

We can see that the top counts of all time are by Matt Williams with 10. Go Matt!

To wrap up this example, sometimes working with separate data frames using filter() can provide more useful or manageable summaries.Perhaps you only want to obtain the mean() or the sum() or the sd() for a single variable. If so, this approach may be easiest.

If you are curious, here is a story covering “How CMS teams became the Stags and Athenas”.

Multiple summary metrics

Sometimes you need more than one summary statistic, for example, the mean() and the sd(). This is a little more complex to code.

Note: In these code blocks, some arguments may be removed for readability.

Grouping by one variable:

DAT |>
  group_by(event) |>
  summarise(sec_mean = mean(sec, na.rm = T),  
            sec_median = median(sec, na.rm = T)
            )
# A tibble: 22 × 3
   event               sec_mean sec_median
   <chr>                  <dbl>      <dbl>
 1 100 BACK               3209.      3212.
 2 100 BREAST             1725.      1696.
 3 100 BRST-Relay Spl.    1710.      1678.
 4 100 FLY                3158.      3165 
 5 100 FLY-Relay Spl.     3135.      3152.
 6 100 FREE               2917.      2908.
 7 100 FREE-Relay Spl.    2888.      2896 
 8 1000 FREE               601.       601.
 9 1650 FREE               990.       991.
10 200 BACK                116.       116.
# ℹ 12 more rows

We can also write the summary functions as a list inside across() along with passing .names = "{.col}_{.fn}" if you want the variables named automatically. This approach is more complex but is more flexible.

When you want to summarize across multiple variables using a list of functions, you will want to make sure your .fns argument passes a function using a {purrr}-style lambda (e.g.,~) for that the function is applied across the variables. You will also want to edit .names = "{.col}_{.fn}" so that the naming is done automatically rather than hard coding the names.

DAT |>
  group_by(event) |>
  summarise(across(.cols = "sec",
                   .fns  = list(mean = ~mean(na.omit(.x)),
                                median = ~median(na.omit(.x))
                                ),
                   .names = "{.col}_{.fn}"
                   )
            ) 
# A tibble: 22 × 3
   event               sec_mean sec_median
   <chr>                  <dbl>      <dbl>
 1 100 BACK               3209.      3212.
 2 100 BREAST             1725.      1696.
 3 100 BRST-Relay Spl.    1710.      1678.
 4 100 FLY                3158.      3165 
 5 100 FLY-Relay Spl.     3135.      3152.
 6 100 FREE               2917.      2908.
 7 100 FREE-Relay Spl.    2888.      2896 
 8 1000 FREE               601.       601.
 9 1650 FREE               990.       991.
10 200 BACK                116.       116.
# ℹ 12 more rows

Grouping by two or more variables:

DAT |>
  group_by(event, team) |>
  summarise(sec_mean = mean(sec, na.rm = T),  
            sec_median = median(sec, na.rm = T)
            )
`summarise()` has grouped output by 'event'. You can override using the
`.groups` argument.
# A tibble: 44 × 4
# Groups:   event [22]
   event               team   sec_mean sec_median
   <chr>               <chr>     <dbl>      <dbl>
 1 100 BACK            Athena   3421.      3436  
 2 100 BACK            Stag     2996.      3032  
 3 100 BREAST          Athena     64.0       64.2
 4 100 BREAST          Stag     3386.      3394. 
 5 100 BRST-Relay Spl. Athena     63.6       64.1
 6 100 BRST-Relay Spl. Stag     3356.      3363  
 7 100 FLY             Athena   3365.      3369  
 8 100 FLY             Stag     2950.      2968. 
 9 100 FLY-Relay Spl.  Athena   3332.      3326  
10 100 FLY-Relay Spl.  Stag     2938.      2937  
# ℹ 34 more rows

Or pass the list:

DAT |>
  group_by(event, team) |>
  summarise(across(.cols = "sec",
                   .fns  = list(mean = ~mean(na.omit(.x)),
                                median = ~median(na.omit(.x))
                                ),
                   .names = "{.col}_{.fn}"
                   )
            ) 
`summarise()` has grouped output by 'event'. You can override using the
`.groups` argument.
# A tibble: 44 × 4
# Groups:   event [22]
   event               team   sec_mean sec_median
   <chr>               <chr>     <dbl>      <dbl>
 1 100 BACK            Athena   3421.      3436  
 2 100 BACK            Stag     2996.      3032  
 3 100 BREAST          Athena     64.0       64.2
 4 100 BREAST          Stag     3386.      3394. 
 5 100 BRST-Relay Spl. Athena     63.6       64.1
 6 100 BRST-Relay Spl. Stag     3356.      3363  
 7 100 FLY             Athena   3365.      3369  
 8 100 FLY             Stag     2950.      2968. 
 9 100 FLY-Relay Spl.  Athena   3332.      3326  
10 100 FLY-Relay Spl.  Stag     2938.      2937  
# ℹ 34 more rows

Summarizing Multiple Variables by Groups Using group_by()

So far, we have shown how to summarize a single variable either with or without grouping by levels of another variable. Summaries, however, are often done for multiple variables in data frame. For example, you might want to obtain the mean() for multiple variables or obtain the mean() and the max() (or some other summary statistic) for multiple variables. The following examples prepare you for such tasks.

A single summary metric

Grouping by one variable:

DAT |>
  group_by(event) |>
  summarise(across(.cols = c("sec", "min"),
                   .fns = ~mean(.x, na.rm = T),
                   .names = "{.col}_{.fn}"
                   )
            ) 
# A tibble: 22 × 3
   event               sec_1 min_1
   <chr>               <dbl> <dbl>
 1 100 BACK            3209. 53.5 
 2 100 BREAST          1725. 28.8 
 3 100 BRST-Relay Spl. 1710. 28.5 
 4 100 FLY             3158. 52.6 
 5 100 FLY-Relay Spl.  3135. 52.2 
 6 100 FREE            2917. 48.6 
 7 100 FREE-Relay Spl. 2888. 48.1 
 8 1000 FREE            601. 10.0 
 9 1650 FREE            990. 16.5 
10 200 BACK             116.  1.94
# ℹ 12 more rows

Grouping by two or more variables:

DAT |>
  group_by(event, team) |>
  summarise(across(.cols = c("sec", "min"),
                   .fns = ~mean(.x, na.rm = T),
                   .names = "{.col}_{.fn}"
                   )
            ) 
`summarise()` has grouped output by 'event'. You can override using the
`.groups` argument.
# A tibble: 44 × 4
# Groups:   event [22]
   event               team    sec_1 min_1
   <chr>               <chr>   <dbl> <dbl>
 1 100 BACK            Athena 3421.  57.0 
 2 100 BACK            Stag   2996.  49.9 
 3 100 BREAST          Athena   64.0  1.07
 4 100 BREAST          Stag   3386.  56.4 
 5 100 BRST-Relay Spl. Athena   63.6  1.06
 6 100 BRST-Relay Spl. Stag   3356.  55.9 
 7 100 FLY             Athena 3365.  56.1 
 8 100 FLY             Stag   2950.  49.2 
 9 100 FLY-Relay Spl.  Athena 3332.  55.5 
10 100 FLY-Relay Spl.  Stag   2938.  49.0 
# ℹ 34 more rows

Multiple summary metrics

And if you need to summarize using multiple metrics, then pass the list into .fns:

Grouping by one variable:

DAT |>
  group_by(event) |>
  summarise(across(.cols = c("sec", "min"),
                   .fns  = list(mean = ~mean(na.omit(.x)),
                                median = ~median(na.omit(.x))
                                ),
                   .names = "{.col}_{.fn}"
                   )
            ) 
# A tibble: 22 × 5
   event               sec_mean sec_median min_mean min_median
   <chr>                  <dbl>      <dbl>    <dbl>      <dbl>
 1 100 BACK               3209.      3212.    53.5       53.5 
 2 100 BREAST             1725.      1696.    28.8       28.3 
 3 100 BRST-Relay Spl.    1710.      1678.    28.5       28.0 
 4 100 FLY                3158.      3165     52.6       52.8 
 5 100 FLY-Relay Spl.     3135.      3152.    52.2       52.5 
 6 100 FREE               2917.      2908.    48.6       48.5 
 7 100 FREE-Relay Spl.    2888.      2896     48.1       48.3 
 8 1000 FREE               601.       601.    10.0       10.0 
 9 1650 FREE               990.       991.    16.5       16.5 
10 200 BACK                116.       116.     1.94       1.93
# ℹ 12 more rows

Grouping by two or more variables:

DAT |>
  group_by(event, team) |>
  summarise(across(.cols = c("sec", "min"),
                   .fns  = list(mean = ~mean(na.omit(.x)),
                                median = ~median(na.omit(.x))
                                ),
                   .names = "{.col}_{.fn}"
                   )
            ) 
`summarise()` has grouped output by 'event'. You can override using the
`.groups` argument.
# A tibble: 44 × 6
# Groups:   event [22]
   event               team   sec_mean sec_median min_mean min_median
   <chr>               <chr>     <dbl>      <dbl>    <dbl>      <dbl>
 1 100 BACK            Athena   3421.      3436      57.0       57.3 
 2 100 BACK            Stag     2996.      3032      49.9       50.5 
 3 100 BREAST          Athena     64.0       64.2     1.07       1.07
 4 100 BREAST          Stag     3386.      3394.     56.4       56.6 
 5 100 BRST-Relay Spl. Athena     63.6       64.1     1.06       1.07
 6 100 BRST-Relay Spl. Stag     3356.      3363      55.9       56.0 
 7 100 FLY             Athena   3365.      3369      56.1       56.2 
 8 100 FLY             Stag     2950.      2968.     49.2       49.5 
 9 100 FLY-Relay Spl.  Athena   3332.      3326      55.5       55.4 
10 100 FLY-Relay Spl.  Stag     2938.      2937      49.0       49.0 
# ℹ 34 more rows

Depends on the order in group_by(), so change the order:

DAT |>
  group_by(team, event) |>
  summarise(across(.cols = "sec",
                   .fns  = list(mean = ~mean(na.omit(.x)),
                                n = ~length(na.omit(.x))
                                ),
                   .names = "{.col}_{.fn}"
                   )
            ) 
`summarise()` has grouped output by 'team'. You can override using the
`.groups` argument.
# A tibble: 44 × 4
# Groups:   team [2]
   team   event               sec_mean sec_n
   <chr>  <chr>                  <dbl> <int>
 1 Athena 100 BACK              3421.     10
 2 Athena 100 BREAST              64.0    10
 3 Athena 100 BRST-Relay Spl.     63.6    10
 4 Athena 100 FLY               3365.     10
 5 Athena 100 FLY-Relay Spl.    3332.     10
 6 Athena 100 FREE              3115.     10
 7 Athena 100 FREE-Relay Spl.   3088.     10
 8 Athena 1000 FREE              625.     10
 9 Athena 1650 FREE             1031.     10
10 Athena 200 BACK               124.     10
# ℹ 34 more rows

Notice the change in the order of the column variables. But remember, you can change the order later using select() and/or relocate().

Summarizing Multiple Variables With Reference by Name`

Variables that are numeric

You can also pass variables that are a certain type, like numeric.

DAT |>
  group_by(team, event) |>
  summarise(across(.cols = where(is.numeric), 
                   .fns = ~mean(.x, na.rm = TRUE), 
                   .names = "{.col}")
            )
`summarise()` has grouped output by 'team'. You can override using the
`.groups` argument.
# A tibble: 44 × 6
# Groups:   team [2]
   team   event                 time    sec   min   hour
   <chr>  <chr>                <dbl>  <dbl> <dbl>  <dbl>
 1 Athena 100 BACK            3421.  3421.  57.0  0.950 
 2 Athena 100 BREAST            64.0   64.0  1.07 0.0178
 3 Athena 100 BRST-Relay Spl.   63.6   63.6  1.06 0.0177
 4 Athena 100 FLY             3365.  3365.  56.1  0.935 
 5 Athena 100 FLY-Relay Spl.  3332.  3332.  55.5  0.926 
 6 Athena 100 FREE            3115.  3115.  51.9  0.865 
 7 Athena 100 FREE-Relay Spl. 3088.  3088.  51.5  0.858 
 8 Athena 1000 FREE            625.   625.  10.4  0.174 
 9 Athena 1650 FREE           1031.  1031.  17.2  0.286 
10 Athena 200 BACK             124.   124.   2.07 0.0345
# ℹ 34 more rows

Variables by pattern match

This approach is fun, especially if you have already named variables in ways that make selection really useful. This data frame is constrained a bit so the examples may be silly.

Using starts_with()

DAT |>
  group_by(team, event) |>
  summarise(across(.cols = starts_with("t"),
                   .fns  = list(mean = ~mean(na.omit(.x)),
                                n = ~length(na.omit(.x))
                                ),
                   .names = "{.col}_{.fn}"
                   )
            ) 
`summarise()` has grouped output by 'team'. You can override using the
`.groups` argument.
# A tibble: 44 × 4
# Groups:   team [2]
   team   event               time_mean time_n
   <chr>  <chr>                   <dbl>  <int>
 1 Athena 100 BACK               3421.      10
 2 Athena 100 BREAST               64.0     10
 3 Athena 100 BRST-Relay Spl.      63.6     10
 4 Athena 100 FLY                3365.      10
 5 Athena 100 FLY-Relay Spl.     3332.      10
 6 Athena 100 FREE               3115.      10
 7 Athena 100 FREE-Relay Spl.    3088.      10
 8 Athena 1000 FREE               625.      10
 9 Athena 1650 FREE              1031.      10
10 Athena 200 BACK                124.      10
# ℹ 34 more rows

Using & for complex selection

You obviously cannot calculate numeric metrics for character variables. But how might you select variables that contain a certain character pattern but are also numeric? You cannot nest these functions (e.g., where(is.numeric(contains("pattern")))). You can, however, pass the functions separately.

DAT |>
  group_by(team, event) |>
  summarise(across(.cols = contains("e") & where(is.numeric),
                   .fns  = list(mean = ~mean(na.omit(.x)),
                                n = ~length(na.omit(.x))
                                ),
                   .names = "{.col}_{.fn}"
                   )
            ) 
`summarise()` has grouped output by 'team'. You can override using the
`.groups` argument.
# A tibble: 44 × 6
# Groups:   team [2]
   team   event               time_mean time_n sec_mean sec_n
   <chr>  <chr>                   <dbl>  <int>    <dbl> <int>
 1 Athena 100 BACK               3421.      10   3421.     10
 2 Athena 100 BREAST               64.0     10     64.0    10
 3 Athena 100 BRST-Relay Spl.      63.6     10     63.6    10
 4 Athena 100 FLY                3365.      10   3365.     10
 5 Athena 100 FLY-Relay Spl.     3332.      10   3332.     10
 6 Athena 100 FREE               3115.      10   3115.     10
 7 Athena 100 FREE-Relay Spl.    3088.      10   3088.     10
 8 Athena 1000 FREE               625.      10    625.     10
 9 Athena 1650 FREE              1031.      10   1031.     10
10 Athena 200 BACK                124.      10    124.     10
# ℹ 34 more rows

Summary

You can see that summarizing data can be fun, interesting, simple, but sometimes complex. The examples above are only limited to selecting variables by name. Of course, you can summarize() across() variables that contain() characters or that starts_with() some characters or even across variables that is.numeric() (pardon the grammar) by pairing these functions with the

A Functional Approach

You can also throw your summaries into functions if you wish. We will create a new object that is a function object. We need to give it a name and we need to define arguments to make the function operate. We will want to make sure we have numeric variables.

summarizer <- function(data, 
                       cols = NULL, 
                       ...
                       ) {
  data |>
    group_by(...) |>
    summarise(across(.cols = {{cols}} & where(is.numeric),
                     .fns = list(
                       mean = ~mean(.x, na.rm = TRUE),
                       sd   = ~sd(.x, na.rm = TRUE)
                       ), 
                     .names = "{col}_{fn}")
              )
}

Test the function:

Without grouping:

summarizer(DAT, cols = contains("e"))
# A tibble: 1 × 4
  time_mean time_sd sec_mean sec_sd
      <dbl>   <dbl>    <dbl>  <dbl>
1     1309.   1226.    1309.  1227.

With grouping:

summarizer(DAT, cols = c(min, hour), event)
# A tibble: 22 × 5
   event               min_mean min_sd hour_mean hour_sd
   <chr>                  <dbl>  <dbl>     <dbl>   <dbl>
 1 100 BACK               53.5   3.76     0.891  0.0626 
 2 100 BREAST             28.8  28.4      0.479  0.473  
 3 100 BRST-Relay Spl.    28.5  28.1      0.475  0.469  
 4 100 FLY                52.6   3.63     0.877  0.0605 
 5 100 FLY-Relay Spl.     52.2   3.44     0.871  0.0573 
 6 100 FREE               48.6   3.43     0.810  0.0571 
 7 100 FREE-Relay Spl.    48.1   3.46     0.802  0.0576 
 8 1000 FREE              10.0   0.432    0.167  0.00719
 9 1650 FREE              16.5   0.728    0.275  0.0121 
10 200 BACK                1.94  0.136    0.0324 0.00227
# ℹ 12 more rows

And of course, when you really get excited, you could add functions so that you can perform different metrics. When you are done, you can save your favorite function to a file you can source().

Session Info

sessionInfo()
R version 4.4.1 (2024-06-14 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 22631)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] htmltools_0.5.8.1 DT_0.33           vroom_1.6.5       lubridate_1.9.3  
 [5] forcats_1.0.0     stringr_1.5.1     dplyr_1.1.4       purrr_1.0.2      
 [9] readr_2.1.5       tidyr_1.3.1       tibble_3.2.1      ggplot2_3.5.1    
[13] tidyverse_2.0.0  

loaded via a namespace (and not attached):
 [1] utf8_1.2.4        generics_0.1.3    stringi_1.8.4     hms_1.1.3        
 [5] digest_0.6.36     magrittr_2.0.3    evaluate_0.24.0   grid_4.4.1       
 [9] timechange_0.3.0  fastmap_1.2.0     R.oo_1.26.0       rprojroot_2.0.4  
[13] jsonlite_1.8.8    zip_2.3.1         R.utils_2.12.3    fansi_1.0.6      
[17] scales_1.3.0      cli_3.6.3         rlang_1.1.4       crayon_1.5.3     
[21] R.methodsS3_1.8.2 bit64_4.0.5       munsell_0.5.1     withr_3.0.1      
[25] yaml_2.3.10       tools_4.4.1       tzdb_0.4.0        colorspace_2.1-0 
[29] here_1.0.1        vctrs_0.6.5       R6_2.5.1          lifecycle_1.0.4  
[33] htmlwidgets_1.6.4 bit_4.0.5         pkgconfig_2.0.3   pillar_1.9.0     
[37] openxlsx_4.2.5.2  gtable_0.3.5      Rcpp_1.0.12       glue_1.7.0       
[41] xfun_0.45         tidyselect_1.2.1  rstudioapi_0.16.0 knitr_1.47       
[45] rmarkdown_2.27    compiler_4.4.1