Data subsets and summaries

Author

Gabriel I. Cook

Published

October 31, 2024

Overview

This module demonstrates how to use {dplyr} to subset data, mutate variables, and summarize variables in data frames. Many data summaries involve creating group-level statistics so we will also cover grouping across variables. {lubridate} will be used to handle time vectors. Other functions covered with examples include openxlsx::read.xlsx(), readRDS() , saveRDS() , across(), relocate(), arrange(), and list() (for passing a list object to summarize()). Some elements are advanced topics related to preparing data in order to be summarized.

To Do

Watch video assignment before class for points and for being prepared for class exercises. Bring questions to class. Come prepared to tackle an exercise.

Supplemental Readings

Data Transformation

Task

Watch associated video(s) and review module content as necessary for exercises, homework, or projects.

Libraries

{here}: 1.0.1: for path management
{dplyr} 1.1.4: for selecting, filtering, and mutating
{lubridate} 1.9.3: for handling date and time vectors
{openxlsx} 4.2.5.2: for reading .xlsx worksheets

External Functions

Provided:

view_html(): for viewing data frames in html format, from /src/functions/view_html.R

R.utils::sourceDirectory(here::here("src", "functions"))

Libraries

We will work with a few different libraries for data manipulation. Let’s load them into our work space using library().

library(dplyr)
library(stringr)
library(lubridate)

Reading and Looking at Data

In order to perform any data summary, you need data. We will use raw data files named "cms-top-all-time-2023-swim.csv" and "cms-top-all-time-2023-swim.xlsx" which can be located and downloaded from the Data tab above. You should download these to /data/raw. We could just access the file from the full URL path but we will later want to save a cleaned version of the file for later use. Modifying the file name will be easier than hard coding the path in both places. Thus, we will use here() to build the file path and pass to the relevant functions. There is also additional content involving mutate() in order to prepare the data file for summarizing.

Reading `.csv` files

This example is a walk through using the raw data file.

We will use here() to build and pass the file string as the argument to file and assign to an object named DAT. If column names are not in the file, modify the header argument. You could use dplyr::read_csv() also if you desire.

DAT <- read.csv(file = here::here("data", "raw", "cms-top-all-time-2023-swim.csv"))

Reading `.xlsx` files

This example is a walk through using the raw data file.

We will use here() to build and pass the file string to the xlxsFile argument and also pass sheet = "swim" so the swim worksheet is read from the file. If {openxlsx} is not loaded, call the function with reference to the library and assign to an object named DAT.

DAT <- openxlsx::read.xlsx(
  xlsxFile = here::here("data", "raw", "cms-top-all-time-2023-swim.xlsx"),
  sheet = "swim"
  )

Looking at Data

If data wrangling and cleaning skills are not needed, jump to summarizing. In order to prepare a data frame for summarizing, however, you need to ensure it is cleaned and saved.

View its contents:

head(DAT)    # or view_html(DAT)

   time             name year   event   team
1 23.29 Jocelyn Crawford 2019 50 FREE Athena
2 23.31    Ava Sealander 2022 50 FREE Athena
3 23.49        Kelly Ngo 2016 50 FREE Athena
4 23.71        Helen Liu 2014 50 FREE Athena
5 23.76      Michele Kee 2014 50 FREE Athena
6 23.77 Natalia Orbach-M 2020 50 FREE Athena

We have several variables: time, name, year, event, team. Some appear to be numeric and some are characters. Passing the data frame to dplyr::glimpse() will provide more detail (see also str()).

dplyr::glimpse(DAT) # or DAT |> dplyr::glimpse()

Rows: 440
Columns: 5
$ time  <chr> "23.29", "23.31", "23.49", "23.71", "23.76", "23.77", "23.77", "…
$ name  <chr> "Jocelyn Crawford", "Ava Sealander", "Kelly Ngo", "Helen Liu", "…
$ year  <chr> "2019", "2022", "2016", "2014", "2014", "2020", "2020", "2010", …
$ event <chr> "50 FREE", "50 FREE", "50 FREE", "50 FREE", "50 FREE", "50 FREE"…
$ team  <chr> "Athena", "Athena", "Athena", "Athena", "Athena", "Athena", "Ath…

Hmm, looks like there are <chr> and <dbl> variables and some numbers are in quotes. Objects in quotes are strings. So we need to clean up the file before we can even get a summary. Other than looking at the structure of data frame using glimpse() or str(), you can examine vectors individually.

When vectors are in data frames, you can reference them using the $ notation: my_dataframe$my_variable. Passing this to is.numeric() will also tell you whether the vector is numeric by returning either TRUE or FALSE.

is.numeric(DAT$time)

[1] FALSE

OK, so this variable is definitely not numeric. Looking at the file, we need to convert at least one variable that should be numbers from character to numeric.

Modifying a Data Frame or Tibble Using {dplyr}

Use dplyr::mutate() to create, modify, and delete column variables. At every least, you will need to pass a data frame as the first argument, .data and then a name-value pair as the second argument. There are other arguments that are experimental which will not be the focus of this module.

mutate(
  .data,
  ...,
  .by = NULL,
  .keep = c("all", "used", "unused", "none"),
  .before = NULL,
  .after = NULL
)

We will use mutate() in conjunction with Base R’s piping operator, |>.

Creating a new variable

We will pass the data frame into mutate() and then specify a name-value variable pair. In order to keep the print out manageable, we will also use the slice() function.

mutate(data_frame, 
       new_variable_name = variable
       )

Create new variables that are set to a constant number or string:

DAT |>
  slice(1:5) |>   # rows 1 through 5
  mutate(newvar1 = 9999) |>
  mutate(newvar2 = "Student")

   time             name year   event   team newvar1 newvar2
1 23.29 Jocelyn Crawford 2019 50 FREE Athena    9999 Student
2 23.31    Ava Sealander 2022 50 FREE Athena    9999 Student
3 23.49        Kelly Ngo 2016 50 FREE Athena    9999 Student
4 23.71        Helen Liu 2014 50 FREE Athena    9999 Student
5 23.76      Michele Kee 2014 50 FREE Athena    9999 Student

You can see that each row in the data frame will take on the paired value.

Modifying a new variable

New variables are modified using the same name-value pairing approach. When you modify a variable, you are taking an existing variable to setting it to another value.

Set an existing variable equal to a constant

Just use an existing variable name (left of = in name-value pair).

DAT |>
  slice(1:5) |>         # rows 1 through 5
  mutate(time = 1) |>
  mutate(name = "0")

  time name year   event   team
1    1    0 2019 50 FREE Athena
2    1    0 2022 50 FREE Athena
3    1    0 2016 50 FREE Athena
4    1    0 2014 50 FREE Athena
5    1    0 2014 50 FREE Athena

OK, that’s not very helpful. We just replaced our existing variables with nothing useful. You can see that name is still a <chr> type.

Set an existing variable equal to another value

As long as {dplyr} can result the character elements of the vector, as.numeric() will convert the character strings to numbers. For example:

as.numeric(c("1", "3.2", "6.99"))

[1] 1.00 3.20 6.99

We can illustrate in a data frame by creating character value that will serve as the constant, and use as.numeric() just to illustrate this example.

DAT |>
  slice(1:5) |>            # rows 1 through 5
  mutate(name = as.numeric("0"))

   time name year   event   team
1 23.29    0 2019 50 FREE Athena
2 23.31    0 2022 50 FREE Athena
3 23.49    0 2016 50 FREE Athena
4 23.71    0 2014 50 FREE Athena
5 23.76    0 2014 50 FREE Athena

And now name is a <dbl>, which is a type of numeric. We can see this by selecting columns from the data frame where() the variable is.numeric().

DAT |>
  slice(1:5) |>   # rows 1 through 5
  mutate(name = as.numeric("0")) |>
  select(where(is.numeric))

But if we try to convert time to numeric this way, you will see that the complex numbers will be converted to NAs, or missing.

as.numeric(DAT$time)

Warning: NAs introduced by coercion

  [1] 23.29 23.31 23.49 23.71 23.76 23.77 23.77 23.87 23.93 24.02 51.05 51.24
 [13] 51.41 51.56 51.56 51.88 52.05 52.05 52.14 52.17    NA    NA    NA    NA
 [25]    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 [37]    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 [49]    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 [61] 55.66 55.67 55.91 56.11 56.74 56.83 57.18 57.36 57.47 57.56    NA    NA
 [73]    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 [85]    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 [97]    NA    NA    NA    NA 54.76 54.92 54.93 55.23 55.74 56.04 56.27 56.42
[109] 56.47 56.56    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
[121]    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA 22.69 22.92
[133] 22.93 22.95 23.27 23.31 23.33 23.38 23.45 23.47 50.65 50.67 50.92 51.19
[145] 51.27 51.28 51.29 51.37 51.45 51.56    NA    NA    NA    NA    NA    NA
[157]    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
[169]    NA    NA 25.94 26.22 26.28 26.57 26.82 26.90 27.14 27.14 27.16 27.17
[181] 28.33 28.96 29.05 29.06 29.09 29.09 29.24 29.26 29.46 29.55 24.05 24.28
[193] 24.58 24.59 24.65 24.85 25.05 25.08 25.24 25.34 54.65 54.81 54.91 55.11
[205] 55.13 55.25 55.27 55.45 55.62 56.21    NA    NA    NA    NA    NA    NA
[217]    NA    NA    NA    NA 19.98 20.21 20.22 20.36 20.51 20.65 20.69 20.71
[229] 20.79 20.82 44.06 44.21 44.73 44.94 45.24 45.31 45.32 45.45 45.50 45.50
[241]    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
[253]    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
[265]    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
[277]    NA    NA    NA    NA 46.99 47.57 49.32 49.97 50.03 50.29 50.35 50.41
[289] 50.51 50.59    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
[301] 54.88 55.80 55.92 56.03 56.27 56.35 56.36 56.45 56.49 56.49    NA    NA
[313]    NA    NA    NA    NA    NA    NA    NA    NA 47.45 47.56 47.80 48.74
[325] 48.91 49.26 49.31 49.34 49.68 49.74    NA    NA    NA    NA    NA    NA
[337]    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
[349]    NA    NA 19.47 19.63 19.96 20.08 20.10 20.14 20.18 20.22 20.25 20.28
[361] 43.28 43.69 43.74 44.43 44.57 44.59 44.81 44.81 44.83 44.87    NA    NA
[373]    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
[385]    NA    NA    NA    NA    NA    NA 22.32 22.50 22.84 23.00 23.24 23.45
[397] 23.72 23.74 23.85 23.87 24.73 24.81 24.88 25.10 25.20 25.47 25.55 25.63
[409] 26.01 26.05 20.60 21.38 21.48 21.58 21.60 21.81 21.83 22.00 22.06 22.07
[421] 47.01 47.72 48.39 48.50 48.54 48.82 49.00 49.45 49.52 49.60 54.52 55.15
[433] 55.21 55.47 55.62 56.04 56.13 56.19 56.35 56.48

The problem we have is that the data are not in a clean form. In this data frame, some elements of time are composed of numbers, decimals, and colons (e.g., x.xx, xx:xx.xx, etc.). which all make up elements that would be numbers.

Converting variables that are time related using {lubridate}

{lubridate} is a library for dealing with dates and times. It is also part of the tidyverse} ecosystem. If it is not loaded, called the function by referencing the library name.

The period_to_seconds() function will convert periods to seconds depending on the period format. We need to pass to it an object that equates to a period of seconds.

For time, the format is: hour minute second (e.g., hms). And there is a hms() function to handle this. Let’s see how it works before modifying the data frame. Load the library if it is not loaded.

We will pass a single character string and convert it to hms with hms() and then convert that to seconds using period_to_seconds(). Because 60 minutes and 1 hour is the same number of seconds, we should end up with the same values. We can also use ms() to convert the format into minutes and seconds and pass to period_to_seconds(). Following the examples, we will apply to the vector in the data frame.

period_to_seconds(lubridate::ms())
period_to_seconds(lubridate::hms())

Sixty minutes and one hour to hms:

hms("00:60:00")

[1] "60M 0S"

hms("01:00:00")

[1] "1H 0M 0S"

One day to hms:

hms("24:00:00")

[1] "24H 0M 0S"

Sixty minutes and one hour to hms to seconds:

period_to_seconds(hms("00:60:00"))

[1] 3600

period_to_seconds(hms("01:00:00"))

[1] 3600

OK, good. So let’s see if we can convert time. Because hms() is passed into period_to_seconds(), we first need to verify that hms() can handle it.

hms(DAT$time)

Warning in .parse_hms(..., order = "HMS", quiet = quiet): Some strings failed
to parse

  [1] NA            NA            NA            NA            NA           
  [6] NA            NA            NA            NA            NA           
 [11] NA            NA            NA            NA            NA           
 [16] NA            NA            NA            NA            NA           
 [21] "1H 50M 30S"  "1H 51M 89S"  "1H 52M 40S"  "1H 52M 55S"  "1H 52M 57S" 
 [26] "1H 52M 67S"  "1H 52M 80S"  "1H 52M 83S"  "1H 52M 88S"  "1H 52M 89S" 
 [31] "4H 57M 37S"  "4H 59M 22S"  "4H 59M 78S"  "5H 0M 6S"    "5H 0M 21S"  
 [36] "5H 0M 47S"   "5H 1M 17S"   "5H 1M 99S"   "5H 2M 16S"   "5H 2M 23S"  
 [41] "10H 14M 33S" "10H 15M 40S" "10H 22M 59S" "10H 22M 84S" "10H 24M 14S"
 [46] "10H 24M 82S" "10H 26M 34S" "10H 26M 89S" "10H 33M 15S" "10H 36M 60S"
 [51] "16H 58M 48S" "17H 3M 21S"  "17H 6M 78S"  "17H 9M 9S"   "17H 9M 26S" 
 [56] "17H 12M 18S" "17H 12M 46S" "17H 13M 58S" "17H 21M 98S" "17H 24M 49S"
 [61] NA            NA            NA            NA            NA           
 [66] NA            NA            NA            NA            NA           
 [71] "1H 59M 91S"  "2H 2M 10S"   "2H 3M 73S"   "2H 3M 85S"   "2H 3M 92S"  
 [76] "2H 5M 10S"   "2H 5M 56S"   "2H 5M 71S"   "2H 5M 76S"   "2H 6M 13S"  
 [81] "1H 1M 84S"   "1H 3M 28S"   "1H 3M 51S"   "1H 3M 91S"   "1H 4M 16S"  
 [86] "1H 4M 19S"   "1H 4M 65S"   "1H 4M 66S"   "1H 4M 77S"   "1H 4M 94S"  
 [91] "2H 14M 83S"  "2H 15M 62S"  "2H 18M 99S"  "2H 19M 6S"   "2H 19M 78S" 
 [96] "2H 20M 1S"   "2H 21M 17S"  "2H 21M 77S"  "2H 22M 34S"  "2H 22M 83S" 
[101] NA            NA            NA            NA            NA           
[106] NA            NA            NA            NA            NA           
[111] "2H 1M 84S"   "2H 2M 58S"   "2H 3M 80S"   "2H 4M 5S"    "2H 4M 42S"  
[116] "2H 4M 81S"   "2H 6M 40S"   "2H 7M 12S"   "2H 7M 33S"   "2H 7M 42S"  
[121] "2H 0M 69S"   "2H 3M 59S"   "2H 3M 79S"   "2H 4M 74S"   "2H 5M 41S"  
[126] "2H 6M 82S"   "2H 7M 12S"   "2H 7M 14S"   "2H 7M 81S"   "2H 7M 94S"  
[131] NA            NA            NA            NA            NA           
[136] NA            NA            NA            NA            NA           
[141] NA            NA            NA            NA            NA           
[146] NA            NA            NA            NA            NA           
[151] "1H 50M 91S"  "1H 50M 98S"  "1H 51M 39S"  "1H 51M 41S"  "1H 51M 56S" 
[156] "1H 51M 64S"  "1H 51M 95S"  "1H 52M 53S"  "1H 52M 78S"  "1H 53M 4S"  
[161] "4H 15M 73S"  "4H 27M 18S"  "4H 30M 33S"  "4H 30M 77S"  "4H 31M 96S" 
[166] "4H 32M 45S"  "4H 32M 57S"  "4H 32M 68S"  "4H 34M 16S"  "4H 34M 27S" 
[171] NA            NA            NA            NA            NA           
[176] NA            NA            NA            NA            NA           
[181] NA            NA            NA            NA            NA           
[186] NA            NA            NA            NA            NA           
[191] NA            NA            NA            NA            NA           
[196] NA            NA            NA            NA            NA           
[201] NA            NA            NA            NA            NA           
[206] NA            NA            NA            NA            NA           
[211] "1H 1M 10S"   "1H 2M 88S"   "1H 2M 89S"   "1H 3M 67S"   "1H 4M 10S"  
[216] "1H 4M 16S"   "1H 4M 24S"   "1H 4M 26S"   "1H 4M 35S"   "1H 4M 43S"  
[221] NA            NA            NA            NA            NA           
[226] NA            NA            NA            NA            NA           
[231] NA            NA            NA            NA            NA           
[236] NA            NA            NA            NA            NA           
[241] "1H 38M 35S"  "1H 38M 88S"  "1H 39M 7S"   "1H 39M 35S"  "1H 39M 63S" 
[246] "1H 39M 80S"  "1H 39M 82S"  "1H 40M 30S"  "1H 40M 31S"  "1H 40M 50S" 
[251] "4H 25M 67S"  "4H 28M 11S"  "4H 28M 89S"  "4H 29M 32S"  "4H 31M 64S" 
[256] "4H 32M 52S"  "4H 32M 65S"  "4H 32M 94S"  "4H 32M 98S"  "4H 34M 70S" 
[261] "9H 14M 11S"  "9H 24M 43S"  "9H 35M 78S"  "9H 36M 64S"  "9H 39M 27S" 
[266] "9H 40M 2S"   "9H 41M 48S"  "9H 45M 72S"  "9H 46M 63S"  "9H 47M 9S"  
[271] "15H 17M 24S" "15H 32M 19S" "15H 45M 57S" "15H 47M 40S" "15H 52M 94S"
[276] "15H 53M 75S" "15H 56M 57S" "15H 57M 89S" "16H 2M 45S"  "16H 3M 38S" 
[281] NA            NA            NA            NA            NA           
[286] NA            NA            NA            NA            NA           
[291] "1H 45M 5S"   "1H 45M 67S"  "1H 46M 51S"  "1H 48M 84S"  "1H 49M 1S"  
[296] "1H 49M 38S"  "1H 50M 32S"  "1H 50M 43S"  "1H 51M 7S"   "1H 51M 57S" 
[301] NA            NA            NA            NA            NA           
[306] NA            NA            NA            NA            NA           
[311] "1H 59M 90S"  "2H 1M 18S"   "2H 1M 45S"   "2H 1M 60S"   "2H 1M 66S"  
[316] "2H 1M 77S"   "2H 1M 78S"   "2H 2M 89S"   "2H 3M 19S"   "2H 3M 23S"  
[321] NA            NA            NA            NA            NA           
[326] NA            NA            NA            NA            NA           
[331] "1H 43M 96S"  "1H 48M 70S"  "1H 49M 24S"  "1H 49M 95S"  "1H 49M 96S" 
[336] "1H 50M 34S"  "1H 50M 47S"  "1H 50M 49S"  "1H 50M 51S"  "1H 50M 76S" 
[341] "1H 46M 97S"  "1H 48M 74S"  "1H 49M 74S"  "1H 50M 51S"  "1H 50M 78S" 
[346] "1H 51M 11S"  "1H 51M 24S"  "1H 51M 48S"  "1H 51M 82S"  "1H 51M 83S" 
[351] NA            NA            NA            NA            NA           
[356] NA            NA            NA            NA            NA           
[361] NA            NA            NA            NA            NA           
[366] NA            NA            NA            NA            NA           
[371] "1H 37M 98S"  "1H 38M 49S"  "1H 39M 9S"   "1H 39M 19S"  "1H 39M 66S" 
[376] "1H 40M 21S"  "1H 40M 22S"  "1H 40M 44S"  "1H 40M 70S"  "1H 40M 70S" 
[381] "3H 55M 61S"  "3H 56M 68S"  "3H 56M 88S"  "3H 59M 2S"   "3H 59M 13S" 
[386] "3H 59M 17S"  "4H 0M 63S"   "4H 1M 14S"   "4H 2M 38S"   "4H 2M 99S"  
[391] NA            NA            NA            NA            NA           
[396] NA            NA            NA            NA            NA           
[401] NA            NA            NA            NA            NA           
[406] NA            NA            NA            NA            NA           
[411] NA            NA            NA            NA            NA           
[416] NA            NA            NA            NA            NA           
[421] NA            NA            NA            NA            NA           
[426] NA            NA            NA            NA            NA           
[431] NA            NA            NA            NA            NA           
[436] NA            NA            NA            NA            NA

Yikes! Note the warning and look at the output. Some strings failed and turned to NA. Looking and the time vector again, we see that contains both values like 1:52.83 and 23.87. If there is only one :, we should be able to use ms().

ms(DAT$time)

  [1] "23M 29S"    "23M 31S"    "23M 49S"    "23M 71S"    "23M 76S"   
  [6] "23M 77S"    "23M 77S"    "23M 87S"    "23M 93S"    "24M 2S"    
 [11] "51M 5S"     "51M 24S"    "51M 41S"    "51M 56S"    "51M 56S"   
 [16] "51M 88S"    "52M 5S"     "52M 5S"     "52M 14S"    "52M 17S"   
 [21] "1M 50.3S"   "1M 51.89S"  "1M 52.4S"   "1M 52.55S"  "1M 52.57S" 
 [26] "1M 52.67S"  "1M 52.8S"   "1M 52.83S"  "1M 52.88S"  "1M 52.89S" 
 [31] "4M 57.37S"  "4M 59.22S"  "4M 59.78S"  "5M 0.06S"   "5M 0.21S"  
 [36] "5M 0.47S"   "5M 1.17S"   "5M 1.99S"   "5M 2.16S"   "5M 2.23S"  
 [41] "10M 14.33S" "10M 15.4S"  "10M 22.59S" "10M 22.84S" "10M 24.14S"
 [46] "10M 24.82S" "10M 26.34S" "10M 26.89S" "10M 33.15S" "10M 36.6S" 
 [51] "16M 58.48S" "17M 3.21S"  "17M 6.78S"  "17M 9.09S"  "17M 9.26S" 
 [56] "17M 12.18S" "17M 12.46S" "17M 13.58S" "17M 21.98S" "17M 24.49S"
 [61] "55M 66S"    "55M 67S"    "55M 91S"    "56M 11S"    "56M 74S"   
 [66] "56M 83S"    "57M 18S"    "57M 36S"    "57M 47S"    "57M 56S"   
 [71] "1M 59.91S"  "2M 2.1S"    "2M 3.73S"   "2M 3.85S"   "2M 3.92S"  
 [76] "2M 5.1S"    "2M 5.56S"   "2M 5.71S"   "2M 5.76S"   "2M 6.13S"  
 [81] "1M 1.84S"   "1M 3.28S"   "1M 3.51S"   "1M 3.91S"   "1M 4.16S"  
 [86] "1M 4.19S"   "1M 4.65S"   "1M 4.66S"   "1M 4.77S"   "1M 4.94S"  
 [91] "2M 14.83S"  "2M 15.62S"  "2M 18.99S"  "2M 19.06S"  "2M 19.78S" 
 [96] "2M 20.01S"  "2M 21.17S"  "2M 21.77S"  "2M 22.34S"  "2M 22.83S" 
[101] "54M 76S"    "54M 92S"    "54M 93S"    "55M 23S"    "55M 74S"   
[106] "56M 4S"     "56M 27S"    "56M 42S"    "56M 47S"    "56M 56S"   
[111] "2M 1.84S"   "2M 2.58S"   "2M 3.8S"    "2M 4.05S"   "2M 4.42S"  
[116] "2M 4.81S"   "2M 6.4S"    "2M 7.12S"   "2M 7.33S"   "2M 7.42S"  
[121] "2M 0.69S"   "2M 3.59S"   "2M 3.79S"   "2M 4.74S"   "2M 5.41S"  
[126] "2M 6.82S"   "2M 7.12S"   "2M 7.14S"   "2M 7.81S"   "2M 7.94S"  
[131] "22M 69S"    "22M 92S"    "22M 93S"    "22M 95S"    "23M 27S"   
[136] "23M 31S"    "23M 33S"    "23M 38S"    "23M 45S"    "23M 47S"   
[141] "50M 65S"    "50M 67S"    "50M 92S"    "51M 19S"    "51M 27S"   
[146] "51M 28S"    "51M 29S"    "51M 37S"    "51M 45S"    "51M 56S"   
[151] "1M 50.91S"  "1M 50.98S"  "1M 51.39S"  "1M 51.41S"  "1M 51.56S" 
[156] "1M 51.64S"  "1M 51.95S"  "1M 52.53S"  "1M 52.78S"  "1M 53.04S" 
[161] "4M 15.73S"  "4M 27.18S"  "4M 30.33S"  "4M 30.77S"  "4M 31.96S" 
[166] "4M 32.45S"  "4M 32.57S"  "4M 32.68S"  "4M 34.16S"  "4M 34.27S" 
[171] "25M 94S"    "26M 22S"    "26M 28S"    "26M 57S"    "26M 82S"   
[176] "26M 90S"    "27M 14S"    "27M 14S"    "27M 16S"    "27M 17S"   
[181] "28M 33S"    "28M 96S"    "29M 5S"     "29M 6S"     "29M 9S"    
[186] "29M 9S"     "29M 24S"    "29M 26S"    "29M 46S"    "29M 55S"   
[191] "24M 5S"     "24M 28S"    "24M 58S"    "24M 59S"    "24M 65S"   
[196] "24M 85S"    "25M 5S"     "25M 8S"     "25M 24S"    "25M 34S"   
[201] "54M 65S"    "54M 81S"    "54M 91S"    "55M 11S"    "55M 13S"   
[206] "55M 25S"    "55M 27S"    "55M 45S"    "55M 62S"    "56M 21S"   
[211] "1M 1.1S"    "1M 2.88S"   "1M 2.89S"   "1M 3.67S"   "1M 4.1S"   
[216] "1M 4.16S"   "1M 4.24S"   "1M 4.26S"   "1M 4.35S"   "1M 4.43S"  
[221] "19M 98S"    "20M 21S"    "20M 22S"    "20M 36S"    "20M 51S"   
[226] "20M 65S"    "20M 69S"    "20M 71S"    "20M 79S"    "20M 82S"   
[231] "44M 6S"     "44M 21S"    "44M 73S"    "44M 94S"    "45M 24S"   
[236] "45M 31S"    "45M 32S"    "45M 45S"    "45M 50S"    "45M 50S"   
[241] "1M 38.35S"  "1M 38.88S"  "1M 39.07S"  "1M 39.35S"  "1M 39.63S" 
[246] "1M 39.8S"   "1M 39.82S"  "1M 40.3S"   "1M 40.31S"  "1M 40.5S"  
[251] "4M 25.67S"  "4M 28.11S"  "4M 28.89S"  "4M 29.32S"  "4M 31.64S" 
[256] "4M 32.52S"  "4M 32.65S"  "4M 32.94S"  "4M 32.98S"  "4M 34.7S"  
[261] "9M 14.11S"  "9M 24.43S"  "9M 35.78S"  "9M 36.64S"  "9M 39.27S" 
[266] "9M 40.02S"  "9M 41.48S"  "9M 45.72S"  "9M 46.63S"  "9M 47.09S" 
[271] "15M 17.24S" "15M 32.19S" "15M 45.57S" "15M 47.4S"  "15M 52.94S"
[276] "15M 53.75S" "15M 56.57S" "15M 57.89S" "16M 2.45S"  "16M 3.38S" 
[281] "46M 99S"    "47M 57S"    "49M 32S"    "49M 97S"    "50M 3S"    
[286] "50M 29S"    "50M 35S"    "50M 41S"    "50M 51S"    "50M 59S"   
[291] "1M 45.05S"  "1M 45.67S"  "1M 46.51S"  "1M 48.84S"  "1M 49.01S" 
[296] "1M 49.38S"  "1M 50.32S"  "1M 50.43S"  "1M 51.07S"  "1M 51.57S" 
[301] "54M 88S"    "55M 80S"    "55M 92S"    "56M 3S"     "56M 27S"   
[306] "56M 35S"    "56M 36S"    "56M 45S"    "56M 49S"    "56M 49S"   
[311] "1M 59.9S"   "2M 1.18S"   "2M 1.45S"   "2M 1.6S"    "2M 1.66S"  
[316] "2M 1.77S"   "2M 1.78S"   "2M 2.89S"   "2M 3.19S"   "2M 3.23S"  
[321] "47M 45S"    "47M 56S"    "47M 80S"    "48M 74S"    "48M 91S"   
[326] "49M 26S"    "49M 31S"    "49M 34S"    "49M 68S"    "49M 74S"   
[331] "1M 43.96S"  "1M 48.7S"   "1M 49.24S"  "1M 49.95S"  "1M 49.96S" 
[336] "1M 50.34S"  "1M 50.47S"  "1M 50.49S"  "1M 50.51S"  "1M 50.76S" 
[341] "1M 46.97S"  "1M 48.74S"  "1M 49.74S"  "1M 50.51S"  "1M 50.78S" 
[346] "1M 51.11S"  "1M 51.24S"  "1M 51.48S"  "1M 51.82S"  "1M 51.83S" 
[351] "19M 47S"    "19M 63S"    "19M 96S"    "20M 8S"     "20M 10S"   
[356] "20M 14S"    "20M 18S"    "20M 22S"    "20M 25S"    "20M 28S"   
[361] "43M 28S"    "43M 69S"    "43M 74S"    "44M 43S"    "44M 57S"   
[366] "44M 59S"    "44M 81S"    "44M 81S"    "44M 83S"    "44M 87S"   
[371] "1M 37.98S"  "1M 38.49S"  "1M 39.09S"  "1M 39.19S"  "1M 39.66S" 
[376] "1M 40.21S"  "1M 40.22S"  "1M 40.44S"  "1M 40.7S"   "1M 40.7S"  
[381] "3M 55.61S"  "3M 56.68S"  "3M 56.88S"  "3M 59.02S"  "3M 59.13S" 
[386] "3M 59.17S"  "4M 0.63S"   "4M 1.14S"   "4M 2.38S"   "4M 2.99S"  
[391] "22M 32S"    "22M 50S"    "22M 84S"    "23M 0S"     "23M 24S"   
[396] "23M 45S"    "23M 72S"    "23M 74S"    "23M 85S"    "23M 87S"   
[401] "24M 73S"    "24M 81S"    "24M 88S"    "25M 10S"    "25M 20S"   
[406] "25M 47S"    "25M 55S"    "25M 63S"    "26M 1S"     "26M 5S"    
[411] "20M 60S"    "21M 38S"    "21M 48S"    "21M 58S"    "21M 60S"   
[416] "21M 81S"    "21M 83S"    "22M 0S"     "22M 6S"     "22M 7S"    
[421] "47M 1S"     "47M 72S"    "48M 39S"    "48M 50S"    "48M 54S"   
[426] "48M 82S"    "49M 0S"     "49M 45S"    "49M 52S"    "49M 60S"   
[431] "54M 52S"    "55M 15S"    "55M 21S"    "55M 47S"    "55M 62S"   
[436] "56M 4S"     "56M 13S"    "56M 19S"    "56M 35S"    "56M 48S"

Great! All elements are in the form of "xxM XXs". But does {lubridate} also convert seconds to a numeric value when using period_to_seconds()? If loaded, we can also remove the name of the library when calling the functions.

is.numeric(period_to_seconds(ms(DAT$time)))

[1] TRUE

So let’s go ahead and modify the character vector named time to a numeric vector representing seconds.

period_to_seconds(ms(DAT$time))

  [1] 1409.00 1411.00 1429.00 1451.00 1456.00 1457.00 1457.00 1467.00 1473.00
 [10] 1442.00 3065.00 3084.00 3101.00 3116.00 3116.00 3148.00 3125.00 3125.00
 [19] 3134.00 3137.00  110.30  111.89  112.40  112.55  112.57  112.67  112.80
 [28]  112.83  112.88  112.89  297.37  299.22  299.78  300.06  300.21  300.47
 [37]  301.17  301.99  302.16  302.23  614.33  615.40  622.59  622.84  624.14
 [46]  624.82  626.34  626.89  633.15  636.60 1018.48 1023.21 1026.78 1029.09
 [55] 1029.26 1032.18 1032.46 1033.58 1041.98 1044.49 3366.00 3367.00 3391.00
 [64] 3371.00 3434.00 3443.00 3438.00 3456.00 3467.00 3476.00  119.91  122.10
 [73]  123.73  123.85  123.92  125.10  125.56  125.71  125.76  126.13   61.84
 [82]   63.28   63.51   63.91   64.16   64.19   64.65   64.66   64.77   64.94
 [91]  134.83  135.62  138.99  139.06  139.78  140.01  141.17  141.77  142.34
[100]  142.83 3316.00 3332.00 3333.00 3323.00 3374.00 3364.00 3387.00 3402.00
[109] 3407.00 3416.00  121.84  122.58  123.80  124.05  124.42  124.81  126.40
[118]  127.12  127.33  127.42  120.69  123.59  123.79  124.74  125.41  126.82
[127]  127.12  127.14  127.81  127.94 1389.00 1412.00 1413.00 1415.00 1407.00
[136] 1411.00 1413.00 1418.00 1425.00 1427.00 3065.00 3067.00 3092.00 3079.00
[145] 3087.00 3088.00 3089.00 3097.00 3105.00 3116.00  110.91  110.98  111.39
[154]  111.41  111.56  111.64  111.95  112.53  112.78  113.04  255.73  267.18
[163]  270.33  270.77  271.96  272.45  272.57  272.68  274.16  274.27 1594.00
[172] 1582.00 1588.00 1617.00 1642.00 1650.00 1634.00 1634.00 1636.00 1637.00
[181] 1713.00 1776.00 1745.00 1746.00 1749.00 1749.00 1764.00 1766.00 1786.00
[190] 1795.00 1445.00 1468.00 1498.00 1499.00 1505.00 1525.00 1505.00 1508.00
[199] 1524.00 1534.00 3305.00 3321.00 3331.00 3311.00 3313.00 3325.00 3327.00
[208] 3345.00 3362.00 3381.00   61.10   62.88   62.89   63.67   64.10   64.16
[217]   64.24   64.26   64.35   64.43 1238.00 1221.00 1222.00 1236.00 1251.00
[226] 1265.00 1269.00 1271.00 1279.00 1282.00 2646.00 2661.00 2713.00 2734.00
[235] 2724.00 2731.00 2732.00 2745.00 2750.00 2750.00   98.35   98.88   99.07
[244]   99.35   99.63   99.80   99.82  100.30  100.31  100.50  265.67  268.11
[253]  268.89  269.32  271.64  272.52  272.65  272.94  272.98  274.70  554.11
[262]  564.43  575.78  576.64  579.27  580.02  581.48  585.72  586.63  587.09
[271]  917.24  932.19  945.57  947.40  952.94  953.75  956.57  957.89  962.45
[280]  963.38 2859.00 2877.00 2972.00 3037.00 3003.00 3029.00 3035.00 3041.00
[289] 3051.00 3059.00  105.05  105.67  106.51  108.84  109.01  109.38  110.32
[298]  110.43  111.07  111.57 3328.00 3380.00 3392.00 3363.00 3387.00 3395.00
[307] 3396.00 3405.00 3409.00 3409.00  119.90  121.18  121.45  121.60  121.66
[316]  121.77  121.78  122.89  123.19  123.23 2865.00 2876.00 2900.00 2954.00
[325] 2971.00 2966.00 2971.00 2974.00 3008.00 3014.00  103.96  108.70  109.24
[334]  109.95  109.96  110.34  110.47  110.49  110.51  110.76  106.97  108.74
[343]  109.74  110.51  110.78  111.11  111.24  111.48  111.82  111.83 1187.00
[352] 1203.00 1236.00 1208.00 1210.00 1214.00 1218.00 1222.00 1225.00 1228.00
[361] 2608.00 2649.00 2654.00 2683.00 2697.00 2699.00 2721.00 2721.00 2723.00
[370] 2727.00   97.98   98.49   99.09   99.19   99.66  100.21  100.22  100.44
[379]  100.70  100.70  235.61  236.68  236.88  239.02  239.13  239.17  240.63
[388]  241.14  242.38  242.99 1352.00 1370.00 1404.00 1380.00 1404.00 1425.00
[397] 1452.00 1454.00 1465.00 1467.00 1513.00 1521.00 1528.00 1510.00 1520.00
[406] 1547.00 1555.00 1563.00 1561.00 1565.00 1260.00 1298.00 1308.00 1318.00
[415] 1320.00 1341.00 1343.00 1320.00 1326.00 1327.00 2821.00 2892.00 2919.00
[424] 2930.00 2934.00 2962.00 2940.00 2985.00 2992.00 3000.00 3292.00 3315.00
[433] 3321.00 3347.00 3362.00 3364.00 3373.00 3379.00 3395.00 3408.00

Perfect! Let’s mutate() that variable in the data frame.

DAT |>
  slice(1:5) |>   # rows 1 through 5
  mutate(time = period_to_seconds(ms(time)))

  time             name year   event   team
1 1409 Jocelyn Crawford 2019 50 FREE Athena
2 1411    Ava Sealander 2022 50 FREE Athena
3 1429        Kelly Ngo 2016 50 FREE Athena
4 1451        Helen Liu 2014 50 FREE Athena
5 1456      Michele Kee 2014 50 FREE Athena

Let’s look at DAT now and see those seconds. You can look at the entire data frame if you wish rather than its head().

head(DAT)

   time             name year   event   team
1 23.29 Jocelyn Crawford 2019 50 FREE Athena
2 23.31    Ava Sealander 2022 50 FREE Athena
3 23.49        Kelly Ngo 2016 50 FREE Athena
4 23.71        Helen Liu 2014 50 FREE Athena
5 23.76      Michele Kee 2014 50 FREE Athena
6 23.77 Natalia Orbach-M 2020 50 FREE Athena

The data frame has not changed. The final step is to assign assign the returned data frame to an object. Remove the slice() so we get the entire data frame.

DAT <- DAT |>
  mutate(time = period_to_seconds(ms(time)))

Now let’s write this to /data/processed as a .Rds file using saveRDS().

here::here("data", "processed", "cms-top-all-time-2023-swim.Rds")

[1] "C:/Users/gcook/Sync/git/dataviz/dataviz24/data/processed/cms-top-all-time-2023-swim.Rds"

Looks good. Pass the data frame and a file name arguments to object and file, respectively.

saveRDS(object = DAT, 
        file = here::here("data", "processed", "cms-top-all-time-2023-swim.Rds")
        )

Verify your file it the directory.

Reading Processed Data

Once you have modified a data frame and saved in a processed directory, you will want to read it for other project tasks. Although the DAT object is available in this case, it will not be available in other scripts. This example demonstrates reading the processed file from /data/processed and assigning to an object. Because this file is a .Rds file, we will use readRDS() to read it.

DAT <- readRDS(file = here::here("data", "processed", "cms-top-all-time-2023-swim.Rds"))

NOTE: Clearly, the DAT object is the same as it was before reading it but you typically will not have objects available in .R scripts unless the script defines it. The main exception is when a .R script reads another script file (aka source()) that defines the object.

Quick Summary Using `base R`

If you want a quick summary of data, summary() will provide some basic information for you. You can pass a data frame to the function.

summary(DAT)

      time            name               year              event          
 Min.   :  61.1   Length:440         Length:440         Length:440        
 1st Qu.: 123.9   Class :character   Class :character   Class :character  
 Median :1115.7   Mode  :character   Mode  :character   Mode  :character  
 Mean   :1309.4                                                           
 3rd Qu.:2721.0                                                           
 Max.   :3476.0                                                           
     team          
 Length:440        
 Class :character  
 Mode  :character

Summarizing Data Using {dplyr}

However, there are many ways to summarize data. To introduce data summary techniques using {dplyr}, we will open DAT and use dplyr::summarise() or dplyr::summarize() to summarize the data. The summarise() function works similar to mutate() insofar as it creates variables but it differs insofar as the data frame returned from summarise() contains only the variable(s) referenced in as part of that summary process. This includes some function calls immediately prior to summarize(). In the example below, we summarize by creating a new variable which is set to represent some data summary technique. In essence, summarizing is for descriptive statistics. Using mean(), we can summarize the data by taking the mean of the time variable.

summarize(.data, 
   ..., 
   .by = NULL, 
   .groups = NULL
   )

DAT |>
  summarise(mean = mean(time, na.rm = T))

      mean
1 1309.388

Notice what is returned is a single value reflecting the mean of all the data in the data frame. We could have obtained the same without using {dplyr}.

mean(DAT$time, na.rm = T)        # $ notation

[1] 1309.388

But we lose flexibility of easily adding new summary procedures. We can summarize by two name-value pair functions.

DAT |>
  summarise(mean = mean(time, na.rm = T),
            sd   = sd(time, na.rm = T)
            )

      mean       sd
1 1309.388 1225.964

Now there is a mean and standard deviation for price. You can also add the sample size using dplyr::n().

DAT |>
  summarise(mean = mean(time, na.rm = T),
            sd   = sd(time, na.rm = T),
            n    = n()
            )

      mean       sd   n
1 1309.388 1225.964 440

Summarizing `across()` Multiple Variables

Summarizing a single variable is useful but if you want to summarize by many, you likely do not want to code a new line for each variable. In such cases, you can use across() as a helper function as was used for creating new variables with mutate() (see previous lesson).

Parameters/Arguments:

.cols: specifies the columns to summarize across
.fns: specifies the function(s) for the summary
.names: specifies the output variable names

Remember across() will want you to pass the columns to summarize by, .cols, the function for how to summarize, .fns, and the names for how to name the new variables, .names (which will be NULL by default). The .x here stands for passing the vector to the mean function and not the data frame. More on ~ and .x later.

Summarize across by numeric variables:

DAT |>
  summarise(across(.cols = where(is.numeric),
                   .fns  = ~mean(.x, na.rm = TRUE)
                   )
            )

      time
1 1309.388

Well, that’s now actually impressive because there is only one numeric variable. What if we had more that were piped to summarize()?

DAT |>
  mutate(num1 = time,
         num2 = time,
         num3 = time
         ) |>
  summarise(across(.cols = where(is.numeric), 
                   .fns  = ~mean(.x, na.rm = TRUE)
                   )
            )

      time     num1     num2     num3
1 1309.388 1309.388 1309.388 1309.388

That was easy.

Because across() is so powerful, let’s just add another variable to the data frame for using in examples. You might also wish to reorder the position of variables in the data frame so that they are grouped in some way. We can use dplyr::relocate() to accomplish this. We will move the time column to the position .before one of the new variables using relocate(time, .before = min).

Doing so will also show you some ways to create variables.

DAT <- DAT |>
  mutate(sec = time,      # will be redundant with time but named accurately
         min  = time/60,
         hour = time/(60*60)
         ) |>
  relocate(time, .before = sec)

Take a look:

head(DAT)

              name year   event   team time  sec      min      hour
1 Jocelyn Crawford 2019 50 FREE Athena 1409 1409 23.48333 0.3913889
2    Ava Sealander 2022 50 FREE Athena 1411 1411 23.51667 0.3919444
3        Kelly Ngo 2016 50 FREE Athena 1429 1429 23.81667 0.3969444
4        Helen Liu 2014 50 FREE Athena 1451 1451 24.18333 0.4030556
5      Michele Kee 2014 50 FREE Athena 1456 1456 24.26667 0.4044444
6 Natalia Orbach-M 2020 50 FREE Athena 1457 1457 24.28333 0.4047222

Variable names created with across() is controlled using the .names argument. The default is equivalent to .names = {.col}, which means that the name(s) are inherited from the .cols argument; the names are a stand-in for the name specification. If you wish to have control over the names, you can pass a string that that either appends (e.g.,"{.col}suffix") or prepends (e.g.,"prefix{.col}") a string to each column name. This string looks odd because it’s a special glue specification that glues together with a string with an object. We will use this concept later when using the {glue} library.

When modifying .names, include a character like "_" (e.g.,"{.col}_suffix") so that the column names and the appended text are separated, making the name easily legible. If you summarize to create means, a good suggestion is something like (e.g.,"{.col}_mean" or (e.g.,"{.col}_mn")) so that you know the variable is a mean. If you prefer the function name first, you can use a prefix (e.g.,"mean_{.col}").

DAT |>
  summarise(across(.cols = c("sec", "min", "hour"), 
                   .fns  = ~mean(.x, na.rm = TRUE),
                   .names = "{.col}_mean"
                   )
            )

  sec_mean min_mean hour_mean
1 1309.388 21.82313 0.3637189

You can see how all variables in the summary end in "_mean".

You can also glue the function and the column names together by passing .names = "{.col}_{.fn}".

DAT |>
  summarise(across(.cols = c("sec", "min", "hour"), 
                   .fns  = ~mean(.x, na.rm = TRUE),
                   .names = "{.col}_{.fn}"
                   )
            )

     sec_1    min_1    hour_1
1 1309.388 21.82313 0.3637189

You you will see there is a number that is appended to the variable name. This is because there is only one function passed to .fns. You can pass more using a special object called a list (see ?list). Unlike vectors, elements of lists need not be the same kind. Elements of lists can combinations of characters, numbers, data frames, functions, etc.

Passing multiple functions from a `list()` in `across()`

Passing functions as a list requires a little fancy coding. We will pass two functions as a list so that we can calculate both the mean() and the sd() on the variables passed to across().

A list is a special object (e.g., container) for which its elements can be different types of objects. Whereas elements of vectors can be only character or only numeric, elements of lists can hold different object. One element can be a numeric vector, another element a data frame, another element a character vector, etc. Many functions used in R will actually return lists for which elements contain different types of objects.

OK back to two or more functions. If you pass a list() with arguments for the mean and the sd (e.g., list(mean, sd), you can summarize by both. If you want to prevent errors (yes you do) and want to keep the summaries separate (probably so), you can modify .names to pass both the column and the function (e.g., "{.col}_{.fn}"). The underscore is not needed here; it only helps with readability of the variables so that you don’t end up with variable names like var1mean but instead var1_mean.

Let’s pass the summary procedures as a list to include measures of mean and standard deviation for the variables.

DAT |>
  summarise(across(.cols  = c("sec", "min", "hour"), 
                   .fns   = list(mean, sd),
                   .names = "{.col}_{.fn}")
            )

     sec_1    sec_2    min_1    min_2    hour_1    hour_2
1 1309.388 1225.964 21.82313 20.43274 0.3637189 0.3405456

Well those are not exactly the names we want but it illustrates how names are created. Because we have two summary functions for each column variable passed to across(), they are enumerated according to the order in the list (e.g., mean then standard deviation).

Fixing `.names` when passing lists to `.cols` in `across()`

Enumeration is not helpful for remembering variable names. There are different ways to do fix this problem, some of which may be better under certain scenarios. You have to determine what approach is best but I’ll lay out some limitations. If you pass only the functions into the list, then when you pass {.fn} to .names, the variable names in the returned data frame will take on a numeric value representing the order/element position of the functions as you entered them in the list. In this coding instance, means would be named with"_1" and standard deviation names with "_2". This approach, however, leads to confusing variable names because you have to remember which is 1 and which is 2 and of course explain this to anyone with whom you share the data. Let’s take a look.

A better approach could be to assign the mean and sd functions their own names in the list() function call. By doing so, the name is appended and the new variable is named meaningfully.

Let’s modify what we pass to .fns by passing a list containing 3 functions (e.g., mean(), sd(), and length()) and give each there name. I know this part is confusing because the () are dropped inside the list. This is just how R works. Don’t blame the messenger.

DAT |>
  summarise(across(.cols = c("sec", "min", "hour"), 
                   .fns  = list(mean = mean, 
                                sd = sd,
                                n = length
                                ),
                   .names = "{.col}_{.fn}")
            )

  sec_mean   sec_sd sec_n min_mean   min_sd min_n hour_mean   hour_sd hour_n
1 1309.388 1225.964   440 21.82313 20.43274   440 0.3637189 0.3405456    440

Importantly, however, certain functions like mean() will operate in ways you might not expect. One one hand, it does what we expect when all elements can be used to calculate the mean.

mean(DAT$time)

[1] 1309.388

On the other hand, if there is a missing value, it does not computer the mean but instead something else. Let’s add an NA to the vector using c() to see what happens.

mean(c(DAT$time, NA))

[1] NA

Understanding `NA`s when passing lists to `.cols` in `across()`

The mean() function returns NA rather than a mean. If there is just one NA, mean() returns NA. By design this is actually good.

Let’s also try sd() for the standard deviation of a vector:

sd(c(DAT$time, NA))

[1] NA

The median of a vector, median():

median(c(DAT$time, NA))

[1] NA

The length of a vector, length():

length(c(DAT$time, NA))

[1] 441

Well, that’s interesting. By default length() will return the number of elements of the vector including NAs but by default mean() will not return the mean of a vector with NAs because na.rm = FALSE by default. If you wish to calculate the mean by removing the NAs, pass na.rm = TRUE.

mean(c(DAT$time, NA), na.rm = T)

[1] 1309.388

Make note, however, that the length of this vector without NAs is shorter than the length with NAs. We can test this hypothesis on a vector with and without the NA by using na.omit() to omit any of them. Using our vector we added an NA, let’s omit it.

na.omit(c(DAT$time, NA))

  [1] 1409.00 1411.00 1429.00 1451.00 1456.00 1457.00 1457.00 1467.00 1473.00
 [10] 1442.00 3065.00 3084.00 3101.00 3116.00 3116.00 3148.00 3125.00 3125.00
 [19] 3134.00 3137.00  110.30  111.89  112.40  112.55  112.57  112.67  112.80
 [28]  112.83  112.88  112.89  297.37  299.22  299.78  300.06  300.21  300.47
 [37]  301.17  301.99  302.16  302.23  614.33  615.40  622.59  622.84  624.14
 [46]  624.82  626.34  626.89  633.15  636.60 1018.48 1023.21 1026.78 1029.09
 [55] 1029.26 1032.18 1032.46 1033.58 1041.98 1044.49 3366.00 3367.00 3391.00
 [64] 3371.00 3434.00 3443.00 3438.00 3456.00 3467.00 3476.00  119.91  122.10
 [73]  123.73  123.85  123.92  125.10  125.56  125.71  125.76  126.13   61.84
 [82]   63.28   63.51   63.91   64.16   64.19   64.65   64.66   64.77   64.94
 [91]  134.83  135.62  138.99  139.06  139.78  140.01  141.17  141.77  142.34
[100]  142.83 3316.00 3332.00 3333.00 3323.00 3374.00 3364.00 3387.00 3402.00
[109] 3407.00 3416.00  121.84  122.58  123.80  124.05  124.42  124.81  126.40
[118]  127.12  127.33  127.42  120.69  123.59  123.79  124.74  125.41  126.82
[127]  127.12  127.14  127.81  127.94 1389.00 1412.00 1413.00 1415.00 1407.00
[136] 1411.00 1413.00 1418.00 1425.00 1427.00 3065.00 3067.00 3092.00 3079.00
[145] 3087.00 3088.00 3089.00 3097.00 3105.00 3116.00  110.91  110.98  111.39
[154]  111.41  111.56  111.64  111.95  112.53  112.78  113.04  255.73  267.18
[163]  270.33  270.77  271.96  272.45  272.57  272.68  274.16  274.27 1594.00
[172] 1582.00 1588.00 1617.00 1642.00 1650.00 1634.00 1634.00 1636.00 1637.00
[181] 1713.00 1776.00 1745.00 1746.00 1749.00 1749.00 1764.00 1766.00 1786.00
[190] 1795.00 1445.00 1468.00 1498.00 1499.00 1505.00 1525.00 1505.00 1508.00
[199] 1524.00 1534.00 3305.00 3321.00 3331.00 3311.00 3313.00 3325.00 3327.00
[208] 3345.00 3362.00 3381.00   61.10   62.88   62.89   63.67   64.10   64.16
[217]   64.24   64.26   64.35   64.43 1238.00 1221.00 1222.00 1236.00 1251.00
[226] 1265.00 1269.00 1271.00 1279.00 1282.00 2646.00 2661.00 2713.00 2734.00
[235] 2724.00 2731.00 2732.00 2745.00 2750.00 2750.00   98.35   98.88   99.07
[244]   99.35   99.63   99.80   99.82  100.30  100.31  100.50  265.67  268.11
[253]  268.89  269.32  271.64  272.52  272.65  272.94  272.98  274.70  554.11
[262]  564.43  575.78  576.64  579.27  580.02  581.48  585.72  586.63  587.09
[271]  917.24  932.19  945.57  947.40  952.94  953.75  956.57  957.89  962.45
[280]  963.38 2859.00 2877.00 2972.00 3037.00 3003.00 3029.00 3035.00 3041.00
[289] 3051.00 3059.00  105.05  105.67  106.51  108.84  109.01  109.38  110.32
[298]  110.43  111.07  111.57 3328.00 3380.00 3392.00 3363.00 3387.00 3395.00
[307] 3396.00 3405.00 3409.00 3409.00  119.90  121.18  121.45  121.60  121.66
[316]  121.77  121.78  122.89  123.19  123.23 2865.00 2876.00 2900.00 2954.00
[325] 2971.00 2966.00 2971.00 2974.00 3008.00 3014.00  103.96  108.70  109.24
[334]  109.95  109.96  110.34  110.47  110.49  110.51  110.76  106.97  108.74
[343]  109.74  110.51  110.78  111.11  111.24  111.48  111.82  111.83 1187.00
[352] 1203.00 1236.00 1208.00 1210.00 1214.00 1218.00 1222.00 1225.00 1228.00
[361] 2608.00 2649.00 2654.00 2683.00 2697.00 2699.00 2721.00 2721.00 2723.00
[370] 2727.00   97.98   98.49   99.09   99.19   99.66  100.21  100.22  100.44
[379]  100.70  100.70  235.61  236.68  236.88  239.02  239.13  239.17  240.63
[388]  241.14  242.38  242.99 1352.00 1370.00 1404.00 1380.00 1404.00 1425.00
[397] 1452.00 1454.00 1465.00 1467.00 1513.00 1521.00 1528.00 1510.00 1520.00
[406] 1547.00 1555.00 1563.00 1561.00 1565.00 1260.00 1298.00 1308.00 1318.00
[415] 1320.00 1341.00 1343.00 1320.00 1326.00 1327.00 2821.00 2892.00 2919.00
[424] 2930.00 2934.00 2962.00 2940.00 2985.00 2992.00 3000.00 3292.00 3315.00
[433] 3321.00 3347.00 3362.00 3364.00 3373.00 3379.00 3395.00 3408.00
attr(,"na.action")
[1] 441
attr(,"class")
[1] "omit"

And then get the length when NAs are omitted:

length(na.omit(c(DAT$time, NA)))

[1] 440

This behavior is important because if you want to obtain the mean of a variable with NAs and the sample size using length(), your sample size will be inaccurate.

In order to see these operations on a data frame and in the context of dplyr::summarize(), let’s modify the data frame to make the values for sec and min in row be NA. The data frame will then contain some missing values. We can use dplyr::case_when() for the conditional rather than if.else() or if_else().

Does this look right?

DAT |>
  mutate(across(.cols = c(sec, min),        # for these columns
                .fns = ~case_when(
                  row_number() == 1 ~ NA,   # make this one NA
                  row_number() != 1 ~ .x    # keep it original 
                  )
                )
         ) |>
  head()

              name year   event   team time  sec      min      hour
1 Jocelyn Crawford 2019 50 FREE Athena 1409   NA       NA 0.3913889
2    Ava Sealander 2022 50 FREE Athena 1411 1411 23.51667 0.3919444
3        Kelly Ngo 2016 50 FREE Athena 1429 1429 23.81667 0.3969444
4        Helen Liu 2014 50 FREE Athena 1451 1451 24.18333 0.4030556
5      Michele Kee 2014 50 FREE Athena 1456 1456 24.26667 0.4044444
6 Natalia Orbach-M 2020 50 FREE Athena 1457 1457 24.28333 0.4047222

Yes, row one has NA values, so let’s clobber DAT with this new version. Remove the pipe to head() of course.

DAT <-
  DAT |>
  mutate(across(.cols = c(sec, min),        # for these columns
                .fns = ~case_when(
                  row_number() == 1 ~ NA,   # make this one NA
                  row_number() != 1 ~ .x    # keep it original 
                  )
                )
         )

Comparing some functionality when passing lists to `.cols` in `across()`

When functions do not contain arguments for dealing with NAs, there is na.omit(), a function that takes an object and removes NAs. You can just pass the variable to na.omit() and then wrap it in the metric function of interest. Also, because na.rm = T cannot be used for length(), na.omit() offers consistency across all functions and as a result, I believe, less confusion.

Unfortunately, accomplishing this task can be rather tricky and requires some new syntax. This requires usage of what’s called a “lambda” technique. You will want to incorporate ~ and .x into your code. The ~ is used to indicate that you are supplying a lambda function and use of .x is to indicate where the variable in across() is used. Using this type of syntax, we can pass functions to the .fns argument that operate across a set of variables. The ?across() documentation calls this “a {purrr}-style lambda” in the arguments section. This approach can be a little bit confusing, so I’m going to show you an example, and then walk through it step by step. You can always create code snippets so you don’t have to rely on memory write complicated code like this.

Anyway, we will precede the function with ~ and reference the vector using .x. Let’s do this and change the .fns argument slightly.

Here is a general example:

name = ~function(na.omit(.x))

We will summarize only time and sec because those variables are identical except for the row we added. We will also add dplyr::n() to see what’s going on with that function.

DAT |>
  summarise(across(.cols = c("time", "sec"), 
                   .fns  = list(mean = ~mean(na.omit(.x)),
                                len  = ~length(na.omit(.x)),
                                n    = ~dplyr::n()
                                ),
                   .names = "{.col}_{.fn}"
                   )
            )

  time_mean time_len time_n sec_mean sec_len sec_n
1  1309.388      440    440 1309.161     439   440

So what happened? The means and lengths for time and sec are not the same. Means differ because they are calculated by different values depending on the presence of NAs. But notice that the n’s are the same based on dplyr::n(). How can the means and differ if the n’s are the same?

So what’s the point of all of this? Well, you need to be careful not to apply functions and assume they are doing what you believe you are doing. You always need to be smarter than the code you use. Also, there is no single answer for dealing with data. Sometimes one approach will be appropriate and in other instances another approach will be. You as the data scientist need to know that there are different methods so that you an decide where to apply those different methods.

Summarize by Groups Using `group_by()`

Identifying how to group

When you have subgroups in your data, you will often want to create summary statistics by those group levels. A typical grouping approach is by some categorical or factor variable present in a data frame. Using glympse(), we can view all variables to see what might be of interest.

glimpse(DAT)

Rows: 440
Columns: 8
$ name  <chr> "Jocelyn Crawford", "Ava Sealander", "Kelly Ngo", "Helen Liu", "…
$ year  <chr> "2019", "2022", "2016", "2014", "2014", "2020", "2020", "2010", …
$ event <chr> "50 FREE", "50 FREE", "50 FREE", "50 FREE", "50 FREE", "50 FREE"…
$ team  <chr> "Athena", "Athena", "Athena", "Athena", "Athena", "Athena", "Ath…
$ time  <dbl> 1409.00, 1411.00, 1429.00, 1451.00, 1456.00, 1457.00, 1457.00, 1…
$ sec   <dbl> NA, 1411.00, 1429.00, 1451.00, 1456.00, 1457.00, 1457.00, 1467.0…
$ min   <dbl> NA, 23.516667, 23.816667, 24.183333, 24.266667, 24.283333, 24.28…
$ hour  <dbl> 0.39138889, 0.39194444, 0.39694444, 0.40305556, 0.40444444, 0.40…

Looks like some factor variables we can group by include name, year, event, and team.

Summarize A (one) Specific Variable by Groups Using `group_by()`

A single summary metric

Perhaps you only want to obtain the mean() or the sum() or the sd() for a variable. If so, this is easiest.

Grouping by one variable:

The data contain the top records for swimming events. You might be curious what year was the best of all time or what swimmer (e.g., name) has attained the most records of all time. There are different ways to accomplish this goal.

One approach that might be the most straight forward to new programmers is to mutate a constant count variable on each row which can be used to sum the counts for different groups.

DAT |>
  mutate(count = 1) |>         # mutate a new variable where all rows get a 1
  group_by(name) |>            # group by the swimmer name
  summarise(count = sum(count)) # sum the count and assign it the name count

# A tibble: 141 × 2
   name            count
   <chr>           <dbl>
 1 A Breazeale         8
 2 A Roeseler          2
 3 Aaron Lutzker       6
 4 Abel Sapirstein     1
 5 Alec Vercruysse     4
 6 Alex Mendoza        2
 7 Alex Poltash        6
 8 Allyson Yao         3
 9 Amy Fuller          1
10 Andrew Cox          4
# ℹ 131 more rows

By default, the data frame is arranged by the grouping variable (e.g., name). We can change the order of the rows by count using arrange() but this function by default sorts in an ascending manner. If you want a descending sorting, pass count to desc() to arrange the data frame in this way. We can also assign it to an object.

NAME_count <- DAT |>
  mutate(count = 1) |>
  group_by(name) |>
  summarise(count = sum(count)) |>
  arrange(desc(count))

NAME_count

# A tibble: 141 × 2
   name          count
   <chr>         <dbl>
 1 Michele Kee      11
 2 Augusta Lewis    10
 3 Matt Williams    10
 4 Kelly Ngo         9
 5 A Breazeale       8
 6 Ava Sealander     8
 7 Gary Simon        8
 8 Marco Conati      8
 9 Nic Tekieli       7
10 Aaron Lutzker     6
# ℹ 131 more rows

We can see that the top counts of all time are by Michele Kee for a total of 11. Kudos to Michele!

YEAR_count <- DAT |>
  mutate(count = 1) |>
  group_by(year) |>
  summarise(count = sum(count)) |>
  arrange(desc(count))

YEAR_count

# A tibble: 32 × 2
   year  count
   <chr> <dbl>
 1 2022     81
 2 2023     59
 3 2020     41
 4 2017     38
 5 2014     27
 6 2013     22
 7 2019     22
 8 2015     20
 9 2016     17
10 2018     16
# ℹ 22 more rows

We can see that the year with the most best are by 2022 for a total of 81. Hooray for 2022!

Grouping by two or more variables:

We can also summarize both the teams as well in order to see the top swimmer by team. If you want to summarize more than one variable, pass them both in group_by():

TEAM_NAME_count <- DAT |>
  mutate(count = 1) |>
  group_by(team, name) |>
  summarise(count = sum(count)) |>
  arrange(desc(count))

`summarise()` has grouped output by 'team'. You can override using the
`.groups` argument.

TEAM_NAME_count

# A tibble: 141 × 3
# Groups:   team [2]
   team   name          count
   <chr>  <chr>         <dbl>
 1 Athena Michele Kee      11
 2 Athena Augusta Lewis    10
 3 Stag   Matt Williams    10
 4 Athena Kelly Ngo         9
 5 Athena Ava Sealander     8
 6 Stag   A Breazeale       8
 7 Stag   Gary Simon        8
 8 Stag   Marco Conati      8
 9 Stag   Nic Tekieli       7
10 Athena Ella Blake        6
# ℹ 131 more rows

Filtering rows and then grouping by one variable:

For Athenas:

ATHENA_NAME_count <- DAT |>
  mutate(count = 1) |>
  filter(team == "Athena") |>
  group_by(name) |>
  summarise(count = sum(count)) |>
  arrange(desc(count))

ATHENA_NAME_count

# A tibble: 66 × 2
   name               count
   <chr>              <dbl>
 1 Michele Kee           11
 2 Augusta Lewis         10
 3 Kelly Ngo              9
 4 Ava Sealander          8
 5 Ella Blake             6
 6 Jamee Mitchum          6
 7 Katie Bilotti          6
 8 Mackenzie Mayfield     6
 9 Annika Jessen          5
10 Jocelyn Crawford       5
# ℹ 56 more rows

We can see that the top counts of all time are by Michele Kee with a total of 11. Nice work Michele!

And for Stags:

STAG_NAME_count <- DAT |>
  mutate(count = 1) |>
  filter(team == "Stag") |>
  group_by(name) |>
  summarise(count = sum(count)) |>
  arrange(desc(count))

STAG_NAME_count

# A tibble: 75 × 2
   name          count
   <chr>         <dbl>
 1 Matt Williams    10
 2 A Breazeale       8
 3 Gary Simon        8
 4 Marco Conati      8
 5 Nic Tekieli       7
 6 Aaron Lutzker     6
 7 Alex Poltash      6
 8 Blake Weber       6
 9 Sam Willett       5
10 Tom Harrison      5
# ℹ 65 more rows

We can see that the top counts of all time are by Matt Williams with 10. Go Matt!

To wrap up this example, sometimes working with separate data frames using filter() can provide more useful or manageable summaries.Perhaps you only want to obtain the mean() or the sum() or the sd() for a single variable. If so, this approach may be easiest.

If you are curious, here is a story covering “How CMS teams became the Stags and Athenas”.

Multiple summary metrics

Sometimes you need more than one summary statistic, for example, the mean() and the sd(). This is a little more complex to code.

Note: In these code blocks, some arguments may be removed for readability.

Grouping by one variable:

DAT |>
  group_by(event) |>
  summarise(sec_mean = mean(sec, na.rm = T),  
            sec_median = median(sec, na.rm = T)
            )

# A tibble: 22 × 3
   event               sec_mean sec_median
   <chr>                  <dbl>      <dbl>
 1 100 BACK               3209.      3212.
 2 100 BREAST             1725.      1696.
 3 100 BRST-Relay Spl.    1710.      1678.
 4 100 FLY                3158.      3165 
 5 100 FLY-Relay Spl.     3135.      3152.
 6 100 FREE               2917.      2908.
 7 100 FREE-Relay Spl.    2888.      2896 
 8 1000 FREE               601.       601.
 9 1650 FREE               990.       991.
10 200 BACK                116.       116.
# ℹ 12 more rows

We can also write the summary functions as a list inside across() along with passing .names = "{.col}_{.fn}" if you want the variables named automatically. This approach is more complex but is more flexible.

When you want to summarize across multiple variables using a list of functions, you will want to make sure your .fns argument passes a function using a {purrr}-style lambda (e.g.,~) for that the function is applied across the variables. You will also want to edit .names = "{.col}_{.fn}" so that the naming is done automatically rather than hard coding the names.

DAT |>
  group_by(event) |>
  summarise(across(.cols = "sec",
                   .fns  = list(mean = ~mean(na.omit(.x)),
                                median = ~median(na.omit(.x))
                                ),
                   .names = "{.col}_{.fn}"
                   )
            )

# A tibble: 22 × 3
   event               sec_mean sec_median
   <chr>                  <dbl>      <dbl>
 1 100 BACK               3209.      3212.
 2 100 BREAST             1725.      1696.
 3 100 BRST-Relay Spl.    1710.      1678.
 4 100 FLY                3158.      3165 
 5 100 FLY-Relay Spl.     3135.      3152.
 6 100 FREE               2917.      2908.
 7 100 FREE-Relay Spl.    2888.      2896 
 8 1000 FREE               601.       601.
 9 1650 FREE               990.       991.
10 200 BACK                116.       116.
# ℹ 12 more rows

Grouping by two or more variables:

DAT |>
  group_by(event, team) |>
  summarise(sec_mean = mean(sec, na.rm = T),  
            sec_median = median(sec, na.rm = T)
            )

`summarise()` has grouped output by 'event'. You can override using the
`.groups` argument.

# A tibble: 44 × 4
# Groups:   event [22]
   event               team   sec_mean sec_median
   <chr>               <chr>     <dbl>      <dbl>
 1 100 BACK            Athena   3421.      3436  
 2 100 BACK            Stag     2996.      3032  
 3 100 BREAST          Athena     64.0       64.2
 4 100 BREAST          Stag     3386.      3394. 
 5 100 BRST-Relay Spl. Athena     63.6       64.1
 6 100 BRST-Relay Spl. Stag     3356.      3363  
 7 100 FLY             Athena   3365.      3369  
 8 100 FLY             Stag     2950.      2968. 
 9 100 FLY-Relay Spl.  Athena   3332.      3326  
10 100 FLY-Relay Spl.  Stag     2938.      2937  
# ℹ 34 more rows

Or pass the list:

DAT |>
  group_by(event, team) |>
  summarise(across(.cols = "sec",
                   .fns  = list(mean = ~mean(na.omit(.x)),
                                median = ~median(na.omit(.x))
                                ),
                   .names = "{.col}_{.fn}"
                   )
            )

`summarise()` has grouped output by 'event'. You can override using the
`.groups` argument.

# A tibble: 44 × 4
# Groups:   event [22]
   event               team   sec_mean sec_median
   <chr>               <chr>     <dbl>      <dbl>
 1 100 BACK            Athena   3421.      3436  
 2 100 BACK            Stag     2996.      3032  
 3 100 BREAST          Athena     64.0       64.2
 4 100 BREAST          Stag     3386.      3394. 
 5 100 BRST-Relay Spl. Athena     63.6       64.1
 6 100 BRST-Relay Spl. Stag     3356.      3363  
 7 100 FLY             Athena   3365.      3369  
 8 100 FLY             Stag     2950.      2968. 
 9 100 FLY-Relay Spl.  Athena   3332.      3326  
10 100 FLY-Relay Spl.  Stag     2938.      2937  
# ℹ 34 more rows

Summarizing Multiple Variables by Groups Using `group_by()`

So far, we have shown how to summarize a single variable either with or without grouping by levels of another variable. Summaries, however, are often done for multiple variables in data frame. For example, you might want to obtain the mean() for multiple variables or obtain the mean() and the max() (or some other summary statistic) for multiple variables. The following examples prepare you for such tasks.

A single summary metric

Grouping by one variable:

DAT |>
  group_by(event) |>
  summarise(across(.cols = c("sec", "min"),
                   .fns = ~mean(.x, na.rm = T),
                   .names = "{.col}_{.fn}"
                   )
            )

# A tibble: 22 × 3
   event               sec_1 min_1
   <chr>               <dbl> <dbl>
 1 100 BACK            3209. 53.5 
 2 100 BREAST          1725. 28.8 
 3 100 BRST-Relay Spl. 1710. 28.5 
 4 100 FLY             3158. 52.6 
 5 100 FLY-Relay Spl.  3135. 52.2 
 6 100 FREE            2917. 48.6 
 7 100 FREE-Relay Spl. 2888. 48.1 
 8 1000 FREE            601. 10.0 
 9 1650 FREE            990. 16.5 
10 200 BACK             116.  1.94
# ℹ 12 more rows

Grouping by two or more variables:

DAT |>
  group_by(event, team) |>
  summarise(across(.cols = c("sec", "min"),
                   .fns = ~mean(.x, na.rm = T),
                   .names = "{.col}_{.fn}"
                   )
            )

`summarise()` has grouped output by 'event'. You can override using the
`.groups` argument.

# A tibble: 44 × 4
# Groups:   event [22]
   event               team    sec_1 min_1
   <chr>               <chr>   <dbl> <dbl>
 1 100 BACK            Athena 3421.  57.0 
 2 100 BACK            Stag   2996.  49.9 
 3 100 BREAST          Athena   64.0  1.07
 4 100 BREAST          Stag   3386.  56.4 
 5 100 BRST-Relay Spl. Athena   63.6  1.06
 6 100 BRST-Relay Spl. Stag   3356.  55.9 
 7 100 FLY             Athena 3365.  56.1 
 8 100 FLY             Stag   2950.  49.2 
 9 100 FLY-Relay Spl.  Athena 3332.  55.5 
10 100 FLY-Relay Spl.  Stag   2938.  49.0 
# ℹ 34 more rows

Multiple summary metrics

And if you need to summarize using multiple metrics, then pass the list into .fns:

Grouping by one variable:

DAT |>
  group_by(event) |>
  summarise(across(.cols = c("sec", "min"),
                   .fns  = list(mean = ~mean(na.omit(.x)),
                                median = ~median(na.omit(.x))
                                ),
                   .names = "{.col}_{.fn}"
                   )
            )

# A tibble: 22 × 5
   event               sec_mean sec_median min_mean min_median
   <chr>                  <dbl>      <dbl>    <dbl>      <dbl>
 1 100 BACK               3209.      3212.    53.5       53.5 
 2 100 BREAST             1725.      1696.    28.8       28.3 
 3 100 BRST-Relay Spl.    1710.      1678.    28.5       28.0 
 4 100 FLY                3158.      3165     52.6       52.8 
 5 100 FLY-Relay Spl.     3135.      3152.    52.2       52.5 
 6 100 FREE               2917.      2908.    48.6       48.5 
 7 100 FREE-Relay Spl.    2888.      2896     48.1       48.3 
 8 1000 FREE               601.       601.    10.0       10.0 
 9 1650 FREE               990.       991.    16.5       16.5 
10 200 BACK                116.       116.     1.94       1.93
# ℹ 12 more rows

Grouping by two or more variables:

DAT |>
  group_by(event, team) |>
  summarise(across(.cols = c("sec", "min"),
                   .fns  = list(mean = ~mean(na.omit(.x)),
                                median = ~median(na.omit(.x))
                                ),
                   .names = "{.col}_{.fn}"
                   )
            )

`summarise()` has grouped output by 'event'. You can override using the
`.groups` argument.

# A tibble: 44 × 6
# Groups:   event [22]
   event               team   sec_mean sec_median min_mean min_median
   <chr>               <chr>     <dbl>      <dbl>    <dbl>      <dbl>
 1 100 BACK            Athena   3421.      3436      57.0       57.3 
 2 100 BACK            Stag     2996.      3032      49.9       50.5 
 3 100 BREAST          Athena     64.0       64.2     1.07       1.07
 4 100 BREAST          Stag     3386.      3394.     56.4       56.6 
 5 100 BRST-Relay Spl. Athena     63.6       64.1     1.06       1.07
 6 100 BRST-Relay Spl. Stag     3356.      3363      55.9       56.0 
 7 100 FLY             Athena   3365.      3369      56.1       56.2 
 8 100 FLY             Stag     2950.      2968.     49.2       49.5 
 9 100 FLY-Relay Spl.  Athena   3332.      3326      55.5       55.4 
10 100 FLY-Relay Spl.  Stag     2938.      2937      49.0       49.0 
# ℹ 34 more rows

Depends on the order in group_by(), so change the order:

DAT |>
  group_by(team, event) |>
  summarise(across(.cols = "sec",
                   .fns  = list(mean = ~mean(na.omit(.x)),
                                n = ~length(na.omit(.x))
                                ),
                   .names = "{.col}_{.fn}"
                   )
            )

`summarise()` has grouped output by 'team'. You can override using the
`.groups` argument.

# A tibble: 44 × 4
# Groups:   team [2]
   team   event               sec_mean sec_n
   <chr>  <chr>                  <dbl> <int>
 1 Athena 100 BACK              3421.     10
 2 Athena 100 BREAST              64.0    10
 3 Athena 100 BRST-Relay Spl.     63.6    10
 4 Athena 100 FLY               3365.     10
 5 Athena 100 FLY-Relay Spl.    3332.     10
 6 Athena 100 FREE              3115.     10
 7 Athena 100 FREE-Relay Spl.   3088.     10
 8 Athena 1000 FREE              625.     10
 9 Athena 1650 FREE             1031.     10
10 Athena 200 BACK               124.     10
# ℹ 34 more rows

Notice the change in the order of the column variables. But remember, you can change the order later using select() and/or relocate().

Summarizing Multiple Variables With Reference by Name`

Variables that are numeric

You can also pass variables that are a certain type, like numeric.

DAT |>
  group_by(team, event) |>
  summarise(across(.cols = where(is.numeric), 
                   .fns = ~mean(.x, na.rm = TRUE), 
                   .names = "{.col}")
            )

`summarise()` has grouped output by 'team'. You can override using the
`.groups` argument.

# A tibble: 44 × 6
# Groups:   team [2]
   team   event                 time    sec   min   hour
   <chr>  <chr>                <dbl>  <dbl> <dbl>  <dbl>
 1 Athena 100 BACK            3421.  3421.  57.0  0.950 
 2 Athena 100 BREAST            64.0   64.0  1.07 0.0178
 3 Athena 100 BRST-Relay Spl.   63.6   63.6  1.06 0.0177
 4 Athena 100 FLY             3365.  3365.  56.1  0.935 
 5 Athena 100 FLY-Relay Spl.  3332.  3332.  55.5  0.926 
 6 Athena 100 FREE            3115.  3115.  51.9  0.865 
 7 Athena 100 FREE-Relay Spl. 3088.  3088.  51.5  0.858 
 8 Athena 1000 FREE            625.   625.  10.4  0.174 
 9 Athena 1650 FREE           1031.  1031.  17.2  0.286 
10 Athena 200 BACK             124.   124.   2.07 0.0345
# ℹ 34 more rows

Variables by pattern match

This approach is fun, especially if you have already named variables in ways that make selection really useful. This data frame is constrained a bit so the examples may be silly.

Using starts_with()

DAT |>
  group_by(team, event) |>
  summarise(across(.cols = starts_with("t"),
                   .fns  = list(mean = ~mean(na.omit(.x)),
                                n = ~length(na.omit(.x))
                                ),
                   .names = "{.col}_{.fn}"
                   )
            )

`summarise()` has grouped output by 'team'. You can override using the
`.groups` argument.

# A tibble: 44 × 4
# Groups:   team [2]
   team   event               time_mean time_n
   <chr>  <chr>                   <dbl>  <int>
 1 Athena 100 BACK               3421.      10
 2 Athena 100 BREAST               64.0     10
 3 Athena 100 BRST-Relay Spl.      63.6     10
 4 Athena 100 FLY                3365.      10
 5 Athena 100 FLY-Relay Spl.     3332.      10
 6 Athena 100 FREE               3115.      10
 7 Athena 100 FREE-Relay Spl.    3088.      10
 8 Athena 1000 FREE               625.      10
 9 Athena 1650 FREE              1031.      10
10 Athena 200 BACK                124.      10
# ℹ 34 more rows

Using `&` for complex selection

You obviously cannot calculate numeric metrics for character variables. But how might you select variables that contain a certain character pattern but are also numeric? You cannot nest these functions (e.g., where(is.numeric(contains("pattern")))). You can, however, pass the functions separately.

DAT |>
  group_by(team, event) |>
  summarise(across(.cols = contains("e") & where(is.numeric),
                   .fns  = list(mean = ~mean(na.omit(.x)),
                                n = ~length(na.omit(.x))
                                ),
                   .names = "{.col}_{.fn}"
                   )
            )

`summarise()` has grouped output by 'team'. You can override using the
`.groups` argument.

# A tibble: 44 × 6
# Groups:   team [2]
   team   event               time_mean time_n sec_mean sec_n
   <chr>  <chr>                   <dbl>  <int>    <dbl> <int>
 1 Athena 100 BACK               3421.      10   3421.     10
 2 Athena 100 BREAST               64.0     10     64.0    10
 3 Athena 100 BRST-Relay Spl.      63.6     10     63.6    10
 4 Athena 100 FLY                3365.      10   3365.     10
 5 Athena 100 FLY-Relay Spl.     3332.      10   3332.     10
 6 Athena 100 FREE               3115.      10   3115.     10
 7 Athena 100 FREE-Relay Spl.    3088.      10   3088.     10
 8 Athena 1000 FREE               625.      10    625.     10
 9 Athena 1650 FREE              1031.      10   1031.     10
10 Athena 200 BACK                124.      10    124.     10
# ℹ 34 more rows

Summary

You can see that summarizing data can be fun, interesting, simple, but sometimes complex. The examples above are only limited to selecting variables by name. Of course, you can summarize() across() variables that contain() characters or that starts_with() some characters or even across variables that is.numeric() (pardon the grammar) by pairing these functions with the

A Functional Approach

You can also throw your summaries into functions if you wish. We will create a new object that is a function object. We need to give it a name and we need to define arguments to make the function operate. We will want to make sure we have numeric variables.

summarizer <- function(data, 
                       cols = NULL, 
                       ...
                       ) {
  data |>
    group_by(...) |>
    summarise(across(.cols = {{cols}} & where(is.numeric),
                     .fns = list(
                       mean = ~mean(.x, na.rm = TRUE),
                       sd   = ~sd(.x, na.rm = TRUE)
                       ), 
                     .names = "{col}_{fn}")
              )
}

Test the function:

Without grouping:

summarizer(DAT, cols = contains("e"))

# A tibble: 1 × 4
  time_mean time_sd sec_mean sec_sd
      <dbl>   <dbl>    <dbl>  <dbl>
1     1309.   1226.    1309.  1227.

With grouping:

summarizer(DAT, cols = c(min, hour), event)

# A tibble: 22 × 5
   event               min_mean min_sd hour_mean hour_sd
   <chr>                  <dbl>  <dbl>     <dbl>   <dbl>
 1 100 BACK               53.5   3.76     0.891  0.0626 
 2 100 BREAST             28.8  28.4      0.479  0.473  
 3 100 BRST-Relay Spl.    28.5  28.1      0.475  0.469  
 4 100 FLY                52.6   3.63     0.877  0.0605 
 5 100 FLY-Relay Spl.     52.2   3.44     0.871  0.0573 
 6 100 FREE               48.6   3.43     0.810  0.0571 
 7 100 FREE-Relay Spl.    48.1   3.46     0.802  0.0576 
 8 1000 FREE              10.0   0.432    0.167  0.00719
 9 1650 FREE              16.5   0.728    0.275  0.0121 
10 200 BACK                1.94  0.136    0.0324 0.00227
# ℹ 12 more rows

And of course, when you really get excited, you could add functions so that you can perform different metrics. When you are done, you can save your favorite function to a file you can source().

Session Info

sessionInfo()

R version 4.4.1 (2024-06-14 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 22631)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] htmltools_0.5.8.1 DT_0.33           vroom_1.6.5       lubridate_1.9.3  
 [5] forcats_1.0.0     stringr_1.5.1     dplyr_1.1.4       purrr_1.0.2      
 [9] readr_2.1.5       tidyr_1.3.1       tibble_3.2.1      ggplot2_3.5.1    
[13] tidyverse_2.0.0  

loaded via a namespace (and not attached):
 [1] utf8_1.2.4        generics_0.1.3    stringi_1.8.4     hms_1.1.3        
 [5] digest_0.6.36     magrittr_2.0.3    evaluate_0.24.0   grid_4.4.1       
 [9] timechange_0.3.0  fastmap_1.2.0     R.oo_1.26.0       rprojroot_2.0.4  
[13] jsonlite_1.8.8    zip_2.3.1         R.utils_2.12.3    fansi_1.0.6      
[17] scales_1.3.0      cli_3.6.3         rlang_1.1.4       crayon_1.5.3     
[21] R.methodsS3_1.8.2 bit64_4.0.5       munsell_0.5.1     withr_3.0.1      
[25] yaml_2.3.10       tools_4.4.1       tzdb_0.4.0        colorspace_2.1-0 
[29] here_1.0.1        vctrs_0.6.5       R6_2.5.1          lifecycle_1.0.4  
[33] htmlwidgets_1.6.4 bit_4.0.5         pkgconfig_2.0.3   pillar_1.9.0     
[37] openxlsx_4.2.5.2  gtable_0.3.5      Rcpp_1.0.12       glue_1.7.0       
[41] xfun_0.45         tidyselect_1.2.1  rstudioapi_0.16.0 knitr_1.47       
[45] rmarkdown_2.27    compiler_4.4.1

Overview

To Do

Supplemental Readings

Task

Libraries

External Functions

Libraries

Reading and Looking at Data

Reading .csv files

Reading .xlsx files

Looking at Data

Modifying a Data Frame or Tibble Using {dplyr}

Creating a new variable

Modifying a new variable

Set an existing variable equal to a constant

Set an existing variable equal to another value

Converting variables that are time related using {lubridate}

Reading Processed Data

Quick Summary Using base R

Summarizing Data Using {dplyr}

Summarizing across() Multiple Variables

Summarize across by numeric variables:

Passing multiple functions from a list() in across()

Fixing .names when passing lists to .cols in across()

Understanding NAs when passing lists to .cols in across()

Comparing some functionality when passing lists to .cols in across()

Summarize by Groups Using group_by()

Identifying how to group

Summarize A (one) Specific Variable by Groups Using group_by()

A single summary metric

Grouping by one variable:

Grouping by two or more variables:

Filtering rows and then grouping by one variable:

Multiple summary metrics

Grouping by one variable:

Grouping by two or more variables:

Summarizing Multiple Variables by Groups Using group_by()

A single summary metric

Grouping by one variable:

Grouping by two or more variables:

Multiple summary metrics

Grouping by one variable:

Grouping by two or more variables:

Summarizing Multiple Variables With Reference by Name`

Variables that are numeric

Variables by pattern match

Using starts_with()

Using & for complex selection

Summary

A Functional Approach

Session Info

Reading `.csv` files

Reading `.xlsx` files

Quick Summary Using `base R`

Summarizing `across()` Multiple Variables

Passing multiple functions from a `list()` in `across()`

Fixing `.names` when passing lists to `.cols` in `across()`

Understanding `NA`s when passing lists to `.cols` in `across()`

Comparing some functionality when passing lists to `.cols` in `across()`

Summarize by Groups Using `group_by()`

Summarize A (one) Specific Variable by Groups Using `group_by()`

Summarizing Multiple Variables by Groups Using `group_by()`

Using `&` for complex selection