::sourceDirectory(here::here("src", "functions")) R.utils
Data subsets and summaries
Overview
This module demonstrates how to use {dplyr} to subset data, mutate variables, and summarize variables in data frames. Many data summaries involve creating group-level statistics so we will also cover grouping across variables. {lubridate} will be used to handle time vectors. Other functions covered with examples include openxlsx::read.xlsx()
, readRDS()
, saveRDS()
, across()
, relocate()
, arrange()
, and list()
(for passing a list object to summarize()
). Some elements are advanced topics related to preparing data in order to be summarized.
To Do
Watch video assignment before class for points and for being prepared for class exercises. Bring questions to class. Come prepared to tackle an exercise.
Supplemental Readings
Task
Watch associated video(s) and review module content as necessary for exercises, homework, or projects.
Libraries
- {here}: 1.0.1: for path management
- {dplyr} 1.1.4: for selecting, filtering, and mutating
- {lubridate} 1.9.3: for handling date and time vectors
- {openxlsx} 4.2.5.2: for reading
.xlsx
worksheets
External Functions
Provided:
view_html()
: for viewing data frames in html format, from /src/functions/view_html.R
Libraries
We will work with a few different libraries for data manipulation. Let’s load them into our work space using library()
.
library(dplyr)
library(stringr)
library(lubridate)
Reading and Looking at Data
In order to perform any data summary, you need data. We will use raw data files named "cms-top-all-time-2023-swim.csv"
and "cms-top-all-time-2023-swim.xlsx"
which can be located and downloaded from the Data tab above. You should download these to /data/raw
. We could just access the file from the full URL path but we will later want to save a cleaned version of the file for later use. Modifying the file name will be easier than hard coding the path in both places. Thus, we will use here()
to build the file path and pass to the relevant functions. There is also additional content involving mutate()
in order to prepare the data file for summarizing.
Reading .csv
files
This example is a walk through using the raw data file.
We will use here()
to build and pass the file string as the argument to file
and assign to an object named DAT
. If column names are not in the file, modify the header
argument. You could use dplyr::read_csv()
also if you desire.
<- read.csv(file = here::here("data", "raw", "cms-top-all-time-2023-swim.csv")) DAT
Reading .xlsx
files
This example is a walk through using the raw data file.
We will use here()
to build and pass the file string to the xlxsFile
argument and also pass sheet = "swim"
so the swim worksheet is read from the file. If {openxlsx} is not loaded, call the function with reference to the library and assign to an object named DAT
.
<- openxlsx::read.xlsx(
DAT xlsxFile = here::here("data", "raw", "cms-top-all-time-2023-swim.xlsx"),
sheet = "swim"
)
Looking at Data
If data wrangling and cleaning skills are not needed, jump to summarizing. In order to prepare a data frame for summarizing, however, you need to ensure it is cleaned and saved.
View its contents:
head(DAT) # or view_html(DAT)
time name year event team
1 23.29 Jocelyn Crawford 2019 50 FREE Athena
2 23.31 Ava Sealander 2022 50 FREE Athena
3 23.49 Kelly Ngo 2016 50 FREE Athena
4 23.71 Helen Liu 2014 50 FREE Athena
5 23.76 Michele Kee 2014 50 FREE Athena
6 23.77 Natalia Orbach-M 2020 50 FREE Athena
We have several variables: time, name, year, event, team. Some appear to be numeric and some are characters. Passing the data frame to dplyr::glimpse()
will provide more detail (see also str()
).
::glimpse(DAT) # or DAT |> dplyr::glimpse() dplyr
Rows: 440
Columns: 5
$ time <chr> "23.29", "23.31", "23.49", "23.71", "23.76", "23.77", "23.77", "…
$ name <chr> "Jocelyn Crawford", "Ava Sealander", "Kelly Ngo", "Helen Liu", "…
$ year <chr> "2019", "2022", "2016", "2014", "2014", "2020", "2020", "2010", …
$ event <chr> "50 FREE", "50 FREE", "50 FREE", "50 FREE", "50 FREE", "50 FREE"…
$ team <chr> "Athena", "Athena", "Athena", "Athena", "Athena", "Athena", "Ath…
Hmm, looks like there are <chr>
and <dbl>
variables and some numbers are in quotes. Objects in quotes are strings. So we need to clean up the file before we can even get a summary. Other than looking at the structure of data frame using glimpse()
or str()
, you can examine vectors individually.
When vectors are in data frames, you can reference them using the $
notation: my_dataframe$my_variable
. Passing this to is.numeric()
will also tell you whether the vector is numeric by returning either TRUE
or FALSE
.
is.numeric(DAT$time)
[1] FALSE
OK, so this variable is definitely not numeric. Looking at the file, we need to convert at least one variable that should be numbers from character to numeric.
Modifying a Data Frame or Tibble Using {dplyr}
Use dplyr::mutate()
to create, modify, and delete column variables. At every least, you will need to pass a data frame as the first argument, .data
and then a name-value pair as the second argument. There are other arguments that are experimental which will not be the focus of this module.
mutate(
.data,
...,.by = NULL,
.keep = c("all", "used", "unused", "none"),
.before = NULL,
.after = NULL
)
We will use mutate()
in conjunction with Base R
’s piping operator, |>
.
Creating a new variable
We will pass the data frame into mutate()
and then specify a name-value variable pair. In order to keep the print out manageable, we will also use the slice()
function.
mutate(data_frame,
new_variable_name = variable
)
Create new variables that are set to a constant number or string:
|>
DAT slice(1:5) |> # rows 1 through 5
mutate(newvar1 = 9999) |>
mutate(newvar2 = "Student")
time name year event team newvar1 newvar2
1 23.29 Jocelyn Crawford 2019 50 FREE Athena 9999 Student
2 23.31 Ava Sealander 2022 50 FREE Athena 9999 Student
3 23.49 Kelly Ngo 2016 50 FREE Athena 9999 Student
4 23.71 Helen Liu 2014 50 FREE Athena 9999 Student
5 23.76 Michele Kee 2014 50 FREE Athena 9999 Student
You can see that each row in the data frame will take on the paired value.
Modifying a new variable
New variables are modified using the same name-value pairing approach. When you modify a variable, you are taking an existing variable to setting it to another value.
Set an existing variable equal to a constant
Just use an existing variable name (left of =
in name-value pair).
|>
DAT slice(1:5) |> # rows 1 through 5
mutate(time = 1) |>
mutate(name = "0")
time name year event team
1 1 0 2019 50 FREE Athena
2 1 0 2022 50 FREE Athena
3 1 0 2016 50 FREE Athena
4 1 0 2014 50 FREE Athena
5 1 0 2014 50 FREE Athena
OK, that’s not very helpful. We just replaced our existing variables with nothing useful. You can see that name is still a <chr>
type.
Set an existing variable equal to another value
As long as {dplyr} can result the character elements of the vector, as.numeric()
will convert the character strings to numbers. For example:
as.numeric(c("1", "3.2", "6.99"))
[1] 1.00 3.20 6.99
We can illustrate in a data frame by creating character value that will serve as the constant, and use as.numeric()
just to illustrate this example.
|>
DAT slice(1:5) |> # rows 1 through 5
mutate(name = as.numeric("0"))
time name year event team
1 23.29 0 2019 50 FREE Athena
2 23.31 0 2022 50 FREE Athena
3 23.49 0 2016 50 FREE Athena
4 23.71 0 2014 50 FREE Athena
5 23.76 0 2014 50 FREE Athena
And now name
is a <dbl>
, which is a type of numeric. We can see this by selecting columns from the data frame where()
the variable is.numeric()
.
|>
DAT slice(1:5) |> # rows 1 through 5
mutate(name = as.numeric("0")) |>
select(where(is.numeric))
name
1 0
2 0
3 0
4 0
5 0
But if we try to convert time
to numeric this way, you will see that the complex numbers will be converted to NA
s, or missing.
as.numeric(DAT$time)
Warning: NAs introduced by coercion
[1] 23.29 23.31 23.49 23.71 23.76 23.77 23.77 23.87 23.93 24.02 51.05 51.24
[13] 51.41 51.56 51.56 51.88 52.05 52.05 52.14 52.17 NA NA NA NA
[25] NA NA NA NA NA NA NA NA NA NA NA NA
[37] NA NA NA NA NA NA NA NA NA NA NA NA
[49] NA NA NA NA NA NA NA NA NA NA NA NA
[61] 55.66 55.67 55.91 56.11 56.74 56.83 57.18 57.36 57.47 57.56 NA NA
[73] NA NA NA NA NA NA NA NA NA NA NA NA
[85] NA NA NA NA NA NA NA NA NA NA NA NA
[97] NA NA NA NA 54.76 54.92 54.93 55.23 55.74 56.04 56.27 56.42
[109] 56.47 56.56 NA NA NA NA NA NA NA NA NA NA
[121] NA NA NA NA NA NA NA NA NA NA 22.69 22.92
[133] 22.93 22.95 23.27 23.31 23.33 23.38 23.45 23.47 50.65 50.67 50.92 51.19
[145] 51.27 51.28 51.29 51.37 51.45 51.56 NA NA NA NA NA NA
[157] NA NA NA NA NA NA NA NA NA NA NA NA
[169] NA NA 25.94 26.22 26.28 26.57 26.82 26.90 27.14 27.14 27.16 27.17
[181] 28.33 28.96 29.05 29.06 29.09 29.09 29.24 29.26 29.46 29.55 24.05 24.28
[193] 24.58 24.59 24.65 24.85 25.05 25.08 25.24 25.34 54.65 54.81 54.91 55.11
[205] 55.13 55.25 55.27 55.45 55.62 56.21 NA NA NA NA NA NA
[217] NA NA NA NA 19.98 20.21 20.22 20.36 20.51 20.65 20.69 20.71
[229] 20.79 20.82 44.06 44.21 44.73 44.94 45.24 45.31 45.32 45.45 45.50 45.50
[241] NA NA NA NA NA NA NA NA NA NA NA NA
[253] NA NA NA NA NA NA NA NA NA NA NA NA
[265] NA NA NA NA NA NA NA NA NA NA NA NA
[277] NA NA NA NA 46.99 47.57 49.32 49.97 50.03 50.29 50.35 50.41
[289] 50.51 50.59 NA NA NA NA NA NA NA NA NA NA
[301] 54.88 55.80 55.92 56.03 56.27 56.35 56.36 56.45 56.49 56.49 NA NA
[313] NA NA NA NA NA NA NA NA 47.45 47.56 47.80 48.74
[325] 48.91 49.26 49.31 49.34 49.68 49.74 NA NA NA NA NA NA
[337] NA NA NA NA NA NA NA NA NA NA NA NA
[349] NA NA 19.47 19.63 19.96 20.08 20.10 20.14 20.18 20.22 20.25 20.28
[361] 43.28 43.69 43.74 44.43 44.57 44.59 44.81 44.81 44.83 44.87 NA NA
[373] NA NA NA NA NA NA NA NA NA NA NA NA
[385] NA NA NA NA NA NA 22.32 22.50 22.84 23.00 23.24 23.45
[397] 23.72 23.74 23.85 23.87 24.73 24.81 24.88 25.10 25.20 25.47 25.55 25.63
[409] 26.01 26.05 20.60 21.38 21.48 21.58 21.60 21.81 21.83 22.00 22.06 22.07
[421] 47.01 47.72 48.39 48.50 48.54 48.82 49.00 49.45 49.52 49.60 54.52 55.15
[433] 55.21 55.47 55.62 56.04 56.13 56.19 56.35 56.48
The problem we have is that the data are not in a clean form. In this data frame, some elements of time
are composed of numbers, decimals, and colons (e.g., x.xx, xx:xx.xx, etc.). which all make up elements that would be numbers.
Reading Processed Data
Once you have modified a data frame and saved in a processed directory, you will want to read it for other project tasks. Although the DAT
object is available in this case, it will not be available in other scripts. This example demonstrates reading the processed file from /data/processed
and assigning to an object. Because this file is a .Rds
file, we will use readRDS()
to read it.
<- readRDS(file = here::here("data", "processed", "cms-top-all-time-2023-swim.Rds")) DAT
NOTE: Clearly, the DAT
object is the same as it was before reading it but you typically will not have objects available in .R
scripts unless the script defines it. The main exception is when a .R
script reads another script file (aka source()
) that defines the object.
Quick Summary Using base R
If you want a quick summary of data, summary()
will provide some basic information for you. You can pass a data frame to the function.
summary(DAT)
time name year event
Min. : 61.1 Length:440 Length:440 Length:440
1st Qu.: 123.9 Class :character Class :character Class :character
Median :1115.7 Mode :character Mode :character Mode :character
Mean :1309.4
3rd Qu.:2721.0
Max. :3476.0
team
Length:440
Class :character
Mode :character
Summarizing Data Using {dplyr}
However, there are many ways to summarize data. To introduce data summary techniques using {dplyr}, we will open DAT
and use dplyr::summarise()
or dplyr::summarize()
to summarize the data. The summarise()
function works similar to mutate()
insofar as it creates variables but it differs insofar as the data frame returned from summarise()
contains only the variable(s) referenced in as part of that summary process. This includes some function calls immediately prior to summarize()
. In the example below, we summarize by creating a new variable which is set to represent some data summary technique. In essence, summarizing is for descriptive statistics. Using mean()
, we can summarize the data by taking the mean of the time
variable.
summarize(.data,
..., .by = NULL,
.groups = NULL
)
|>
DAT summarise(mean = mean(time, na.rm = T))
mean
1 1309.388
Notice what is returned is a single value reflecting the mean of all the data in the data frame. We could have obtained the same without using {dplyr}.
mean(DAT$time, na.rm = T) # $ notation
[1] 1309.388
But we lose flexibility of easily adding new summary procedures. We can summarize by two name-value pair functions.
|>
DAT summarise(mean = mean(time, na.rm = T),
sd = sd(time, na.rm = T)
)
mean sd
1 1309.388 1225.964
Now there is a mean and standard deviation for price. You can also add the sample size using dplyr::n()
.
|>
DAT summarise(mean = mean(time, na.rm = T),
sd = sd(time, na.rm = T),
n = n()
)
mean sd n
1 1309.388 1225.964 440
Summarizing across()
Multiple Variables
Summarizing a single variable is useful but if you want to summarize by many, you likely do not want to code a new line for each variable. In such cases, you can use across()
as a helper function as was used for creating new variables with mutate()
(see previous lesson).
Parameters/Arguments:
.cols
: specifies the columns to summarize across.fns
: specifies the function(s) for the summary.names
: specifies the output variable names
Remember across()
will want you to pass the columns to summarize by, .cols
, the function for how to summarize, .fns
, and the names for how to name the new variables, .names
(which will be NULL
by default). The .x
here stands for passing the vector to the mean function and not the data frame. More on ~
and .x
later.
Summarize across by numeric variables:
|>
DAT summarise(across(.cols = where(is.numeric),
.fns = ~mean(.x, na.rm = TRUE)
) )
time
1 1309.388
Well, that’s now actually impressive because there is only one numeric variable. What if we had more that were piped to summarize()
?
|>
DAT mutate(num1 = time,
num2 = time,
num3 = time
|>
) summarise(across(.cols = where(is.numeric),
.fns = ~mean(.x, na.rm = TRUE)
) )
time num1 num2 num3
1 1309.388 1309.388 1309.388 1309.388
That was easy.
Because across()
is so powerful, let’s just add another variable to the data frame for using in examples. You might also wish to reorder the position of variables in the data frame so that they are grouped in some way. We can use dplyr::relocate()
to accomplish this. We will move the time
column to the position .before
one of the new variables using relocate(time, .before = min)
.
Doing so will also show you some ways to create variables.
<- DAT |>
DAT mutate(sec = time, # will be redundant with time but named accurately
min = time/60,
hour = time/(60*60)
|>
) relocate(time, .before = sec)
Take a look:
head(DAT)
name year event team time sec min hour
1 Jocelyn Crawford 2019 50 FREE Athena 1409 1409 23.48333 0.3913889
2 Ava Sealander 2022 50 FREE Athena 1411 1411 23.51667 0.3919444
3 Kelly Ngo 2016 50 FREE Athena 1429 1429 23.81667 0.3969444
4 Helen Liu 2014 50 FREE Athena 1451 1451 24.18333 0.4030556
5 Michele Kee 2014 50 FREE Athena 1456 1456 24.26667 0.4044444
6 Natalia Orbach-M 2020 50 FREE Athena 1457 1457 24.28333 0.4047222
Variable names created with across()
is controlled using the .names
argument. The default is equivalent to .names = {.col}
, which means that the name(s) are inherited from the .cols
argument; the names are a stand-in for the name specification. If you wish to have control over the names, you can pass a string that that either appends (e.g.,"{.col}suffix"
) or prepends (e.g.,"prefix{.col}"
) a string to each column name. This string looks odd because it’s a special glue specification that glues together with a string with an object. We will use this concept later when using the {glue} library.
When modifying .names
, include a character like "_"
(e.g.,"{.col}_suffix"
) so that the column names and the appended text are separated, making the name easily legible. If you summarize to create means, a good suggestion is something like (e.g.,"{.col}_mean"
or (e.g.,"{.col}_mn"
)) so that you know the variable is a mean. If you prefer the function name first, you can use a prefix (e.g.,"mean_{.col}"
).
|>
DAT summarise(across(.cols = c("sec", "min", "hour"),
.fns = ~mean(.x, na.rm = TRUE),
.names = "{.col}_mean"
) )
sec_mean min_mean hour_mean
1 1309.388 21.82313 0.3637189
You can see how all variables in the summary end in "_mean"
.
You can also glue the function and the column names together by passing .names = "{.col}_{.fn}"
.
|>
DAT summarise(across(.cols = c("sec", "min", "hour"),
.fns = ~mean(.x, na.rm = TRUE),
.names = "{.col}_{.fn}"
) )
sec_1 min_1 hour_1
1 1309.388 21.82313 0.3637189
You you will see there is a number that is appended to the variable name. This is because there is only one function passed to .fns
. You can pass more using a special object called a list (see ?list
). Unlike vectors, elements of lists need not be the same kind. Elements of lists can combinations of characters, numbers, data frames, functions, etc.
Passing multiple functions from a list()
in across()
Passing functions as a list requires a little fancy coding. We will pass two functions as a list
so that we can calculate both the mean()
and the sd()
on the variables passed to across()
.
A list
is a special object (e.g., container) for which its elements can be different types of objects. Whereas elements of vectors
can be only character or only numeric, elements of lists can hold different object. One element can be a numeric vector, another element a data frame, another element a character vector, etc. Many functions used in R will actually return lists for which elements contain different types of objects.
OK back to two or more functions. If you pass a list()
with arguments for the mean
and the sd
(e.g., list(mean, sd)
, you can summarize by both. If you want to prevent errors (yes you do) and want to keep the summaries separate (probably so), you can modify .names
to pass both the column and the function (e.g., "{.col}_{.fn}"
). The underscore is not needed here; it only helps with readability of the variables so that you don’t end up with variable names like var1mean
but instead var1_mean
.
Let’s pass the summary procedures as a list
to include measures of mean and standard deviation for the variables.
|>
DAT summarise(across(.cols = c("sec", "min", "hour"),
.fns = list(mean, sd),
.names = "{.col}_{.fn}")
)
sec_1 sec_2 min_1 min_2 hour_1 hour_2
1 1309.388 1225.964 21.82313 20.43274 0.3637189 0.3405456
Well those are not exactly the names we want but it illustrates how names are created. Because we have two summary functions for each column variable passed to across()
, they are enumerated according to the order in the list (e.g., mean then standard deviation).
Fixing .names
when passing lists to .cols
in across()
Enumeration is not helpful for remembering variable names. There are different ways to do fix this problem, some of which may be better under certain scenarios. You have to determine what approach is best but I’ll lay out some limitations. If you pass only the functions into the list, then when you pass {.fn}
to .names
, the variable names in the returned data frame will take on a numeric value representing the order/element position of the functions as you entered them in the list
. In this coding instance, means would be named with"_1"
and standard deviation names with "_2"
. This approach, however, leads to confusing variable names because you have to remember which is 1 and which is 2 and of course explain this to anyone with whom you share the data. Let’s take a look.
A better approach could be to assign the mean
and sd
functions their own names in the list()
function call. By doing so, the name is appended and the new variable is named meaningfully.
Let’s modify what we pass to .fns
by passing a list containing 3 functions (e.g., mean()
, sd()
, and length()
) and give each there name. I know this part is confusing because the ()
are dropped inside the list. This is just how R
works. Don’t blame the messenger.
|>
DAT summarise(across(.cols = c("sec", "min", "hour"),
.fns = list(mean = mean,
sd = sd,
n = length
),.names = "{.col}_{.fn}")
)
sec_mean sec_sd sec_n min_mean min_sd min_n hour_mean hour_sd hour_n
1 1309.388 1225.964 440 21.82313 20.43274 440 0.3637189 0.3405456 440
Importantly, however, certain functions like mean()
will operate in ways you might not expect. One one hand, it does what we expect when all elements can be used to calculate the mean.
mean(DAT$time)
[1] 1309.388
On the other hand, if there is a missing value, it does not computer the mean but instead something else. Let’s add an NA
to the vector using c()
to see what happens.
mean(c(DAT$time, NA))
[1] NA
Understanding NA
s when passing lists to .cols
in across()
The mean()
function returns NA
rather than a mean. If there is just one NA
, mean()
returns NA
. By design this is actually good.
Let’s also try sd()
for the standard deviation of a vector:
sd(c(DAT$time, NA))
[1] NA
The median of a vector, median()
:
median(c(DAT$time, NA))
[1] NA
The length of a vector, length()
:
length(c(DAT$time, NA))
[1] 441
Well, that’s interesting. By default length()
will return the number of elements of the vector including NA
s but by default mean()
will not return the mean of a vector with NA
s because na.rm = FALSE
by default. If you wish to calculate the mean by removing the NA
s, pass na.rm = TRUE
.
mean(c(DAT$time, NA), na.rm = T)
[1] 1309.388
Make note, however, that the length of this vector without NA
s is shorter than the length with NA
s. We can test this hypothesis on a vector with and without the NA
by using na.omit()
to omit any of them. Using our vector we added an NA
, let’s omit it.
na.omit(c(DAT$time, NA))
[1] 1409.00 1411.00 1429.00 1451.00 1456.00 1457.00 1457.00 1467.00 1473.00
[10] 1442.00 3065.00 3084.00 3101.00 3116.00 3116.00 3148.00 3125.00 3125.00
[19] 3134.00 3137.00 110.30 111.89 112.40 112.55 112.57 112.67 112.80
[28] 112.83 112.88 112.89 297.37 299.22 299.78 300.06 300.21 300.47
[37] 301.17 301.99 302.16 302.23 614.33 615.40 622.59 622.84 624.14
[46] 624.82 626.34 626.89 633.15 636.60 1018.48 1023.21 1026.78 1029.09
[55] 1029.26 1032.18 1032.46 1033.58 1041.98 1044.49 3366.00 3367.00 3391.00
[64] 3371.00 3434.00 3443.00 3438.00 3456.00 3467.00 3476.00 119.91 122.10
[73] 123.73 123.85 123.92 125.10 125.56 125.71 125.76 126.13 61.84
[82] 63.28 63.51 63.91 64.16 64.19 64.65 64.66 64.77 64.94
[91] 134.83 135.62 138.99 139.06 139.78 140.01 141.17 141.77 142.34
[100] 142.83 3316.00 3332.00 3333.00 3323.00 3374.00 3364.00 3387.00 3402.00
[109] 3407.00 3416.00 121.84 122.58 123.80 124.05 124.42 124.81 126.40
[118] 127.12 127.33 127.42 120.69 123.59 123.79 124.74 125.41 126.82
[127] 127.12 127.14 127.81 127.94 1389.00 1412.00 1413.00 1415.00 1407.00
[136] 1411.00 1413.00 1418.00 1425.00 1427.00 3065.00 3067.00 3092.00 3079.00
[145] 3087.00 3088.00 3089.00 3097.00 3105.00 3116.00 110.91 110.98 111.39
[154] 111.41 111.56 111.64 111.95 112.53 112.78 113.04 255.73 267.18
[163] 270.33 270.77 271.96 272.45 272.57 272.68 274.16 274.27 1594.00
[172] 1582.00 1588.00 1617.00 1642.00 1650.00 1634.00 1634.00 1636.00 1637.00
[181] 1713.00 1776.00 1745.00 1746.00 1749.00 1749.00 1764.00 1766.00 1786.00
[190] 1795.00 1445.00 1468.00 1498.00 1499.00 1505.00 1525.00 1505.00 1508.00
[199] 1524.00 1534.00 3305.00 3321.00 3331.00 3311.00 3313.00 3325.00 3327.00
[208] 3345.00 3362.00 3381.00 61.10 62.88 62.89 63.67 64.10 64.16
[217] 64.24 64.26 64.35 64.43 1238.00 1221.00 1222.00 1236.00 1251.00
[226] 1265.00 1269.00 1271.00 1279.00 1282.00 2646.00 2661.00 2713.00 2734.00
[235] 2724.00 2731.00 2732.00 2745.00 2750.00 2750.00 98.35 98.88 99.07
[244] 99.35 99.63 99.80 99.82 100.30 100.31 100.50 265.67 268.11
[253] 268.89 269.32 271.64 272.52 272.65 272.94 272.98 274.70 554.11
[262] 564.43 575.78 576.64 579.27 580.02 581.48 585.72 586.63 587.09
[271] 917.24 932.19 945.57 947.40 952.94 953.75 956.57 957.89 962.45
[280] 963.38 2859.00 2877.00 2972.00 3037.00 3003.00 3029.00 3035.00 3041.00
[289] 3051.00 3059.00 105.05 105.67 106.51 108.84 109.01 109.38 110.32
[298] 110.43 111.07 111.57 3328.00 3380.00 3392.00 3363.00 3387.00 3395.00
[307] 3396.00 3405.00 3409.00 3409.00 119.90 121.18 121.45 121.60 121.66
[316] 121.77 121.78 122.89 123.19 123.23 2865.00 2876.00 2900.00 2954.00
[325] 2971.00 2966.00 2971.00 2974.00 3008.00 3014.00 103.96 108.70 109.24
[334] 109.95 109.96 110.34 110.47 110.49 110.51 110.76 106.97 108.74
[343] 109.74 110.51 110.78 111.11 111.24 111.48 111.82 111.83 1187.00
[352] 1203.00 1236.00 1208.00 1210.00 1214.00 1218.00 1222.00 1225.00 1228.00
[361] 2608.00 2649.00 2654.00 2683.00 2697.00 2699.00 2721.00 2721.00 2723.00
[370] 2727.00 97.98 98.49 99.09 99.19 99.66 100.21 100.22 100.44
[379] 100.70 100.70 235.61 236.68 236.88 239.02 239.13 239.17 240.63
[388] 241.14 242.38 242.99 1352.00 1370.00 1404.00 1380.00 1404.00 1425.00
[397] 1452.00 1454.00 1465.00 1467.00 1513.00 1521.00 1528.00 1510.00 1520.00
[406] 1547.00 1555.00 1563.00 1561.00 1565.00 1260.00 1298.00 1308.00 1318.00
[415] 1320.00 1341.00 1343.00 1320.00 1326.00 1327.00 2821.00 2892.00 2919.00
[424] 2930.00 2934.00 2962.00 2940.00 2985.00 2992.00 3000.00 3292.00 3315.00
[433] 3321.00 3347.00 3362.00 3364.00 3373.00 3379.00 3395.00 3408.00
attr(,"na.action")
[1] 441
attr(,"class")
[1] "omit"
And then get the length when NA
s are omitted:
length(na.omit(c(DAT$time, NA)))
[1] 440
This behavior is important because if you want to obtain the mean of a variable with NA
s and the sample size using length()
, your sample size will be inaccurate.
In order to see these operations on a data frame and in the context of dplyr::summarize()
, let’s modify the data frame to make the values for sec
and min
in row be NA
. The data frame will then contain some missing values. We can use dplyr::case_when()
for the conditional rather than if.else()
or if_else()
.
Does this look right?
|>
DAT mutate(across(.cols = c(sec, min), # for these columns
.fns = ~case_when(
row_number() == 1 ~ NA, # make this one NA
row_number() != 1 ~ .x # keep it original
)
)|>
) head()
name year event team time sec min hour
1 Jocelyn Crawford 2019 50 FREE Athena 1409 NA NA 0.3913889
2 Ava Sealander 2022 50 FREE Athena 1411 1411 23.51667 0.3919444
3 Kelly Ngo 2016 50 FREE Athena 1429 1429 23.81667 0.3969444
4 Helen Liu 2014 50 FREE Athena 1451 1451 24.18333 0.4030556
5 Michele Kee 2014 50 FREE Athena 1456 1456 24.26667 0.4044444
6 Natalia Orbach-M 2020 50 FREE Athena 1457 1457 24.28333 0.4047222
Yes, row one has NA
values, so let’s clobber DAT
with this new version. Remove the pipe to head()
of course.
<-
DAT |>
DAT mutate(across(.cols = c(sec, min), # for these columns
.fns = ~case_when(
row_number() == 1 ~ NA, # make this one NA
row_number() != 1 ~ .x # keep it original
)
) )
Comparing some functionality when passing lists to .cols
in across()
When functions do not contain arguments for dealing with NA
s, there is na.omit()
, a function that takes an object and removes NA
s. You can just pass the variable to na.omit()
and then wrap it in the metric function of interest. Also, because na.rm = T
cannot be used for length()
, na.omit()
offers consistency across all functions and as a result, I believe, less confusion.
Unfortunately, accomplishing this task can be rather tricky and requires some new syntax. This requires usage of what’s called a “lambda” technique. You will want to incorporate ~
and .x
into your code. The ~
is used to indicate that you are supplying a lambda function and use of .x
is to indicate where the variable in across()
is used. Using this type of syntax, we can pass functions to the .fns
argument that operate across a set of variables. The ?across()
documentation calls this “a {purrr}-style lambda” in the arguments section. This approach can be a little bit confusing, so I’m going to show you an example, and then walk through it step by step. You can always create code snippets so you don’t have to rely on memory write complicated code like this.
Anyway, we will precede the function with ~
and reference the vector using .x
. Let’s do this and change the .fns
argument slightly.
Here is a general example:
name = ~function(na.omit(.x))
We will summarize only time
and sec
because those variables are identical except for the row we added. We will also add dplyr::n()
to see what’s going on with that function.
|>
DAT summarise(across(.cols = c("time", "sec"),
.fns = list(mean = ~mean(na.omit(.x)),
len = ~length(na.omit(.x)),
n = ~dplyr::n()
),.names = "{.col}_{.fn}"
) )
time_mean time_len time_n sec_mean sec_len sec_n
1 1309.388 440 440 1309.161 439 440
So what happened? The means and lengths for time
and sec
are not the same. Means differ because they are calculated by different values depending on the presence of NA
s. But notice that the n’s are the same based on dplyr::n()
. How can the means and differ if the n’s are the same?
So what’s the point of all of this? Well, you need to be careful not to apply functions and assume they are doing what you believe you are doing. You always need to be smarter than the code you use. Also, there is no single answer for dealing with data. Sometimes one approach will be appropriate and in other instances another approach will be. You as the data scientist need to know that there are different methods so that you an decide where to apply those different methods.
Summarize by Groups Using group_by()
Identifying how to group
When you have subgroups in your data, you will often want to create summary statistics by those group levels. A typical grouping approach is by some categorical or factor variable present in a data frame. Using glympse()
, we can view all variables to see what might be of interest.
glimpse(DAT)
Rows: 440
Columns: 8
$ name <chr> "Jocelyn Crawford", "Ava Sealander", "Kelly Ngo", "Helen Liu", "…
$ year <chr> "2019", "2022", "2016", "2014", "2014", "2020", "2020", "2010", …
$ event <chr> "50 FREE", "50 FREE", "50 FREE", "50 FREE", "50 FREE", "50 FREE"…
$ team <chr> "Athena", "Athena", "Athena", "Athena", "Athena", "Athena", "Ath…
$ time <dbl> 1409.00, 1411.00, 1429.00, 1451.00, 1456.00, 1457.00, 1457.00, 1…
$ sec <dbl> NA, 1411.00, 1429.00, 1451.00, 1456.00, 1457.00, 1457.00, 1467.0…
$ min <dbl> NA, 23.516667, 23.816667, 24.183333, 24.266667, 24.283333, 24.28…
$ hour <dbl> 0.39138889, 0.39194444, 0.39694444, 0.40305556, 0.40444444, 0.40…
Looks like some factor variables we can group by include name
, year
, event
, and team
.
Summarize A (one) Specific Variable by Groups Using group_by()
A single summary metric
Perhaps you only want to obtain the mean()
or the sum()
or the sd()
for a variable. If so, this is easiest.
Grouping by one variable:
The data contain the top records for swimming events. You might be curious what year
was the best of all time or what swimmer (e.g., name
) has attained the most records of all time. There are different ways to accomplish this goal.
One approach that might be the most straight forward to new programmers is to mutate a constant count variable on each row which can be used to sum the counts for different groups.
|>
DAT mutate(count = 1) |> # mutate a new variable where all rows get a 1
group_by(name) |> # group by the swimmer name
summarise(count = sum(count)) # sum the count and assign it the name count
# A tibble: 141 × 2
name count
<chr> <dbl>
1 A Breazeale 8
2 A Roeseler 2
3 Aaron Lutzker 6
4 Abel Sapirstein 1
5 Alec Vercruysse 4
6 Alex Mendoza 2
7 Alex Poltash 6
8 Allyson Yao 3
9 Amy Fuller 1
10 Andrew Cox 4
# ℹ 131 more rows
By default, the data frame is arranged by the grouping variable (e.g., name
). We can change the order of the rows by count
using arrange()
but this function by default sorts in an ascending manner. If you want a descending sorting, pass count
to desc()
to arrange the data frame in this way. We can also assign it to an object.
<- DAT |>
NAME_count mutate(count = 1) |>
group_by(name) |>
summarise(count = sum(count)) |>
arrange(desc(count))
NAME_count
# A tibble: 141 × 2
name count
<chr> <dbl>
1 Michele Kee 11
2 Augusta Lewis 10
3 Matt Williams 10
4 Kelly Ngo 9
5 A Breazeale 8
6 Ava Sealander 8
7 Gary Simon 8
8 Marco Conati 8
9 Nic Tekieli 7
10 Aaron Lutzker 6
# ℹ 131 more rows
We can see that the top counts of all time are by Michele Kee for a total of 11. Kudos to Michele!
<- DAT |>
YEAR_count mutate(count = 1) |>
group_by(year) |>
summarise(count = sum(count)) |>
arrange(desc(count))
YEAR_count
# A tibble: 32 × 2
year count
<chr> <dbl>
1 2022 81
2 2023 59
3 2020 41
4 2017 38
5 2014 27
6 2013 22
7 2019 22
8 2015 20
9 2016 17
10 2018 16
# ℹ 22 more rows
We can see that the year with the most best are by 2022 for a total of 81. Hooray for 2022!
Grouping by two or more variables:
We can also summarize both the teams as well in order to see the top swimmer by team. If you want to summarize more than one variable, pass them both in group_by()
:
<- DAT |>
TEAM_NAME_count mutate(count = 1) |>
group_by(team, name) |>
summarise(count = sum(count)) |>
arrange(desc(count))
`summarise()` has grouped output by 'team'. You can override using the
`.groups` argument.
TEAM_NAME_count
# A tibble: 141 × 3
# Groups: team [2]
team name count
<chr> <chr> <dbl>
1 Athena Michele Kee 11
2 Athena Augusta Lewis 10
3 Stag Matt Williams 10
4 Athena Kelly Ngo 9
5 Athena Ava Sealander 8
6 Stag A Breazeale 8
7 Stag Gary Simon 8
8 Stag Marco Conati 8
9 Stag Nic Tekieli 7
10 Athena Ella Blake 6
# ℹ 131 more rows
Filtering rows and then grouping by one variable:
For Athenas:
<- DAT |>
ATHENA_NAME_count mutate(count = 1) |>
filter(team == "Athena") |>
group_by(name) |>
summarise(count = sum(count)) |>
arrange(desc(count))
ATHENA_NAME_count
# A tibble: 66 × 2
name count
<chr> <dbl>
1 Michele Kee 11
2 Augusta Lewis 10
3 Kelly Ngo 9
4 Ava Sealander 8
5 Ella Blake 6
6 Jamee Mitchum 6
7 Katie Bilotti 6
8 Mackenzie Mayfield 6
9 Annika Jessen 5
10 Jocelyn Crawford 5
# ℹ 56 more rows
We can see that the top counts of all time are by Michele Kee with a total of 11. Nice work Michele!
And for Stags:
<- DAT |>
STAG_NAME_count mutate(count = 1) |>
filter(team == "Stag") |>
group_by(name) |>
summarise(count = sum(count)) |>
arrange(desc(count))
STAG_NAME_count
# A tibble: 75 × 2
name count
<chr> <dbl>
1 Matt Williams 10
2 A Breazeale 8
3 Gary Simon 8
4 Marco Conati 8
5 Nic Tekieli 7
6 Aaron Lutzker 6
7 Alex Poltash 6
8 Blake Weber 6
9 Sam Willett 5
10 Tom Harrison 5
# ℹ 65 more rows
We can see that the top counts of all time are by Matt Williams with 10. Go Matt!
To wrap up this example, sometimes working with separate data frames using filter()
can provide more useful or manageable summaries.Perhaps you only want to obtain the mean()
or the sum()
or the sd()
for a single variable. If so, this approach may be easiest.
If you are curious, here is a story covering “How CMS teams became the Stags and Athenas”.
Multiple summary metrics
Sometimes you need more than one summary statistic, for example, the mean()
and the sd()
. This is a little more complex to code.
Note: In these code blocks, some arguments may be removed for readability.
Grouping by one variable:
|>
DAT group_by(event) |>
summarise(sec_mean = mean(sec, na.rm = T),
sec_median = median(sec, na.rm = T)
)
# A tibble: 22 × 3
event sec_mean sec_median
<chr> <dbl> <dbl>
1 100 BACK 3209. 3212.
2 100 BREAST 1725. 1696.
3 100 BRST-Relay Spl. 1710. 1678.
4 100 FLY 3158. 3165
5 100 FLY-Relay Spl. 3135. 3152.
6 100 FREE 2917. 2908.
7 100 FREE-Relay Spl. 2888. 2896
8 1000 FREE 601. 601.
9 1650 FREE 990. 991.
10 200 BACK 116. 116.
# ℹ 12 more rows
We can also write the summary functions as a list inside across()
along with passing .names = "{.col}_{.fn}"
if you want the variables named automatically. This approach is more complex but is more flexible.
When you want to summarize across multiple variables using a list of functions, you will want to make sure your .fns
argument passes a function using a {purrr}-style lambda (e.g.,~
) for that the function is applied across the variables. You will also want to edit .names = "{.col}_{.fn}"
so that the naming is done automatically rather than hard coding the names.
|>
DAT group_by(event) |>
summarise(across(.cols = "sec",
.fns = list(mean = ~mean(na.omit(.x)),
median = ~median(na.omit(.x))
),.names = "{.col}_{.fn}"
) )
# A tibble: 22 × 3
event sec_mean sec_median
<chr> <dbl> <dbl>
1 100 BACK 3209. 3212.
2 100 BREAST 1725. 1696.
3 100 BRST-Relay Spl. 1710. 1678.
4 100 FLY 3158. 3165
5 100 FLY-Relay Spl. 3135. 3152.
6 100 FREE 2917. 2908.
7 100 FREE-Relay Spl. 2888. 2896
8 1000 FREE 601. 601.
9 1650 FREE 990. 991.
10 200 BACK 116. 116.
# ℹ 12 more rows
Grouping by two or more variables:
|>
DAT group_by(event, team) |>
summarise(sec_mean = mean(sec, na.rm = T),
sec_median = median(sec, na.rm = T)
)
`summarise()` has grouped output by 'event'. You can override using the
`.groups` argument.
# A tibble: 44 × 4
# Groups: event [22]
event team sec_mean sec_median
<chr> <chr> <dbl> <dbl>
1 100 BACK Athena 3421. 3436
2 100 BACK Stag 2996. 3032
3 100 BREAST Athena 64.0 64.2
4 100 BREAST Stag 3386. 3394.
5 100 BRST-Relay Spl. Athena 63.6 64.1
6 100 BRST-Relay Spl. Stag 3356. 3363
7 100 FLY Athena 3365. 3369
8 100 FLY Stag 2950. 2968.
9 100 FLY-Relay Spl. Athena 3332. 3326
10 100 FLY-Relay Spl. Stag 2938. 2937
# ℹ 34 more rows
Or pass the list:
|>
DAT group_by(event, team) |>
summarise(across(.cols = "sec",
.fns = list(mean = ~mean(na.omit(.x)),
median = ~median(na.omit(.x))
),.names = "{.col}_{.fn}"
) )
`summarise()` has grouped output by 'event'. You can override using the
`.groups` argument.
# A tibble: 44 × 4
# Groups: event [22]
event team sec_mean sec_median
<chr> <chr> <dbl> <dbl>
1 100 BACK Athena 3421. 3436
2 100 BACK Stag 2996. 3032
3 100 BREAST Athena 64.0 64.2
4 100 BREAST Stag 3386. 3394.
5 100 BRST-Relay Spl. Athena 63.6 64.1
6 100 BRST-Relay Spl. Stag 3356. 3363
7 100 FLY Athena 3365. 3369
8 100 FLY Stag 2950. 2968.
9 100 FLY-Relay Spl. Athena 3332. 3326
10 100 FLY-Relay Spl. Stag 2938. 2937
# ℹ 34 more rows
Summarizing Multiple Variables by Groups Using group_by()
So far, we have shown how to summarize a single variable either with or without grouping by levels of another variable. Summaries, however, are often done for multiple variables in data frame. For example, you might want to obtain the mean()
for multiple variables or obtain the mean()
and the max()
(or some other summary statistic) for multiple variables. The following examples prepare you for such tasks.
A single summary metric
Grouping by one variable:
|>
DAT group_by(event) |>
summarise(across(.cols = c("sec", "min"),
.fns = ~mean(.x, na.rm = T),
.names = "{.col}_{.fn}"
) )
# A tibble: 22 × 3
event sec_1 min_1
<chr> <dbl> <dbl>
1 100 BACK 3209. 53.5
2 100 BREAST 1725. 28.8
3 100 BRST-Relay Spl. 1710. 28.5
4 100 FLY 3158. 52.6
5 100 FLY-Relay Spl. 3135. 52.2
6 100 FREE 2917. 48.6
7 100 FREE-Relay Spl. 2888. 48.1
8 1000 FREE 601. 10.0
9 1650 FREE 990. 16.5
10 200 BACK 116. 1.94
# ℹ 12 more rows
Grouping by two or more variables:
|>
DAT group_by(event, team) |>
summarise(across(.cols = c("sec", "min"),
.fns = ~mean(.x, na.rm = T),
.names = "{.col}_{.fn}"
) )
`summarise()` has grouped output by 'event'. You can override using the
`.groups` argument.
# A tibble: 44 × 4
# Groups: event [22]
event team sec_1 min_1
<chr> <chr> <dbl> <dbl>
1 100 BACK Athena 3421. 57.0
2 100 BACK Stag 2996. 49.9
3 100 BREAST Athena 64.0 1.07
4 100 BREAST Stag 3386. 56.4
5 100 BRST-Relay Spl. Athena 63.6 1.06
6 100 BRST-Relay Spl. Stag 3356. 55.9
7 100 FLY Athena 3365. 56.1
8 100 FLY Stag 2950. 49.2
9 100 FLY-Relay Spl. Athena 3332. 55.5
10 100 FLY-Relay Spl. Stag 2938. 49.0
# ℹ 34 more rows
Multiple summary metrics
And if you need to summarize using multiple metrics, then pass the list into .fns
:
Grouping by one variable:
|>
DAT group_by(event) |>
summarise(across(.cols = c("sec", "min"),
.fns = list(mean = ~mean(na.omit(.x)),
median = ~median(na.omit(.x))
),.names = "{.col}_{.fn}"
) )
# A tibble: 22 × 5
event sec_mean sec_median min_mean min_median
<chr> <dbl> <dbl> <dbl> <dbl>
1 100 BACK 3209. 3212. 53.5 53.5
2 100 BREAST 1725. 1696. 28.8 28.3
3 100 BRST-Relay Spl. 1710. 1678. 28.5 28.0
4 100 FLY 3158. 3165 52.6 52.8
5 100 FLY-Relay Spl. 3135. 3152. 52.2 52.5
6 100 FREE 2917. 2908. 48.6 48.5
7 100 FREE-Relay Spl. 2888. 2896 48.1 48.3
8 1000 FREE 601. 601. 10.0 10.0
9 1650 FREE 990. 991. 16.5 16.5
10 200 BACK 116. 116. 1.94 1.93
# ℹ 12 more rows
Grouping by two or more variables:
|>
DAT group_by(event, team) |>
summarise(across(.cols = c("sec", "min"),
.fns = list(mean = ~mean(na.omit(.x)),
median = ~median(na.omit(.x))
),.names = "{.col}_{.fn}"
) )
`summarise()` has grouped output by 'event'. You can override using the
`.groups` argument.
# A tibble: 44 × 6
# Groups: event [22]
event team sec_mean sec_median min_mean min_median
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 100 BACK Athena 3421. 3436 57.0 57.3
2 100 BACK Stag 2996. 3032 49.9 50.5
3 100 BREAST Athena 64.0 64.2 1.07 1.07
4 100 BREAST Stag 3386. 3394. 56.4 56.6
5 100 BRST-Relay Spl. Athena 63.6 64.1 1.06 1.07
6 100 BRST-Relay Spl. Stag 3356. 3363 55.9 56.0
7 100 FLY Athena 3365. 3369 56.1 56.2
8 100 FLY Stag 2950. 2968. 49.2 49.5
9 100 FLY-Relay Spl. Athena 3332. 3326 55.5 55.4
10 100 FLY-Relay Spl. Stag 2938. 2937 49.0 49.0
# ℹ 34 more rows
Depends on the order in group_by()
, so change the order:
|>
DAT group_by(team, event) |>
summarise(across(.cols = "sec",
.fns = list(mean = ~mean(na.omit(.x)),
n = ~length(na.omit(.x))
),.names = "{.col}_{.fn}"
) )
`summarise()` has grouped output by 'team'. You can override using the
`.groups` argument.
# A tibble: 44 × 4
# Groups: team [2]
team event sec_mean sec_n
<chr> <chr> <dbl> <int>
1 Athena 100 BACK 3421. 10
2 Athena 100 BREAST 64.0 10
3 Athena 100 BRST-Relay Spl. 63.6 10
4 Athena 100 FLY 3365. 10
5 Athena 100 FLY-Relay Spl. 3332. 10
6 Athena 100 FREE 3115. 10
7 Athena 100 FREE-Relay Spl. 3088. 10
8 Athena 1000 FREE 625. 10
9 Athena 1650 FREE 1031. 10
10 Athena 200 BACK 124. 10
# ℹ 34 more rows
Notice the change in the order of the column variables. But remember, you can change the order later using select()
and/or relocate()
.
Summarizing Multiple Variables With Reference by Name`
Variables that are numeric
You can also pass variables that are a certain type, like numeric.
|>
DAT group_by(team, event) |>
summarise(across(.cols = where(is.numeric),
.fns = ~mean(.x, na.rm = TRUE),
.names = "{.col}")
)
`summarise()` has grouped output by 'team'. You can override using the
`.groups` argument.
# A tibble: 44 × 6
# Groups: team [2]
team event time sec min hour
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Athena 100 BACK 3421. 3421. 57.0 0.950
2 Athena 100 BREAST 64.0 64.0 1.07 0.0178
3 Athena 100 BRST-Relay Spl. 63.6 63.6 1.06 0.0177
4 Athena 100 FLY 3365. 3365. 56.1 0.935
5 Athena 100 FLY-Relay Spl. 3332. 3332. 55.5 0.926
6 Athena 100 FREE 3115. 3115. 51.9 0.865
7 Athena 100 FREE-Relay Spl. 3088. 3088. 51.5 0.858
8 Athena 1000 FREE 625. 625. 10.4 0.174
9 Athena 1650 FREE 1031. 1031. 17.2 0.286
10 Athena 200 BACK 124. 124. 2.07 0.0345
# ℹ 34 more rows
Variables by pattern match
This approach is fun, especially if you have already named variables in ways that make selection really useful. This data frame is constrained a bit so the examples may be silly.
Using starts_with()
|>
DAT group_by(team, event) |>
summarise(across(.cols = starts_with("t"),
.fns = list(mean = ~mean(na.omit(.x)),
n = ~length(na.omit(.x))
),.names = "{.col}_{.fn}"
) )
`summarise()` has grouped output by 'team'. You can override using the
`.groups` argument.
# A tibble: 44 × 4
# Groups: team [2]
team event time_mean time_n
<chr> <chr> <dbl> <int>
1 Athena 100 BACK 3421. 10
2 Athena 100 BREAST 64.0 10
3 Athena 100 BRST-Relay Spl. 63.6 10
4 Athena 100 FLY 3365. 10
5 Athena 100 FLY-Relay Spl. 3332. 10
6 Athena 100 FREE 3115. 10
7 Athena 100 FREE-Relay Spl. 3088. 10
8 Athena 1000 FREE 625. 10
9 Athena 1650 FREE 1031. 10
10 Athena 200 BACK 124. 10
# ℹ 34 more rows
Using &
for complex selection
You obviously cannot calculate numeric metrics for character variables. But how might you select variables that contain a certain character pattern but are also numeric? You cannot nest these functions (e.g., where(is.numeric(contains("pattern")))
). You can, however, pass the functions separately.
|>
DAT group_by(team, event) |>
summarise(across(.cols = contains("e") & where(is.numeric),
.fns = list(mean = ~mean(na.omit(.x)),
n = ~length(na.omit(.x))
),.names = "{.col}_{.fn}"
) )
`summarise()` has grouped output by 'team'. You can override using the
`.groups` argument.
# A tibble: 44 × 6
# Groups: team [2]
team event time_mean time_n sec_mean sec_n
<chr> <chr> <dbl> <int> <dbl> <int>
1 Athena 100 BACK 3421. 10 3421. 10
2 Athena 100 BREAST 64.0 10 64.0 10
3 Athena 100 BRST-Relay Spl. 63.6 10 63.6 10
4 Athena 100 FLY 3365. 10 3365. 10
5 Athena 100 FLY-Relay Spl. 3332. 10 3332. 10
6 Athena 100 FREE 3115. 10 3115. 10
7 Athena 100 FREE-Relay Spl. 3088. 10 3088. 10
8 Athena 1000 FREE 625. 10 625. 10
9 Athena 1650 FREE 1031. 10 1031. 10
10 Athena 200 BACK 124. 10 124. 10
# ℹ 34 more rows
Summary
You can see that summarizing data can be fun, interesting, simple, but sometimes complex. The examples above are only limited to selecting variables by name. Of course, you can summarize()
across()
variables that contain()
characters or that starts_with()
some characters or even across variables that is.numeric()
(pardon the grammar) by pairing these functions with the
A Functional Approach
You can also throw your summaries into functions if you wish. We will create a new object that is a function object. We need to give it a name and we need to define arguments to make the function operate. We will want to make sure we have numeric variables.
<- function(data,
summarizer cols = NULL,
...
) {|>
data group_by(...) |>
summarise(across(.cols = {{cols}} & where(is.numeric),
.fns = list(
mean = ~mean(.x, na.rm = TRUE),
sd = ~sd(.x, na.rm = TRUE)
), .names = "{col}_{fn}")
) }
Test the function:
Without grouping:
summarizer(DAT, cols = contains("e"))
# A tibble: 1 × 4
time_mean time_sd sec_mean sec_sd
<dbl> <dbl> <dbl> <dbl>
1 1309. 1226. 1309. 1227.
With grouping:
summarizer(DAT, cols = c(min, hour), event)
# A tibble: 22 × 5
event min_mean min_sd hour_mean hour_sd
<chr> <dbl> <dbl> <dbl> <dbl>
1 100 BACK 53.5 3.76 0.891 0.0626
2 100 BREAST 28.8 28.4 0.479 0.473
3 100 BRST-Relay Spl. 28.5 28.1 0.475 0.469
4 100 FLY 52.6 3.63 0.877 0.0605
5 100 FLY-Relay Spl. 52.2 3.44 0.871 0.0573
6 100 FREE 48.6 3.43 0.810 0.0571
7 100 FREE-Relay Spl. 48.1 3.46 0.802 0.0576
8 1000 FREE 10.0 0.432 0.167 0.00719
9 1650 FREE 16.5 0.728 0.275 0.0121
10 200 BACK 1.94 0.136 0.0324 0.00227
# ℹ 12 more rows
And of course, when you really get excited, you could add functions so that you can perform different metrics. When you are done, you can save your favorite function to a file you can source()
.
Session Info
sessionInfo()
R version 4.4.1 (2024-06-14 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 22631)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8
[2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
time zone: America/Los_Angeles
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] htmltools_0.5.8.1 DT_0.33 vroom_1.6.5 lubridate_1.9.3
[5] forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4 purrr_1.0.2
[9] readr_2.1.5 tidyr_1.3.1 tibble_3.2.1 ggplot2_3.5.1
[13] tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] utf8_1.2.4 generics_0.1.3 stringi_1.8.4 hms_1.1.3
[5] digest_0.6.36 magrittr_2.0.3 evaluate_0.24.0 grid_4.4.1
[9] timechange_0.3.0 fastmap_1.2.0 R.oo_1.26.0 rprojroot_2.0.4
[13] jsonlite_1.8.8 zip_2.3.1 R.utils_2.12.3 fansi_1.0.6
[17] scales_1.3.0 cli_3.6.3 rlang_1.1.4 crayon_1.5.3
[21] R.methodsS3_1.8.2 bit64_4.0.5 munsell_0.5.1 withr_3.0.1
[25] yaml_2.3.10 tools_4.4.1 tzdb_0.4.0 colorspace_2.1-0
[29] here_1.0.1 vctrs_0.6.5 R6_2.5.1 lifecycle_1.0.4
[33] htmlwidgets_1.6.4 bit_4.0.5 pkgconfig_2.0.3 pillar_1.9.0
[37] openxlsx_4.2.5.2 gtable_0.3.5 Rcpp_1.0.12 glue_1.7.0
[41] xfun_0.45 tidyselect_1.2.1 rstudioapi_0.16.0 knitr_1.47
[45] rmarkdown_2.27 compiler_4.4.1