Summarizing Data

Author

Gabriel I. Cook

Published

February 29, 2024

Overview

This module describes how to perform statistical transformations on data, including producing data summaries. In order to summarize data, we will again use {dplyr}. The way data are summarized will influence the way the data can be modeled as statistical models typically rely on understanding how predictor variables (independent) can account for variability in outcome variables (dependent). Eliminating variability in data by aggregating will affect the model you can use as well as the statistical inference draw from that statistical model. We will also discuss differences in grouping structure for dplyr::mutate() and dplyr::summarize() that will affect both variable computation and plot creation.

Readings and Preparation

Before Class: First, read to familiarize yourself with the concepts rather than master them. I will assume that you attend class with some level of basic understanding of concepts and working of functions. The goal of reading should be to understand and implement code functions as well as support your understanding and help your troubleshooting of problems. This cannot happen if you just read the content without interacting with it, however reading is absolutely essential to being successful during class time.

Class: In class, some functions and concepts will be introduced and we will practice implementing code through exercises.

Supplementary Readings

To Do

Read through the module. You can use the R console or open up an R Markdown (e.g., .Rmd) file to follow along interactively. If you instead prefer to simply read through the content so that you can understand the concepts without coding, that is fine too. You will practice using the concepts by working on activities. Reading the module will provide you with confidence working on those activities and prevent you from feeling lost while completing activities. Testing out some code may provide you more confidence.

External Functions

Provided in class:

view_html(): for viewing data frames in html format

You can access this remotely from:

source("https://raw.githubusercontent.com/slicesofdata/fods24/main/r/functions/view_html.R")

Or from your own project work space by loading functions in /r/functions.

Libraries

  • {dplyr} 1.1.4: for selecting, filtering, and mutating

Load libraries

library(dplyr)

Loading Data

For this exercise, we will use some data from a 2023 CMS swim meet located at: “https://github.com/slicesofdata/dataviz23/raw/main/data/swim/cleaned-2023-CMS-Invite.csv”.

SWIM <- readr::read_csv(
  file = "https://github.com/slicesofdata/dataviz23/raw/main/data/swim/cleaned-2023-CMS-Invite.csv",
  show_col_types = F
  )

Flavors of Data

Data come in many flavors. Data are collected on virtually everything you could think of. If there is a record of something, there is data. The stars, weather, television viewing, social media use, ocean currents, brand preference, parenting, symphonies, learning to code, graphical perceptual. When there is no record of something, you could collect data on it. For example, you could start logging the quantity of quality of advice your grandparents offer you. You may likely discover that advice changes based on their age. You may also consider logging the number to times you see your grandparents or your parents/guardians annually in order to determine approximately how many more times you will see them before they die, if they haven’t yet. In fact, some folks have already worked on this problem using actuarial tables. Seeyourfolks will provide a model estimate for you. See for yourself. Sure, it may be in error but all models have error. Nevertheless, prediction may change your life.

Data Organization and Variability

In some industries, data may be organized at a high level. For example, someone examining population growth may have data at a census level or an annual level. Someone who studies education may had data organized at a state level or county level. In other industries, data may be organized at a fine-grained level. For example, someone studying health outcomes may have data about individuals daily if daily health checks are part of a data-collection protocol. Someone who studies education may design a study to collect data, or obtain data, on daily-performance of students. Even further, their data may contain repeated measurements of specific topics (e.g., 10 different addition problems rather than one). Similarly, someone studying some aspect of human cognition may have high-level data like overall accuracy and total time needed to complete the task or they might have data at a low level containing trial-level accuracy and reaction timed data for each item.

Depending the organization of data and its level of granularity, one may be limited to a set of statistical models to examine. Because variability in data plays a significant role in testing models, the absence of variability will prevent modeling using statistical techniques that require that variability. For example, if your data set contains only one value for a country, you are likely limited to models that allow for comparisons or associations across countries or across time. For example, you could examine data to understand how population growth for a single country changes over time or you could examine population growth in one country compares to one or more than one country. Using the educational example, one could examine changes in a state over time or one could examine states to one another. By contrast, if you have daily data per student for a variety of topics, you have variability in data to examine at a variety of levels. You could examine how an individual’s performance changes across time in general or specifically for a specific topic. You can also examine how the individual performs on one type of task relative to another. You could also examine changes over time in performance on one task compared with changes over time for other tasks. Stepping up the hierarchy, you could compare performance of individuals across teachers, teacher’s sections, schools, school districts, counties, states, etc. because there is variability in the data to try to understand.

Although the data structure is typically conceived from top to bottom, which lower levels nested under higher levels, I will present the opposite order to prevent you reading from the bottom up.

Daily Performance
- on an item type (e.g., 4 + 17)
-- on a task (e.g., addition)
--- for a person (e.g, Bill)
---- in a teacher's class (e.g., Mrs. Collins, Section 1)
----- in a school (e.g., Robert D. Wilson)
------ in a district (e.g., Western Wayne)
------- in a country (e.g., Wayne)
-------- in a state (e.g., Pennsylvania)
--------- in a country (e.g., United States of America)

The structure at all these levels means that you can perform more complex models like linear-mixed models/hierarchical linear modeling. The variability in the data at all these levels allows you to account for the variability in ways you cannot if you only have top-level data. When you have the variability, you can examine for effect like random effects or fixed effects, which are not applicable to simpler linear models. To the extent that teachers in District A teach differently or motivate students differently from those teachers in District B, you could examine the data for these differences and account for them statistically (hold them constant as if teachers were the same) at higher levels of the data. In fact, that repeated-measures ANOVA model you are taught in your statistics class is not be appropriate for data with repeated measurements under different conditions because two individuals who with the same average performance may differ in their variability. For example, two students answer the subtraction problem (14 - 7) on average in 10.0 seconds across 5 instances. Mildred is consistent and answers it between 8 and 12 seconds. Tilford, however, answers with a range of 5 to 20 seconds. Although performance time is equated, the variability represents something very different in ability.

Making Data

We will work with a tiny data frame containing data for performance scores from students in schools in districts.

DATA <- 
  data.frame(
    Student = c("Jim", "Sally", "June", "Mildred", "Tilford", "Beavis", "Herman", "Peppa", "Kay", "Jake", "Name Missing"),
    Score  = c(80, 85, 79, 80, 81, 91, 89, 60, 65, 67, 65), 
    School = c("A", "A", "B", "B", "C", "C", "C", "D", "D", "E", "E"),
    District = c("West", "West", "West", "West", "East", "East", "East", "North", "North" , "North", "North")
  ) 

You can see that the performance for all students is on average, for example, if you didn’t know where they went to school. We would take only the mean.

Summarizing Data Using {dplyr}

However, there are many ways to summarize data. To introduce data summary techniques using {dplyr}, we will use dplyr::summarise() or dplyr::summarize() to summarize the data in the data frame. The summarize() (or summarise()) function works similar to mutate() insofar as it creates variables but it differs insofar as the data frame returned from summarize() contains only the variable(s) referenced in as part of that summary process. Importantly, summarize() will be performed based on some functions called immediately prior to summarize().

summarize(
   .data, 
   ..., 
   .by = NULL, 
   .groups = NULL
   )

Key Parameters/Arguments:

  • .data: a data frame
  • ...: name-value pairs of variable names and functions for the summary operation

Summarizing as a Constant

When piping code, the data frame will be inherited; the parameter and argument can be omitted. You will need to pass the expressions for the summary, however. Like mutate() The expression will require a variable name as well as a value or function.

Although summarize() can create variables like mutate(), it does so by grouping the data in some way. We will cover using summarize() with group_by() to group data but for now, you can see grouping happens automatically, making summarize() a different function than mutate(). The functions serve different purposes. A good rule of thumb for remembering the differences is that mutate() is used to add column variables to a data frame to make it bigger and summarize() is used to reduce the data frame by summarizing the data within it.

If we assign a variable name a value, we see that the data frame that is returned is now reduced to a single row. Specifically, the default operation is that summarize() groups all rows of the data frame together.

DATA |>
  summarize(my_new_var = 1)
  my_new_var
1          1

Clearly, this is a ridiculous summary of the data. Let’s do something that makes more sense.

Summarizing with a Function

You will likely want to summarize the data in some way, for example, by obtaining the mean, sum, count, standard deviation, or some other statistic from the data. If we wanted to obtain the mean of some variable in the data frame, we would need to specify the variable that we want to summarize, the function for how to summarize that variable and then we would want to specify the variable name for the data frame to contain.

General case:

summarized_var = the_function(my_var)

Specific case:

mean = mean(my_var)

Let’s summarize the performance Score in the data frame. In the example below, we summarize data by creating a new variable which is set to represent some data summary technique. In essence, summarizing is for descriptive statistics. Using mean(), we can summarize the data by taking the mean of the Score variable.

DATA |>
  summarize(bob = mean(Score))
       bob
1 76.54545

We see that the returned data frame contains only one row. By default, the data frame has been reduced to the mean across all rows. The column variable is named bob. Let’s make the variable in the returned data frame make more sense.

DATA |>
  summarize(mean_score = mean(Score))
  mean_score
1   76.54545

Notice what is returned is a single value reflecting the mean of all the data in the data frame. We could have obtained the same without using {dplyr}.

mean(DATA$Score)        # $ notation
[1] 76.54545

However, we lose flexibility of easily adding new summary procedures using this approach. Instead, summarize() allows for multiple name-value pair functions.

Summarizing with Multiple Functions

Let’s create the mean and the standard deviation of Score using mean(). Because you will use mean() often, you should understand how it works, so let’s first look at its parameters and default arguments more carefully.

mean(x, trim = 0, na.rm = FALSE, ...)

By default, the data are not trimmed and NA values are not removed. If NAs exist in your vector, the function will break unless you pass TRUE as the argument to it, na.rm = TRUE. The same is true for sd() for computing the standard deviation.

Let’s summarize Score using both functions. We see that the data frame has one row and two columns reflecting the variables specified in the function.

DATA |>
  summarise(mean = mean(Score, na.rm = T),
            sd   = sd(Score, na.rm = T)
            )
      mean       sd
1 76.54545 10.56753

You see the basic operation of summarize(). By default, variables are summarized using functions that aggregate the data across all rows. The resulting summary value is entered as column specified by the named we provided. Of course, however, we may not want to aggregate all of the data as a single group. Rather, we may wish to summarize by grouping data based on subgroups, factors, etc.

Aggregating Data

Data aggregation refers to how data are organized or compiled. When you have multiple levels of data as presented in the example above, data aggregation refers to the process of taking the raw data and gathering it in some way to summarize for statistical analysis or data visualization. For example, you could summarize data by individual, class, school, district, county, and state. Alternatively, you could summarize data by district, county, and state and ignore individuals, class, and school. The two types of data aggregation will result in different numbers of cases (rows) in a data frame most obviously but also in the actual numbers.

The variability in individuals as well as in subgroups can be useful to answering your data question. Models like linear mixed models are beyond the scope of this course, however. Whether you look into those types of model or you don’t, you should realize that you are making a big assumption in your data when you aggregate at any level. That assumption is that the variability in the data that are aggregated is the same across all cases in the data. Specifically, by aggregating Tilford’s and Mildred’s individual data in order to obtain their average performance, you have lost the variability at the individual level and essentially arbitrarily set that variability to 0. At higher level, aggregating the data by School also means that you lose this information and make the assumption that there is no variability among individuals in each school (e.g., all individuals have the same value). Such assumptions are particularly relevant to repeated-measures models (e.g., repeated-measures ANOVA does not use this information) and are potentially not assumptions you want to make about your data.

In the end, how you interpret a model or a data visualization will be dependent on that aggregation of the data, modeled or displayed. The take home message about aggregating data, then, is that you trade off a loss of information for some summary. Only you can weigh the costs and benefits of this trade off. That said, sometimes you have no choice because your client, your employer, or your data resource has only data an an aggregate level. In those situations, you cannot do much except acknowledge the information loss in the ability of the model to predict an outcome. If, however, that aggregated level can still be summarized or aggregated in some way, you are back to reflecting on issues of aggregation.

Aggregating Data using group_by()

{dplyr} has one main verb function for grouping data. group_by() orders the groups in an ascending order so that functions performed on the subgroups will return a data frame that is ordered based on the grouping variable(s). Other functions like group_map(), group_nest(), group_split(), and group_trim() exist but they are experimental for {dplyr} rpackageVersion(“dplyr”)`. Although they allow for fun tasks like passing data subsets to statistical models, their experimental life cycle suggests you might not get to comfortable with them.

group_by() is used to group data frames by one or more variables.

group_by(.data, 
         ..., 
         .add = FALSE, 
         .drop = group_by_drop_default(.data)
         )

Key Parameters/Arguments:

  • .data: a data frame
  • ...: variables or computations to group by.

The .add and .drop parameters are special cases of the function and will be addressed later. For now, let’s look at the basic functionality of group_by() used prior to summarize().

Data aggregation using group_by()

Let’s say we want to group the data by School and then summarize. The returned data frame now contains two columns, one for the grouping variable followed by a second for the specification in summarize(). Importantly, the grouping variables will always be organized in the data frame to the left of variables created in the summary.

DATA |>
  group_by(School) |>
  summarize(Score = mean(Score))   # get the school level aggregated data
# A tibble: 5 × 2
  School Score
  <chr>  <dbl>
1 A       82.5
2 B       79.5
3 C       87  
4 D       62.5
5 E       66  

Let’s say we try to determine mean performance for “all students” if we had the data aggregated as the school level and not the individual level. As a reminder, the mean across all rows in the data frame was rmean(DATA$Score, na.rm = T)`.

If we want to summarize the data frame that was summarized, we might assume that we can pipe the data to another add another summarize().

DATA |>
  group_by(School) |>                # group by school
  summarize(Score = mean(Score)) |>  # get the school level aggregated data
  summarize(Score = mean(Score))     # then calculate average across all schools
# A tibble: 1 × 1
  Score
  <dbl>
1  75.5

Note: If you noticed and wonder why the mean value returned here differs from rmean(DATA$Score, na.rm = T)`, make a mental note. We will address that later.

Let’s say we try to determine average performance for “all students” if we had the data aggregated as the district level and not the individual level.

DATA |>
  group_by(District) |>                # group by district
  summarize(Score = mean(Score)) |>    # get the district level aggregated data
  summarize(Score = mean(Score))       # then calculate average across all districts
# A tibble: 1 × 1
  Score
  <dbl>
1  77.4

So what’s the average performance for students? Well, that interpretation differs based on how values are treated. At the highest level, all students’ Scores are weighed equally. Each student contributes to the data in the same way. When aggregated by School, each school contributes to the calculation equally, even if the number of bodies per school differs. And finally, when data are aggregated at the District level, all districts are treated equally in the calculation independent on the number of schools in a district. Only with weighting means would you end up with the same average performance.

Measures of variability in the data, however, reveal something else. Schools differ in variability and schools within districts vary as well. Depending on the level of aggregation, you may never notice interesting patterns in data that lead to questions to investigate and later policies to change. For example, addressing a school that is left behind others or a district that is left behind others. Aggregation can also lead to inefficient allocations of resources. If one school in a district needs help rather than all schools in the district needing help, the cost may differ substantially.

std_error <- function(x) { sd(na.omit(x)) / sqrt(length(na.omit(x))) }

Use std_error() in summary:

DATA |>
  group_by(School, District) |>
  summarize(Var = var(Score),
            SD  = sd(Score),
            SEM = std_error(Score),
            min = min(Score),
            max = max(Score)
  ) |> mutate(range = max - min)
`summarise()` has grouped output by 'School'. You can override using the
`.groups` argument.
# A tibble: 5 × 8
# Groups:   School [5]
  School District   Var    SD   SEM   min   max range
  <chr>  <chr>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A      West      12.5 3.54   2.5     80    85     5
2 B      West       0.5 0.707  0.5     79    80     1
3 C      East      28   5.29   3.06    81    91    10
4 D      North     12.5 3.54   2.5     60    65     5
5 E      North      2   1.41   1       65    67     2

Comparing mutate() with summarize()

{dplyr} has two main verb functions for creating variables (e.g., mutate() and summarize()). They operate differently and their returned tibble also differ. If you have not been paying attention to the returned tibble post calculation, you might have overlooked the differences.

Grouping structure of group_by() and mutate()

Let’s mutate() the mean for Score after sub-setting with group_by().

DATA |>
  group_by(School, District) |>
  mutate(Score = mean(na.omit(Score)))
# A tibble: 11 × 4
# Groups:   School, District [5]
   Student      Score School District
   <chr>        <dbl> <chr>  <chr>   
 1 Jim           82.5 A      West    
 2 Sally         82.5 A      West    
 3 June          79.5 B      West    
 4 Mildred       79.5 B      West    
 5 Tilford       87   C      East    
 6 Beavis        87   C      East    
 7 Herman        87   C      East    
 8 Peppa         62.5 D      North   
 9 Kay           62.5 D      North   
10 Jake          66   E      North   
11 Name Missing  66   E      North   

There are two things to watch when using this mutate() following group_by().

  • 1. A Score will be assigned to each row/case or Student in the data frame.

    • When schools and districts are grouped, each student in the same school will have the same assigned average value. All rows are maintained, none dropped.
  • 2. The returned tibble takes on a new structure.

    • Looking at the feedback displayed in the console, you see the following report preceding the data.
A tibble: 11 × 3
# Groups:   School, District [5]

You also see this structure using glimpse().

DATA |>
  group_by(School, District) |>
  mutate(Score = mean(na.omit(Score))) |>
  glimpse()
Rows: 11
Columns: 4
Groups: School, District [5]
$ Student  <chr> "Jim", "Sally", "June", "Mildred", "Tilford", "Beavis", "Herm…
$ Score    <dbl> 82.5, 82.5, 79.5, 79.5, 87.0, 87.0, 87.0, 62.5, 62.5, 66.0, 6…
$ School   <chr> "A", "A", "B", "B", "C", "C", "C", "D", "D", "E", "E"
$ District <chr> "West", "West", "West", "West", "East", "East", "East", "Nort…
Rows: 11
Columns: 3
Groups: School, District [5]

What does the [5] mean? Well, before we answer this, let’s use summarize().

Grouping structure of group_by() and summarize()

Let’s summarize() the mean for Score after sub-setting with group_by().

DATA |>
  group_by(School, District) |>
  summarize(Score = mean(Score))
`summarise()` has grouped output by 'School'. You can override using the
`.groups` argument.
# A tibble: 5 × 3
# Groups:   School [5]
  School District Score
  <chr>  <chr>    <dbl>
1 A      West      82.5
2 B      West      79.5
3 C      East      87  
4 D      North     62.5
5 E      North     66  

There are two things to watch when using this summarize() following group_by().

1. A Score average is assigned to each row/case or School in this aggregated data frame.

When schools and districts are grouped, each school will have its own average value. Rows from the data frame are dropped as are columns that are not passed to group_by(). In this case, School, District, and the new variable, Score are returned.

2. The returned tibble takes on a new structure.

The feedback in the console is a little more detailed here.

`summarise()` has grouped output by 'School'. You can override using the `.groups` argument.
# A tibble: 5 × 3
# Groups:   School [5]

You see reference to overriding the grouping using .groups. According to the tidyverse documentation for grouping, this parameter “controls the grouping structure of the output. The historical behaviour of removing the right hand side grouping variable corresponds to .groups = "drop_last" without a message or .groups = NULL with a message (the default)”.

You have most likely paid little attention to this message. By default summarize() keeps the first grouping variable passed to group_by() in the returned tibble. This is why you see School referenced and not District or both variables. So, do you want your data frame to contain groups or no groups? Stated differently, do you just want that data summarized by your grouping variables and have a simple nxm data frame or do you want something more complex?

To see the structure better, let’s first look a the structure of DATA and then more closely at the summarized version.

str(DATA)
'data.frame':   11 obs. of  4 variables:
 $ Student : chr  "Jim" "Sally" "June" "Mildred" ...
 $ Score   : num  80 85 79 80 81 91 89 60 65 67 ...
 $ School  : chr  "A" "A" "B" "B" ...
 $ District: chr  "West" "West" "West" "West" ...

Let’s assign the summarized data frame to an object for inspection.

DATA_SUM <- 
  DATA |>
  group_by(School, District) |>
  summarize(Score = mean(Score))
`summarise()` has grouped output by 'School'. You can override using the
`.groups` argument.

For DATA_SUM, the structure is different. In particular, you see reference to (S3: grouped_df/tbl_df/tbl/data.frame). This tells you that the data frame is not a simple nxm but contains groups. You could think of this as 3 nxm data frames organized together. For more details, see the dplyr documentation.

str(DATA_SUM)
gropd_df [5 × 3] (S3: grouped_df/tbl_df/tbl/data.frame)
 $ School  : chr [1:5] "A" "B" "C" "D" ...
 $ District: chr [1:5] "West" "West" "East" "North" ...
 $ Score   : num [1:5] 82.5 79.5 87 62.5 66
 - attr(*, "groups")= tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
  ..$ School: chr [1:5] "A" "B" "C" "D" ...
  ..$ .rows : list<int> [1:5] 
  .. ..$ : int 1
  .. ..$ : int 2
  .. ..$ : int 3
  .. ..$ : int 4
  .. ..$ : int 5
  .. ..@ ptype: int(0) 
  ..- attr(*, ".drop")= logi TRUE

group_by() and summarize() with .groups

When you group data using group_by() and then summarize(), the summary variables (e.g., mean) will result in a single row for each level of a single grouping variable. If there are more than one grouping variable, the additional grouping variable will be introduced to the data frame as a second column. The total number of rows in the data frame will be equal to the number of levels of group 1 x number of levels of group 2 if an only if each levels of group 1 has a corresponding level for group 2,

for example:

sex   age    mean
men   young  x
men   old    x
women young  x
women old    x

If there is no pairing of levels in the data (e.g., no men who are old), that row will be omitted from the returned data frame.

for example:

sex   age    mean
men   young  x
women young  x
women old    x

rather than:

sex   age    mean
men   young  x
men   old    NA
women young  x
women old    x

You really need to query ?dplyr::summarize to understand .groups, which controls the grouping of the returned data frame. This is also experimental to summarize(), so it might not be available in the future.

Parameters/Arguments:

There are four argument options:

  • "drop_last": dropping the last level of grouping. This was the only supported option before version 1.0.0.
  • "drop": All levels of grouping are dropped.
  • "keep": Same grouping structure as .data.
  • "rowwise": Each row is its own group.

Let’s work through them individually and see how they work.

.groups = "drop_last":

DATA |>
  group_by(School, District) |>
  summarize(Score = mean(Score),
            .groups = "drop_last"
            )
# A tibble: 5 × 3
# Groups:   School [5]
  School District Score
  <chr>  <chr>    <dbl>
1 A      West      82.5
2 B      West      79.5
3 C      East      87  
4 D      North     62.5
5 E      North     66  

You will see that the grouping for District is not in the grouping structure because it was the last grouping.

# A tibble: 5 × 3
# Groups:   School [5]
  School   District   Score

.groups = "drop":

DATA |>
  group_by(School, District) |>
  summarize(Score = mean(Score),
            .groups = "drop"
            )
# A tibble: 5 × 3
  School District Score
  <chr>  <chr>    <dbl>
1 A      West      82.5
2 B      West      79.5
3 C      East      87  
4 D      North     62.5
5 E      North     66  

You will see that all groups are dropped from the grouping structure.

# A tibble: 5 × 3
  School District Score

.groups = "keep":

DATA |>
  group_by(School, District) |>
  summarize(Score = mean(Score),
            .groups = "keep"
            )
# A tibble: 5 × 3
# Groups:   School, District [5]
  School District Score
  <chr>  <chr>    <dbl>
1 A      West      82.5
2 B      West      79.5
3 C      East      87  
4 D      North     62.5
5 E      North     66  

You will see that all groups are kept in the grouping structure.

# A tibble: 5 × 3
# Groups:   School, District [5]
  School District Score

.groups = "rowwise":

DATA |>
  group_by(School, District) |>
  summarize(Score = mean(Score),
            .groups = "rowwise"
            )
# A tibble: 5 × 3
# Rowwise:  School, District
  School District Score
  <chr>  <chr>    <dbl>
1 A      West      82.5
2 B      West      79.5
3 C      East      87  
4 D      North     62.5
5 E      North     66  

You will see that there are groups but they are not based on group_by().

# A tibble: 5 × 3
# Rowwise:  School, District
  School District Score

The grouping structure always matters for subsequent computations. But with "rowwise" you will see this clearly. Following from above, let’s say we wanted to compute the mean across all of the Schools in the data frame returned by group_by() then summarize(). We add mutate(mean = mean(Score)) on a new line of our piped code block.

DATA |>
  group_by(School, District) |>
  summarize(Score = mean(Score),
            .groups = "rowwise"
            ) |>
  mutate(mean = mean(Score))
# A tibble: 5 × 4
# Rowwise:  School, District
  School District Score  mean
  <chr>  <chr>    <dbl> <dbl>
1 A      West      82.5  82.5
2 B      West      79.5  79.5
3 C      East      87    87  
4 D      North     62.5  62.5
5 E      North     66    66  

The mean column does not contain a single mean replicated for each row in the data frame. Rather the grouping was per row, so each row has its one mean. The mean of a single value is, of course, itself.

Revisiting ".groups = drop_last":

Returning to the default .groups structure, which is "drop_last", we can add the same mutate() to see what happens.

DATA |>
  group_by(School, District) |>
  summarize(Score = mean(Score),
            .groups = "drop_last"
            ) |>
  mutate(mean = mean(Score))
# A tibble: 5 × 4
# Groups:   School [5]
  School District Score  mean
  <chr>  <chr>    <dbl> <dbl>
1 A      West      82.5  82.5
2 B      West      79.5  79.5
3 C      East      87    87  
4 D      North     62.5  62.5
5 E      North     66    66  

The mean column still has different values but they are replicated on rows with the same School group. This is because the data are grouped that way.

Revisiting .groups = "drop":

Let’s again add the same mutate() to see what happens.

DATA |>
  group_by(School, District) |>
  summarize(Score = mean(Score),
            .groups = "drop"
            ) |>
  mutate(mean = mean(Score))
# A tibble: 5 × 4
  School District Score  mean
  <chr>  <chr>    <dbl> <dbl>
1 A      West      82.5  75.5
2 B      West      79.5  75.5
3 C      East      87    75.5
4 D      North     62.5  75.5
5 E      North     66    75.5

Consider ungroup():

.groups = "drop" in effect works the same as does ungroup(). The grouping structure is broken and all subsequent operations are based on the ungrouped data frame.

DATA |>
  group_by(School, District) |>
  summarize(Score = mean(Score),
            .groups = "drop_last"
            ) |>
  ungroup() |>
  mutate(mean = mean(Score))
# A tibble: 5 × 4
  School District Score  mean
  <chr>  <chr>    <dbl> <dbl>
1 A      West      82.5  75.5
2 B      West      79.5  75.5
3 C      East      87    75.5
4 D      North     62.5  75.5
5 E      North     66    75.5

Keep in mind that functions work as they are programmed to work. Functions do not work like you think they work. You must understand the function and check our work to ensure your calculations are what you intend them to be.

The order of operations matters

To illustrate further, consider you want to calculate some summary statistics. You set out to obtain the mean and the standard deviation for your data. Those computations will be performed according to the grouping structure.

When you compute standard deviation and the mean of Score, you assign the mean to Score because you want your plot to contain a clean name rather than one like Score_Mean that you will have to address in the plot.

DATA |>
  group_by(School, District) |>
  summarize(Score_sd = sd(Score),   # get sd 
            Score = mean(Score)     # get mean, assign to same name
            ) 
`summarise()` has grouped output by 'School'. You can override using the
`.groups` argument.
# A tibble: 5 × 4
# Groups:   School [5]
  School District Score_sd Score
  <chr>  <chr>       <dbl> <dbl>
1 A      West        3.54   82.5
2 B      West        0.707  79.5
3 C      East        5.29   87  
4 D      North       3.54   62.5
5 E      North       1.41   66  

Both variables use the same data because the standard deviation assigns the value to a different variable name Score_sd. The mean() is not based on some changed variable.

The output is different from the one returned when the mean is computed before the standard deviation and in particular when the mean is assigned to Score. In this case, the standard deviation is based on this new Score variable. Because the standard deviation is a measure of variability, and Score does not vary based on the grouping structure, NA is returned.

DATA |>
  group_by(School, District) |>
  summarize(Score = mean(Score),   # get mean, assign to same name
            Score_sd = sd(Score)   # get sd         
            )
`summarise()` has grouped output by 'School'. You can override using the
`.groups` argument.
# A tibble: 5 × 4
# Groups:   School [5]
  School District Score Score_sd
  <chr>  <chr>    <dbl>    <dbl>
1 A      West      82.5       NA
2 B      West      79.5       NA
3 C      East      87         NA
4 D      North     62.5       NA
5 E      North     66         NA

If you really wanted the standard deviation of all the means, consider ungrouping and then compute the standard deviation or use .groups = "drop" in summarize(). Realize, however, this latter functionality is experimental and may not work sometime later. You are likely better off using ungroup().

DATA |>
  group_by(School, District) |>
  summarize(Score = mean(Score)) |>  # get mean, assign to same name
  ungroup() |>
  mutate(sd = sd(Score))
`summarise()` has grouped output by 'School'. You can override using the
`.groups` argument.
# A tibble: 5 × 4
  School District Score    sd
  <chr>  <chr>    <dbl> <dbl>
1 A      West      82.5  10.7
2 B      West      79.5  10.7
3 C      East      87    10.7
4 D      North     62.5  10.7
5 E      North     66    10.7

More on Grouping Structure

We can now also consider ways to create column variables and a corresponding grouping structure within the group_by() verb.

New Variable Columns

group_by() is not designed to create new variables but rather create groups. The function, however, does not require variables in a data frame in order to group based on their levels or differences in values.

group_by() using a function

Heretofore, we have grouped by column variables in a data frame. But you can also group in other ways, for example, by a variable calculated by a function. For example, if you wanted to group by standard deviations, you could calculated the standard deviation (Score-mean(Score)) / sd(Score) for each student and then cut() that variable into groups. cut will take a numeric vector variable and turn it into a factor variable of n groups as determined by what you pass to breaks. There is also an argument to make the factor ordered if you wish, ordered_result = TRUE but the default behavior does not order the factor. You can also change the level labels if you inspect the function.

DATA |>
  group_by(z_factor = cut( 
    x = ((Score - mean(Score)) / sd(Score) ), 
    breaks = 5, 
    ordered_result = T
    )
  )
# A tibble: 11 × 5
# Groups:   z_factor [4]
   Student      Score School District z_factor       
   <chr>        <dbl> <chr>  <chr>    <ord>          
 1 Jim             80 A      West     (0.194,0.781]  
 2 Sally           85 A      West     (0.781,1.37]   
 3 June            79 B      West     (0.194,0.781]  
 4 Mildred         80 B      West     (0.194,0.781]  
 5 Tilford         81 C      East     (0.194,0.781]  
 6 Beavis          91 C      East     (0.781,1.37]   
 7 Herman          89 C      East     (0.781,1.37]   
 8 Peppa           60 D      North    (-1.57,-0.979] 
 9 Kay             65 D      North    (-1.57,-0.979] 
10 Jake            67 E      North    (-0.979,-0.392]
11 Name Missing    65 E      North    (-1.57,-0.979] 

You can see that there are 4 different levels of the grouping variable. The tibble is grouped of course too.

If you struggle with remembering the formula for a z score, scale() will to the same thing.

DATA |>
  mutate(z = (Score - mean(Score))/sd(Score),
         scale = scale(Score)
         )
        Student Score School District          z      scale
1           Jim    80      A     West  0.3269018  0.3269018
2         Sally    85      A     West  0.8000492  0.8000492
3          June    79      B     West  0.2322724  0.2322724
4       Mildred    80      B     West  0.3269018  0.3269018
5       Tilford    81      C     East  0.4215313  0.4215313
6        Beavis    91      C     East  1.3678261  1.3678261
7        Herman    89      C     East  1.1785671  1.1785671
8         Peppa    60      D    North -1.5656877 -1.5656877
9           Kay    65      D    North -1.0925403 -1.0925403
10         Jake    67      E    North -0.9032814 -0.9032814
11 Name Missing    65      E    North -1.0925403 -1.0925403

If you want n groups based on specific breaks, then pass a vector of those break units. For example, if we want levels to correspond to some meaningful standard-deviation cuts. For example, if you wanted to group those with z scores ranging from infinitely negative to -2, -2 to -1, - to 1, 1 to 2, and 2 to infinitely large, you could specify the breaks. If there are no values in those ranges, then there won’t be any data for those breaks.

DATA |>
  group_by(z_factor = cut( scale(Score), 
                          breaks = c(-Inf, -2, -1, 1, 2, Inf),
                          ordered_result = T
                          ), 
           ) |>
  ungroup() 
# A tibble: 11 × 5
   Student      Score School District z_factor
   <chr>        <dbl> <chr>  <chr>    <ord>   
 1 Jim             80 A      West     (-1,1]  
 2 Sally           85 A      West     (-1,1]  
 3 June            79 B      West     (-1,1]  
 4 Mildred         80 B      West     (-1,1]  
 5 Tilford         81 C      East     (-1,1]  
 6 Beavis          91 C      East     (1,2]   
 7 Herman          89 C      East     (1,2]   
 8 Peppa           60 D      North    (-2,-1] 
 9 Kay             65 D      North    (-2,-1] 
10 Jake            67 E      North    (-1,1]  
11 Name Missing    65 E      North    (-2,-1] 

You can also just group by the function without using cut() if you wish to group by those with identical values on the variable. In this case, using count() or tally() reveals there are only two instances with the same score.

DATA |>
  group_by(z_factor = ((Score - mean(Score)) / sd(Score) )) |>
  count(sort = TRUE)     # tally(sort = TRUE)
# A tibble: 9 × 2
# Groups:   z_factor [9]
  z_factor     n
     <dbl> <int>
1   -1.09      2
2    0.327     2
3   -1.57      1
4   -0.903     1
5    0.232     1
6    0.422     1
7    0.800     1
8    1.18      1
9    1.37      1

Add/drop Grouping Variable Columns

By default, group_by() on a data frame that is already grouped (see earlier on grouped data frame), the existing grouping structure will be replaced by new grouping structure.

Let’s get an example. Please note that this example assigned the tibble to an object but whether you assign or not, the grouped structure exists. So functions that follow the group_by() keep that structure.

(school_grouped <- 
  DATA |>
  group_by(School) 
)
# A tibble: 11 × 4
# Groups:   School [5]
   Student      Score School District
   <chr>        <dbl> <chr>  <chr>   
 1 Jim             80 A      West    
 2 Sally           85 A      West    
 3 June            79 B      West    
 4 Mildred         80 B      West    
 5 Tilford         81 C      East    
 6 Beavis          91 C      East    
 7 Herman          89 C      East    
 8 Peppa           60 D      North   
 9 Kay             65 D      North   
10 Jake            67 E      North   
11 Name Missing    65 E      North   

Now group by a new column:

school_grouped |>
  group_by(District)
# A tibble: 11 × 4
# Groups:   District [3]
   Student      Score School District
   <chr>        <dbl> <chr>  <chr>   
 1 Jim             80 A      West    
 2 Sally           85 A      West    
 3 June            79 B      West    
 4 Mildred         80 B      West    
 5 Tilford         81 C      East    
 6 Beavis          91 C      East    
 7 Herman          89 C      East    
 8 Peppa           60 D      North   
 9 Kay             65 D      North   
10 Jake            67 E      North   
11 Name Missing    65 E      North   

You will see the grouping structure has changed.

# A tibble: 11 × 4
# Groups:   District [3]

Retain Grouping

If you want to retain the existing group, you would need to use .add = TRUE.

school_grouped |>
  group_by(District, .add = TRUE)
# A tibble: 11 × 4
# Groups:   School, District [5]
   Student      Score School District
   <chr>        <dbl> <chr>  <chr>   
 1 Jim             80 A      West    
 2 Sally           85 A      West    
 3 June            79 B      West    
 4 Mildred         80 B      West    
 5 Tilford         81 C      East    
 6 Beavis          91 C      East    
 7 Herman          89 C      East    
 8 Peppa           60 D      North   
 9 Kay             65 D      North   
10 Jake            67 E      North   
11 Name Missing    65 E      North   

Notice in the console, that both School and District are included in the groups.

# A tibble: 11 × 4
# Groups:   School, District [5]

Why Care About Grouping Structure

Well, the functions you apply to a grouped tibble will sometimes lead to calculations that are not what you intend. We provided some examples earlier but given the importance of the issue, we may benefit from another example.

Let’s say you want to compute the average for each school and add that school average for each student. This would tell you how the student differs from their school performance. Then you want to obtain the mean of all the schools and see how the school differs from all the schools.

You code it out:

DATA |>
  # group 
  group_by(School) |>
  # calculate the mean of Score for each School
  mutate(School_Mean = mean(Score)) |>
  # calculate the mean of all values (think Grand Mean from stats)
  mutate(Mean_of_all = mean(School_Mean)) |>
  # calculate the school performance relative to all
  mutate(School_Performance = factor(
    case_when(
      School_Mean < Mean_of_all ~ "Below Average",
      School_Mean == Mean_of_all ~ "Average",
      School_Mean > Mean_of_all ~ "Above Average"
      ), ordered = T)
  )
# A tibble: 11 × 7
# Groups:   School [5]
   Student      Score School District School_Mean Mean_of_all School_Performance
   <chr>        <dbl> <chr>  <chr>          <dbl>       <dbl> <ord>             
 1 Jim             80 A      West            82.5        82.5 Average           
 2 Sally           85 A      West            82.5        82.5 Average           
 3 June            79 B      West            79.5        79.5 Average           
 4 Mildred         80 B      West            79.5        79.5 Average           
 5 Tilford         81 C      East            87          87   Average           
 6 Beavis          91 C      East            87          87   Average           
 7 Herman          89 C      East            87          87   Average           
 8 Peppa           60 D      North           62.5        62.5 Average           
 9 Kay             65 D      North           62.5        62.5 Average           
10 Jake            67 E      North           66          66   Average           
11 Name Missing    65 E      North           66          66   Average           

Perfect. OK, let’s plot it. Oh but first let me inspect the data. Um, why is School_Performance the same for all Students? Schools are not performing the same so they cannot all be average. Check your case_when() for errors because that where the new variable was created. Then you spend 40 days and 40 nights trying to fix your code. No matter what you do with case_when(), you cannot fix the problem. So you try to create 42 data frames to solve your problem. Even grandma knows that is a ridiculous strategy. She suggests you read the documentation for all of the functions you used because one time her pot-luck cake flopped and she inspected the expiration date for all of her ingredients and found the baking soda was old.

You discover that the grouping structure is retained on all operations until that grouping structure is removed or replaced.

ungroup()

Whereas summarize has a .groups functionality, mutate() does not. In order to have a simple data frame use follow group_by() and mutate() with ungroup(). But ungroup() can be used following both functions if that approach makes more sense to you.

Let’s add an ungroup() directly after performing the operation that needed the grouping structure.

DATA |>
  # group 
  group_by(School) |>
  # calculate the mean of Score for each School
  mutate(School_Mean = mean(Score)) |>  
  # break that grouping structure
  ungroup() |>
  # calculate the mean of all values (think Grand Mean from stats)
  mutate(Mean_of_all = mean(School_Mean)) |>
  # calculate the school performance relative to all
  mutate(School_Performance = factor(
    case_when(
      School_Mean < Mean_of_all ~ "Below Average",
      School_Mean == Mean_of_all ~ "Average",
      School_Mean > Mean_of_all ~ "Above Average"
      ), ordered = T)
  )
# A tibble: 11 × 7
   Student      Score School District School_Mean Mean_of_all School_Performance
   <chr>        <dbl> <chr>  <chr>          <dbl>       <dbl> <ord>             
 1 Jim             80 A      West            82.5        76.5 Above Average     
 2 Sally           85 A      West            82.5        76.5 Above Average     
 3 June            79 B      West            79.5        76.5 Above Average     
 4 Mildred         80 B      West            79.5        76.5 Above Average     
 5 Tilford         81 C      East            87          76.5 Above Average     
 6 Beavis          91 C      East            87          76.5 Above Average     
 7 Herman          89 C      East            87          76.5 Above Average     
 8 Peppa           60 D      North           62.5        76.5 Below Average     
 9 Kay             65 D      North           62.5        76.5 Below Average     
10 Jake            67 E      North           66          76.5 Below Average     
11 Name Missing    65 E      North           66          76.5 Below Average     

Yes! Mean_of_all is now a true grand mean based on all of the data with no grouping at all. Comparing Mean_of_all to School_Mean provides two classifications.

Variable loss with summarize()

Because summarize() returns a tibble based on the grouping structure, you will lose all variables that are not in the grouping structure. By design, it no longer allows you to return a data frame with duplicated rows based on the variables passed to group_by().

Returning to the school data, what this means, is that you cannot retain the School if you group by the District in order to obtain averages by district.

DATA |>
  group_by(District) |>
  summarize(Mean = mean(Score))
# A tibble: 3 × 2
  District  Mean
  <chr>    <dbl>
1 East      87  
2 North     64.2
3 West      81  

If you try to add School to group_by(), then you will only obtain the averages by schools within districts.

DATA |>
  group_by(School, District) |>
  summarize(Mean = mean(Score)) 
`summarise()` has grouped output by 'School'. You can override using the
`.groups` argument.
# A tibble: 5 × 3
# Groups:   School [5]
  School District  Mean
  <chr>  <chr>    <dbl>
1 A      West      82.5
2 B      West      79.5
3 C      East      87  
4 D      North     62.5
5 E      North     66  

You can try to summarize based on a variable without assign it to a function but that is no longer allowed with summarize(). The following code block will throw an error.

DATA |>
  group_by(District) |>
  summarize(Mean = mean(Score),
          School               # add school 
          )

An alternative to summarize(): reframe()

Instead, you can use reframe(), which allows for previous functionality of summarize() that is now deprecated.

DATA |>
  group_by(District) |>
  reframe(Mean = mean(Score),
          School,
          Student
          )
# A tibble: 11 × 4
   District  Mean School Student     
   <chr>    <dbl> <chr>  <chr>       
 1 East      87   C      Tilford     
 2 East      87   C      Beavis      
 3 East      87   C      Herman      
 4 North     64.2 D      Peppa       
 5 North     64.2 D      Kay         
 6 North     64.2 E      Jake        
 7 North     64.2 E      Name Missing
 8 West      81   A      Jim         
 9 West      81   A      Sally       
10 West      81   B      June        
11 West      81   B      Mildred     

Remember that the operations in summarize(), mutate(), and reframe() all depend on the grouping structure. If you wanted to add variables, you would need to group a different way and you would need to add all variables into summarize() every step of the way.

DATA |>
  group_by(District) |>
  reframe(District_Mean = mean(Score),
          School, 
          Student,
          Score
          ) |>
  ungroup() |>
  group_by(School) %>%
  reframe(School_Mean = mean(Score),
          School, 
          Student,
          Score,
          District_Mean
          ) |>
  ungroup()
# A tibble: 11 × 5
   School School_Mean Student      Score District_Mean
   <chr>        <dbl> <chr>        <dbl>         <dbl>
 1 A             82.5 Jim             80          81  
 2 A             82.5 Sally           85          81  
 3 B             79.5 June            79          81  
 4 B             79.5 Mildred         80          81  
 5 C             87   Tilford         81          87  
 6 C             87   Beavis          91          87  
 7 C             87   Herman          89          87  
 8 D             62.5 Peppa           60          64.2
 9 D             62.5 Kay             65          64.2
10 E             66   Jake            67          64.2
11 E             66   Name Missing    65          64.2

Retaining distinct data after mutate() using distinct()

This is just too tedious. Your better option is mutate() as all variables will be added to the full data frame. All variables are neatly and logically appended to the right hand side of the data frame.

DATA |>
  group_by(District) |>
  mutate(District_Mean = mean(Score)) |>
  group_by(School) |>
  mutate(School_Mean = mean(Score))
# A tibble: 11 × 6
# Groups:   School [5]
   Student      Score School District District_Mean School_Mean
   <chr>        <dbl> <chr>  <chr>            <dbl>       <dbl>
 1 Jim             80 A      West              81          82.5
 2 Sally           85 A      West              81          82.5
 3 June            79 B      West              81          79.5
 4 Mildred         80 B      West              81          79.5
 5 Tilford         81 C      East              87          87  
 6 Beavis          91 C      East              87          87  
 7 Herman          89 C      East              87          87  
 8 Peppa           60 D      North             64.2        62.5
 9 Kay             65 D      North             64.2        62.5
10 Jake            67 E      North             64.2        66  
11 Name Missing    65 E      North             64.2        66  

Also, when you use a new group_by(), it will replace previous grouping structure by default. The following two code blocks return the same data frame.

DATA |>
  group_by(District) |>
  mutate(District_Mean = mean(Score)) |>
  group_by(School) |>
  mutate(School_Mean = mean(Score))

With ungroup():

DATA |>
  group_by(District) |>
  mutate(District_Mean = mean(Score)) |>
  ungroup() |>                            # ungroup
  group_by(School) |>
  mutate(School_Mean = mean(Score))

If you wish those variables to remain, you have a couple of options. The first would be to use mutate().

Notice, however, that if you want to extract the summaries from the large data frame as you would have had you used summarize(), you cannot just select() your column of interest because the repetitions will exist for each row in the data frame.

DATA |>
  group_by(District) |>
  mutate(District_Mean = mean(Score)) |>
  select(c("School", "District_Mean"))    # if passing a vector, use any_of() or all_of())
Adding missing grouping variables: `District`
# A tibble: 11 × 3
# Groups:   District [3]
   District School District_Mean
   <chr>    <chr>          <dbl>
 1 West     A               81  
 2 West     A               81  
 3 West     B               81  
 4 West     B               81  
 5 East     C               87  
 6 East     C               87  
 7 East     C               87  
 8 North    D               64.2
 9 North    D               64.2
10 North    E               64.2
11 North    E               64.2

You can obtain the rows that are distinct() (unique) and then pick() the ones to keep.

DATA |>
  group_by(District) |>
  mutate(District_Mean = mean(Score)) |>
  distinct(pick(c("District", "District_Mean")))
# A tibble: 3 × 2
# Groups:   District [3]
  District District_Mean
  <chr>            <dbl>
1 West              81  
2 East              87  
3 North             64.2

But you will need to ungroup() before distinct() because, as we have mentioned before, all subsequent functions other than group_by() will maintain the previous grouping structure by default.

DATA |>
  group_by(School) |>
  mutate(School_Mean = mean(Score)) |>
  ungroup()
# A tibble: 11 × 5
   Student      Score School District School_Mean
   <chr>        <dbl> <chr>  <chr>          <dbl>
 1 Jim             80 A      West            82.5
 2 Sally           85 A      West            82.5
 3 June            79 B      West            79.5
 4 Mildred         80 B      West            79.5
 5 Tilford         81 C      East            87  
 6 Beavis          91 C      East            87  
 7 Herman          89 C      East            87  
 8 Peppa           60 D      North           62.5
 9 Kay             65 D      North           62.5
10 Jake            67 E      North           66  
11 Name Missing    65 E      North           66  

Summary

How data are aggregated will affect both summary statistics, the models that you can run, and the data visualizations that you can create to represent the data. With each level of aggregation, fine-grained or case-level data are lost. That fine-grained detail may be computed with measures like variability and confidence levels but those measures are summaries themselves. Including them is not the same as including all data. That said, including all data may influence model outcome and statistical inference. The data will also influence the interpretation of plots to the extent that outlying points or grouping characteristics (point proximity) drive attention to physical locations of plots at the cost of attending to other locations.

A main workhorse of adding variables and creating aggregated data in the R language is the {dplyr} library. You must, however, be sensitive to the way its functions work so that the variables you create in either a full or summarized data frame represent your intentions and goals.

We have seen that a lot of customization can be applied to our visualizations of data and have shown just a little bit of what can be done. Many more functions and parameters can be used to customize to greater extents.

Advanced Summarizing Techniques

For more advanced summarizing techniques, see this module on advanced summarizing techniques.

Session Info

R version 4.3.2 (2023-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22621)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] dplyr_1.1.4

loaded via a namespace (and not attached):
 [1] bit_4.0.5           jsonlite_1.8.7      crayon_1.5.2       
 [4] compiler_4.3.2      BiocManager_1.30.22 renv_1.0.3         
 [7] tidyselect_1.2.0    stringr_1.5.1       parallel_4.3.2     
[10] yaml_2.3.7          fastmap_1.1.1       here_1.0.1         
[13] readr_2.1.4         R6_2.5.1            generics_0.1.3     
[16] curl_5.2.0          knitr_1.45          htmlwidgets_1.6.4  
[19] tibble_3.2.1        rprojroot_2.0.3     pillar_1.9.0       
[22] tzdb_0.4.0          R.utils_2.12.2      rlang_1.1.1        
[25] utf8_1.2.4          stringi_1.7.12      xfun_0.40          
[28] bit64_4.0.5         cli_3.6.1           withr_2.5.0        
[31] magrittr_2.0.3      digest_0.6.33       vroom_1.6.5        
[34] rstudioapi_0.15.0   hms_1.1.3           lifecycle_1.0.3    
[37] R.methodsS3_1.8.2   R.oo_1.25.0         vctrs_0.6.5        
[40] evaluate_0.21       glue_1.6.2          fansi_1.0.4        
[43] rmarkdown_2.25      tools_4.3.2         pkgconfig_2.0.3    
[46] htmltools_0.5.7