source("https://raw.githubusercontent.com/slicesofdata/fods24/main/r/functions/view_html.R")
Summarizing Data
Overview
This module describes how to perform statistical transformations on data, including producing data summaries. In order to summarize data, we will again use {dplyr}. The way data are summarized will influence the way the data can be modeled as statistical models typically rely on understanding how predictor variables (independent) can account for variability in outcome variables (dependent). Eliminating variability in data by aggregating will affect the model you can use as well as the statistical inference draw from that statistical model. We will also discuss differences in grouping structure for dplyr::mutate()
and dplyr::summarize()
that will affect both variable computation and plot creation.
Readings and Preparation
Before Class: First, read to familiarize yourself with the concepts rather than master them. I will assume that you attend class with some level of basic understanding of concepts and working of functions. The goal of reading should be to understand and implement code functions as well as support your understanding and help your troubleshooting of problems. This cannot happen if you just read the content without interacting with it, however reading is absolutely essential to being successful during class time.
Class: In class, some functions and concepts will be introduced and we will practice implementing code through exercises.
Supplementary Readings
To Do
Read through the module. You can use the R
console or open up an R Markdown
(e.g., .Rmd
) file to follow along interactively. If you instead prefer to simply read through the content so that you can understand the concepts without coding, that is fine too. You will practice using the concepts by working on activities. Reading the module will provide you with confidence working on those activities and prevent you from feeling lost while completing activities. Testing out some code may provide you more confidence.
External Functions
Provided in class:
view_html()
: for viewing data frames in html format
You can access this remotely from:
Or from your own project work space by loading functions in /r/functions
.
Libraries
- {dplyr} 1.1.4: for selecting, filtering, and mutating
Load libraries
library(dplyr)
Loading Data
For this exercise, we will use some data from a 2023 CMS swim meet located at: “https://github.com/slicesofdata/dataviz23/raw/main/data/swim/cleaned-2023-CMS-Invite.csv”.
<- readr::read_csv(
SWIM file = "https://github.com/slicesofdata/dataviz23/raw/main/data/swim/cleaned-2023-CMS-Invite.csv",
show_col_types = F
)
Flavors of Data
Data come in many flavors. Data are collected on virtually everything you could think of. If there is a record of something, there is data. The stars, weather, television viewing, social media use, ocean currents, brand preference, parenting, symphonies, learning to code, graphical perceptual. When there is no record of something, you could collect data on it. For example, you could start logging the quantity of quality of advice your grandparents offer you. You may likely discover that advice changes based on their age. You may also consider logging the number to times you see your grandparents or your parents/guardians annually in order to determine approximately how many more times you will see them before they die, if they haven’t yet. In fact, some folks have already worked on this problem using actuarial tables. Seeyourfolks will provide a model estimate for you. See for yourself. Sure, it may be in error but all models have error. Nevertheless, prediction may change your life.
Data Organization and Variability
In some industries, data may be organized at a high level. For example, someone examining population growth may have data at a census level or an annual level. Someone who studies education may had data organized at a state level or county level. In other industries, data may be organized at a fine-grained level. For example, someone studying health outcomes may have data about individuals daily if daily health checks are part of a data-collection protocol. Someone who studies education may design a study to collect data, or obtain data, on daily-performance of students. Even further, their data may contain repeated measurements of specific topics (e.g., 10 different addition problems rather than one). Similarly, someone studying some aspect of human cognition may have high-level data like overall accuracy and total time needed to complete the task or they might have data at a low level containing trial-level accuracy and reaction timed data for each item.
Depending the organization of data and its level of granularity, one may be limited to a set of statistical models to examine. Because variability in data plays a significant role in testing models, the absence of variability will prevent modeling using statistical techniques that require that variability. For example, if your data set contains only one value for a country, you are likely limited to models that allow for comparisons or associations across countries or across time. For example, you could examine data to understand how population growth for a single country changes over time or you could examine population growth in one country compares to one or more than one country. Using the educational example, one could examine changes in a state over time or one could examine states to one another. By contrast, if you have daily data per student for a variety of topics, you have variability in data to examine at a variety of levels. You could examine how an individual’s performance changes across time in general or specifically for a specific topic. You can also examine how the individual performs on one type of task relative to another. You could also examine changes over time in performance on one task compared with changes over time for other tasks. Stepping up the hierarchy, you could compare performance of individuals across teachers, teacher’s sections, schools, school districts, counties, states, etc. because there is variability in the data to try to understand.
Although the data structure is typically conceived from top to bottom, which lower levels nested under higher levels, I will present the opposite order to prevent you reading from the bottom up.
Daily Performance
- on an item type (e.g., 4 + 17)
-- on a task (e.g., addition)
--- for a person (e.g, Bill)
---- in a teacher's class (e.g., Mrs. Collins, Section 1)
----- in a school (e.g., Robert D. Wilson)
------ in a district (e.g., Western Wayne)
------- in a country (e.g., Wayne)
-------- in a state (e.g., Pennsylvania)
--------- in a country (e.g., United States of America)
The structure at all these levels means that you can perform more complex models like linear-mixed models/hierarchical linear modeling. The variability in the data at all these levels allows you to account for the variability in ways you cannot if you only have top-level data. When you have the variability, you can examine for effect like random effects or fixed effects, which are not applicable to simpler linear models. To the extent that teachers in District A teach differently or motivate students differently from those teachers in District B, you could examine the data for these differences and account for them statistically (hold them constant as if teachers were the same) at higher levels of the data. In fact, that repeated-measures ANOVA model you are taught in your statistics class is not be appropriate for data with repeated measurements under different conditions because two individuals who with the same average performance may differ in their variability. For example, two students answer the subtraction problem (14 - 7) on average in 10.0 seconds across 5 instances. Mildred is consistent and answers it between 8 and 12 seconds. Tilford, however, answers with a range of 5 to 20 seconds. Although performance time is equated, the variability represents something very different in ability.
Making Data
We will work with a tiny data frame containing data for performance scores from students in schools in districts.
<-
DATA data.frame(
Student = c("Jim", "Sally", "June", "Mildred", "Tilford", "Beavis", "Herman", "Peppa", "Kay", "Jake", "Name Missing"),
Score = c(80, 85, 79, 80, 81, 91, 89, 60, 65, 67, 65),
School = c("A", "A", "B", "B", "C", "C", "C", "D", "D", "E", "E"),
District = c("West", "West", "West", "West", "East", "East", "East", "North", "North" , "North", "North")
)
You can see that the performance for all students is on average, for example, if you didn’t know where they went to school. We would take only the mean.
Summarizing Data Using {dplyr}
However, there are many ways to summarize data. To introduce data summary techniques using {dplyr}, we will use dplyr::summarise()
or dplyr::summarize()
to summarize the data in the data frame. The summarize()
(or summarise()
) function works similar to mutate()
insofar as it creates variables but it differs insofar as the data frame returned from summarize()
contains only the variable(s) referenced in as part of that summary process. Importantly, summarize()
will be performed based on some functions called immediately prior to summarize()
.
summarize(
.data,
..., .by = NULL,
.groups = NULL
)
Key Parameters/Arguments:
.data
: a data frame...
: name-value pairs of variable names and functions for the summary operation
Summarizing as a Constant
When piping code, the data frame will be inherited; the parameter and argument can be omitted. You will need to pass the expressions for the summary, however. Like mutate()
The expression will require a variable name as well as a value or function.
Although summarize()
can create variables like mutate()
, it does so by grouping the data in some way. We will cover using summarize()
with group_by()
to group data but for now, you can see grouping happens automatically, making summarize()
a different function than mutate()
. The functions serve different purposes. A good rule of thumb for remembering the differences is that mutate()
is used to add column variables to a data frame to make it bigger and summarize()
is used to reduce the data frame by summarizing the data within it.
If we assign a variable name a value, we see that the data frame that is returned is now reduced to a single row. Specifically, the default operation is that summarize()
groups all rows of the data frame together.
|>
DATA summarize(my_new_var = 1)
my_new_var
1 1
Clearly, this is a ridiculous summary of the data. Let’s do something that makes more sense.
Summarizing with a Function
You will likely want to summarize the data in some way, for example, by obtaining the mean, sum, count, standard deviation, or some other statistic from the data. If we wanted to obtain the mean of some variable in the data frame, we would need to specify the variable that we want to summarize, the function for how to summarize that variable and then we would want to specify the variable name for the data frame to contain.
General case:
= the_function(my_var) summarized_var
Specific case:
= mean(my_var) mean
Let’s summarize the performance Score
in the data frame. In the example below, we summarize data by creating a new variable which is set to represent some data summary technique. In essence, summarizing is for descriptive statistics. Using mean()
, we can summarize the data by taking the mean of the Score
variable.
|>
DATA summarize(bob = mean(Score))
bob
1 76.54545
We see that the returned data frame contains only one row. By default, the data frame has been reduced to the mean across all rows. The column variable is named bob
. Let’s make the variable in the returned data frame make more sense.
|>
DATA summarize(mean_score = mean(Score))
mean_score
1 76.54545
Notice what is returned is a single value reflecting the mean of all the data in the data frame. We could have obtained the same without using {dplyr}.
mean(DATA$Score) # $ notation
[1] 76.54545
However, we lose flexibility of easily adding new summary procedures using this approach. Instead, summarize()
allows for multiple name-value pair functions.
Summarizing with Multiple Functions
Let’s create the mean and the standard deviation of Score
using mean()
. Because you will use mean()
often, you should understand how it works, so let’s first look at its parameters and default arguments more carefully.
mean(x, trim = 0, na.rm = FALSE, ...)
By default, the data are not trimmed and NA
values are not removed. If NA
s exist in your vector, the function will break unless you pass TRUE
as the argument to it, na.rm = TRUE
. The same is true for sd()
for computing the standard deviation.
Let’s summarize Score
using both functions. We see that the data frame has one row and two columns reflecting the variables specified in the function.
|>
DATA summarise(mean = mean(Score, na.rm = T),
sd = sd(Score, na.rm = T)
)
mean sd
1 76.54545 10.56753
You see the basic operation of summarize()
. By default, variables are summarized using functions that aggregate the data across all rows. The resulting summary value is entered as column specified by the named we provided. Of course, however, we may not want to aggregate all of the data as a single group. Rather, we may wish to summarize by grouping data based on subgroups, factors, etc.
Aggregating Data
Data aggregation refers to how data are organized or compiled. When you have multiple levels of data as presented in the example above, data aggregation refers to the process of taking the raw data and gathering it in some way to summarize for statistical analysis or data visualization. For example, you could summarize data by individual, class, school, district, county, and state. Alternatively, you could summarize data by district, county, and state and ignore individuals, class, and school. The two types of data aggregation will result in different numbers of cases (rows) in a data frame most obviously but also in the actual numbers.
The variability in individuals as well as in subgroups can be useful to answering your data question. Models like linear mixed models are beyond the scope of this course, however. Whether you look into those types of model or you don’t, you should realize that you are making a big assumption in your data when you aggregate at any level. That assumption is that the variability in the data that are aggregated is the same across all cases in the data. Specifically, by aggregating Tilford’s and Mildred’s individual data in order to obtain their average performance, you have lost the variability at the individual level and essentially arbitrarily set that variability to 0. At higher level, aggregating the data by School
also means that you lose this information and make the assumption that there is no variability among individuals in each school (e.g., all individuals have the same value). Such assumptions are particularly relevant to repeated-measures models (e.g., repeated-measures ANOVA does not use this information) and are potentially not assumptions you want to make about your data.
In the end, how you interpret a model or a data visualization will be dependent on that aggregation of the data, modeled or displayed. The take home message about aggregating data, then, is that you trade off a loss of information for some summary. Only you can weigh the costs and benefits of this trade off. That said, sometimes you have no choice because your client, your employer, or your data resource has only data an an aggregate level. In those situations, you cannot do much except acknowledge the information loss in the ability of the model to predict an outcome. If, however, that aggregated level can still be summarized or aggregated in some way, you are back to reflecting on issues of aggregation.
Aggregating Data using group_by()
{dplyr} has one main verb function for grouping data. group_by()
orders the groups in an ascending order so that functions performed on the subgroups will return a data frame that is ordered based on the grouping variable(s). Other functions like group_map()
, group_nest()
, group_split()
, and group_trim()
exist but they are experimental for {dplyr} r
packageVersion(“dplyr”)`. Although they allow for fun tasks like passing data subsets to statistical models, their experimental life cycle suggests you might not get to comfortable with them.
group_by()
is used to group data frames by one or more variables.
group_by(.data,
..., .add = FALSE,
.drop = group_by_drop_default(.data)
)
Key Parameters/Arguments:
.data
: a data frame...
: variables or computations to group by.
The .add
and .drop
parameters are special cases of the function and will be addressed later. For now, let’s look at the basic functionality of group_by()
used prior to summarize().
Data aggregation using group_by()
Let’s say we want to group the data by School
and then summarize. The returned data frame now contains two columns, one for the grouping variable followed by a second for the specification in summarize()
. Importantly, the grouping variables will always be organized in the data frame to the left of variables created in the summary.
|>
DATA group_by(School) |>
summarize(Score = mean(Score)) # get the school level aggregated data
# A tibble: 5 × 2
School Score
<chr> <dbl>
1 A 82.5
2 B 79.5
3 C 87
4 D 62.5
5 E 66
Let’s say we try to determine mean performance for “all students” if we had the data aggregated as the school level and not the individual level. As a reminder, the mean across all rows in the data frame was r
mean(DATA$Score, na.rm = T)`.
If we want to summarize the data frame that was summarized, we might assume that we can pipe the data to another add another summarize()
.
|>
DATA group_by(School) |> # group by school
summarize(Score = mean(Score)) |> # get the school level aggregated data
summarize(Score = mean(Score)) # then calculate average across all schools
# A tibble: 1 × 1
Score
<dbl>
1 75.5
Note: If you noticed and wonder why the mean value returned here differs from r
mean(DATA$Score, na.rm = T)`, make a mental note. We will address that later.
Let’s say we try to determine average performance for “all students” if we had the data aggregated as the district level and not the individual level.
|>
DATA group_by(District) |> # group by district
summarize(Score = mean(Score)) |> # get the district level aggregated data
summarize(Score = mean(Score)) # then calculate average across all districts
# A tibble: 1 × 1
Score
<dbl>
1 77.4
So what’s the average performance for students? Well, that interpretation differs based on how values are treated. At the highest level, all students’ Scores
are weighed equally. Each student contributes to the data in the same way. When aggregated by School
, each school contributes to the calculation equally, even if the number of bodies per school differs. And finally, when data are aggregated at the District
level, all districts are treated equally in the calculation independent on the number of schools in a district. Only with weighting means would you end up with the same average performance.
Measures of variability in the data, however, reveal something else. Schools differ in variability and schools within districts vary as well. Depending on the level of aggregation, you may never notice interesting patterns in data that lead to questions to investigate and later policies to change. For example, addressing a school that is left behind others or a district that is left behind others. Aggregation can also lead to inefficient allocations of resources. If one school in a district needs help rather than all schools in the district needing help, the cost may differ substantially.
<- function(x) { sd(na.omit(x)) / sqrt(length(na.omit(x))) } std_error
Use std_error()
in summary:
|>
DATA group_by(School, District) |>
summarize(Var = var(Score),
SD = sd(Score),
SEM = std_error(Score),
min = min(Score),
max = max(Score)
|> mutate(range = max - min) )
`summarise()` has grouped output by 'School'. You can override using the
`.groups` argument.
# A tibble: 5 × 8
# Groups: School [5]
School District Var SD SEM min max range
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A West 12.5 3.54 2.5 80 85 5
2 B West 0.5 0.707 0.5 79 80 1
3 C East 28 5.29 3.06 81 91 10
4 D North 12.5 3.54 2.5 60 65 5
5 E North 2 1.41 1 65 67 2
Comparing mutate()
with summarize()
{dplyr} has two main verb functions for creating variables (e.g., mutate()
and summarize()
). They operate differently and their returned tibble also differ. If you have not been paying attention to the returned tibble post calculation, you might have overlooked the differences.
Grouping structure of group_by()
and mutate()
Let’s mutate()
the mean for Score
after sub-setting with group_by()
.
|>
DATA group_by(School, District) |>
mutate(Score = mean(na.omit(Score)))
# A tibble: 11 × 4
# Groups: School, District [5]
Student Score School District
<chr> <dbl> <chr> <chr>
1 Jim 82.5 A West
2 Sally 82.5 A West
3 June 79.5 B West
4 Mildred 79.5 B West
5 Tilford 87 C East
6 Beavis 87 C East
7 Herman 87 C East
8 Peppa 62.5 D North
9 Kay 62.5 D North
10 Jake 66 E North
11 Name Missing 66 E North
There are two things to watch when using this mutate()
following group_by()
.
1. A
Score
will be assigned to each row/case orStudent
in the data frame.- When schools and districts are grouped, each student in the same school will have the same assigned average value. All rows are maintained, none dropped.
2. The returned tibble takes on a new structure.
- Looking at the feedback displayed in the console, you see the following report preceding the data.
A tibble: 11 × 3
# Groups: School, District [5]
You also see this structure using glimpse()
.
|>
DATA group_by(School, District) |>
mutate(Score = mean(na.omit(Score))) |>
glimpse()
Rows: 11
Columns: 4
Groups: School, District [5]
$ Student <chr> "Jim", "Sally", "June", "Mildred", "Tilford", "Beavis", "Herm…
$ Score <dbl> 82.5, 82.5, 79.5, 79.5, 87.0, 87.0, 87.0, 62.5, 62.5, 66.0, 6…
$ School <chr> "A", "A", "B", "B", "C", "C", "C", "D", "D", "E", "E"
$ District <chr> "West", "West", "West", "West", "East", "East", "East", "Nort…
Rows: 11
Columns: 3
Groups: School, District [5]
What does the [5]
mean? Well, before we answer this, let’s use summarize()
.
Grouping structure of group_by()
and summarize()
Let’s summarize()
the mean for Score
after sub-setting with group_by()
.
|>
DATA group_by(School, District) |>
summarize(Score = mean(Score))
`summarise()` has grouped output by 'School'. You can override using the
`.groups` argument.
# A tibble: 5 × 3
# Groups: School [5]
School District Score
<chr> <chr> <dbl>
1 A West 82.5
2 B West 79.5
3 C East 87
4 D North 62.5
5 E North 66
There are two things to watch when using this summarize()
following group_by()
.
1. A Score
average is assigned to each row/case or School
in this aggregated data frame.
When schools and districts are grouped, each school will have its own average value. Rows from the data frame are dropped as are columns that are not passed to group_by()
. In this case, School
, District
, and the new variable, Score
are returned.
2. The returned tibble takes on a new structure.
The feedback in the console is a little more detailed here.
`summarise()` has grouped output by 'School'. You can override using the `.groups` argument.
# A tibble: 5 × 3
# Groups: School [5]
You see reference to overriding the grouping using .groups
. According to the tidyverse documentation for grouping, this parameter “controls the grouping structure of the output. The historical behaviour of removing the right hand side grouping variable corresponds to .groups = "drop_last"
without a message or .groups = NULL
with a message (the default)”.
You have most likely paid little attention to this message. By default summarize()
keeps the first grouping variable passed to group_by()
in the returned tibble. This is why you see School
referenced and not District
or both variables. So, do you want your data frame to contain groups or no groups? Stated differently, do you just want that data summarized by your grouping variables and have a simple nxm data frame or do you want something more complex?
To see the structure better, let’s first look a the structure of DATA
and then more closely at the summarized version.
str(DATA)
'data.frame': 11 obs. of 4 variables:
$ Student : chr "Jim" "Sally" "June" "Mildred" ...
$ Score : num 80 85 79 80 81 91 89 60 65 67 ...
$ School : chr "A" "A" "B" "B" ...
$ District: chr "West" "West" "West" "West" ...
Let’s assign the summarized data frame to an object for inspection.
<-
DATA_SUM |>
DATA group_by(School, District) |>
summarize(Score = mean(Score))
`summarise()` has grouped output by 'School'. You can override using the
`.groups` argument.
For DATA_SUM
, the structure is different. In particular, you see reference to (S3: grouped_df/tbl_df/tbl/data.frame)
. This tells you that the data frame is not a simple nxm but contains groups. You could think of this as 3 nxm data frames organized together. For more details, see the dplyr documentation.
str(DATA_SUM)
gropd_df [5 × 3] (S3: grouped_df/tbl_df/tbl/data.frame)
$ School : chr [1:5] "A" "B" "C" "D" ...
$ District: chr [1:5] "West" "West" "East" "North" ...
$ Score : num [1:5] 82.5 79.5 87 62.5 66
- attr(*, "groups")= tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
..$ School: chr [1:5] "A" "B" "C" "D" ...
..$ .rows : list<int> [1:5]
.. ..$ : int 1
.. ..$ : int 2
.. ..$ : int 3
.. ..$ : int 4
.. ..$ : int 5
.. ..@ ptype: int(0)
..- attr(*, ".drop")= logi TRUE
group_by()
and summarize()
with .groups
When you group data using group_by()
and then summarize()
, the summary variables (e.g., mean) will result in a single row for each level of a single grouping variable. If there are more than one grouping variable, the additional grouping variable will be introduced to the data frame as a second column. The total number of rows in the data frame will be equal to the number of levels of group 1 x number of levels of group 2 if an only if each levels of group 1 has a corresponding level for group 2,
for example:
sex age mean
men young x
men old x
women young x
women old x
If there is no pairing of levels in the data (e.g., no men who are old), that row will be omitted from the returned data frame.
for example:
sex age mean
men young x
women young x
women old x
rather than:
sex age mean
men young x
men old NA
women young x
women old x
You really need to query ?dplyr::summarize
to understand .groups
, which controls the grouping of the returned data frame. This is also experimental to summarize()
, so it might not be available in the future.
Parameters/Arguments:
There are four argument options:
"drop_last"
: dropping the last level of grouping. This was the only supported option before version 1.0.0."drop"
: All levels of grouping are dropped."keep"
: Same grouping structure as .data."rowwise"
: Each row is its own group.
Let’s work through them individually and see how they work.
.groups = "drop_last"
:
|>
DATA group_by(School, District) |>
summarize(Score = mean(Score),
.groups = "drop_last"
)
# A tibble: 5 × 3
# Groups: School [5]
School District Score
<chr> <chr> <dbl>
1 A West 82.5
2 B West 79.5
3 C East 87
4 D North 62.5
5 E North 66
You will see that the grouping for District
is not in the grouping structure because it was the last grouping.
# A tibble: 5 × 3
# Groups: School [5]
School District Score
.groups = "drop"
:
|>
DATA group_by(School, District) |>
summarize(Score = mean(Score),
.groups = "drop"
)
# A tibble: 5 × 3
School District Score
<chr> <chr> <dbl>
1 A West 82.5
2 B West 79.5
3 C East 87
4 D North 62.5
5 E North 66
You will see that all groups are dropped from the grouping structure.
# A tibble: 5 × 3
School District Score
.groups = "keep"
:
|>
DATA group_by(School, District) |>
summarize(Score = mean(Score),
.groups = "keep"
)
# A tibble: 5 × 3
# Groups: School, District [5]
School District Score
<chr> <chr> <dbl>
1 A West 82.5
2 B West 79.5
3 C East 87
4 D North 62.5
5 E North 66
You will see that all groups are kept in the grouping structure.
# A tibble: 5 × 3
# Groups: School, District [5]
School District Score
.groups = "rowwise"
:
|>
DATA group_by(School, District) |>
summarize(Score = mean(Score),
.groups = "rowwise"
)
# A tibble: 5 × 3
# Rowwise: School, District
School District Score
<chr> <chr> <dbl>
1 A West 82.5
2 B West 79.5
3 C East 87
4 D North 62.5
5 E North 66
You will see that there are groups but they are not based on group_by()
.
# A tibble: 5 × 3
# Rowwise: School, District
School District Score
The grouping structure always matters for subsequent computations. But with "rowwise"
you will see this clearly. Following from above, let’s say we wanted to compute the mean across all of the School
s in the data frame returned by group_by()
then summarize()
. We add mutate(mean = mean(Score))
on a new line of our piped code block.
|>
DATA group_by(School, District) |>
summarize(Score = mean(Score),
.groups = "rowwise"
|>
) mutate(mean = mean(Score))
# A tibble: 5 × 4
# Rowwise: School, District
School District Score mean
<chr> <chr> <dbl> <dbl>
1 A West 82.5 82.5
2 B West 79.5 79.5
3 C East 87 87
4 D North 62.5 62.5
5 E North 66 66
The mean
column does not contain a single mean replicated for each row in the data frame. Rather the grouping was per row, so each row has its one mean. The mean of a single value is, of course, itself.
Revisiting ".groups = drop_last"
:
Returning to the default .groups
structure, which is "drop_last"
, we can add the same mutate()
to see what happens.
|>
DATA group_by(School, District) |>
summarize(Score = mean(Score),
.groups = "drop_last"
|>
) mutate(mean = mean(Score))
# A tibble: 5 × 4
# Groups: School [5]
School District Score mean
<chr> <chr> <dbl> <dbl>
1 A West 82.5 82.5
2 B West 79.5 79.5
3 C East 87 87
4 D North 62.5 62.5
5 E North 66 66
The mean
column still has different values but they are replicated on rows with the same School
group. This is because the data are grouped that way.
Revisiting .groups = "drop"
:
Let’s again add the same mutate()
to see what happens.
|>
DATA group_by(School, District) |>
summarize(Score = mean(Score),
.groups = "drop"
|>
) mutate(mean = mean(Score))
# A tibble: 5 × 4
School District Score mean
<chr> <chr> <dbl> <dbl>
1 A West 82.5 75.5
2 B West 79.5 75.5
3 C East 87 75.5
4 D North 62.5 75.5
5 E North 66 75.5
Consider ungroup()
:
.groups = "drop"
in effect works the same as does ungroup()
. The grouping structure is broken and all subsequent operations are based on the ungrouped data frame.
|>
DATA group_by(School, District) |>
summarize(Score = mean(Score),
.groups = "drop_last"
|>
) ungroup() |>
mutate(mean = mean(Score))
# A tibble: 5 × 4
School District Score mean
<chr> <chr> <dbl> <dbl>
1 A West 82.5 75.5
2 B West 79.5 75.5
3 C East 87 75.5
4 D North 62.5 75.5
5 E North 66 75.5
Keep in mind that functions work as they are programmed to work. Functions do not work like you think they work. You must understand the function and check our work to ensure your calculations are what you intend them to be.
The order of operations matters
To illustrate further, consider you want to calculate some summary statistics. You set out to obtain the mean and the standard deviation for your data. Those computations will be performed according to the grouping structure.
When you compute standard deviation and the mean of Score
, you assign the mean to Score
because you want your plot to contain a clean name rather than one like Score_Mean
that you will have to address in the plot.
|>
DATA group_by(School, District) |>
summarize(Score_sd = sd(Score), # get sd
Score = mean(Score) # get mean, assign to same name
)
`summarise()` has grouped output by 'School'. You can override using the
`.groups` argument.
# A tibble: 5 × 4
# Groups: School [5]
School District Score_sd Score
<chr> <chr> <dbl> <dbl>
1 A West 3.54 82.5
2 B West 0.707 79.5
3 C East 5.29 87
4 D North 3.54 62.5
5 E North 1.41 66
Both variables use the same data because the standard deviation assigns the value to a different variable name Score_sd
. The mean()
is not based on some changed variable.
The output is different from the one returned when the mean is computed before the standard deviation and in particular when the mean is assigned to Score
. In this case, the standard deviation is based on this new Score
variable. Because the standard deviation is a measure of variability, and Score
does not vary based on the grouping structure, NA
is returned.
|>
DATA group_by(School, District) |>
summarize(Score = mean(Score), # get mean, assign to same name
Score_sd = sd(Score) # get sd
)
`summarise()` has grouped output by 'School'. You can override using the
`.groups` argument.
# A tibble: 5 × 4
# Groups: School [5]
School District Score Score_sd
<chr> <chr> <dbl> <dbl>
1 A West 82.5 NA
2 B West 79.5 NA
3 C East 87 NA
4 D North 62.5 NA
5 E North 66 NA
If you really wanted the standard deviation of all the means, consider ungrouping and then compute the standard deviation or use .groups = "drop"
in summarize()
. Realize, however, this latter functionality is experimental and may not work sometime later. You are likely better off using ungroup()
.
|>
DATA group_by(School, District) |>
summarize(Score = mean(Score)) |> # get mean, assign to same name
ungroup() |>
mutate(sd = sd(Score))
`summarise()` has grouped output by 'School'. You can override using the
`.groups` argument.
# A tibble: 5 × 4
School District Score sd
<chr> <chr> <dbl> <dbl>
1 A West 82.5 10.7
2 B West 79.5 10.7
3 C East 87 10.7
4 D North 62.5 10.7
5 E North 66 10.7
More on Grouping Structure
We can now also consider ways to create column variables and a corresponding grouping structure within the group_by()
verb.
New Variable Columns
group_by()
is not designed to create new variables but rather create groups. The function, however, does not require variables in a data frame in order to group based on their levels or differences in values.
group_by()
using a function
Heretofore, we have grouped by column variables in a data frame. But you can also group in other ways, for example, by a variable calculated by a function. For example, if you wanted to group by standard deviations, you could calculated the standard deviation (Score-mean(Score)) / sd(Score)
for each student and then cut()
that variable into groups. cut
will take a numeric vector variable and turn it into a factor variable of n groups as determined by what you pass to breaks
. There is also an argument to make the factor ordered if you wish, ordered_result = TRUE
but the default behavior does not order the factor. You can also change the level labels if you inspect the function.
|>
DATA group_by(z_factor = cut(
x = ((Score - mean(Score)) / sd(Score) ),
breaks = 5,
ordered_result = T
) )
# A tibble: 11 × 5
# Groups: z_factor [4]
Student Score School District z_factor
<chr> <dbl> <chr> <chr> <ord>
1 Jim 80 A West (0.194,0.781]
2 Sally 85 A West (0.781,1.37]
3 June 79 B West (0.194,0.781]
4 Mildred 80 B West (0.194,0.781]
5 Tilford 81 C East (0.194,0.781]
6 Beavis 91 C East (0.781,1.37]
7 Herman 89 C East (0.781,1.37]
8 Peppa 60 D North (-1.57,-0.979]
9 Kay 65 D North (-1.57,-0.979]
10 Jake 67 E North (-0.979,-0.392]
11 Name Missing 65 E North (-1.57,-0.979]
You can see that there are 4 different levels of the grouping variable. The tibble is grouped of course too.
If you struggle with remembering the formula for a z score, scale()
will to the same thing.
|>
DATA mutate(z = (Score - mean(Score))/sd(Score),
scale = scale(Score)
)
Student Score School District z scale
1 Jim 80 A West 0.3269018 0.3269018
2 Sally 85 A West 0.8000492 0.8000492
3 June 79 B West 0.2322724 0.2322724
4 Mildred 80 B West 0.3269018 0.3269018
5 Tilford 81 C East 0.4215313 0.4215313
6 Beavis 91 C East 1.3678261 1.3678261
7 Herman 89 C East 1.1785671 1.1785671
8 Peppa 60 D North -1.5656877 -1.5656877
9 Kay 65 D North -1.0925403 -1.0925403
10 Jake 67 E North -0.9032814 -0.9032814
11 Name Missing 65 E North -1.0925403 -1.0925403
If you want n groups based on specific breaks, then pass a vector of those break units. For example, if we want levels to correspond to some meaningful standard-deviation cuts. For example, if you wanted to group those with z scores ranging from infinitely negative to -2, -2 to -1, - to 1, 1 to 2, and 2 to infinitely large, you could specify the breaks. If there are no values in those ranges, then there won’t be any data for those breaks.
|>
DATA group_by(z_factor = cut( scale(Score),
breaks = c(-Inf, -2, -1, 1, 2, Inf),
ordered_result = T
), |>
) ungroup()
# A tibble: 11 × 5
Student Score School District z_factor
<chr> <dbl> <chr> <chr> <ord>
1 Jim 80 A West (-1,1]
2 Sally 85 A West (-1,1]
3 June 79 B West (-1,1]
4 Mildred 80 B West (-1,1]
5 Tilford 81 C East (-1,1]
6 Beavis 91 C East (1,2]
7 Herman 89 C East (1,2]
8 Peppa 60 D North (-2,-1]
9 Kay 65 D North (-2,-1]
10 Jake 67 E North (-1,1]
11 Name Missing 65 E North (-2,-1]
You can also just group by the function without using cut()
if you wish to group by those with identical values on the variable. In this case, using count()
or tally()
reveals there are only two instances with the same score.
|>
DATA group_by(z_factor = ((Score - mean(Score)) / sd(Score) )) |>
count(sort = TRUE) # tally(sort = TRUE)
# A tibble: 9 × 2
# Groups: z_factor [9]
z_factor n
<dbl> <int>
1 -1.09 2
2 0.327 2
3 -1.57 1
4 -0.903 1
5 0.232 1
6 0.422 1
7 0.800 1
8 1.18 1
9 1.37 1
Add/drop Grouping Variable Columns
By default, group_by()
on a data frame that is already grouped (see earlier on grouped data frame), the existing grouping structure will be replaced by new grouping structure.
Let’s get an example. Please note that this example assigned the tibble to an object but whether you assign or not, the grouped structure exists. So functions that follow the group_by()
keep that structure.
<-
(school_grouped |>
DATA group_by(School)
)
# A tibble: 11 × 4
# Groups: School [5]
Student Score School District
<chr> <dbl> <chr> <chr>
1 Jim 80 A West
2 Sally 85 A West
3 June 79 B West
4 Mildred 80 B West
5 Tilford 81 C East
6 Beavis 91 C East
7 Herman 89 C East
8 Peppa 60 D North
9 Kay 65 D North
10 Jake 67 E North
11 Name Missing 65 E North
Now group by a new column:
|>
school_grouped group_by(District)
# A tibble: 11 × 4
# Groups: District [3]
Student Score School District
<chr> <dbl> <chr> <chr>
1 Jim 80 A West
2 Sally 85 A West
3 June 79 B West
4 Mildred 80 B West
5 Tilford 81 C East
6 Beavis 91 C East
7 Herman 89 C East
8 Peppa 60 D North
9 Kay 65 D North
10 Jake 67 E North
11 Name Missing 65 E North
You will see the grouping structure has changed.
# A tibble: 11 × 4
# Groups: District [3]
Retain Grouping
If you want to retain the existing group, you would need to use .add = TRUE
.
|>
school_grouped group_by(District, .add = TRUE)
# A tibble: 11 × 4
# Groups: School, District [5]
Student Score School District
<chr> <dbl> <chr> <chr>
1 Jim 80 A West
2 Sally 85 A West
3 June 79 B West
4 Mildred 80 B West
5 Tilford 81 C East
6 Beavis 91 C East
7 Herman 89 C East
8 Peppa 60 D North
9 Kay 65 D North
10 Jake 67 E North
11 Name Missing 65 E North
Notice in the console, that both School
and District
are included in the groups.
# A tibble: 11 × 4
# Groups: School, District [5]
Why Care About Grouping Structure
Well, the functions you apply to a grouped tibble will sometimes lead to calculations that are not what you intend. We provided some examples earlier but given the importance of the issue, we may benefit from another example.
Let’s say you want to compute the average for each school and add that school average for each student. This would tell you how the student differs from their school performance. Then you want to obtain the mean of all the schools and see how the school differs from all the schools.
You code it out:
|>
DATA # group
group_by(School) |>
# calculate the mean of Score for each School
mutate(School_Mean = mean(Score)) |>
# calculate the mean of all values (think Grand Mean from stats)
mutate(Mean_of_all = mean(School_Mean)) |>
# calculate the school performance relative to all
mutate(School_Performance = factor(
case_when(
< Mean_of_all ~ "Below Average",
School_Mean == Mean_of_all ~ "Average",
School_Mean > Mean_of_all ~ "Above Average"
School_Mean ordered = T)
), )
# A tibble: 11 × 7
# Groups: School [5]
Student Score School District School_Mean Mean_of_all School_Performance
<chr> <dbl> <chr> <chr> <dbl> <dbl> <ord>
1 Jim 80 A West 82.5 82.5 Average
2 Sally 85 A West 82.5 82.5 Average
3 June 79 B West 79.5 79.5 Average
4 Mildred 80 B West 79.5 79.5 Average
5 Tilford 81 C East 87 87 Average
6 Beavis 91 C East 87 87 Average
7 Herman 89 C East 87 87 Average
8 Peppa 60 D North 62.5 62.5 Average
9 Kay 65 D North 62.5 62.5 Average
10 Jake 67 E North 66 66 Average
11 Name Missing 65 E North 66 66 Average
Perfect. OK, let’s plot it. Oh but first let me inspect the data. Um, why is School_Performance
the same for all Students
? Schools are not performing the same so they cannot all be average. Check your case_when()
for errors because that where the new variable was created. Then you spend 40 days and 40 nights trying to fix your code. No matter what you do with case_when()
, you cannot fix the problem. So you try to create 42 data frames to solve your problem. Even grandma knows that is a ridiculous strategy. She suggests you read the documentation for all of the functions you used because one time her pot-luck cake flopped and she inspected the expiration date for all of her ingredients and found the baking soda was old.
You discover that the grouping structure is retained on all operations until that grouping structure is removed or replaced.
ungroup()
Whereas summarize has a .groups
functionality, mutate()
does not. In order to have a simple data frame use follow group_by()
and mutate()
with ungroup()
. But ungroup()
can be used following both functions if that approach makes more sense to you.
Let’s add an ungroup()
directly after performing the operation that needed the grouping structure.
|>
DATA # group
group_by(School) |>
# calculate the mean of Score for each School
mutate(School_Mean = mean(Score)) |>
# break that grouping structure
ungroup() |>
# calculate the mean of all values (think Grand Mean from stats)
mutate(Mean_of_all = mean(School_Mean)) |>
# calculate the school performance relative to all
mutate(School_Performance = factor(
case_when(
< Mean_of_all ~ "Below Average",
School_Mean == Mean_of_all ~ "Average",
School_Mean > Mean_of_all ~ "Above Average"
School_Mean ordered = T)
), )
# A tibble: 11 × 7
Student Score School District School_Mean Mean_of_all School_Performance
<chr> <dbl> <chr> <chr> <dbl> <dbl> <ord>
1 Jim 80 A West 82.5 76.5 Above Average
2 Sally 85 A West 82.5 76.5 Above Average
3 June 79 B West 79.5 76.5 Above Average
4 Mildred 80 B West 79.5 76.5 Above Average
5 Tilford 81 C East 87 76.5 Above Average
6 Beavis 91 C East 87 76.5 Above Average
7 Herman 89 C East 87 76.5 Above Average
8 Peppa 60 D North 62.5 76.5 Below Average
9 Kay 65 D North 62.5 76.5 Below Average
10 Jake 67 E North 66 76.5 Below Average
11 Name Missing 65 E North 66 76.5 Below Average
Yes! Mean_of_all
is now a true grand mean based on all of the data with no grouping at all. Comparing Mean_of_all
to School_Mean
provides two classifications.
Variable loss with summarize()
Because summarize()
returns a tibble based on the grouping structure, you will lose all variables that are not in the grouping structure. By design, it no longer allows you to return a data frame with duplicated rows based on the variables passed to group_by()
.
Returning to the school data, what this means, is that you cannot retain the School
if you group by the District
in order to obtain averages by district.
|>
DATA group_by(District) |>
summarize(Mean = mean(Score))
# A tibble: 3 × 2
District Mean
<chr> <dbl>
1 East 87
2 North 64.2
3 West 81
If you try to add School
to group_by()
, then you will only obtain the averages by schools within districts.
|>
DATA group_by(School, District) |>
summarize(Mean = mean(Score))
`summarise()` has grouped output by 'School'. You can override using the
`.groups` argument.
# A tibble: 5 × 3
# Groups: School [5]
School District Mean
<chr> <chr> <dbl>
1 A West 82.5
2 B West 79.5
3 C East 87
4 D North 62.5
5 E North 66
You can try to summarize based on a variable without assign it to a function but that is no longer allowed with summarize()
. The following code block will throw an error.
|>
DATA group_by(District) |>
summarize(Mean = mean(Score),
# add school
School )
An alternative to summarize()
: reframe()
Instead, you can use reframe()
, which allows for previous functionality of summarize()
that is now deprecated.
|>
DATA group_by(District) |>
reframe(Mean = mean(Score),
School,
Student )
# A tibble: 11 × 4
District Mean School Student
<chr> <dbl> <chr> <chr>
1 East 87 C Tilford
2 East 87 C Beavis
3 East 87 C Herman
4 North 64.2 D Peppa
5 North 64.2 D Kay
6 North 64.2 E Jake
7 North 64.2 E Name Missing
8 West 81 A Jim
9 West 81 A Sally
10 West 81 B June
11 West 81 B Mildred
Remember that the operations in summarize()
, mutate()
, and reframe()
all depend on the grouping structure. If you wanted to add variables, you would need to group a different way and you would need to add all variables into summarize()
every step of the way.
|>
DATA group_by(District) |>
reframe(District_Mean = mean(Score),
School,
Student,
Score|>
) ungroup() |>
group_by(School) %>%
reframe(School_Mean = mean(Score),
School,
Student,
Score,
District_Mean|>
) ungroup()
# A tibble: 11 × 5
School School_Mean Student Score District_Mean
<chr> <dbl> <chr> <dbl> <dbl>
1 A 82.5 Jim 80 81
2 A 82.5 Sally 85 81
3 B 79.5 June 79 81
4 B 79.5 Mildred 80 81
5 C 87 Tilford 81 87
6 C 87 Beavis 91 87
7 C 87 Herman 89 87
8 D 62.5 Peppa 60 64.2
9 D 62.5 Kay 65 64.2
10 E 66 Jake 67 64.2
11 E 66 Name Missing 65 64.2
Retaining distinct data after mutate()
using distinct()
This is just too tedious. Your better option is mutate()
as all variables will be added to the full data frame. All variables are neatly and logically appended to the right hand side of the data frame.
|>
DATA group_by(District) |>
mutate(District_Mean = mean(Score)) |>
group_by(School) |>
mutate(School_Mean = mean(Score))
# A tibble: 11 × 6
# Groups: School [5]
Student Score School District District_Mean School_Mean
<chr> <dbl> <chr> <chr> <dbl> <dbl>
1 Jim 80 A West 81 82.5
2 Sally 85 A West 81 82.5
3 June 79 B West 81 79.5
4 Mildred 80 B West 81 79.5
5 Tilford 81 C East 87 87
6 Beavis 91 C East 87 87
7 Herman 89 C East 87 87
8 Peppa 60 D North 64.2 62.5
9 Kay 65 D North 64.2 62.5
10 Jake 67 E North 64.2 66
11 Name Missing 65 E North 64.2 66
Also, when you use a new group_by()
, it will replace previous grouping structure by default. The following two code blocks return the same data frame.
|>
DATA group_by(District) |>
mutate(District_Mean = mean(Score)) |>
group_by(School) |>
mutate(School_Mean = mean(Score))
With ungroup()
:
|>
DATA group_by(District) |>
mutate(District_Mean = mean(Score)) |>
ungroup() |> # ungroup
group_by(School) |>
mutate(School_Mean = mean(Score))
If you wish those variables to remain, you have a couple of options. The first would be to use mutate()
.
Notice, however, that if you want to extract the summaries from the large data frame as you would have had you used summarize()
, you cannot just select()
your column of interest because the repetitions will exist for each row in the data frame.
|>
DATA group_by(District) |>
mutate(District_Mean = mean(Score)) |>
select(c("School", "District_Mean")) # if passing a vector, use any_of() or all_of())
Adding missing grouping variables: `District`
# A tibble: 11 × 3
# Groups: District [3]
District School District_Mean
<chr> <chr> <dbl>
1 West A 81
2 West A 81
3 West B 81
4 West B 81
5 East C 87
6 East C 87
7 East C 87
8 North D 64.2
9 North D 64.2
10 North E 64.2
11 North E 64.2
You can obtain the rows that are distinct()
(unique) and then pick()
the ones to keep.
|>
DATA group_by(District) |>
mutate(District_Mean = mean(Score)) |>
distinct(pick(c("District", "District_Mean")))
# A tibble: 3 × 2
# Groups: District [3]
District District_Mean
<chr> <dbl>
1 West 81
2 East 87
3 North 64.2
But you will need to ungroup()
before distinct()
because, as we have mentioned before, all subsequent functions other than group_by()
will maintain the previous grouping structure by default.
|>
DATA group_by(School) |>
mutate(School_Mean = mean(Score)) |>
ungroup()
# A tibble: 11 × 5
Student Score School District School_Mean
<chr> <dbl> <chr> <chr> <dbl>
1 Jim 80 A West 82.5
2 Sally 85 A West 82.5
3 June 79 B West 79.5
4 Mildred 80 B West 79.5
5 Tilford 81 C East 87
6 Beavis 91 C East 87
7 Herman 89 C East 87
8 Peppa 60 D North 62.5
9 Kay 65 D North 62.5
10 Jake 67 E North 66
11 Name Missing 65 E North 66
Summary
How data are aggregated will affect both summary statistics, the models that you can run, and the data visualizations that you can create to represent the data. With each level of aggregation, fine-grained or case-level data are lost. That fine-grained detail may be computed with measures like variability and confidence levels but those measures are summaries themselves. Including them is not the same as including all data. That said, including all data may influence model outcome and statistical inference. The data will also influence the interpretation of plots to the extent that outlying points or grouping characteristics (point proximity) drive attention to physical locations of plots at the cost of attending to other locations.
A main workhorse of adding variables and creating aggregated data in the R
language is the {dplyr} library. You must, however, be sensitive to the way its functions work so that the variables you create in either a full or summarized data frame represent your intentions and goals.
We have seen that a lot of customization can be applied to our visualizations of data and have shown just a little bit of what can be done. Many more functions and parameters can be used to customize to greater extents.
Advanced Summarizing Techniques
For more advanced summarizing techniques, see this module on advanced summarizing techniques.
Session Info
R version 4.3.2 (2023-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22621)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8
[2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
time zone: America/Los_Angeles
tzcode source: internal
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] dplyr_1.1.4
loaded via a namespace (and not attached):
[1] bit_4.0.5 jsonlite_1.8.7 crayon_1.5.2
[4] compiler_4.3.2 BiocManager_1.30.22 renv_1.0.3
[7] tidyselect_1.2.0 stringr_1.5.1 parallel_4.3.2
[10] yaml_2.3.7 fastmap_1.1.1 here_1.0.1
[13] readr_2.1.4 R6_2.5.1 generics_0.1.3
[16] curl_5.2.0 knitr_1.45 htmlwidgets_1.6.4
[19] tibble_3.2.1 rprojroot_2.0.3 pillar_1.9.0
[22] tzdb_0.4.0 R.utils_2.12.2 rlang_1.1.1
[25] utf8_1.2.4 stringi_1.7.12 xfun_0.40
[28] bit64_4.0.5 cli_3.6.1 withr_2.5.0
[31] magrittr_2.0.3 digest_0.6.33 vroom_1.6.5
[34] rstudioapi_0.15.0 hms_1.1.3 lifecycle_1.0.3
[37] R.methodsS3_1.8.2 R.oo_1.25.0 vctrs_0.6.5
[40] evaluate_0.21 glue_1.6.2 fansi_1.0.4
[43] rmarkdown_2.25 tools_4.3.2 pkgconfig_2.0.3
[46] htmltools_0.5.7