Statistical transformations
Overview
This module describes some differences in data summaries and operations of functions in {dplyr}. Any form of data aggregation (e.g., data summary) will influence the contents of a data frame and consequently, how data are communicated visually. Depending on the data information you wish to communicate in your visualization, you may need to use different data frames or you may need to summarize a single data frame in a different ways either using {dplyr} functions like group_by()
and summarize()
or **{ggplot2}* functions like stat_summary()
. For example, you may wish to communicate the min()
and max()
of some numeric variable in a single line-plot visualization in order to convey changes in both metrics. Alternatively, you may wish to display both mean()
and median()
values across time in the same visualization as seen below. You may also need to create multiple data frames if a single data frame contains either too much or too little information. We will also discuss differences in grouping structure for dplyr::mutate()
and dplyr::summarize()
that will affect both variable computation and plot creation.
To Do
Review corresponding Canvas video(s).
Optional Readings
External Functions
Provided in class:
view_html()
: for viewing data frames in html format, from /src/my_functions.R
You can use this in your own work space but I am having a challenge rendering this of the website, so I’ll default to print()
on occasion.
::sourceDirectory(here::here("src", "functions")) R.utils
Libraries
- {dplyr} 1.1.4: for selecting, filtering, and mutating
- {ggplot2} 3.5.1: for plotting
Load libraries
library(dplyr)
library(ggplot2)
Loading Data
To examine some associations, we will use some swimming event times which can be accessed from:
https://raw.githubusercontent.com/slicesofdata/dataviz24/main/data/processed/cleaned-2023-cms-invite.csv
To access the data, either read the file directly from the url using read.csv()
and assign the data frame a name like SWIM
:
read.csv("https://raw.githubusercontent.com/slicesofdata/dataviz24/main/data/processed/cleaned-2023-cms-invite.csv")
Or download it and save to the /data/processed
directory and read from there.
<- read.csv(here::here("data", "processed", "cleaned-2023-cms-invite.csv")) SWIM
Flavors of Data
Data come in many flavors. Data are collected on virtually everything you could think of. If there is a record of something, there is data. The stars, weather, television viewing, social media use, ocean currents, brand preference, parenting, symphonies, learning to code, graphical perceptual. When there is no record of something, you could collect data on it. For example, you could start logging the quantity of quality of advice your grandparents offer you. You may likely discover that advice changes based on their age. You may also consider logging the number to times you see your grandparents or your parents/guardians annually in order to determine approximately how many more times you will see them before they die, if they haven’t yet. In fact, some folks have already worked on this problem using actuarial tables. Seeyourfolks will provide a model estimate for you. See for yourself. Sure, it may be in error but all models have error. Nevertheless, prediction may change your life.
Data Organization and Variability
In some industries, data may be organized at a high level. For example, someone examining population growth may have data at a census level or an annual level. Someone who studies education may had data organized at a state level or county level. In other industries, data may be organized at a fine-grained level. For example, someone studying health outcomes may have data about individuals daily if daily health checks are part of a data-collection protocol. Someone who studies education may design a study to collect data, or obtain data, on daily-performance of students. Even further, their data may contain repeated measurements of specific topics (e.g., 10 different addition problems rather than one). Similarly, someone studying some aspect of human cognition may have high-level data like overall accuracy and total time needed to complete the task or they might have data at a low level containing trial-level accuracy and reaction timed data for each item.
Depending the organization of data and its level of granularity, one may be limited to a set of statistical models to examine. Because variability in data plays a significant role in testing models, the absence of variability will prevent modeling using statistical techniques that require that variability. For example, if your data set contains only one value for a country, you are likely limited to models that allow for comparisons or associations across countries or across time. For example, you could examine data to understand how population growth for a single country changes over time or you could examine population growth in one country compares to one or more than one country. Using the educational example, one could examine changes in a state over time or one could examine states to one another. By contrast, if you have daily data per student for a variety of topics, you have variability in data to examine at a variety of levels. You could examine how an individual’s performance changes across time in general or specifically for a specific topic. You can also examine how the individual performs on one type of task relative to another. You could also examine changes over time in performance on one task compared with changes over time for other tasks. Stepping up the hierarchy, you could compare performance of individuals across teachers, teacher’s sections, schools, school districts, counties, states, etc. because there is variability in the data to try to understand.
Although the data structure is typically conceived from top to bottom, which lower levels nested under higher levels, I will present the opposite order to prevent you reading from the bottom up.
Daily Performance
- on an item type (e.g., 4 + 17)
-- on a task (e.g., addition)
--- for a person (e.g, Bill)
---- in a teacher's class (e.g., Mrs. Collins, Section 1)
----- in a school (e.g., Robert D. Wilson)
------ in a district (e.g., Western Wayne)
------- in a country (e.g., Wayne)
-------- in a state (e.g., Pennsylvania)
--------- in a country (e.g., United States of America)
The structure at all these levels means that you can perform more complex models like linear-mixed models/hierarchical linear modeling. The variability in the data at all these levels allows you to account for the variability in ways you cannot if you only have top-level data. When you have the variability, you can examine for effect like random effects or fixed effects, which are not applicable to simpler linear models. To the extent that teachers in District A teach differently or motivate students differently from those teachers in District B, you could examine the data for these differences and account for them statistically (hold them constant as if teachers were the same) at higher levels of the data. In fact, that repeated-measures ANOVA model you are taught in your statistics class is not be appropriate for data with repeated measurements under different conditions because two individuals who with the same average performance may differ in their variability. For example, two students answer the subtraction problem (14 - 7) on average in 10.0 seconds across 5 instances. Mildred is consistent and answers it between 8 and 12 seconds. Tilford, however, answers with a range of 5 to 20 seconds. Although performance time is equated, the variability represents something very different in ability.
Aggregating Data
Data aggregation refers to how data are organized or compiled. When you have multiple levels of data as presented in the example above, data aggregation refers to the process of taking the raw data and gathering it in some way to summarize for statistical analysis or data visualization. For example, you could summarize data by individual, class, school, district, county, and state. Alternatively, you could summarize data by district, county, and state and ignore individuals, class, and school. The two types of data aggregation will result in different numbers of cases (rows) in a data frame most obviously but also in the actual numbers.
The variability in individuals as well as in subgroups can be useful to answering your data question. Models like linear mixed models are beyond the scope of this course, however. Whether you look into those types of model or you don’t, you should realize that you are making a big assumption in your data when you aggregate at any level. That assumption is that the variability in the data that are aggregated is the same across all cases in the data. Specifically, by aggregating Tilford’s and Mildred’s individual data in order to obtain their average performance, you have lost the variability at the individual level and essentially arbitrarily set that variability to 0. At higher level, aggregating the data by School
also means that you lose this information and make the assumption that there is no variability among individuals in each school (e.g., all individuals have the same value). Such assumptions are particularly relevant to repeated-measures models (e.g., repeated-measures ANOVA does not use this information) and are potentially not assumptions you want to make about your data.
In the end, how you interpret a model or a data visualization will be dependent on that aggregation of the data, modeled or displayed. The take home message about aggregating data, then, is that you trade off a loss of information for some summary. Only you can weigh the costs and benefits of this trade off. That said, sometimes you have no choice because your client, your employer, or your data resource has only data an an aggregate level. In those situations, you cannot do much except acknowledge the information loss in the ability of the model to predict an outcome. If, however, that aggregated level can still be summarized or aggregated in some way, you are back to reflecting on issues of aggregation.
Statistical Summaries
Whenever you create visualizations using {ggplot2}, there is some statistical transformation performed on the data. The data that are plotting represent that transformation. In some cases, the transformations are simple and in other cases they are complex. Let’s review some simple examples.
Identity and Count Transformations
Looking at the geom_point()
function, we see that stat = "identity"
by default. When "identity"
is set to stat
, this will call the stat_identity()
function for performing the transformation. Identity transformations are essentially a WYSIWYG (what you see is what you get) transformation, so this is not what you typically think of as a transformation.
geom_point(
mapping = NULL,
data = NULL,
stat = "identity",
position = "identity",
...,na.rm = FALSE,
show.legend = NA,
inherit.aes = TRUE
)
Using data, we see an example of geom_point()
with this default operation.
|>
SWIM ggplot(mapping = aes(x = Split50, y = Time)) +
geom_point(stat = "identity")
Warning: Removed 15 rows containing missing values or values outside the scale range
(`geom_point()`).
Other functions perform other statistical transformations on the data. For example, by default, geom_bar()
will count the instances of a variable because the statistical transformation is stat = "count"
.
geom_bar(
mapping = NULL,
data = NULL,
stat = "count",
position = "stack",
...,just = 0.5,
width = NULL,
na.rm = FALSE,
orientation = NA,
show.legend = NA,
inherit.aes = TRUE
)
And example with geom_bar()
shows this but of course, we need to specify only an x
or a y
aesthetic for geom_bar()
. In order to count something, we can use a variable that is not as distributed as those just used.
|>
SWIM ggplot(mapping = aes(x = Event)) +
geom_bar(stat = "count")
But how is geom_bar()
doing this transformation? Well, when stat = "count"
, the stat_count()
function is called with a default geom of geom = "bar"
, thus plotting the counts with bars.
stat_count(
mapping = NULL,
data = NULL,
geom = "bar",
position = "stack",
...,width = NULL,
na.rm = FALSE,
orientation = NA,
show.legend = NA,
inherit.aes = TRUE
)
Using stat_count()
specifically, we can create the same plot.
|>
SWIM ggplot(mapping = aes(x = Event)) +
stat_count(geom = "bar")
Other functions like stat_summary()
or stat_boxplot()
perform more complicated statistical transformations than simple counting. Let’s now address some of the complex transformations.
Statistical Transformations using stat_summary()
The fun
parameter
The stat_summary()
function will perform statistical transforms on your data frame, and then use that modified data frame to plotting. The most common ways to transform the data is to pass arguments to the fun
and fun.data
parameters.
stat_summary(
mapping = NULL,
data = NULL,
geom = "pointrange",
position = "identity",
...,fun.data = NULL,
fun = NULL,
fun.max = NULL,
fun.min = NULL,
fun.args = list(),
na.rm = FALSE,
orientation = NA,
show.legend = NA,
inherit.aes = TRUE,
fun.y = deprecated(),
fun.ymin = deprecated(),
fun.ymax = deprecated()
)
Initializing a plot and then adding stat_summary()
layer will run the defaults. You will see a geom = "pointrange"
which calls geom_pointrange()
. A geom_pointrange()
will need a y
, ymin
, and ymax
which by default is computed using mean_se()
(see console message), which is a mean +/- 1 standard error.
|>
SWIM ggplot(mapping = aes(x = Event, y = Time)) +
stat_summary()
No summary function supplied, defaulting to `mean_se()`
OK, so this is a complicated first step. Let’s use a simpler approach by setting fun = mean
to compute the mean in the form of a geom = "bar"
.
|>
SWIM ggplot(mapping = aes(x = Event, y = Time)) +
stat_summary(fun = mean,
geom = "bar"
)
And setting fun = median
to computes the median:
|>
SWIM ggplot(mapping = aes(x = Event, y = Time)) +
stat_summary(fun = median,
geom = "bar"
)
Adding a second layer illustates how you can change the function and the geom. In particular, you can see when the distribution is skewed positively, negatively, or not so much.
|>
SWIM ggplot(mapping = aes(x = Event, y = Time)) +
stat_summary(fun = median,
geom = "bar"
+
) stat_summary(fun = mean,
geom = "point"
)
The fun.data
parameter
Compared with fun
, fun.data
will need a function that returns a data frame with a y
, ymin
, and a ymax
. The mean_se
function used by default in stat_summary()
provides those values but we can set them manually.
|>
SWIM ggplot(mapping = aes(x = Event, y = Time)) +
stat_summary(fun.data = mean_se)
But if you want to plot 2 standard errors of the mean, you will need to use fun.args = list(mult = 2)
for passing a multiple of the mean_se
function. This really isn’t so bad when you understand how to do it.
|>
SWIM ggplot(mapping = aes(x = Event, y = Time)) +
stat_summary(fun.data = mean_se,
fun.args = list(mult = 2))
But {ggplot2} has other built in functions, like mean_sdl
which will compute a mean +/- 1 standard deviation.
|>
SWIM ggplot(mapping = aes(x = Event, y = Time)) +
stat_summary(fun.data = mean_sdl)
And mean_cl_boot
will compute a mean and a bootstrapped confidence interval.
|>
SWIM ggplot(mapping = aes(x = Event, y = Time)) +
stat_summary(fun.data = mean_cl_boot)
We can also create our own functions to pass to fun
if there is single value or to pass to fun.data
if there is a y
, ymin
, and ymax
. Let’s create them using function()
.
<- function(x) mean(x, na.rm = TRUE, trim = 0.1) # Trim 10% of the values
trimmed_mean
<- function(x) {
iqr_summary = na.omit(x)
x data.frame(
y = median(x),
ymin = quantile(x, 0.25),
ymax = quantile(x, 0.75)
) }
Pass fun = trimmed_mean
:
|>
SWIM ggplot(mapping = aes(x = Event, y = Time)) +
stat_summary(fun = trimmed_mean)
Or pass fun.data = iqr_summary
:
|>
SWIM ggplot(mapping = aes(x = Event, y = Time)) +
stat_summary(fun.data = iqr_summary)
Statistical Transformations using geom_boxplot()
Although you can plot quartiles as just shown, a box-and-whisked plot is the best for this. Beyond geom_mooth()
and stat_summary()
, geom_boxplot()
also performs statistical transformations. Although this geom
is beyond the scope of this topic, we can at least see the statistical transformation is very unique, stat = "boxplot"
.
When you want to perform a compl
geom_boxplot(
mapping = NULL,
data = NULL,
stat = "boxplot",
position = "dodge2",
...,outliers = TRUE,
outlier.colour = NULL,
outlier.color = NULL,
outlier.fill = NULL,
outlier.shape = 19,
outlier.size = 1.5,
outlier.stroke = 0.5,
outlier.alpha = NULL,
notch = FALSE,
notchwidth = 0.5,
staplewidth = 0,
varwidth = FALSE,
na.rm = FALSE,
orientation = NA,
show.legend = NA,
inherit.aes = TRUE
)
We can see how the box plot provides quartile information as we did with a custom function. The boxplot provides additional information as well. A topic for another time.
|>
SWIM filter(Time < 500) |>
ggplot(mapping = aes(x = Event, y = Time)) +
geom_boxplot()
Aggregating Data using group_by()
{dplyr} has one main verb function for grouping data. group_by()
orders the groups in an ascending order so that functions performed on the subgroups will return a data frame that is ordered based on the grouping variable(s). Other functions like group_map()
, group_nest()
, group_split()
, and group_trim()
exist but they are experimental for {dplyr} r
packageVersion(“dplyr”)`. Although they allow for fun tasks like passing data subsets to statistical models, their experimental life cycle suggests you might not get to comfortable with them.
Data aggregation using group_by()
Let’s go through a simple example of data aggregation using group_by()
.
<- data.frame(
DATA Student = c("Jim", "Sally", "June", "Mildred", "Tilford", "Beavis", "Herman", "Peppa", "Kay", "Jake", "Name Missing"),
Score = c(80, 85, 79, 80, 81, 91, 89, 60, 65, 67, 65),
School = c("A", "A", "B", "B", "C", "C", "C", "D", "D", "E", "E"),
District = c("West", "West", "West", "West", "East", "East", "East", "North", "North" , "North", "North")
)
You can see that the performance for all students is on average, for example, if you didn’t know where they went to school. We would take only the mean.
|>
DATA summarize(Score = mean(Score)) # calculate average
Score
1 76.54545
Let’s say we try to determine average performance for “all students” if we had the data aggregated as the school level and not the individual level.
|>
DATA group_by(School) |>
summarize(Score = mean(Score)) |> # get the school level aggregated data
summarize(Score = mean(Score)) # then calculate average across all schools
# A tibble: 1 × 1
Score
<dbl>
1 75.5
Let’s say we try to determine average performance for “all students” if we had the data aggregated as the district level and not the individual level.
|>
DATA group_by(District) |>
summarize(Score = mean(Score)) |> # get the district level aggregated data
summarize(Score = mean(Score)) # then calculate average across all districts
# A tibble: 1 × 1
Score
<dbl>
1 77.4
So what’s the average performance for students? Well, that interpretation differs based on how values are treated. At the highest level, all students’ Scores
are weighed equally. Each student contributes to the data in the same way. When aggregated by School
, each school contributes to the calculation equally, even if the number of bodies per school differs. And finally, when data are aggregated at the District
level, all districts are treated equally in the calculation independent on the number of schools in a district. Only with weighting means would you end up with the same average performance.
Measures of variability in the data, however, reveal something else. Schools differ in variability and schools within districts vary as well. Depending on the level of aggregation, you may never notice interesting patterns in data that lead to questions to investigate and later policies to change. For example, addressing a school that is left behind others or a district that is left behind others. Aggregation can also lead to inefficient allocations of resources. If one school in a district needs help rather than all schools in the district needing help, the cost may differ substantially.
<- function(x) { sd(na.omit(x)) / sqrt(length(na.omit(x))) }
std_error
|>
DATA group_by(School, District) |>
summarize(Var = var(Score),
SD = sd(Score),
SEM = std_error(Score),
min = min(Score),
max = max(Score)
|>
) mutate(range = (max - min))
`summarise()` has grouped output by 'School'. You can override using the
`.groups` argument.
# A tibble: 5 × 8
# Groups: School [5]
School District Var SD SEM min max range
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A West 12.5 3.54 2.5 80 85 5
2 B West 0.5 0.707 0.5 79 80 1
3 C East 28 5.29 3.06 81 91 10
4 D North 12.5 3.54 2.5 60 65 5
5 E North 2 1.41 1 65 67 2
Comparing mutate() with summarize()
{dplyr} has two main verb functions for creating variables. They operate differently and their returned tibble also differ. If you have not been paying attention to the returned tibble post calculation, you might have overlooked the differences.
Grouping structure of group_by()
and mutate()
Let’s mutate()
the mean for Score
after sub-setting with group_by()
.
|>
DATA group_by(School, District) |>
mutate(Score = mean(na.omit(Score)))
# A tibble: 11 × 4
# Groups: School, District [5]
Student Score School District
<chr> <dbl> <chr> <chr>
1 Jim 82.5 A West
2 Sally 82.5 A West
3 June 79.5 B West
4 Mildred 79.5 B West
5 Tilford 87 C East
6 Beavis 87 C East
7 Herman 87 C East
8 Peppa 62.5 D North
9 Kay 62.5 D North
10 Jake 66 E North
11 Name Missing 66 E North
There are two things to watch when using this mutate()
following group_by()
.
1. A Score
will be assigned to each row/case or Student
in the data frame.
When schools and districts are grouped, each student in the same school will have the same assigned average value. All rows are maintained, none dropped.
2. The returned tibble takes on a new structure.
Looking at the feedback in the console, you see the following report preceding the data.
A tibble: 11 × 3
# Groups: School, District [5]
You also see this structure using glimpse()
.
|>
DATA group_by(School, District) |>
mutate(Score = mean(na.omit(Score))) |>
glimpse()
Rows: 11
Columns: 4
Groups: School, District [5]
$ Student <chr> "Jim", "Sally", "June", "Mildred", "Tilford", "Beavis", "Herm…
$ Score <dbl> 82.5, 82.5, 79.5, 79.5, 87.0, 87.0, 87.0, 62.5, 62.5, 66.0, 6…
$ School <chr> "A", "A", "B", "B", "C", "C", "C", "D", "D", "E", "E"
$ District <chr> "West", "West", "West", "West", "East", "East", "East", "Nort…
Rows: 11
Columns: 3
Groups: School, District [5]
What does the [5]
mean? Well, before we answer this, let’s use summarize()
.
Grouping structure of group_by()
and summarize()
Let’s summarize()
the mean for Score
after sub-setting with group_by()
.
|>
DATA group_by(School, District) |>
summarize(Score = mean(Score))
`summarise()` has grouped output by 'School'. You can override using the
`.groups` argument.
# A tibble: 5 × 3
# Groups: School [5]
School District Score
<chr> <chr> <dbl>
1 A West 82.5
2 B West 79.5
3 C East 87
4 D North 62.5
5 E North 66
There are two things to watch when using this summarize()
following group_by()
.
1. A Score
average is assigned to each row/case or School
in this aggregated data frame.
When schools and districts are grouped, each school will have its own average value. Rows from the data frame are dropped as are columns that are not passed to group_by()
. In this case, School
, District
, and the new variable, Score
are returned.
2. The returned tibble takes on a new structure.
The feedback in the console is a little more detailed here.
`summarise()` has grouped output by 'School'. You can override using the `.groups` argument.
# A tibble: 5 × 3
# Groups: School [5]
You see reference to overriding the grouping using .groups
. According to the tidyverse documentation for grouping, this parameter “controls the grouping structure of the output. The historical behavior of removing the right hand side grouping variable corresponds to .groups = "drop_last"
without a message or .groups = NULL
with a message (the default)”.
You have most likely paid little attention to this message. By default summarize()
keeps the first grouping variable passed to group_by()
in the returned tibble. This is why you see School
referenced and not District
or both variables. So, do you want your data frame to contain groups or no groups? Stated differently, do you just want that data summarized by your grouping variables and have a simple nxm data frame or do you want something more complex?
To see the structure better, let’s first look a the structure of DATA
and then more closely at the summarized version.
str(DATA)
'data.frame': 11 obs. of 4 variables:
$ Student : chr "Jim" "Sally" "June" "Mildred" ...
$ Score : num 80 85 79 80 81 91 89 60 65 67 ...
$ School : chr "A" "A" "B" "B" ...
$ District: chr "West" "West" "West" "West" ...
Let’s assign the summarized data frame to an object for inspection.
<-
DSUM |>
DATA group_by(School, District) |>
summarize(Score = mean(Score))
`summarise()` has grouped output by 'School'. You can override using the
`.groups` argument.
For DSUM
, the structure is different. In particular, you see reference to (S3: grouped_df/tbl_df/tbl/data.frame)
. This tells you that the data frame is not a simple nxm but contains groups. You could think of this as 3 nxm data frames organized together. For more details, see the dplyr documentation.
str(DSUM)
gropd_df [5 × 3] (S3: grouped_df/tbl_df/tbl/data.frame)
$ School : chr [1:5] "A" "B" "C" "D" ...
$ District: chr [1:5] "West" "West" "East" "North" ...
$ Score : num [1:5] 82.5 79.5 87 62.5 66
- attr(*, "groups")= tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
..$ School: chr [1:5] "A" "B" "C" "D" ...
..$ .rows : list<int> [1:5]
.. ..$ : int 1
.. ..$ : int 2
.. ..$ : int 3
.. ..$ : int 4
.. ..$ : int 5
.. ..@ ptype: int(0)
..- attr(*, ".drop")= logi TRUE
group_by()
and summarize()
with .groups
When you group data using group_by()
and then summarize()
, the summary variables (e.g., mean) will result in a single row for each level of a single grouping variable. If there are more than one grouping variable, the additional grouping variable will be introduced to the data frame as a second column. The total number of rows in the data frame will be equal to the number of levels of group 1 x number of levels of group 2 if an only if each levels of group 1 has a corresponding level for group 2,
for example:
sex age mean
men young x
men old x
women young x
women old x
If there is no pairing of levels in the data (e.g., no men who are old), that row will be omitted from the returned data frame.
for example:
sex age mean
men young x
women young x
women old x
rather than:
sex age mean
men young x
men old NA
women young x
women old x
You really need to query ?dplyr::summarize
to understand .groups
, which controls the grouping of the returned data frame. This is also experimental to summarize()
, so it might not be available in the future.
There are four argument options:
"drop_last"
: dropping the last level of grouping. This was the only supported option before version 1.0.0."drop"
: All levels of grouping are dropped."keep"
: Same grouping structure as .data."rowwise"
: Each row is its own group.
.groups = "drop_last"
:
|>
SWIM group_by(Event, Distance) |>
summarize(Time = mean(Time),
.groups = "drop_last")
# A tibble: 14 × 3
# Groups: Event [6]
Event Distance Time
<chr> <int> <dbl>
1 Backstroke 100 59.8
2 Backstroke 200 124.
3 Breaststroke 100 61.6
4 Breaststroke 200 139.
5 Butterfly 100 55.3
6 Butterfly 200 125.
7 Freestyle 50 23.6
8 Freestyle 100 51.9
9 Freestyle 200 107.
10 Freestyle 500 306.
11 Freestyle 1650 1143.
12 IM 200 126.
13 IM 400 269.
14 Medley 200 104.
You will see that the grouping for Distance
is not in the grouping structure because it was the last grouping.
# A tibble: 14 × 3
# Groups: Event [6]
Event Distance Time
.groups = "drop"
:
|>
SWIM group_by(Event, Distance) |>
summarize(Time = mean(Time),
.groups = "drop")
# A tibble: 14 × 3
Event Distance Time
<chr> <int> <dbl>
1 Backstroke 100 59.8
2 Backstroke 200 124.
3 Breaststroke 100 61.6
4 Breaststroke 200 139.
5 Butterfly 100 55.3
6 Butterfly 200 125.
7 Freestyle 50 23.6
8 Freestyle 100 51.9
9 Freestyle 200 107.
10 Freestyle 500 306.
11 Freestyle 1650 1143.
12 IM 200 126.
13 IM 400 269.
14 Medley 200 104.
You will see that all group are dropped from the grouping structure.
# A tibble: 14 × 3
Event Distance Time
.groups = "keep"
:
|>
SWIM group_by(Event, Distance) |>
summarize(Time = mean(Time),
.groups = "keep")
# A tibble: 14 × 3
# Groups: Event, Distance [14]
Event Distance Time
<chr> <int> <dbl>
1 Backstroke 100 59.8
2 Backstroke 200 124.
3 Breaststroke 100 61.6
4 Breaststroke 200 139.
5 Butterfly 100 55.3
6 Butterfly 200 125.
7 Freestyle 50 23.6
8 Freestyle 100 51.9
9 Freestyle 200 107.
10 Freestyle 500 306.
11 Freestyle 1650 1143.
12 IM 200 126.
13 IM 400 269.
14 Medley 200 104.
You will see that all group are kept in the grouping structure.
# A tibble: 14 × 3
# Groups: Event, Distance [14]
Event Distance Time
.groups = "rowwise"
:
|>
SWIM group_by(Event, Distance) |>
summarize(Time = mean(Time),
.groups = "rowwise")
# A tibble: 14 × 3
# Rowwise: Event, Distance
Event Distance Time
<chr> <int> <dbl>
1 Backstroke 100 59.8
2 Backstroke 200 124.
3 Breaststroke 100 61.6
4 Breaststroke 200 139.
5 Butterfly 100 55.3
6 Butterfly 200 125.
7 Freestyle 50 23.6
8 Freestyle 100 51.9
9 Freestyle 200 107.
10 Freestyle 500 306.
11 Freestyle 1650 1143.
12 IM 200 126.
13 IM 400 269.
14 Medley 200 104.
You will see that there groups but these are not based on the
# A tibble: 14 × 3
# Rowwise: Event, Distance
The grouping structure always matters for subsequent computations. But with "rowwise"
you will see this clearly. Following from above, let’s say we wanted to compute the mean across all of the Events
in the data frame returned by group_by()
then summarize()
. We add mutate(mean = mean(Time))
on a new line of our piped code block.
|>
SWIM group_by(Event, Distance) |>
summarize(Time = mean(Time),
.groups = "rowwise") |>
mutate(mean = mean(Time))
# A tibble: 14 × 4
# Rowwise: Event, Distance
Event Distance Time mean
<chr> <int> <dbl> <dbl>
1 Backstroke 100 59.8 59.8
2 Backstroke 200 124. 124.
3 Breaststroke 100 61.6 61.6
4 Breaststroke 200 139. 139.
5 Butterfly 100 55.3 55.3
6 Butterfly 200 125. 125.
7 Freestyle 50 23.6 23.6
8 Freestyle 100 51.9 51.9
9 Freestyle 200 107. 107.
10 Freestyle 500 306. 306.
11 Freestyle 1650 1143. 1143.
12 IM 200 126. 126.
13 IM 400 269. 269.
14 Medley 200 104. 104.
The mean
column does not contain a single mean replicated for each row in the data frame. Rather the grouping was per row, so each row has its one mean. The mean of a single value is, of course, itself.
Revisiting ".groups = drop_last"
:
Returning to the default .groups
structure, which is "drop_last"
, we can add the same mutate()
to see what happens.
|>
SWIM group_by(Event, Distance) |>
summarize(Time = mean(Time),
.groups = "drop_last") |>
mutate(mean = mean(Time))
# A tibble: 14 × 4
# Groups: Event [6]
Event Distance Time mean
<chr> <int> <dbl> <dbl>
1 Backstroke 100 59.8 92.1
2 Backstroke 200 124. 92.1
3 Breaststroke 100 61.6 100.
4 Breaststroke 200 139. 100.
5 Butterfly 100 55.3 90.4
6 Butterfly 200 125. 90.4
7 Freestyle 50 23.6 326.
8 Freestyle 100 51.9 326.
9 Freestyle 200 107. 326.
10 Freestyle 500 306. 326.
11 Freestyle 1650 1143. 326.
12 IM 200 126. 197.
13 IM 400 269. 197.
14 Medley 200 104. 104.
The mean
column still has different values but they are replicated on rows with the same Event
group. This is because the data are grouped that way.
Revisiting .groups = "drop"
:
Let’s again add the same mutate()
to see what happens.
|>
SWIM group_by(Event, Distance) |>
summarize(Time = mean(Time),
.groups = "drop") |>
mutate(mean = mean(Time))
# A tibble: 14 × 4
Event Distance Time mean
<chr> <int> <dbl> <dbl>
1 Backstroke 100 59.8 193.
2 Backstroke 200 124. 193.
3 Breaststroke 100 61.6 193.
4 Breaststroke 200 139. 193.
5 Butterfly 100 55.3 193.
6 Butterfly 200 125. 193.
7 Freestyle 50 23.6 193.
8 Freestyle 100 51.9 193.
9 Freestyle 200 107. 193.
10 Freestyle 500 306. 193.
11 Freestyle 1650 1143. 193.
12 IM 200 126. 193.
13 IM 400 269. 193.
14 Medley 200 104. 193.
Consider ungroup()
:
.groups = "drop"
in effect works the same as does ungroup()
. The grouping structure is broken and all subsequent operations are based on the ungrouped data frame.
|>
SWIM group_by(Event, Distance) |>
summarize(Time = mean(Time)) |>
ungroup() |>
mutate(mean = mean(Time))
`summarise()` has grouped output by 'Event'. You can override using the
`.groups` argument.
# A tibble: 14 × 4
Event Distance Time mean
<chr> <int> <dbl> <dbl>
1 Backstroke 100 59.8 193.
2 Backstroke 200 124. 193.
3 Breaststroke 100 61.6 193.
4 Breaststroke 200 139. 193.
5 Butterfly 100 55.3 193.
6 Butterfly 200 125. 193.
7 Freestyle 50 23.6 193.
8 Freestyle 100 51.9 193.
9 Freestyle 200 107. 193.
10 Freestyle 500 306. 193.
11 Freestyle 1650 1143. 193.
12 IM 200 126. 193.
13 IM 400 269. 193.
14 Medley 200 104. 193.
Keep in mind that functions work as they are programmed to work. Functions do not work like you think they work. You must understand the function and check our work to ensure your calculations are what you intend them to be.
The order of operations matters
To illustrate further, consider you want to calculate some summary statistics. You set out to obtain the mean and the standard deviation for your data. Those computations will be performed according to the grouping structure.
When you compute standard deviation and the mean of Time
, you assign the mean to Time
because you want your plot to contain a clean name rather than one like Time_Mean
that you will have to address in the plot.
|>
SWIM group_by(Event, Distance) |>
summarize(Time_sd = sd(Time), # get sd of Time
Time = mean(Time) # get mean, assign to same name
)
`summarise()` has grouped output by 'Event'. You can override using the
`.groups` argument.
# A tibble: 14 × 4
# Groups: Event [6]
Event Distance Time_sd Time
<chr> <int> <dbl> <dbl>
1 Backstroke 100 4.47 59.8
2 Backstroke 200 8.02 124.
3 Breaststroke 100 5.76 61.6
4 Breaststroke 200 10.5 139.
5 Butterfly 100 4.32 55.3
6 Butterfly 200 7.74 125.
7 Freestyle 50 1.99 23.6
8 Freestyle 100 4.00 51.9
9 Freestyle 200 10.8 107.
10 Freestyle 500 20.3 306.
11 Freestyle 1650 98.2 1143.
12 IM 200 8.13 126.
13 IM 400 22.2 269.
14 Medley 200 3.41 104.
Both variables use the same data because the standard deviation assigns the value to a different variable name Time_sd
. The mean()
is not based on some changed variable.
The output is different from the one returned when the mean is computed before the standard deviation and in particular when the mean is assigned to Time
. In this case, the standard deviation is based on this new Time
variable. Because the standard deviation is a measure of variability, and Time
does not vary based on the grouping structure, NA
is returned.
|>
SWIM group_by(Event, Distance) |>
summarize(Time = mean(Time), # get mean, assign to same name
Time_sd = sd(Time) # get sd of Time
)
`summarise()` has grouped output by 'Event'. You can override using the
`.groups` argument.
# A tibble: 14 × 4
# Groups: Event [6]
Event Distance Time Time_sd
<chr> <int> <dbl> <dbl>
1 Backstroke 100 59.8 NA
2 Backstroke 200 124. NA
3 Breaststroke 100 61.6 NA
4 Breaststroke 200 139. NA
5 Butterfly 100 55.3 NA
6 Butterfly 200 125. NA
7 Freestyle 50 23.6 NA
8 Freestyle 100 51.9 NA
9 Freestyle 200 107. NA
10 Freestyle 500 306. NA
11 Freestyle 1650 1143. NA
12 IM 200 126. NA
13 IM 400 269. NA
14 Medley 200 104. NA
If you really wanted the standard deviation of all the means, consider ungrouping and then compute the standard deviation or use .groups = "drop"
in summarize()
. Realize, however, this latter functionality is experimental and may not work sometime later. You are likely better off using ungroup()
.
|>
SWIM group_by(Event, Distance) |>
summarize(Time = mean(Time)) |> # get mean, assign to same name
ungroup() |>
mutate(sd = sd(Time))
`summarise()` has grouped output by 'Event'. You can override using the
`.groups` argument.
# A tibble: 14 × 4
Event Distance Time sd
<chr> <int> <dbl> <dbl>
1 Backstroke 100 59.8 285.
2 Backstroke 200 124. 285.
3 Breaststroke 100 61.6 285.
4 Breaststroke 200 139. 285.
5 Butterfly 100 55.3 285.
6 Butterfly 200 125. 285.
7 Freestyle 50 23.6 285.
8 Freestyle 100 51.9 285.
9 Freestyle 200 107. 285.
10 Freestyle 500 306. 285.
11 Freestyle 1650 1143. 285.
12 IM 200 126. 285.
13 IM 400 269. 285.
14 Medley 200 104. 285.
More on Grouping Structure
We can now also consider ways to create column variables and a corresponding grouping structure within the group_by()
verb.
New Variable Columns
group_by()
is not designed to create new variables but rather create groups. The function, however, does not require variables in a data frame in order to group based on their levels or differences in values.
group_by()
using a function
Heretofore, we have grouped by column variables in a data frame. But you can also group in other ways, for example, by a variable calculated by a function. For example, if you wanted to group by standard deviations, you could calculated the standard deviation (Score-mean(Score)) / sd(Score)
for each student and then cut()
that variable into groups. cut
will take a numeric vector variable and turn it into a factor variable of n groups as determined by what you pass to breaks
. There is also an argument to make the factor ordered if you wish, ordered_result = TRUE
but the default behavior does not order the factor. You can also change the level labels if you inspect the function.
|>
DATA group_by(z_factor = cut(x = ((Score - mean(Score)) / sd(Score)),
breaks = 5,
ordered_result = TRUE
) )
# A tibble: 11 × 5
# Groups: z_factor [4]
Student Score School District z_factor
<chr> <dbl> <chr> <chr> <ord>
1 Jim 80 A West (0.194,0.781]
2 Sally 85 A West (0.781,1.37]
3 June 79 B West (0.194,0.781]
4 Mildred 80 B West (0.194,0.781]
5 Tilford 81 C East (0.194,0.781]
6 Beavis 91 C East (0.781,1.37]
7 Herman 89 C East (0.781,1.37]
8 Peppa 60 D North (-1.57,-0.979]
9 Kay 65 D North (-1.57,-0.979]
10 Jake 67 E North (-0.979,-0.392]
11 Name Missing 65 E North (-1.57,-0.979]
You can see that there are 4 different levels of the grouping variable. The tibble is grouped of course too.
If you struggle with remembering the formula for a z score, scale()
will to the same thing.
|>
DATA mutate(z = (Score - mean(Score)) / sd(Score),
scale = scale(Score)
)
Student Score School District z scale
1 Jim 80 A West 0.3269018 0.3269018
2 Sally 85 A West 0.8000492 0.8000492
3 June 79 B West 0.2322724 0.2322724
4 Mildred 80 B West 0.3269018 0.3269018
5 Tilford 81 C East 0.4215313 0.4215313
6 Beavis 91 C East 1.3678261 1.3678261
7 Herman 89 C East 1.1785671 1.1785671
8 Peppa 60 D North -1.5656877 -1.5656877
9 Kay 65 D North -1.0925403 -1.0925403
10 Jake 67 E North -0.9032814 -0.9032814
11 Name Missing 65 E North -1.0925403 -1.0925403
If you want n groups based on specific breaks, then pass a vector of those break units. For example, if we want levels to correspond to some meaningful standard-deviation cuts. For example, if you wanted to group those with z scores ranging from infinitely negative to -2, -2 to -1, - to 1, 1 to 2, and 2 to infinitely large, you could specify the breaks. If there are no values in those ranges, then there won’t be any data for those breaks.
|>
DATA group_by(z_factor = cut(x = scale(Score),
breaks = c(-Inf, -2, -1, 1, 2, Inf),
ordered_result = TRUE
), |>
) ungroup()
# A tibble: 11 × 5
Student Score School District z_factor
<chr> <dbl> <chr> <chr> <ord>
1 Jim 80 A West (-1,1]
2 Sally 85 A West (-1,1]
3 June 79 B West (-1,1]
4 Mildred 80 B West (-1,1]
5 Tilford 81 C East (-1,1]
6 Beavis 91 C East (1,2]
7 Herman 89 C East (1,2]
8 Peppa 60 D North (-2,-1]
9 Kay 65 D North (-2,-1]
10 Jake 67 E North (-1,1]
11 Name Missing 65 E North (-2,-1]
You can also just group by the function without using cut()
if you wish to group by those with identical values on the variable. In this case, using count()
or tally()
reveals there are only two instances with the same score.
|>
DATA group_by(z_factor = ((Score - mean(Score)) / sd(Score))) |>
count(sort = TRUE) # tally(sort = TRUE)
# A tibble: 9 × 2
# Groups: z_factor [9]
z_factor n
<dbl> <int>
1 -1.09 2
2 0.327 2
3 -1.57 1
4 -0.903 1
5 0.232 1
6 0.422 1
7 0.800 1
8 1.18 1
9 1.37 1
Add/drop Grouping Variable Columns
By default, group_by()
on a data frame that is already grouped (see earlier on grouped data frame), the existing grouping structure will be replaced by new grouping structure.
Let’s get an example. Please note that this example assigned the tibble to an object but whether you assign or not, the grouped structure exists. So functions that follow the group_by()
keep that structure.
<-
school_grouped |>
DATA group_by(School)
school_grouped
# A tibble: 11 × 4
# Groups: School [5]
Student Score School District
<chr> <dbl> <chr> <chr>
1 Jim 80 A West
2 Sally 85 A West
3 June 79 B West
4 Mildred 80 B West
5 Tilford 81 C East
6 Beavis 91 C East
7 Herman 89 C East
8 Peppa 60 D North
9 Kay 65 D North
10 Jake 67 E North
11 Name Missing 65 E North
Now group by a new column:
|>
school_grouped group_by(District)
# A tibble: 11 × 4
# Groups: District [3]
Student Score School District
<chr> <dbl> <chr> <chr>
1 Jim 80 A West
2 Sally 85 A West
3 June 79 B West
4 Mildred 80 B West
5 Tilford 81 C East
6 Beavis 91 C East
7 Herman 89 C East
8 Peppa 60 D North
9 Kay 65 D North
10 Jake 67 E North
11 Name Missing 65 E North
You will see the grouping structure has changed.
# A tibble: 11 × 4
# Groups: District [3]
Retain Grouping
If you want to retain the existing group, you would need to use .add = TRUE
.
|>
school_grouped group_by(District, .add = TRUE)
# A tibble: 11 × 4
# Groups: School, District [5]
Student Score School District
<chr> <dbl> <chr> <chr>
1 Jim 80 A West
2 Sally 85 A West
3 June 79 B West
4 Mildred 80 B West
5 Tilford 81 C East
6 Beavis 91 C East
7 Herman 89 C East
8 Peppa 60 D North
9 Kay 65 D North
10 Jake 67 E North
11 Name Missing 65 E North
Notice in the console, that both School
and District
are included in the groups.
# A tibble: 11 × 4
# Groups: School, District [5]
Why Care About Grouping Structure
Well, the functions you apply to a grouped tibble will sometimes lead to calculations that are not what you intend. We provided some examples earlier but given the importance of the issue, we may benefit from another example.
Let’s say you want to compute the average for each school and add that school average for each student. This would tell you how the student differs from their school performance. Then you want to obtain the mean of all the schools and see how the school differs from all the schools.
You code it out:
|>
DATA # group
group_by(School) |>
# calculate the mean of Score for each School
mutate(School_Mean = mean(Score)) |>
# calculate the mean of all values (think Grand Mean from stats)
mutate(Mean_of_all = mean(School_Mean)) |>
# calculate the school performance relative to all
mutate(School_Performance = factor(case_when(
< Mean_of_all ~ "Below Average",
School_Mean == Mean_of_all ~ "Average",
School_Mean > Mean_of_all ~ "Above Average"
School_Mean
), ordered = TRUE)
)
# A tibble: 11 × 7
# Groups: School [5]
Student Score School District School_Mean Mean_of_all School_Performance
<chr> <dbl> <chr> <chr> <dbl> <dbl> <ord>
1 Jim 80 A West 82.5 82.5 Average
2 Sally 85 A West 82.5 82.5 Average
3 June 79 B West 79.5 79.5 Average
4 Mildred 80 B West 79.5 79.5 Average
5 Tilford 81 C East 87 87 Average
6 Beavis 91 C East 87 87 Average
7 Herman 89 C East 87 87 Average
8 Peppa 60 D North 62.5 62.5 Average
9 Kay 65 D North 62.5 62.5 Average
10 Jake 67 E North 66 66 Average
11 Name Missing 65 E North 66 66 Average
Perfect. OK, let’s plot it. Oh but first let me inspect the data. Um, why is School_Performance
the same for all Students
? Schools are not performing the same so they cannot all be average. Check your case_when()
for errors because that where the new variable was created. Then you spend 40 days and 40 nights trying to fix your code. No matter what you do with case_when()
, you cannot fix the problem. So you try to create 42 data frames to solve your problem. Even grandma knows that is a ridiculous strategy. She suggests you read the documentation for all of the functions you used because one time her pot-luck cake flopped and she inspected the expiration date for all of her ingredients and found the baking soda was old.
You discover that the grouping structure is retained on all operations until that grouping structure is removed or replaced.
ungroup()
Whereas summarize has a .groups
functionality, mutate()
does not. In order to have a simple data frame use follow group_by()
and mutate()
with ungroup()
. But ungroup()
can be used following both functions if that approach makes more sense to you.
Let’s add an ungroup()
directly after performing the operation that needed the grouping structure.
|>
DATA # group
group_by(School) |>
# calculate the mean of Score for each School
mutate(School_Mean = mean(Score)) |>
# break that grouping structure
ungroup() |>
# calculate the mean of all values (think Grand Mean from stats)
mutate(Mean_of_all = mean(School_Mean)) |>
# calculate the school performance relative to all
mutate(School_Performance = factor(case_when(
< Mean_of_all ~ "Below Average",
School_Mean == Mean_of_all ~ "Average",
School_Mean > Mean_of_all ~ "Above Average"
School_Mean
), ordered = TRUE)
)
# A tibble: 11 × 7
Student Score School District School_Mean Mean_of_all School_Performance
<chr> <dbl> <chr> <chr> <dbl> <dbl> <ord>
1 Jim 80 A West 82.5 76.5 Above Average
2 Sally 85 A West 82.5 76.5 Above Average
3 June 79 B West 79.5 76.5 Above Average
4 Mildred 80 B West 79.5 76.5 Above Average
5 Tilford 81 C East 87 76.5 Above Average
6 Beavis 91 C East 87 76.5 Above Average
7 Herman 89 C East 87 76.5 Above Average
8 Peppa 60 D North 62.5 76.5 Below Average
9 Kay 65 D North 62.5 76.5 Below Average
10 Jake 67 E North 66 76.5 Below Average
11 Name Missing 65 E North 66 76.5 Below Average
Yes! Mean_of_all
is now a true grand mean based on all of the data with no grouping at all. Comparing Mean_of_all
to School_Mean
provides two classifications.
Variable loss with summarize()
Because summarize()
returns a tibble based on the grouping structure, you will lose all variables that are not in the grouping structure. By design, it no longer allows you to return a data frame with duplicated rows based on the variables passed to group_by()
.
Returning to the school data, what this means, is that you cannot retain the School
if you group by the District
in order to obtain averages by district.
|>
DATA group_by(District) |>
summarize(Mean = mean(Score))
# A tibble: 3 × 2
District Mean
<chr> <dbl>
1 East 87
2 North 64.2
3 West 81
If you try to add School
to group_by()
, then you will only obtain the averages by schools within districts.
|>
DATA group_by(School, District) |>
summarize(Mean = mean(Score))
`summarise()` has grouped output by 'School'. You can override using the
`.groups` argument.
# A tibble: 5 × 3
# Groups: School [5]
School District Mean
<chr> <chr> <dbl>
1 A West 82.5
2 B West 79.5
3 C East 87
4 D North 62.5
5 E North 66
You can try to summarize based on a variable without assign it to a function but that is no longer allowed with summarize()
. The following code block will throw an error.
DATA |>
group_by(District) |>
summarize(Mean = mean(Score),
School # add school
)
An alternative to summarize()
: reframe()
Instead, you can use reframe()
, which allows for previous functionality of summarize()
that is now deprecated.
|>
DATA group_by(District) |>
reframe(Mean = mean(Score),
School,
Student )
# A tibble: 11 × 4
District Mean School Student
<chr> <dbl> <chr> <chr>
1 East 87 C Tilford
2 East 87 C Beavis
3 East 87 C Herman
4 North 64.2 D Peppa
5 North 64.2 D Kay
6 North 64.2 E Jake
7 North 64.2 E Name Missing
8 West 81 A Jim
9 West 81 A Sally
10 West 81 B June
11 West 81 B Mildred
Remember that the operations in summarize()
, mutate()
, and reframe()
all depend on the grouping structure. If you wanted to add variables, you would need to group a different way and you would need to add all variables into summarize()
every step of the way.
|>
DATA group_by(District) |>
reframe(District_Mean = mean(Score),
School,
Student,
Score|>
) ungroup() |>
group_by(School) |>
reframe(School_Mean = mean(Score),
School,
Student,
Score,
District_Mean|>
) ungroup()
# A tibble: 11 × 5
School School_Mean Student Score District_Mean
<chr> <dbl> <chr> <dbl> <dbl>
1 A 82.5 Jim 80 81
2 A 82.5 Sally 85 81
3 B 79.5 June 79 81
4 B 79.5 Mildred 80 81
5 C 87 Tilford 81 87
6 C 87 Beavis 91 87
7 C 87 Herman 89 87
8 D 62.5 Peppa 60 64.2
9 D 62.5 Kay 65 64.2
10 E 66 Jake 67 64.2
11 E 66 Name Missing 65 64.2
Retaining distinct data after mutate()
using distinct()`
This is just too tedious. Your better option is mutate()
as all variables will be added to the full data frame. All variables are neatly and logically appended to the right hand side of the data frame.
|>
DATA group_by(District) |>
mutate(District_Mean = mean(Score)) |>
group_by(School) |>
mutate(School_Mean = mean(Score))
# A tibble: 11 × 6
# Groups: School [5]
Student Score School District District_Mean School_Mean
<chr> <dbl> <chr> <chr> <dbl> <dbl>
1 Jim 80 A West 81 82.5
2 Sally 85 A West 81 82.5
3 June 79 B West 81 79.5
4 Mildred 80 B West 81 79.5
5 Tilford 81 C East 87 87
6 Beavis 91 C East 87 87
7 Herman 89 C East 87 87
8 Peppa 60 D North 64.2 62.5
9 Kay 65 D North 64.2 62.5
10 Jake 67 E North 64.2 66
11 Name Missing 65 E North 64.2 66
Also, when you use a new group_by()
, it will replace previous grouping structure by default. The following two code blocks return the same data frame.
DATA |>
group_by(District) |>
mutate(District_Mean = mean(Score)) |>
group_by(School) |>
mutate(School_Mean = mean(Score))
DATA |>
group_by(District) |>
mutate(District_Mean = mean(Score)) |>
ungroup() |> # ungroup
group_by(School) |>
mutate(School_Mean = mean(Score))
If you wish those variables to remain, you have a couple of options. The first would be to use mutate()
.
Notice, however, that if you want to extract the summaries from the large data frame as you would have had you used summarize()
, you cannot just select()
your column of interest because the repetitions will exist for each row in the data frame.
|>
DATA group_by(District) |>
mutate(District_Mean = mean(Score)) |>
select(School, District_Mean)
Adding missing grouping variables: `District`
# A tibble: 11 × 3
# Groups: District [3]
District School District_Mean
<chr> <chr> <dbl>
1 West A 81
2 West A 81
3 West B 81
4 West B 81
5 East C 87
6 East C 87
7 East C 87
8 North D 64.2
9 North D 64.2
10 North E 64.2
11 North E 64.2
You can obtain the rows that are distinct()
(unique) and then pick()
the ones to keep.
|>
DATA group_by(District) |>
mutate(District_Mean = mean(Score)) |>
distinct(pick(District, District_Mean))
# A tibble: 3 × 2
# Groups: District [3]
District District_Mean
<chr> <dbl>
1 West 81
2 East 87
3 North 64.2
But you will need to ungroup()
before distinct()
because, as we have mentioned before, all subsequent functions other than group_by()
will maintain the previous grouping structure by default.
|>
DATA group_by(School) |>
mutate(School_Mean = mean(Score)) |>
ungroup()
# A tibble: 11 × 5
Student Score School District School_Mean
<chr> <dbl> <chr> <chr> <dbl>
1 Jim 80 A West 82.5
2 Sally 85 A West 82.5
3 June 79 B West 79.5
4 Mildred 80 B West 79.5
5 Tilford 81 C East 87
6 Beavis 91 C East 87
7 Herman 89 C East 87
8 Peppa 60 D North 62.5
9 Kay 65 D North 62.5
10 Jake 67 E North 66
11 Name Missing 65 E North 66
Visualizing Data
Visualizations of data aggregated differently will also convey information differently. The four variants of the data presented as bars communicate
Note: When you create geom_bar()
or geom_col()
plots, the bar width
will change depending on the number of the bars in the plot. A legend position will also affect bar width. Thus, the four resulting visualizations will contain different bar widths even if set to the same value.
<-
avg_plot |>
DATA summarize(Score = mean(Score)) |>
mutate(Average = "Average") |>
ggplot(mapping = aes(x = Average, Score)) +
geom_point() +
ylim(0, 100) +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank()) + labs(tag = "A")
<-
district_plot |>
DATA ggplot(mapping = aes(District, Score)) +
geom_point(position = position_jitter(width = .1)) +
ylim(0, 100) +
theme(axis.ticks.x = element_blank()) + labs(tag = "B")
<-
school_plot |>
DATA ggplot(mapping = aes(School, Score)) +
geom_point(position = position_jitter(width = .1)) +
ylim(0, 100) +
theme(axis.ticks.x = element_blank()) + labs(tag = "C")
<-
school_district_plot |>
DATA ggplot(mapping = aes(School, Score, col = District)) +
geom_point(position = position_jitter(width = .1)) +
ylim(0, 100) +
theme(legend.position = "top",
axis.ticks.x = element_blank()) + labs(tag = "D")
suppressWarnings(
plot(gridExtra::arrangeGrob(avg_plot,
district_plot,
school_plot, ncol = 2)
school_district_plot,
) )
Adding Data Summaries to Plots
But, let’s consider you don’t want to use color for your plot because the printing fees costs too much. Or perhaps, you are giving a talk to an audience of people who have achromatopsia. Whatever the reason, you need shapes. There are different ways to map shapes to variables. One way is with aes(shape = variable)
. Just like with mapping col
or fill
to variables results in a palette assignment, a set of shapes are used by default as well. If you don’t like those shapes, you can pass a name vector to scale_shape_manual(values = my_name_vector)
. Shapes, however, can also be mapped by variables as their identity. We have discussed identity values before. For shape, you need to find the shape values to assign a variable and then add a scale_shape_identity()
layer.
For this example, we will add two new variables to the data frame and then map each to the full data and the group data, respectively.
First, let’s look at different shapes.
::show_point_shapes() ggpubr
Scale for y is already present.
Adding another scale for y, which will replace the existing scale.
A Single Data Frame using mutate()
Let’s say that we want a plot that contains all of the data points jittered as well as the group means jittered so they don’t overlap. If we want one data frame to handle it, we will need to mutate()
the group means following group_by(Event, School)
. In the event that there happen to be exact or very similar means for the Schools, we will want to jitter the group means using position = position_jitter(height = 0, width = .1)
.
Let’s see what happens. But to lead into another step, we will just assign a new data frame to SWIM_NEW
.
<-
SWIM_NEW |>
SWIM filter(Time < 500) |>
mutate(
School = ifelse(stringr::str_detect(School, "Clare"), "CMS", "PP"),
Shape = ifelse(stringr::str_detect(School, "CMS"), 1, 2),
Shape2 = ifelse(stringr::str_detect(School, "CMS"), 21, 24),
|>
) # group both (creates nxm subsets)
group_by(Event, School) |>
mutate(Mean = mean(Time)) |>
ungroup()
Then pass to ggplot()
:
<-
full_identity_plot |>
SWIM_NEW ggplot(mapping = aes(x = Event)
+
) # plot all data (y = Time)
geom_point(mapping = aes(y = Time,
shape = Shape
),stroke = 1.2, # make a little thinker
alpha = .4, # a little transparent
position = position_jitter(seed = 167,
height = 0,
width = .2)
+
) # Add a layer using the grouped data (y = Mean)
geom_point(mapping = aes(y = Mean,
shape = Shape2 # filled
), fill = "grey50", # fill color
size = 3, # larger size
alpha = 1, # opaque
position = position_jitter(seed = 167, height = 0, width = .1)
+
) coord_flip() +
scale_shape_identity() + # use the actual values in the mapped variable
labs(caption = "Caption illustration:\nCMS (cicle), PP (triangle); Mean (filled)")
full_identity_plot
The Jitter problem with full data frame
Jittering takes place at the row level. Because the full data frame is used, each row plotted gets jitter. This means that the group mean that is on each row (redundant per group) is plotted. In such cases, you might be better off using one data frame for the full plot and a summarized data frame for the other plot. You can define the data frame in ggplot()
or in the specific geom_point()
. In this example, we will define it in ggplot()
along with aesthetics. This will be inherited by default for the first geom_point()
and then we can specify for the second geom_point()
.
Because there are different data frames, Time
can be used in both data sets. There is no need to have two different variables as there is for the full data set.
<-
SWIMMEAN |>
SWIM_NEW # group both (creates nxm subsets)
group_by(Event, School) |>
summarize(Time = mean(Time)) |>
ungroup() |>
# and we need to add a shape too
mutate(Shape = ifelse(stringr::str_detect(
string = tolower(School), pattern = "cms"),
21, 24)) # filled
`summarise()` has grouped output by 'Event'. You can override using the
`.groups` argument.
Notice that this data frame contains only four variables: Event
, School
, Time
, and Shape
. All aesthetics for a plot using that data set will be constrained to those variables. Because both data frames contain them, we can just add the aesthetics to the plot object rather than the geom.
<-
subset_identity_plot |>
SWIM_NEW ggplot(mapping = aes(x = Event,
y = Time,
shape = Shape
)+
) # plot all data (y = Time) from SWIM
geom_point(position = position_jitter(seed = 167,
height = 0,
width = .2),
alpha = .8,
size = 2
+
) # Add a layer using the grouped data (y = Time) from SWIMMEAN
geom_point(data = SWIMMEAN,
size = 3,
fill = "grey60",
stroke = 1,
alpha = 1,
position = position_jitter(seed = 167, height = 0, width = .3)
+
) coord_flip() +
scale_shape_identity() +
labs(caption = "Caption illustration:\nCMS (cicle), PP (triangle); Mean (filled)")
subset_identity_plot
Both plots together:
plot(gridExtra::arrangeGrob(full_identity_plot, subset_identity_plot, ncol = 1))
What you see in the second plot is what you might expect. Remember, that if you want to pass a grouping variable summary as an additional layer of a plot, you might be better off creating a separate data frame and pass it separately.
Examples of Color Versions
<-
color_plot1 |>
SWIM_NEW # group both (creates nxm subsets)
group_by(Event, School) |>
mutate(MeanTime = mean(Time)) |>
ungroup() |>
ggplot(mapping = aes(x = Event,
y = Time,
col = School,
fill = School
)+
) # plot all data (y = Time)
geom_point(position = position_jitter(seed = 167,
height = 0,
width = .2),
pch = 1,
alpha = .7,
stroke = 1,
size = 2
+
) # Add a layer using the grouped data (y = MeanTime)
geom_point(aes(y = MeanTime),
size = 2,
alpha = .4
+
) coord_flip() +
labs(caption = "Caption illustration:\nMeans are closed circles")
<-
color_plot2 |>
SWIM_NEW # group both (creates nxm subsets)
group_by(Event, School) |>
mutate(MeanTime = mean(Time)) |>
ungroup() |>
ggplot(mapping = aes(x = Event,
y = Time,
col = School,
fill = School
)+
) # plot all data (y = Time)
geom_point(position = position_jitter(seed = 167,
height = 0,
width = .2),
alpha = .7,
size = 2
+
) # Add a layer using the grouped data (y = MeanTime)
geom_point(aes(y = MeanTime),
pch = 1,
size = 2,
stroke = 1,
alpha = .4
+
) coord_flip() +
labs(caption = "Caption illustration:\nMeans are open circles")
plot(gridExtra::arrangeGrob(color_plot1, color_plot2, ncol = 1))
A note on the size
aesthetic
For geom_point()
, you should understand what is happening when you pass an argument (e.g., size = 2
) to the parameter because of its implications for bias in data visualizations. The size parameter scales on both the x and y dimensions proportionately which means that for any circle that doubles the size of the point, its radius will also double. When the radius doubles, however, the diameter and circumference also double. You might guess where this discussion is going.
If size is a constant, this likely will not be a problem. If, however, size is mapped to a variable, then then you have a perceptual challenge related to difficulty with comparing area. See Cleveland & McGill (1984). Moreover, because the area of a circle is proportional to the square of its radius, doubling the point size increases area fourfold.
Here is an example.
ggplot(data =
data.frame(x = 0,
y = 0,
panel = c("`size = 13`", "`size = 26`"),
size = c(13, 26)),
mapping = aes(x, y, size = size)
+
) geom_point() +
scale_x_continuous(limits = c(-5, 5),
breaks = seq(-5, 5, 1)) +
scale_y_continuous(limits = c(-5, 5),
breaks = seq(-5, 5, 1)) +
scale_size_identity() +
facet_wrap(. ~ panel) +
labs(x = NULL, y = NULL)
Summary
How data are aggregated will affect both summary statistics and data visualizations. With each level of aggregation, fine-grained or case-level data are lost and not available for plotting. That fine-grained detail may be computed with measures like variability and confidence levels but those measures are summaries themselves. Including them is not the same as including all data. That said, including all data may influence visual attention and interpretation of plots to the extent that outlying points or grouping characteristics (point proximity) drive attention to physical locations of plots at the cost of attending to other locations.
A main workhorse of adding variables and creating aggregated data in the R
language is the {dplyr} library. You must, however, be sensitive to the way its functions work so that the variables you create in either a full or summarized data frame represent your intentions and goals.
We have seen that a lot of customization can be applied to our visualizations of data and have shown just a little bit of what can be done. Many more functions and parameters can be used to customize to greater extents.
Session Info
R version 4.4.1 (2024-06-14 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 22631)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8
[2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
time zone: America/Los_Angeles
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] htmltools_0.5.8.1 DT_0.33 vroom_1.6.5 lubridate_1.9.3
[5] forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4 purrr_1.0.2
[9] readr_2.1.5 tidyr_1.3.1 tibble_3.2.1 ggplot2_3.5.1
[13] tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] gtable_0.3.5 xfun_0.45 htmlwidgets_1.6.4 rstatix_0.7.2
[5] tzdb_0.4.0 vctrs_0.6.5 tools_4.4.1 generics_0.1.3
[9] fansi_1.0.6 cluster_2.1.6 pkgconfig_2.0.3 R.oo_1.26.0
[13] data.table_1.15.4 checkmate_2.3.1 lifecycle_1.0.4 compiler_4.4.1
[17] farver_2.1.2 munsell_0.5.1 carData_3.0-5 yaml_2.3.10
[21] htmlTable_2.4.2 Formula_1.2-5 car_3.1-2 pillar_1.9.0
[25] ggpubr_0.6.0 crayon_1.5.3 R.utils_2.12.3 Hmisc_5.1-3
[29] abind_1.4-5 rpart_4.1.23 tidyselect_1.2.1 digest_0.6.36
[33] stringi_1.8.4 labeling_0.4.3 rprojroot_2.0.4 fastmap_1.2.0
[37] grid_4.4.1 here_1.0.1 colorspace_2.1-0 cli_3.6.3
[41] magrittr_2.0.3 base64enc_0.1-3 utf8_1.2.4 broom_1.0.6
[45] foreign_0.8-87 withr_3.0.1 scales_1.3.0 backports_1.5.0
[49] bit64_4.0.5 timechange_0.3.0 rmarkdown_2.27 bit_4.0.5
[53] nnet_7.3-19 gridExtra_2.3 ggsignif_0.6.4 R.methodsS3_1.8.2
[57] hms_1.1.3 evaluate_0.24.0 knitr_1.47 rlang_1.1.4
[61] glue_1.7.0 rstudioapi_0.16.0 jsonlite_1.8.8 R6_2.5.1