library(dplyr)
Joining Relational Data
Overview
In this module, we will cover how to join data frames that are related in some way. For example, perhaps you have multiple data frames containing data from the same participants and you need to combine them all so that you can summarize, visualize, or model the data. Students familiar with SQL
(Structured Query Language) should have some experience with joining relational data. The joining or merging process is often a large component of data science projects, especially those that involve collaborators cleaning separate components that need to be merged together.
You can run SQL
code within an R Markdown code block but there are other libraries like {sqldf} that allow you to include SQL
to perform data queries.
For those who wish to dig deeper into learning SQL
, there are tutorials here.
Readings and Preparation
Before Class: First, read to familiarize yourself with the concepts rather than master them. I will assume that you attend class with some level of basic understanding of concepts and working of functions. The goal of reading should be to understand and implement code functions as well as support your understanding and help your troubleshooting of problems. This cannot happen if you just read the content without interacting with it, however reading is absolutely essential to being successful during class time. Work through some examples so that you have a good idea of your level of understanding and confidence.
Class: In class, some functions and concepts will be introduced and we will practice implementing code through exercises.
Supplementary Readings (Optional)
Libraries
- {here} 1.0.1: for file path management
- {dplyr} 1.1.4: for joining data frames
More generally:
- {tidyverse} 2.0.0: the tidyverse ecosystem
Creating Data Frames to Join
Let’s say that you have three data frames assigned to A
, B
, and C
. Each contains different variables and/or participants.
Data frame A
contains heights and weights for different individuals, who appear to be uniquely identify by some numeric code.
<- tibble::tribble(
(A ~id, ~height, ~weight,
1111, 60, 128,
2211, NA, 120,
9811, 60, 149,
3000, 65, 140,
))
# A tibble: 4 × 3
id height weight
<dbl> <dbl> <dbl>
1 1111 60 128
2 2211 NA 120
3 9811 60 149
4 3000 65 140
Data frame B
contains data on cities and ages for different individuals, who also appear to be uniquely identify by some numeric code.
<- tibble::tribble(
(B ~id, ~city, ~age,
1111, "Claremont", 22,
2211, "Dallas", 25,
9811, "Akron", 38,
3050, NA, 50
))
# A tibble: 4 × 3
id city age
<dbl> <chr> <dbl>
1 1111 Claremont 22
2 2211 Dallas 25
3 9811 Akron 38
4 3050 <NA> 50
Data frame C
contains names, who appear to be uniquely identify by some numeric code.
<- tibble::tribble(
(C ~id, ~name,
1111, "Sally",
2211, "Jane",
9811, NA,
NA, "Kelly"
))
# A tibble: 4 × 2
id name
<dbl> <chr>
1 1111 Sally
2 2211 Jane
3 9811 <NA>
4 NA Kelly
All three data frames appear to have at least one common variable, id
even thought the data frames don’t contain all of the same ids (e.g., 3050). If you know that id
is a unique identifier, this could be a way to merge the data from these folks with other data associated with them. The question is how to join the data frames.
Merge/Join using base R’s merge()
merge(x, y)
Key Parameters/Arguments
x
,y
: data frames, or objects to be coerced to oneby
,by.x
,by.y
: specifications for the matching column
Joining A
and B
:
merge(x = A,
y = B,
by = "id"
)
id height weight city age
1 1111 60 128 Claremont 22
2 2211 NA 120 Dallas 25
3 9811 60 149 Akron 38
We see that the returned data frame contains 3 rows and 5 columns. Although there are NA
values for some variables, the merged data frame contains the column variables from both data frames. However, the id
s that are NA
will be dropped. Similarly, id == 3050
is also dropped because the other data frames don’t contain that individual.
Joining A
and C
:
merge(x = A,
y = C,
by = "id"
)
id height weight name
1 1111 60 128 Sally
2 2211 NA 120 Jane
3 9811 60 149 <NA>
We see that the returned data frame contains 3 rows and 4 columns. Although there are NA
values for some variables, the merged data frame contains the column variables from both data frames. However, you see that when id
has value of NA
for that row, that row is dropped from the merge. This resulted in a data frame with only 3 rows/individuals.
Joining B
and C
:
merge(x = B,
y = C,
by = "id"
)
id city age name
1 1111 Claremont 22 Sally
2 2211 Dallas 25 Jane
3 9811 Akron 38 <NA>
This merge performs similarly to merging A
and C
because there are id
value of NA
and the omission of id == 3050
.
Full Join using dplyr::full_join()
A full join is used when you want to **join all rows from two data frames. When you combine the data frames using some matching value and including NA where there are no matches.
full_join(
x,
yby = NULL,
copy = FALSE,
suffix = c(".x", ".y"),
...,keep = NULL
)
Key Parameters/Arguments
x
,y
: data frames, or objects to be coerced to oneby
: specifications for the matching column
Joining A
and B
:
full_join(x = A,
y = B,
by = "id"
)
# A tibble: 5 × 5
id height weight city age
<dbl> <dbl> <dbl> <chr> <dbl>
1 1111 60 128 Claremont 22
2 2211 NA 120 Dallas 25
3 9811 60 149 Akron 38
4 3000 65 140 <NA> NA
5 3050 NA NA <NA> 50
For the most part, this full_join()
works the same a merge()
but with some differences. Whereas merge()
sorts the returned data frame by the by
argument, full_join()
does not. Also, full_join()
includes a row of data for id == 3050
.
Joining A
and C
:
full_join(x = A,
y = C,
by = "id"
)
# A tibble: 5 × 4
id height weight name
<dbl> <dbl> <dbl> <chr>
1 1111 60 128 Sally
2 2211 NA 120 Jane
3 9811 60 149 <NA>
4 3000 65 140 <NA>
5 NA NA NA Kelly
Joining B
and C
:
full_join(x = B,
y = C,
by = "id"
)
# A tibble: 5 × 4
id city age name
<dbl> <chr> <dbl> <chr>
1 1111 Claremont 22 Sally
2 2211 Dallas 25 Jane
3 9811 Akron 38 <NA>
4 3050 <NA> 50 <NA>
5 NA <NA> NA Kelly
Similarly, you see that full_join()
maintains data from both data frames even if there are NA
values anywhere.
Left Join using dplyr::left_join()
A left join will return all rows from the left data frame (e.g., the argument passed to x
) and the matched rows from the right data frame (e.g., the argument passed to y
). If there’s no match in the right data frame, it returns NA
for the columns from the right data frame, y
.
left_join(
x,
y,by = NULL,
copy = FALSE,
suffix = c(".x", ".y"),
...,keep = NULL
Key Parameters/Arguments
x
,y
: data frames, or objects to be coerced to one.by
: specifications for the matching column
Joining B
left and A
:
left_join(x = A,
y = B,
by = "id"
)
# A tibble: 4 × 5
id height weight city age
<dbl> <dbl> <dbl> <chr> <dbl>
1 1111 60 128 Claremont 22
2 2211 NA 120 Dallas 25
3 9811 60 149 Akron 38
4 3000 65 140 <NA> NA
Here, left_join()
takes B
and joins it left, into A
. What is returned is A
along with any data in B
that can be matched to the id
in A
. If there is an id
in B
that is not in A
, it won’t be included.
Joining C
left into A
:
left_join(x = A,
y = C,
by = "id"
)
# A tibble: 4 × 4
id height weight name
<dbl> <dbl> <dbl> <chr>
1 1111 60 128 Sally
2 2211 NA 120 Jane
3 9811 60 149 <NA>
4 3000 65 140 <NA>
With these data frames, left_join()
takes C
and joins it left, into C
. What is returned is B
along with any data in C
that can be matched to the id
in B
.
Joining C
left into B
:
left_join(x = B,
y = C,
by = "id"
)
# A tibble: 4 × 4
id city age name
<dbl> <chr> <dbl> <chr>
1 1111 Claremont 22 Sally
2 2211 Dallas 25 Jane
3 9811 Akron 38 <NA>
4 3050 <NA> 50 <NA>
Joining B
left into C
:
Remember that C
contains an NA
in id
.
left_join(x = C,
y = B,
by = "id"
)
# A tibble: 4 × 4
id name city age
<dbl> <chr> <chr> <dbl>
1 1111 Sally Claremont 22
2 2211 Jane Dallas 25
3 9811 <NA> Akron 38
4 NA Kelly <NA> NA
When we left_join()
takes B
and joins it left, into C
, the data frame includes the NA
row for id
.
Joining More than 2 Data Frames
When you have multiple data frames, you have to think about how you will join two first, and then join another to that joined data frame, and so on.
We will join A
and B
and assign the joined data frame to AB
.
<-
AB full_join(
x = A,
y = B,
by = "id"
)
AB
# A tibble: 5 × 5
id height weight city age
<dbl> <dbl> <dbl> <chr> <dbl>
1 1111 60 128 Claremont 22
2 2211 NA 120 Dallas 25
3 9811 60 149 Akron 38
4 3000 65 140 <NA> NA
5 3050 NA NA <NA> 50
Now, we will join C
with AB
and assign it to ABC
to represent all three data frames.
<-
ABC full_join(
x = AB,
y = C,
by = "id"
)
ABC
# A tibble: 6 × 6
id height weight city age name
<dbl> <dbl> <dbl> <chr> <dbl> <chr>
1 1111 60 128 Claremont 22 Sally
2 2211 NA 120 Dallas 25 Jane
3 9811 60 149 Akron 38 <NA>
4 3000 65 140 <NA> NA <NA>
5 3050 NA NA <NA> 50 <NA>
6 NA NA NA <NA> NA Kelly
Writing a Function
You can also write a function to help you. Here, we can make a function to take a list of data frames and cycles through them to return a data frame based on full_join()
.
<- function(list, by) {
full_join_data_frames_in_list if (missing(by)) {
message("Please specify the 'by' parameter for merging.")
return(NULL)
else {
} <- Reduce(function(x, y) dplyr::full_join(x, y, by = by), list)
joined return(joined)
} }
Put the data frames in a list:
<- list(A, B, C) ABC_list
Pass the list to the list
parameter and supply the by
argument as a string.
full_join_data_frames_in_list(list = ABC_list,
by = "id"
)
# A tibble: 6 × 6
id height weight city age name
<dbl> <dbl> <dbl> <chr> <dbl> <chr>
1 1111 60 128 Claremont 22 Sally
2 2211 NA 120 Dallas 25 Jane
3 9811 60 149 Akron 38 <NA>
4 3000 65 140 <NA> NA <NA>
5 3050 NA NA <NA> 50 <NA>
6 NA NA NA <NA> NA Kelly
Summary
When you are working with data obtained from different sources or when you split a large data set into pieces that need to joined later, joining or merging data frames will be required. There are several ways to join data frames as described here but also others like inner_join()
not addressed here. Moreover, join functions have additional parameters not addressed in this brief module. Each join function serves a special purpose which may best suit your use case.