Joining Relational Data

Author

Gabriel I. Cook

Overview

In this module, we will cover how to join data frames that are related in some way. For example, perhaps you have multiple data frames containing data from the same participants and you need to combine them all so that you can summarize, visualize, or model the data. Students familiar with SQL (Structured Query Language) should have some experience with joining relational data. The joining or merging process is often a large component of data science projects, especially those that involve collaborators cleaning separate components that need to be merged together.

You can run SQL code within an R Markdown code block but there are other libraries like {sqldf} that allow you to include SQL to perform data queries.

For those who wish to dig deeper into learning SQL, there are tutorials here.

Readings and Preparation

Before Class: First, read to familiarize yourself with the concepts rather than master them. I will assume that you attend class with some level of basic understanding of concepts and working of functions. The goal of reading should be to understand and implement code functions as well as support your understanding and help your troubleshooting of problems. This cannot happen if you just read the content without interacting with it, however reading is absolutely essential to being successful during class time. Work through some examples so that you have a good idea of your level of understanding and confidence.

Class: In class, some functions and concepts will be introduced and we will practice implementing code through exercises.

Supplementary Readings (Optional)

Libraries

  • {here} 1.0.1: for file path management
  • {dplyr} 1.1.4: for joining data frames

More generally:

  • {tidyverse} 2.0.0: the tidyverse ecosystem
library(dplyr)

Creating Data Frames to Join

Let’s say that you have three data frames assigned to A, B, and C. Each contains different variables and/or participants.

Data frame A contains heights and weights for different individuals, who appear to be uniquely identify by some numeric code.

(A <- tibble::tribble(
  ~id, ~height, ~weight,
  1111, 60, 128,
  2211, NA, 120,
  9811, 60, 149,
  3000, 65, 140,
))
# A tibble: 4 × 3
     id height weight
  <dbl>  <dbl>  <dbl>
1  1111     60    128
2  2211     NA    120
3  9811     60    149
4  3000     65    140

Data frame B contains data on cities and ages for different individuals, who also appear to be uniquely identify by some numeric code.

(B <- tibble::tribble(
  ~id, ~city, ~age,
  1111, "Claremont", 22,
  2211, "Dallas", 25,
  9811, "Akron", 38,
  3050, NA, 50
))
# A tibble: 4 × 3
     id city        age
  <dbl> <chr>     <dbl>
1  1111 Claremont    22
2  2211 Dallas       25
3  9811 Akron        38
4  3050 <NA>         50

Data frame C contains names, who appear to be uniquely identify by some numeric code.

(C <- tibble::tribble(
  ~id, ~name,
  1111, "Sally",
  2211, "Jane", 
  9811, NA,
  NA, "Kelly"
))
# A tibble: 4 × 2
     id name 
  <dbl> <chr>
1  1111 Sally
2  2211 Jane 
3  9811 <NA> 
4    NA Kelly

All three data frames appear to have at least one common variable, id even thought the data frames don’t contain all of the same ids (e.g., 3050). If you know that id is a unique identifier, this could be a way to merge the data from these folks with other data associated with them. The question is how to join the data frames.

Merge/Join using base R’s merge()

merge(x, y)

Key Parameters/Arguments

  • x, y: data frames, or objects to be coerced to one
  • by, by.x, by.y: specifications for the matching column

Joining A and B:

merge(x = A, 
      y = B, 
      by = "id"
      )
    id height weight      city age
1 1111     60    128 Claremont  22
2 2211     NA    120    Dallas  25
3 9811     60    149     Akron  38

We see that the returned data frame contains 3 rows and 5 columns. Although there are NA values for some variables, the merged data frame contains the column variables from both data frames. However, the ids that are NA will be dropped. Similarly, id == 3050 is also dropped because the other data frames don’t contain that individual.

Joining A and C:

merge(x = A, 
      y = C, 
      by = "id"
      )
    id height weight  name
1 1111     60    128 Sally
2 2211     NA    120  Jane
3 9811     60    149  <NA>

We see that the returned data frame contains 3 rows and 4 columns. Although there are NA values for some variables, the merged data frame contains the column variables from both data frames. However, you see that when id has value of NA for that row, that row is dropped from the merge. This resulted in a data frame with only 3 rows/individuals.

Joining B and C:

merge(x = B, 
      y = C, 
      by = "id"
      )
    id      city age  name
1 1111 Claremont  22 Sally
2 2211    Dallas  25  Jane
3 9811     Akron  38  <NA>

This merge performs similarly to merging A and C because there are id value of NA and the omission of id == 3050.

Full Join using dplyr::full_join()

A full join is used when you want to **join all rows from two data frames. When you combine the data frames using some matching value and including NA where there are no matches.

full_join(
  x,
  y
  by = NULL,
  copy = FALSE,
  suffix = c(".x", ".y"),
  ...,
  keep = NULL
)

Key Parameters/Arguments

  • x, y: data frames, or objects to be coerced to one
  • by: specifications for the matching column

Joining A and B:

full_join(x = A, 
          y = B, 
          by = "id"
          )
# A tibble: 5 × 5
     id height weight city        age
  <dbl>  <dbl>  <dbl> <chr>     <dbl>
1  1111     60    128 Claremont    22
2  2211     NA    120 Dallas       25
3  9811     60    149 Akron        38
4  3000     65    140 <NA>         NA
5  3050     NA     NA <NA>         50

For the most part, this full_join() works the same a merge() but with some differences. Whereas merge() sorts the returned data frame by the by argument, full_join() does not. Also, full_join() includes a row of data for id == 3050.

Joining A and C:

full_join(x = A, 
          y = C, 
          by = "id"
          )
# A tibble: 5 × 4
     id height weight name 
  <dbl>  <dbl>  <dbl> <chr>
1  1111     60    128 Sally
2  2211     NA    120 Jane 
3  9811     60    149 <NA> 
4  3000     65    140 <NA> 
5    NA     NA     NA Kelly

Joining B and C:

full_join(x = B, 
          y = C, 
          by = "id"
          )
# A tibble: 5 × 4
     id city        age name 
  <dbl> <chr>     <dbl> <chr>
1  1111 Claremont    22 Sally
2  2211 Dallas       25 Jane 
3  9811 Akron        38 <NA> 
4  3050 <NA>         50 <NA> 
5    NA <NA>         NA Kelly

Similarly, you see that full_join() maintains data from both data frames even if there are NA values anywhere.

Left Join using dplyr::left_join()

A left join will return all rows from the left data frame (e.g., the argument passed to x) and the matched rows from the right data frame (e.g., the argument passed to y). If there’s no match in the right data frame, it returns NA for the columns from the right data frame, y.

left_join(
  x,
  y,
  by = NULL,
  copy = FALSE,
  suffix = c(".x", ".y"),
  ...,
  keep = NULL

Key Parameters/Arguments

  • x, y: data frames, or objects to be coerced to one.
  • by: specifications for the matching column

Joining B left and A:

left_join(x = A, 
          y = B, 
          by = "id"
          )
# A tibble: 4 × 5
     id height weight city        age
  <dbl>  <dbl>  <dbl> <chr>     <dbl>
1  1111     60    128 Claremont    22
2  2211     NA    120 Dallas       25
3  9811     60    149 Akron        38
4  3000     65    140 <NA>         NA

Here, left_join() takes B and joins it left, into A. What is returned is A along with any data in B that can be matched to the id in A. If there is an id in B that is not in A, it won’t be included.

Joining C left into A:

left_join(x = A, 
          y = C, 
          by = "id"
          )
# A tibble: 4 × 4
     id height weight name 
  <dbl>  <dbl>  <dbl> <chr>
1  1111     60    128 Sally
2  2211     NA    120 Jane 
3  9811     60    149 <NA> 
4  3000     65    140 <NA> 

With these data frames, left_join() takes C and joins it left, into C. What is returned is B along with any data in C that can be matched to the id in B.

Joining C left into B:

left_join(x = B, 
          y = C, 
          by = "id"
          )
# A tibble: 4 × 4
     id city        age name 
  <dbl> <chr>     <dbl> <chr>
1  1111 Claremont    22 Sally
2  2211 Dallas       25 Jane 
3  9811 Akron        38 <NA> 
4  3050 <NA>         50 <NA> 

Joining B left into C:

Remember that C contains an NA in id.

left_join(x = C, 
          y = B, 
          by = "id"
          )
# A tibble: 4 × 4
     id name  city        age
  <dbl> <chr> <chr>     <dbl>
1  1111 Sally Claremont    22
2  2211 Jane  Dallas       25
3  9811 <NA>  Akron        38
4    NA Kelly <NA>         NA

When we left_join() takes B and joins it left, into C, the data frame includes the NA row for id.

Joining More than 2 Data Frames

When you have multiple data frames, you have to think about how you will join two first, and then join another to that joined data frame, and so on.

We will join A and B and assign the joined data frame to AB.

AB <- 
  full_join(
    x = A, 
    y = B, 
    by = "id"
  )
AB
# A tibble: 5 × 5
     id height weight city        age
  <dbl>  <dbl>  <dbl> <chr>     <dbl>
1  1111     60    128 Claremont    22
2  2211     NA    120 Dallas       25
3  9811     60    149 Akron        38
4  3000     65    140 <NA>         NA
5  3050     NA     NA <NA>         50

Now, we will join C with AB and assign it to ABC to represent all three data frames.

ABC <- 
  full_join(
    x = AB, 
    y = C, 
    by = "id"
  )
ABC
# A tibble: 6 × 6
     id height weight city        age name 
  <dbl>  <dbl>  <dbl> <chr>     <dbl> <chr>
1  1111     60    128 Claremont    22 Sally
2  2211     NA    120 Dallas       25 Jane 
3  9811     60    149 Akron        38 <NA> 
4  3000     65    140 <NA>         NA <NA> 
5  3050     NA     NA <NA>         50 <NA> 
6    NA     NA     NA <NA>         NA Kelly

Writing a Function

You can also write a function to help you. Here, we can make a function to take a list of data frames and cycles through them to return a data frame based on full_join().

full_join_data_frames_in_list <- function(list, by) {
  if (missing(by)) {
    message("Please specify the 'by' parameter for merging.")
    return(NULL)
  } else {
    joined <- Reduce(function(x, y) dplyr::full_join(x, y, by = by), list)
    return(joined)
  }
}

Put the data frames in a list:

ABC_list <- list(A, B, C)

Pass the list to the list parameter and supply the by argument as a string.

full_join_data_frames_in_list(list = ABC_list,
                              by = "id"
                              )
# A tibble: 6 × 6
     id height weight city        age name 
  <dbl>  <dbl>  <dbl> <chr>     <dbl> <chr>
1  1111     60    128 Claremont    22 Sally
2  2211     NA    120 Dallas       25 Jane 
3  9811     60    149 Akron        38 <NA> 
4  3000     65    140 <NA>         NA <NA> 
5  3050     NA     NA <NA>         50 <NA> 
6    NA     NA     NA <NA>         NA Kelly

Summary

When you are working with data obtained from different sources or when you split a large data set into pieces that need to joined later, joining or merging data frames will be required. There are several ways to join data frames as described here but also others like inner_join() not addressed here. Moreover, join functions have additional parameters not addressed in this brief module. Each join function serves a special purpose which may best suit your use case.