Examining Relationships in Variables of Cognition

Author

Gabriel I. Cook

Published

March 29, 2024

Overview

We will explore some ways to examine relationships among variables. The focus will be on using Pearson’s r correlation to examine linear relationships and presenting a correlation matrix as a table. We will also visualize variable relationships in the form of scatterplots, faceted scatterplots, and generalized pairs plots.

Readings and Preparation

Before Class: First, read to familiarize yourself with the concepts rather than master them. I will assume that you attend class with some level of basic understanding of concepts and working of functions. The goal of reading should be to understand and implement code functions as well as support your understanding and help your troubleshooting of problems. This cannot happen if you just read the content without interacting with it, however reading is absolutely essential to being successful during class time. Work through some examples so that you have a good idea of your level of understanding and confidence.

Class: In class, some functions and concepts will be introduced and we will practice implementing code through exercises.

Libraries

{here} 1.0.1: for file path management
{dplyr} 1.1.4: for data frame manipulation
{tidyr} 1.3.0: for data frame transformation
{correlation} 0.8.4: for correlations; from {easystats} ecosystem
{ggplot2} 3.4.4: for data visualization
{GGally} 1.3.0: for generalized pairs plots

Other: - {insight} 0.19.7: for model post processing

Correlations

Correlation refers to a statistical measure that describes the extent to which two variables change together. Correlation measures quantify the relationship between two variables. A positive correlation indicates that the variables tend to move in the same direction, whereas a negative correlation suggests that they move in opposite directions. When variables are uncorrelated, the variability in one variable cannot be explained by the other variable. Some common correlation coefficients include Pearson’s product-moment correlation coefficient, Spearman’s rank correlation coefficient, and Kendall tau correlation coefficient.

Getting a Data Frame

We will use a long-format data file that contains repeated measures of a variable.

Arranging Variables as Columns

When you want to correlate variables, you will need to have them as vectors of the same length. If those vectors are in data frames, you will want to arrange the variables you wish to correlate as column variables in in the data frame. If, for example, you want to correlate variables that are

You see that var1 and var2 are measured repeatedly at each time. Some individuals have data for one, two, or three time points. In order to correlate the var1 or var2 performance across the different time points, the data need to be arranged as columns in the data frame.

Pivot Wide

Pivot the data frame to take the levels of time to create new column variables by taking the values from the var1 and var2.

(WIDE <-
  LONG |>
  tidyr::pivot_wider( 
    names_from  = time,            # the measurement time variable
    values_from = c(var1, var2)    # two types of variables measured
  )
)

# A tibble: 576 × 8
   id    group var1_1 var1_2 var1_3 var2_1 var2_2 var2_3
   <chr> <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 AQAFS A         32     NA     NA      0     NA     NA
 2 AQAMF A         16     NA     NA      4     NA     NA
 3 AQAMS A         27     33     19      5      4     11
 4 AQAQQ A         37     NA     22      6     NA      6
 5 AQAQP A         41     40     39      1      3      2
 6 AQAYR A         34     42     36      6      1     11
 7 AQASY A         29     24     NA      5      5     NA
 8 AQARM A          0     NA      8     23     NA      7
 9 AQFAF A         20     16     NA      4      6     NA
10 AQFJA A         31     19     21     26     24      3
# ℹ 566 more rows

Correlate

Using the correlation function, cor(), you can correlate data from two vectors. Looking at the function, you see that there are parameters for the correlation method and the values to use.

cor(x, 
    y = NULL, 
    use = "everything",
    method = c("pearson", "kendall", "spearman")
    )

Key Parameters/Arguments

x: a numeric vector, matrix or data frame
y: NULL (default) or a vector, matrix or data frame with compatible dimensions to x. The default is equivalent to y = x (but more efficient).
na.rm: logical. Should missing values be removed?
use: an optional character string giving a method for computing covariances in the presence of missing values. This must be (an abbreviation of) one of the strings “everything”, “all.obs”, “complete.obs”, “na.or.complete”, or “pairwise.complete.obs”.
method: a character string indicating which correlation coefficient (or covariance) is to be computed. One of “pearson” (default), “kendall”, or “spearman”: can be abbreviated.

Correlating Variables using `cor()`

Out of the box, however, cor() will require some modifications. If you specify an x and a y variable to correlate, the default method = "person which means that Pearson’s r will be correlated.

cor(x = WIDE$var1_1, 
    y = WIDE$var1_2
    )

[1] NA

Dealing with `NA`s (Missing Observations)

But if there are NAs, you will see that the correlation is returned as NA. This is because use = "everything", which means that all values will be included in the correlation computation. However, just like other functions (e.g., mean(), etc.), when there are NAs, the correlation will be NA.

In order to correlate only the complete cases, for which individuals have data for both x and y variables, pass use = "complete.obs"

cor(x = WIDE$var1_1, 
    y = WIDE$var1_2,
    use = "complete.obs"
    )

[1] 0.577295

Correlating Matrices

Correlating more than two variables, however, is quite common. The x parameter in cor() will also take a matrix, so you can pass a data frame that contains the numeric variables you wish to correlate.

Selecting specific variables is easy using :.

WIDE |>
  select(var1_1:var2_3) |>
  cor(use = "complete.obs")

           var1_1     var1_2     var1_3     var2_1     var2_2     var2_3
var1_1  1.0000000  0.5369649  0.4126811 -0.4877476 -0.3259415 -0.2694527
var1_2  0.5369649  1.0000000  0.6122694 -0.3838897 -0.6045711 -0.4540006
var1_3  0.4126811  0.6122694  1.0000000 -0.3138597 -0.3748611 -0.4750568
var2_1 -0.4877476 -0.3838897 -0.3138597  1.0000000  0.5824039  0.5119893
var2_2 -0.3259415 -0.6045711 -0.3748611  0.5824039  1.0000000  0.6204512
var2_3 -0.2694527 -0.4540006 -0.4750568  0.5119893  0.6204512  1.0000000

But if you wanted to correlate all numeric variables, you can select columns that are numeric by using where(). All you need to do is pass an argument for a function, for example, fn = is.numeric.

WIDE |>
  select(where(fn = is.numeric)) |> 
  head()

# A tibble: 6 × 6
  var1_1 var1_2 var1_3 var2_1 var2_2 var2_3
   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1     32     NA     NA      0     NA     NA
2     16     NA     NA      4     NA     NA
3     27     33     19      5      4     11
4     37     NA     22      6     NA      6
5     41     40     39      1      3      2
6     34     42     36      6      1     11

You see that all other variables are no longer in the data frame. With only numeric variables, if the entire data frame is passed as the argument to x in cor(), all variables will be correlated with each other.

WIDE |>
  select(where(fn = is.numeric)) |>
  cor()

       var1_1 var1_2 var1_3 var2_1 var2_2 var2_3
var1_1      1     NA     NA     NA     NA     NA
var1_2     NA      1     NA     NA     NA     NA
var1_3     NA     NA      1     NA     NA     NA
var2_1     NA     NA     NA      1     NA     NA
var2_2     NA     NA     NA     NA      1     NA
var2_3     NA     NA     NA     NA     NA      1

But if there are NAs, you will need to deal with them. There are two options for this. First, you can correlate variables at the pair levels by dropping out the NA for each pair for pairwise correlations (e.g., use = "pairwise.complete.obs" or you can drop out all the NAs to include complete cases use = "complete.obs". The latter option will correlate data by including only those who have no missing data across all of the variables. Let’s try both options.

Correlate at the pair level. Drop out those who have not contributed for each x,y pair.

WIDE |>
  select(where(fn = is.numeric)) |>
  cor(use = "pairwise.complete.obs")

           var1_1     var1_2     var1_3     var2_1     var2_2     var2_3
var1_1  1.0000000  0.5772950  0.4325422 -0.4724163 -0.3796002 -0.3368073
var1_2  0.5772950  1.0000000  0.6067406 -0.4172432 -0.5526024 -0.4589754
var1_3  0.4325422  0.6067406  1.0000000 -0.3323334 -0.3704202 -0.5220840
var2_1 -0.4724163 -0.4172432 -0.3323334  1.0000000  0.5747723  0.4700286
var2_2 -0.3796002 -0.5526024 -0.3704202  0.5747723  1.0000000  0.6242233
var2_3 -0.3368073 -0.4589754 -0.5220840  0.4700286  0.6242233  1.0000000

Use only those who have data for all measurements.

WIDE |>
  select(where(fn = is.numeric)) |>
  cor(use = "complete.obs")

           var1_1     var1_2     var1_3     var2_1     var2_2     var2_3
var1_1  1.0000000  0.5369649  0.4126811 -0.4877476 -0.3259415 -0.2694527
var1_2  0.5369649  1.0000000  0.6122694 -0.3838897 -0.6045711 -0.4540006
var1_3  0.4126811  0.6122694  1.0000000 -0.3138597 -0.3748611 -0.4750568
var2_1 -0.4877476 -0.3838897 -0.3138597  1.0000000  0.5824039  0.5119893
var2_2 -0.3259415 -0.6045711 -0.3748611  0.5824039  1.0000000  0.6204512
var2_3 -0.2694527 -0.4540006 -0.4750568  0.5119893  0.6204512  1.0000000

Correlating Across Sub-Groups

If you want to correlate data for subgroups or subsets of your data frame, you will want to create a data-frame summary.

WIDE |>
  group_by(group) |>
  summarize(var1_var2_r = cor(var1_1, 
                              var2_1,
                              use = "complete.obs"
                              ) 
  )

# A tibble: 2 × 2
  group var1_var2_r
  <chr>       <dbl>
1 A          -0.457
2 B          -0.490

Add the other correlations:

WIDE |>
  group_by(group) |>
  summarize(var1_var2_r = cor(var1_1, 
                              var1_2,
                              use = "pairwise.complete.obs"
                              ),
            var1_var3_r = cor(var1_1, 
                              var1_3,
                              use = "pairwise.complete.obs"
                              ),
            var2_var2_r = cor(var1_2, 
                              var1_3,
                              use = "pairwise.complete.obs"
                              )
  )

# A tibble: 2 × 4
  group var1_var2_r var1_var3_r var2_var2_r
  <chr>       <dbl>       <dbl>       <dbl>
1 A           0.572       0.487       0.618
2 B           0.576       0.378       0.556

An Alternative to `cor()`: {correlation}

Base R has useful functions but developers have created better alternatives that are build off of that base functionality. A better approach to correlations, however, is to use correlation() from the {correlation} library. The correlation() function from {correlation} is a wrapper function for cor(), which means that it uses cor() to perform operations. Thus, understanding the use of cor() in the preceding section is important to understand how correlation() works.

correlation(), however, has several advantages over cor(). The {correlation} library is also part of the {easystats} ecosystem of libraries. You can load either {correlation} or {easystats}, though the latter will be more helpful for modeling. You can read more about all of the methods for correlating variables at the {correlation} github page.

correlation(
  data,
  data2 = NULL,
  select = NULL,
  select2 = NULL,
  rename = NULL,
  method = "pearson",
  p_adjust = "holm",
  ci = 0.95,
  bayesian = FALSE,
  bayesian_prior = "medium",
  bayesian_ci_method = "hdi",
  bayesian_test = c("pd", "rope", "bf"),
  redundant = FALSE,
  include_factors = FALSE,
  partial = FALSE,
  partial_bayesian = FALSE,
  multilevel = FALSE,
  ranktransform = FALSE,
  winsorize = FALSE,
  verbose = TRUE,
  standardize_names = getOption("easystats.standardize_names", FALSE),
  ...
)

Key Parameters/Arguments

data: a data frame
data2: an optional data frame for pair-wise correlations (e.g., correlate data with data2)
p_adjust: for correcting frequentist correlations
ci: for creating confidence intervals around the correlations
partial: for calculating partial correlations
standardize_names:
winsorize: for trimming data

Let’s use it to correlate some data.

Correlating with `correlation()`

Select the the var1 variables, var1_1:var1_3 as well as the group variable, group by group, and correlate the pairwise combinations for each group. By default, the correlation will be Pearson’s r.

WIDE |>
  select(c(var1_1 : var1_3), group) |>     # select var1 variables
  group_by(group) |>
  correlation()

# Correlation Matrix (pearson-method)

Group | Parameter1 | Parameter2 |    r |       95% CI |    t |  df |         p
------------------------------------------------------------------------------
A     |     var1_1 |     var1_2 | 0.57 | [0.45, 0.67] | 8.22 | 139 | < .001***
A     |     var1_1 |     var1_3 | 0.49 | [0.33, 0.62] | 5.90 | 112 | < .001***
A     |     var1_2 |     var1_3 | 0.62 | [0.46, 0.74] | 6.90 |  77 | < .001***
B     |     var1_1 |     var1_2 | 0.58 | [0.43, 0.69] | 7.05 | 100 | < .001***
B     |     var1_1 |     var1_3 | 0.38 | [0.21, 0.53] | 4.31 | 111 | < .001***
B     |     var1_2 |     var1_3 | 0.56 | [0.37, 0.70] | 5.56 |  69 | < .001***

p-value adjustment method: Holm (1979)
Observations: 71-141

The correlation table print out from correlation() is organized by group, followed by the pairwise correlations for the variable pairs (e.g., parameters 1 and 2). In addition to r, a 95% confidence interval (CI), t-test, degrees of freedom (df), and p value, are also provided. Because these metrics are important for reporting, all of these are advantages of using correlation().

You will also notice that the degrees of freedom are not the same for all variable pairs, thus indicating that the default operation is to use complete observations in a pairwise fashion; rows with NAs for any of the variables are not dropped. The smallest sample sizes correspond to correlation between var1_2 and var1_3 as attrition rates affected data collected at time 3. Also, the p-values are adjusted by default using Holm’s methods.

Correlating Complete Cases

If you want to obtain the correlation for cases that contributed to all variables, reduce the data frame by dropping any rows with NAs using drop_na().

WIDE |>
  select(c(var1_1 : var1_3), group) |>     # select var1 variables
  drop_na() |>
  group_by(group) |>
  correlation()

# Correlation Matrix (pearson-method)

Group | Parameter1 | Parameter2 |    r |       95% CI |    t | df |         p
-----------------------------------------------------------------------------
A     |     var1_1 |     var1_2 | 0.47 | [0.27, 0.63] | 4.60 | 75 | < .001***
A     |     var1_1 |     var1_3 | 0.40 | [0.19, 0.57] | 3.73 | 75 | < .001***
A     |     var1_2 |     var1_3 | 0.63 | [0.47, 0.75] | 6.95 | 75 | < .001***
B     |     var1_1 |     var1_2 | 0.62 | [0.45, 0.75] | 6.41 | 66 | < .001***
B     |     var1_1 |     var1_3 | 0.42 | [0.20, 0.60] | 3.74 | 66 | < .001***
B     |     var1_2 |     var1_3 | 0.56 | [0.37, 0.71] | 5.53 | 66 | < .001***

p-value adjustment method: Holm (1979)
Observations: 68-77

You will see that the data are now reduced to those who contributed to all measures.

Zero-Order, Part and Partial Correlations

When you calculate the correlation between two variables (bivariate correlation), hours studied (predictor) and grade performance (outcome), the value represents the association between the two variables without regard to other variables. This is referred to as a zero-order correlation. A zero-order correlation provided a basic idea of how two variables are related but it does not take into account any other factors that might also influence the two variables.

A partial correlation represents the correlation between the two variables after statistically controlling for the influence of other variables on both variables. For instance, the correlation between the hours studied and grade performance while controlling for one or more potentially confounding variables (e.g., study quality, etc.). Thus, the partial correlation between , the predictor, hours studied, and the outcome variable, grade performance takes into account the relationship between the predictor and study quality and the outcome variables and study quality. In other words, the partial correlation is the unique measure of correlation between two variables that statically removes overlapping relationships with other variables.

The part correlation (aka semi-partial) also controls for outside influence. Unlike the partial correlation, however, the part correlation takes into account the correlation between the other variable(s) (e.g., study quality) and the predictor variable, hours studied. In this example, the part correlation does not take into account study quality and grade performance. Because the latter relationship is not accounted for, some refer to this correlation as semi-partial.

Calculating Partial Correlations`

To calculate partial correlations, pass partial = TRUE.

WIDE |>
  select(c(var1_1 : var1_3), group) |>     # select var1 variables
  group_by(group) |>
  correlation(partial = TRUE)

# Correlation Matrix (pearson-method)

Group | Parameter1 | Parameter2 |    r |        95% CI |    t | df |         p
------------------------------------------------------------------------------
A     |     var1_1 |     var1_2 | 0.31 | [ 0.09, 0.50] | 2.81 | 75 | 0.013*   
A     |     var1_1 |     var1_3 | 0.14 | [-0.08, 0.36] | 1.27 | 75 | 0.209    
A     |     var1_2 |     var1_3 | 0.55 | [ 0.37, 0.69] | 5.71 | 75 | < .001***
B     |     var1_1 |     var1_2 | 0.51 | [ 0.31, 0.67] | 4.82 | 66 | < .001***
B     |     var1_1 |     var1_3 | 0.11 | [-0.13, 0.34] | 0.88 | 66 | 0.382    
B     |     var1_2 |     var1_3 | 0.43 | [ 0.21, 0.60] | 3.82 | 66 | < .001***

p-value adjustment method: Holm (1979)
Observations: 68-77

You see that the sample size again is reduced to include rows with data for all variables. This is because the partial correlation represents the relationship between each variable pair by removing the relationship between each of those two variables and the third variable. In order to remove the relationship between var1_1 and var1_2 while controlling for the relationship between var1_1 and var1_3 and var1_2 and var1_3. In order for these relationships to be accounted for, only those contributing to all measures will be included in the partial correlation.

As a result, dropping NAs using drop_na() does not change the sample sizes.

WIDE |>
  select(c(var1_1 : var1_3), group) |>   
  drop_na() |>
  group_by(group) |>
  correlation(partial = TRUE)

# Correlation Matrix (pearson-method)

Group | Parameter1 | Parameter2 |    r |        95% CI |    t | df |         p
------------------------------------------------------------------------------
A     |     var1_1 |     var1_2 | 0.31 | [ 0.09, 0.50] | 2.81 | 75 | 0.012*   
A     |     var1_1 |     var1_3 | 0.15 | [-0.08, 0.36] | 1.29 | 75 | 0.200    
A     |     var1_2 |     var1_3 | 0.54 | [ 0.36, 0.68] | 5.60 | 75 | < .001***
B     |     var1_1 |     var1_2 | 0.51 | [ 0.31, 0.67] | 4.83 | 66 | < .001***
B     |     var1_1 |     var1_3 | 0.11 | [-0.13, 0.34] | 0.88 | 66 | 0.380    
B     |     var1_2 |     var1_3 | 0.43 | [ 0.21, 0.60] | 3.82 | 66 | < .001***

p-value adjustment method: Holm (1979)
Observations: 68-77

Visualizing Relationships

Tables

Presenting the data in tabular form is one of the easiest ways to communicate the relationships between the variables. Moreover, the data are already in a form that is ready for a table. The {gt} library provides a simple way to presents data in a table. You can find additional about using the library at the {gt} website.

{gt} will want to work with data frames, so we need to verify that the correlation matrix is actually a data frame by piping it to is.data.frame().

WIDE |>
  select(c(var1_1 : var1_3), group) |>   
  drop_na() |>
  group_by(group) |>
  correlation(partial = TRUE) |>
  is.data.frame()

[1] TRUE

Great, we know it is. The next step is to simply pipe the data frame to gt::gt() to create a table.

WIDE |>
  select(c(var1_1 : var1_3), group) |>   
  drop_na() |>
  group_by(group) |>
  correlation(partial = TRUE) |>
  gt::gt()

Group	Parameter1	Parameter2	r	CI	CI_low	CI_high	t	df_error	p	Method	n_Obs
A	var1_1	var1_2	0.3091192	0.95	0.09147369	0.4985784	2.8149164	75	1.245767e-02	Pearson correlation	77
A	var1_1	var1_3	0.1476916	0.95	-0.07889718	0.3597693	1.2932286	75	1.999012e-01	Pearson correlation	77
A	var1_2	var1_3	0.5430397	0.95	0.36324172	0.6838431	5.6005980	75	1.005176e-06	Pearson correlation	77
B	var1_1	var1_2	0.5111804	0.95	0.31061211	0.6681701	4.8318522	66	2.522863e-05	Pearson correlation	68
B	var1_1	var1_3	0.1080723	0.95	-0.13380040	0.3377935	0.8831565	66	3.803573e-01	Pearson correlation	68
B	var1_2	var1_3	0.4253243	0.95	0.20799113	0.6026371	3.8178939	66	5.993056e-04	Pearson correlation	68

The table is somewhat busy, so we should clean it up a bit. We don’t need the Method variable and we can also round all of the values that are numeric in the data frame in order to improve appearance. We need to pass arguments to .cols and .fns. Using {dplyr}, we can mutate() across() columns where() the data are numeric and then pass a function that will iterate across all of those columns. Let’s just round by 3 digits. If you desire more rounding granularity, you can round specific columns if you don’t want the same rounding rule applied to all values.

WIDE |>
  select(c(var1_1 : var1_3), group) |>   
  drop_na() |>
  group_by(group) |>
  correlation(partial = TRUE) |>
  select(-Method) |>
  mutate(across(.cols = where(is.numeric),  # where the columns are numeric 
                .fns = ~round(.x, 3))       # apply the rounding function
         ) |>
  gt::gt()

Group	Parameter1	Parameter2	r	CI	CI_low	CI_high	t	df_error	p	n_Obs
A	var1_1	var1_2	0.309	0.95	0.091	0.499	2.815	75	0.012	77
A	var1_1	var1_3	0.148	0.95	-0.079	0.360	1.293	75	0.200	77
A	var1_2	var1_3	0.543	0.95	0.363	0.684	5.601	75	0.000	77
B	var1_1	var1_2	0.511	0.95	0.311	0.668	4.832	66	0.000	68
B	var1_1	var1_3	0.108	0.95	-0.134	0.338	0.883	66	0.380	68
B	var1_2	var1_3	0.425	0.95	0.208	0.603	3.818	66	0.001	68

Scatterplots using `geom_point()` and `geom_smooth()`

Using {ggplot}, a point plot can be used to visualize the relationship between two variables.

WIDE |>
  #select(c(var1_1 : var1_3), group) |>
  ggplot(mapping = aes(
    x = var1_1, 
    y = var1_2
    )
    ) +
  geom_point() +
  theme_minimal()

To plot points from subgroups in different colors, map a variable to the color aesthetic.

WIDE |>
  ggplot(mapping = aes(
    x = var1_1, 
    y = var1_2
    )
    ) +
  geom_point(mapping = aes(col = group)) +
  theme_minimal()

If you wanted to add a fit line to the plot, add a geom_smooth() layer. Depending on how many rows of data there are, the formual for the function may differ so we will pass formula = y ~ x to geom_smoth()

WIDE |>
  ggplot(mapping = aes(
    x = var1_1, 
    y = var1_2
    )
    ) +
  geom_point() +
  geom_smooth(formula = y ~ x)

To add a linear fit line, use method = "lm".

WIDE |>
  ggplot(mapping = aes(
    x = var1_1, 
    y = var1_2
    )
    ) +
  geom_point() +
  geom_smooth(formula = y ~ x,
              method = "lm"
              ) +
  theme_minimal()

To plot separate fits lines for subgroups map a variable to the color aesthetic.

WIDE |>
  ggplot(mapping = aes(
    x = var1_1, 
    y = var1_2
    )
    ) +
  geom_point(mapping = aes(col = group)) +
  geom_smooth(formula = y ~ x,
              method = "lm",
              mapping = aes(col = group)
              ) +
  theme_minimal()

To extend fits lines, use fullrange = TRUE.

WIDE |>
  ggplot(mapping = aes(
    x = var1_1, 
    y = var1_2
    )
    ) +
  geom_point(mapping = aes(col = group)) +
  geom_smooth(formula = y ~ x,
              method = "lm",
              mapping = aes(col = group),
              fullrange = TRUE
              ) +
  theme_minimal()

Rather than use different colors, to create a facet plot by subgroups, use facet_wrap().

WIDE |>
  ggplot(mapping = aes(
    x = var1_1, 
    y = var1_2
    )
    ) +
  geom_point() +
  geom_smooth(formula = y ~ x,
              method = "lm",
              fullrange = TRUE
              ) +
  facet_wrap(facets = vars(group))

Generalized Pairs Plot using `GGally::ggpairs()`

Generalized pairs plots are data visualizations used to exploratory data for individual variables and for relationships across multiple variable pairs. They are often used in exploratory data analysis (EDA) to examine the relationships between multiple variables using point plots, bar plots, boxplots, density plots, histograms, etc. They are an extension of traditional pairwise scatter plots created above using geom_point() but they also allow for the visualization of relationships between more than two variables simultaneously.

GGally::ggpairs() makes creating generalized pairs plots easy.

ggpairs(
  data,
  mapping = NULL,
  columns = 1:ncol(data),
  title = NULL,
  upper = list(continuous = "cor", combo = "box_no_facet", discrete = "count", na = "na"),
  lower = list(continuous = "points", combo = "facethist", discrete = "facetbar", na =
    "na"),
  diag = list(continuous = "densityDiag", discrete = "barDiag", na = "naDiag"),
  params = NULL,
  ...
)

Key Parameters/Arguments

data: a data frame
mapping: for mapping aesthetics in ggplot objects

There are other parameters but not needed for this brief introduction.

The simplest usage is to pipe your data frame containing your columns of interest to ggpairs() without passing any arguments to the function. Let’s assume first that you have only numeric data and no grouping variable. The pairs plot will provide xy point plots for all variables pairs, density plots for each variable, and Pearson’s r correlations for each variable pair.

WIDE |> 
  select(var1_1:var1_3) |>
  GGally::ggpairs()

When you have a grouping variable, you see a bar plot for sample sizes across groups, side-by-side histograms of data for each sub group across all variables, boxplots for the distributions of data for each sub group, density plots without grouping, scatter plots, and correlations.

WIDE |> 
  select(group, var1_1:var1_3) |>
  GGally::ggpairs()

The {GGally} library is built on the backs of {ggplot2}, so the aesthetics can be mapped as you would with ggplot objects.

WIDE |> 
  select(group, var1_1:var1_3) |>
  GGally::ggpairs(mapping = aes(col = group,    # set color by group
                                alpha = .3      # adjust transparency
                                )
                  )

If you do not like the default palette, you can modify the colors manually. One easy way is to create a vector of colors (e.g., c("cornflowerblue", "firebrick") and pass that to the values parameter of scale_fill_manual(). This tells ggplot() to manually change the colors for the fill aesthetic.

new_colors <- c("cornflowerblue", "firebrick") 

WIDE |> 
  select(group, var1_1:var1_3) |>
  GGally::ggpairs(mapping = aes(col = group, 
                                alpha = .3
                                )
                  ) +
  scale_fill_manual(values = new_colors)

Whereas the fill color changes for bars, boxplots, and density plots, the point color did not change. We need to change the color aesthetic using scale_color_manual().

WIDE |> 
  select(group, var1_1:var1_3) |>
 
  GGally::ggpairs(mapping = aes(col = group, 
                                alpha = .3
                                )
                  ) +
  scale_color_manual(values = new_colors) +
  scale_fill_manual(values = new_colors)

Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 333 rows containing missing values

Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 349 rows containing missing values

Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
Removed 426 rows containing missing values

Warning: Removed 54 rows containing non-finite values (`stat_boxplot()`).

Warning: Removed 287 rows containing non-finite values (`stat_boxplot()`).

Warning: Removed 336 rows containing non-finite values (`stat_boxplot()`).

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Warning: Removed 54 rows containing non-finite values (`stat_bin()`).

Warning: Removed 54 rows containing non-finite values (`stat_density()`).

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Warning: Removed 287 rows containing non-finite values (`stat_bin()`).

Warning: Removed 333 rows containing missing values (`geom_point()`).

Warning: Removed 287 rows containing non-finite values (`stat_density()`).

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Warning: Removed 336 rows containing non-finite values (`stat_bin()`).

Warning: Removed 349 rows containing missing values (`geom_point()`).

Warning: Removed 426 rows containing missing values (`geom_point()`).

Warning: Removed 336 rows containing non-finite values (`stat_density()`).

Voilà! The colors for both the fill and color aesthetics have changed.

Finally, if you desired to add fit lines to the plot, you can modify the lower parameter. The argument lower = list(continuous = wrap("smooth")) to produce the fit lines is beyond the scope of this course so there is not a lengthy discussion. If you are curious, you can review the lower parameter and the wrap() function. If you want to remove the standard error range from the plot, passing se = FALSE will remove it.

WIDE |> 
  select(group, var1_1:var1_3) |>

  GGally::ggpairs(mapping = aes(col = group, 
                                alpha = .3
                                ),
                  lower = list(continuous = wrap("smooth", se = FALSE))
                  ) +
  scale_color_manual(values = new_colors) +
  scale_fill_manual(values = new_colors)

Summary

Examining variable relationships can be both exploratory and targeted. The module introduces some ways to examine relationships among variable numerically and graphically. Both approaches focused on examining a linear relationship, which represents only one type of relationship (and model) of interest. More complicated relationships are necessary when relationships take on different forms. Generalized pairs plots both extend and facilitate the investigation of variables, both in isolation and in relation to other variables. They offer a way to communicate a data visually in a variety of ways without requiring either extensive coding or multiple plots. Graphical depictions of relationships, however, do not distill data summaries to key metrics. In such a case, a table is your friend.

Session Information

sessioninfo::session_info()

─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.2 (2023-10-31 ucrt)
 os       Windows 11 x64 (build 22621)
 system   x86_64, mingw32
 ui       RTerm
 language (EN)
 collate  English_United States.utf8
 ctype    English_United States.utf8
 tz       America/Los_Angeles
 date     2024-03-29
 pandoc   3.1.5 @ C:/PROGRA~1/Pandoc/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 ! package      * version date (UTC) lib source
   bayestestR   * 0.13.1  2023-04-07 [1] RSPM (R 4.3.0)
   BiocManager    1.30.22 2023-08-08 [1] RSPM (R 4.3.0)
 P bit            4.0.5   2022-11-15 [?] CRAN (R 4.3.1)
 P bit64          4.0.5   2020-08-30 [?] CRAN (R 4.3.1)
 P cli            3.6.1   2023-03-23 [?] CRAN (R 4.3.1)
   coda           0.19-4  2020-09-30 [1] RSPM (R 4.3.0)
 P codetools      0.2-19  2023-02-01 [?] CRAN (R 4.3.1)
 P colorspace     2.1-0   2023-01-23 [?] CRAN (R 4.3.1)
   correlation  * 0.8.4   2023-04-06 [1] RSPM (R 4.3.0)
 P crayon         1.5.2   2022-09-29 [?] CRAN (R 4.3.1)
   datawizard   * 0.9.1   2023-12-21 [1] RSPM (R 4.3.0)
 P digest         0.6.33  2023-07-07 [?] CRAN (R 4.3.1)
   dplyr        * 1.1.4   2023-11-17 [1] RSPM (R 4.3.0)
   easystats    * 0.7.0   2023-11-05 [1] RSPM (R 4.3.0)
   effectsize   * 0.8.6   2023-09-14 [1] RSPM (R 4.3.0)
   emmeans        1.9.0   2023-12-18 [1] RSPM (R 4.3.0)
   estimability   1.4.1   2022-08-05 [1] RSPM (R 4.3.0)
 P evaluate       0.21    2023-05-05 [?] CRAN (R 4.3.1)
 P fansi          1.0.4   2023-01-22 [?] CRAN (R 4.3.1)
 P farver         2.1.1   2022-07-06 [?] CRAN (R 4.3.1)
 P fastmap        1.1.1   2023-02-24 [?] CRAN (R 4.3.1)
 P forcats      * 1.0.0   2023-01-29 [?] CRAN (R 4.3.1)
 P generics       0.1.3   2022-07-05 [?] CRAN (R 4.3.1)
   GGally       * 2.2.0   2023-11-22 [1] RSPM (R 4.3.0)
   ggplot2      * 3.4.4   2023-10-12 [1] RSPM (R 4.3.0)
   ggstats        0.5.1   2023-11-21 [1] RSPM (R 4.3.0)
   glue           1.6.2   2022-02-24 [1] RSPM (R 4.3.0)
   gt             0.10.0  2023-10-07 [1] RSPM (R 4.3.0)
 P gtable         0.3.4   2023-08-21 [?] CRAN (R 4.3.1)
   here           1.0.1   2020-12-13 [1] RSPM (R 4.3.0)
 P hms            1.1.3   2023-03-21 [?] CRAN (R 4.3.1)
   htmltools      0.5.7   2023-11-03 [1] RSPM (R 4.3.0)
   htmlwidgets    1.6.4   2023-12-06 [1] RSPM (R 4.3.0)
   insight      * 0.19.7  2023-11-26 [1] RSPM (R 4.3.0)
 P jsonlite       1.8.7   2023-06-29 [?] CRAN (R 4.3.1)
   knitr          1.45    2023-10-30 [1] RSPM (R 4.3.0)
 P labeling       0.4.2   2020-10-20 [?] CRAN (R 4.3.0)
 P lattice        0.21-8  2023-04-05 [?] CRAN (R 4.3.1)
 P lifecycle      1.0.3   2022-10-07 [?] CRAN (R 4.3.1)
   lubridate    * 1.9.3   2023-09-27 [1] RSPM (R 4.3.0)
   magrittr       2.0.3   2022-03-30 [1] RSPM (R 4.3.0)
 P MASS           7.3-60  2023-05-04 [?] CRAN (R 4.3.1)
 P Matrix         1.6-1   2023-08-14 [?] CRAN (R 4.3.1)
 P mgcv           1.8-42  2023-03-02 [?] CRAN (R 4.3.1)
   modelbased   * 0.8.6   2023-01-13 [1] RSPM (R 4.3.0)
   multcomp       1.4-25  2023-06-20 [1] RSPM (R 4.3.0)
 P munsell        0.5.0   2018-06-12 [?] CRAN (R 4.3.1)
 P mvtnorm        1.2-2   2023-06-08 [?] CRAN (R 4.3.1)
 P nlme           3.1-162 2023-01-31 [?] CRAN (R 4.3.1)
   parameters   * 0.21.3  2023-11-02 [1] RSPM (R 4.3.0)
   performance  * 0.10.8  2023-10-30 [1] RSPM (R 4.3.0)
 P pillar         1.9.0   2023-03-22 [?] CRAN (R 4.3.1)
 P pkgconfig      2.0.3   2019-09-22 [?] CRAN (R 4.3.1)
   plyr           1.8.9   2023-10-02 [1] RSPM (R 4.3.0)
 P purrr        * 1.0.2   2023-08-10 [?] CRAN (R 4.3.1)
 P R.methodsS3    1.8.2   2022-06-13 [?] CRAN (R 4.3.0)
 P R.oo           1.25.0  2022-06-12 [?] CRAN (R 4.3.0)
 P R.utils        2.12.2  2022-11-11 [?] CRAN (R 4.3.1)
 P R6             2.5.1   2021-08-19 [?] CRAN (R 4.3.1)
 P RColorBrewer   1.1-3   2022-04-03 [?] CRAN (R 4.3.0)
 P Rcpp           1.0.11  2023-07-06 [?] CRAN (R 4.3.1)
   readr        * 2.1.4   2023-02-10 [1] RSPM (R 4.3.0)
   renv           1.0.3   2023-09-19 [1] RSPM (R 4.3.0)
   report       * 0.5.8   2023-12-07 [1] RSPM (R 4.3.0)
 P rlang          1.1.1   2023-04-28 [?] CRAN (R 4.3.1)
   rmarkdown      2.25    2023-09-18 [1] RSPM (R 4.3.0)
 P rprojroot      2.0.3   2022-04-02 [?] CRAN (R 4.3.1)
   rstudioapi     0.15.0  2023-07-07 [1] RSPM (R 4.3.0)
   sandwich       3.1-0   2023-12-11 [1] RSPM (R 4.3.0)
 P sass           0.4.7   2023-07-15 [?] CRAN (R 4.3.1)
   scales         1.3.0   2023-11-28 [1] RSPM (R 4.3.0)
   see          * 0.8.1   2023-11-03 [1] RSPM (R 4.3.0)
   sessioninfo    1.2.2   2021-12-06 [1] RSPM (R 4.3.0)
 P stringi        1.7.12  2023-01-11 [?] CRAN (R 4.3.0)
   stringr      * 1.5.1   2023-11-14 [1] RSPM (R 4.3.0)
 P survival       3.5-7   2023-08-14 [?] CRAN (R 4.3.2)
   TH.data        1.1-2   2023-04-17 [1] RSPM (R 4.3.0)
 P tibble       * 3.2.1   2023-03-20 [?] CRAN (R 4.3.1)
 P tidyr        * 1.3.0   2023-01-24 [?] CRAN (R 4.3.1)
 P tidyselect     1.2.0   2022-10-10 [?] CRAN (R 4.3.1)
   tidyverse    * 2.0.0   2023-02-22 [1] RSPM (R 4.3.0)
 P timechange     0.2.0   2023-01-11 [?] CRAN (R 4.3.1)
 P tzdb           0.4.0   2023-05-12 [?] CRAN (R 4.3.1)
   utf8           1.2.4   2023-10-22 [1] RSPM (R 4.3.0)
   vctrs          0.6.5   2023-12-01 [1] RSPM (R 4.3.0)
   vroom        * 1.6.5   2023-12-05 [1] RSPM (R 4.3.0)
 P withr          2.5.0   2022-03-03 [?] CRAN (R 4.3.1)
 P xfun           0.40    2023-08-09 [?] CRAN (R 4.3.1)
 P xml2           1.3.5   2023-07-06 [?] CRAN (R 4.3.1)
   xtable         1.8-4   2019-04-21 [1] RSPM (R 4.3.0)
 P yaml           2.3.7   2023-01-23 [?] CRAN (R 4.3.0)
   zoo            1.8-12  2023-04-13 [1] RSPM (R 4.3.0)

 [1] C:/Users/gcook/Sync/git/fods24/renv/library/R-4.3/x86_64-w64-mingw32
 [2] C:/Users/gcook/AppData/Local/R/cache/R/renv/sandbox/R-4.3/x86_64-w64-mingw32/5b568fc0

 P ── Loaded and on-disk path mismatch.

──────────────────────────────────────────────────────────────────────────────

Overview

Readings and Preparation

Libraries

Correlations

Getting a Data Frame

Arranging Variables as Columns

Pivot Wide

Correlate

Correlating Variables using cor()

Dealing with NAs (Missing Observations)

Correlating Matrices

Correlating Across Sub-Groups

An Alternative to cor(): {correlation}

Correlating with correlation()

Correlating Complete Cases

Zero-Order, Part and Partial Correlations

Calculating Partial Correlations`

Visualizing Relationships

Tables

Scatterplots using geom_point() and geom_smooth()

Generalized Pairs Plot using GGally::ggpairs()

Summary

Session Information

Correlating Variables using `cor()`

Dealing with `NA`s (Missing Observations)

An Alternative to `cor()`: {correlation}

Correlating with `correlation()`

Scatterplots using `geom_point()` and `geom_smooth()`

Generalized Pairs Plot using `GGally::ggpairs()`