Data frame manipulation and wrangling

Author

Gabriel I. Cook

Published

October 31, 2024

Overview

In this module, we will start manipulating data frames. Data frames are composed of column variables and row observations/cases and we will address how to perform operations on both. We will learn how to use {dplyr}, rather than base R to select, remove, add, and modify vectors in data frames. In addition, we will address how to filter and arrange rows in various ways. The focus will be on cleaning or wrangling data. Helper functions from libraries like {tidyr}, {stringr}, and {tidyselect} are also used in conjunction with {dplyr}.

Readings and Preparation

Before Class: First, read to familiarize yourself with the concepts rather than master them. I will assume that you attend class with some level of basic understanding of concepts and working of functions. The goal of reading should be to understand and implement code functions as well as support your understanding and help your troubleshooting of problems. This cannot happen if you just read the content without interacting with it, however reading is absolutely essential to being successful during class time. Work through some examples so that you have a good idea of your level of understanding and confidence.

Class: In class, some functions and concepts will be introduced and we will practice implementing code through exercises.

Supplementary Readings (Optional)

Libraries

  • {here}: 1.0.1: for path management
  • {dplyr} 1.1.4: for selecting, filtering, and mutating
  • {stringr}: 1.5.1: for working with strings
  • {tidyselect}: 1.2.1: for selecting sets from strings

External Functions

view_html(): for viewing data frames in HTML format, which I created as an alternative to View(). You can source() this from your /src/functions directory or download from here.

To define all functions in your /src/functions directory, R.utils::sourceDirectory() will be helpful:

R.utils::sourceDirectory(here::here("src", "functions"))

Loading Libraries

We will work with a few different libraries for data manipulation. Let’s load them into our work space using library().

library(dplyr)
library(stringr)
library(tidyselect)

Make note of any warnings that appear when loading libraries. There are some libraries that contain functions with the same names. Be aware that the most recently loaded library function will take precedence. You can avoid confusion using :: to call a function from a particular library (e.g., libraryname::functionname()).

A Grammar for Data Wrangling

The {dplyr} package presents a type of grammar for wrangling data (Wickham and Francois 2014). {dplyr} is part of the {tidyverse} ecosystem which loads using library(tidyverse), or specifically using library(dplyr). {dplyr} also imports functions from {tidyselect}, which is also part of the {tidyverse} ecosystem.

The functions from any R package can be called after loading the package library. One does not need to reference the package explicitly when calling a function but to avoid confusion, we will often reference the package/library and the function together (e.g., package::function()).

Working with data involves creating new column variables, removing or renaming column variables, removing or deleting row observations, sorting, and summarizing data (often by groups/factors). etc. Consequently, there are five main function verbs for working with data in a data frame: select, filter, mutate, arange, and summarize. Their corresponding functions are:

  • select(dataframe, variables_to_select): select columns to retain
  • mutate(dataframe, variables_to_create) and dplyr::rename(): add or modify existing columns
  • filter(dataframe, rows_to_select): subset by rows
  • arrange(dataframe, variable_to_sort_by): sort rows
  • summarize(dataframe, procedures_for_summarizing) in conjunction with dplyr::group_by(): aggregate the data in some way

Some Common Ways for Selecting Variables Using {dplyr}

You may often wish to subset data frames containing multiple variables to make them more lean and manageable for carrying out a specific goal. For such tasks, select() will be used.

Selecting Variables using dplyr::select()

You can reference variables by their names or by their column position in a data frame (e.g., 1, 25, etc.).

Using select(), you can select columns/variables from a data frame. The variables you select are retained and variables omitted that you don’t select are omitted in the returned data frame.

Before using a function you have never used before, you would always review the documentation for what the function is designed to accomplish.

help(dplyr::select)

Or:

?dplyr::select

You can see that select() is used for selecting variables from a data frame and there are many selection features for accomplishing such tasks.

Parameters/Arguments:

  • .data: a data frame
  • ...: one ore more expressions separated by commas

We will definitely need a data frame and we will need variables to select.

Let’s start with a data frame. The USAressts data is built into R so let’s just use that for now. Let’s use select() to subset a data frame in two ways, first without piping and then with piping (preferred method).

  1. Without piping

At a bare minimum, we will need to pass a data frame to select() at very least. But what happens if we don’t pass variables to select?

select(.data = USArrests)
data frame with 0 columns and 50 rows

select() is designed to take a data frame and select column variables from that data frame. If you do not specify what to select, you will see the that the returned data frame has 0 columns and 50 rows.

  1. Piping using |>

If you pipe the data frame into the function, .data = USArrests is inherited and you do not need to declare it within select().

USArrests |>
  select()
data frame with 0 columns and 50 rows

Again, the returned data frame has 0 columns and 50 rows. If you use select() without specifying any variables for the function to select, as seen in both examples, you will have data frame that contains no columns. Let’s address some methods for selecting variables.

Selecting Variables by Column Position

Variables are columns in a data frame, so they can be selected by their element position . If you wanted to select the first and third column, one way is to specify each position separated by a comma.

USArrests |>        # take the data frame
  select(1, 3)      # select columns 1 and 3
               Murder UrbanPop
Alabama          13.2       58
Alaska           10.0       48
Arizona           8.1       80
Arkansas          8.8       50
California        9.0       91
Colorado          7.9       78
Connecticut       3.3       77
Delaware          5.9       72
Florida          15.4       80
Georgia          17.4       60
Hawaii            5.3       83
Idaho             2.6       54
Illinois         10.4       83
Indiana           7.2       65
Iowa              2.2       57
Kansas            6.0       66
Kentucky          9.7       52
Louisiana        15.4       66
Maine             2.1       51
Maryland         11.3       67
Massachusetts     4.4       85
Michigan         12.1       74
Minnesota         2.7       66
Mississippi      16.1       44
Missouri          9.0       70
Montana           6.0       53
Nebraska          4.3       62
Nevada           12.2       81
New Hampshire     2.1       56
New Jersey        7.4       89
New Mexico       11.4       70
New York         11.1       86
North Carolina   13.0       45
North Dakota      0.8       44
Ohio              7.3       75
Oklahoma          6.6       68
Oregon            4.9       67
Pennsylvania      6.3       72
Rhode Island      3.4       87
South Carolina   14.4       48
South Dakota      3.8       45
Tennessee        13.2       59
Texas            12.7       80
Utah              3.2       80
Vermont           2.2       32
Virginia          8.5       63
Washington        4.0       73
West Virginia     5.7       39
Wisconsin         2.6       66
Wyoming           6.8       60

Let’s remind ourselves of how piping works. Remember that piping will modify the data frame in steps, with each pipe passing the current form of the data frame to the next function. Not that you ever want to write two lines of code and perform this extra step to select only column 1 but you could pipe the data frame to another select().

USArrests |>        # take the data frame
  select(1, 3) |>   # select columns 1 and 3
  select(1)         # then select column 1 only
               Murder
Alabama          13.2
Alaska           10.0
Arizona           8.1
Arkansas          8.8
California        9.0
Colorado          7.9
Connecticut       3.3
Delaware          5.9
Florida          15.4
Georgia          17.4
Hawaii            5.3
Idaho             2.6
Illinois         10.4
Indiana           7.2
Iowa              2.2
Kansas            6.0
Kentucky          9.7
Louisiana        15.4
Maine             2.1
Maryland         11.3
Massachusetts     4.4
Michigan         12.1
Minnesota         2.7
Mississippi      16.1
Missouri          9.0
Montana           6.0
Nebraska          4.3
Nevada           12.2
New Hampshire     2.1
New Jersey        7.4
New Mexico       11.4
New York         11.1
North Carolina   13.0
North Dakota      0.8
Ohio              7.3
Oklahoma          6.6
Oregon            4.9
Pennsylvania      6.3
Rhode Island      3.4
South Carolina   14.4
South Dakota      3.8
Tennessee        13.2
Texas            12.7
Utah              3.2
Vermont           2.2
Virginia          8.5
Washington        4.0
West Virginia     5.7
Wisconsin         2.6
Wyoming           6.8

The better approach is just to select column 1 in a single line.

USArrests |>
  select(1)
               Murder
Alabama          13.2
Alaska           10.0
Arizona           8.1
Arkansas          8.8
California        9.0
Colorado          7.9
Connecticut       3.3
Delaware          5.9
Florida          15.4
Georgia          17.4
Hawaii            5.3
Idaho             2.6
Illinois         10.4
Indiana           7.2
Iowa              2.2
Kansas            6.0
Kentucky          9.7
Louisiana        15.4
Maine             2.1
Maryland         11.3
Massachusetts     4.4
Michigan         12.1
Minnesota         2.7
Mississippi      16.1
Missouri          9.0
Montana           6.0
Nebraska          4.3
Nevada           12.2
New Hampshire     2.1
New Jersey        7.4
New Mexico       11.4
New York         11.1
North Carolina   13.0
North Dakota      0.8
Ohio              7.3
Oklahoma          6.6
Oregon            4.9
Pennsylvania      6.3
Rhode Island      3.4
South Carolina   14.4
South Dakota      3.8
Tennessee        13.2
Texas            12.7
Utah              3.2
Vermont           2.2
Virginia          8.5
Washington        4.0
West Virginia     5.7
Wisconsin         2.6
Wyoming           6.8

Selecting Variables by Column Name

Rather than position, you will likely wish to select by variable name. We will address several approaches for doing so. Each approach are beneficial in different contexts.

In order to know the names of the data frame, you can pass the data frame to the names() function, which will return the column names.

USArrests |>
  names()
[1] "Murder"   "Assault"  "UrbanPop" "Rape"    

Or pass the data frame to head() to see the names atop each colomn.

USArrests |>
  head()
           Murder Assault UrbanPop Rape
Alabama      13.2     236       58 21.2
Alaska       10.0     263       48 44.5
Arizona       8.1     294       80 31.0
Arkansas      8.8     190       50 19.5
California    9.0     276       91 40.6
Colorado      7.9     204       78 38.7

There are many ways to inspect variable names but those will suffice for now.

Selecting Variables as Unquoted Expressions

Variables can be passed separately without quotes. We can pass two arguments, one with Murder and the other with Assault, separating each with a comma. Because of some special characteristics of the {dplyr}, we can use variable names without quotes. Let’s also reduce the returned data frame using head() in order to save space.

USArrests |> 
  select(Murder, Assault) |> 
  head()
           Murder Assault
Alabama      13.2     236
Alaska       10.0     263
Arizona       8.1     294
Arkansas      8.8     190
California    9.0     276
Colorado      7.9     204

Selecting Variables as Characters/Strings

You are most likely familiar with variable names as quoted strings. We can pass two arguments here using the variable names in quotes. This approach is more similar to procedures used with base R functions as you cannot pass unquoted expressions in base R.

USArrests |> 
  select("Murder", "Assault") |> 
  head()
           Murder Assault
Alabama      13.2     236
Alaska       10.0     263
Arizona       8.1     294
Arkansas      8.8     190
California    9.0     276
Colorado      7.9     204

Selecting Variables as a Character Vector

What if your variables were in a vector, c("Murder", "Assault"))?

USArrests |>
  select(c("Murder", "Assault")) |> 
  head()
           Murder Assault
Alabama      13.2     236
Alaska       10.0     263
Arizona       8.1     294
Arkansas      8.8     190
California    9.0     276
Colorado      7.9     204

The same will work if you passed a numeric vector c(1, 3).

Selecting Using an External Vector Object

In many use cases, you will will have a vector of variable names that would will want to pass rather than type individually. This approach also reduces confusion when you have key variables that you want to work with. Create a vector using c() (to combine elements) once and then use again and again. If you make any changes, you will only have to make them once, rather than in multiple places.

First, create the vector using c() that contains two string elements.

c("Murder", "Assault")
[1] "Murder"  "Assault"

At this point, there is no object holding the vector. We can assign the vector to a name so that we have a vector object. Let’s name it keep_vars to represent variables in the data frame that we want to keep.

Assign the vector to keep_vars using the assignment operator, <-:

keep_vars <- c("Murder", "Assault")

Inspecting the keep_vars object by typing its name will return it. You can see that it contains 2 character/string elements.

keep_vars
[1] "Murder"  "Assault"

Now that we know what we are dealing with, pass the vector obect as as arugme to select():

USArrests |>
  select(keep_vars) 
Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.
  # Was:
  data %>% select(keep_vars)

  # Now:
  data %>% select(all_of(keep_vars))

See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
               Murder Assault
Alabama          13.2     236
Alaska           10.0     263
Arizona           8.1     294
Arkansas          8.8     190
California        9.0     276
Colorado          7.9     204
Connecticut       3.3     110
Delaware          5.9     238
Florida          15.4     335
Georgia          17.4     211
Hawaii            5.3      46
Idaho             2.6     120
Illinois         10.4     249
Indiana           7.2     113
Iowa              2.2      56
Kansas            6.0     115
Kentucky          9.7     109
Louisiana        15.4     249
Maine             2.1      83
Maryland         11.3     300
Massachusetts     4.4     149
Michigan         12.1     255
Minnesota         2.7      72
Mississippi      16.1     259
Missouri          9.0     178
Montana           6.0     109
Nebraska          4.3     102
Nevada           12.2     252
New Hampshire     2.1      57
New Jersey        7.4     159
New Mexico       11.4     285
New York         11.1     254
North Carolina   13.0     337
North Dakota      0.8      45
Ohio              7.3     120
Oklahoma          6.6     151
Oregon            4.9     159
Pennsylvania      6.3     106
Rhode Island      3.4     174
South Carolina   14.4     279
South Dakota      3.8      86
Tennessee        13.2     188
Texas            12.7     201
Utah              3.2     120
Vermont           2.2      48
Virginia          8.5     156
Washington        4.0     145
West Virginia     5.7      81
Wisconsin         2.6      53
Wyoming           6.8     161

Hmm. Warning message.

Note: select(keep_vars) is deprecated as a solution, so if you see this approach someplace, it won’t work properly at this time or in the future. The warning message states that cannot do this without some modifications. What is suggested is all_of(). When your variables are in an external variable, all_of(keep_vars) will tell select() that your variables are all of the variables in the vector. Yes, this is an extra step but worth the flexibility.

USArrests |>
  select(all_of(keep_vars)) |> 
  head()
           Murder Assault
Alabama      13.2     236
Alaska       10.0     263
Arizona       8.1     294
Arkansas      8.8     190
California    9.0     276
Colorado      7.9     204

Using all_of() could be problematic if even one variable in your vector does not exist. Let’s add some variables that are not in the data frame and then pass that vector.

keep_vars_more <- c("Murder", "Assault", "Var_x", "Var_y", "Var_z")

USArrests |>
  select(all_of(keep_vars_more)) |> 
  head()

You will see the following error message:

Error in `all_of()`:
! Can't subset columns that don't exist.
✖ Columns `Var_x`, `Var_y`, and `Var_z` don't exist.

Tip: Rather than using tidyselect::all_of(), use tidyselect::any_of(). In this instance, any of the variables that exist in the data frame will be selection. Those variables that are not in the data frame will be ignored. With this approach, your code will not break if there something changes.

USArrests |>
  select(any_of(keep_vars_more)) |> 
  head()
           Murder Assault
Alabama      13.2     236
Alaska       10.0     263
Arizona       8.1     294
Arkansas      8.8     190
California    9.0     276
Colorado      7.9     204

Selecting Variables Starting with or Ending with Certain Characters

When you load the {dplyr} library, there are functions from other libraries that are imported along with {dplyr}’s own functions. These important functions are designed to work with each other, so the people who maintain the libraries have packaged them up nicely so you don’t have to load separate libraries.

Many of the functions are imported from the {tidyselect} library and these functions are what give you additional manipulation ability. Some imported functions are: all_of(), any_of(), contains(), ends_with(), everything(), last_col(), matches(), and starts_with().

With functions like starts_with(), contains(), and ends_with(), you can select variables containing character patterns in their names. For example, rather than code out all variables for a memory span task, you might want to grab variables that contain the characters "span".

Rather than hard coding variable names (e.g., "measure_time1", "measure_time2", c("Murder", "Assault"), etc.) to pass as an argument to select(), you would use another function to help you perform the heaving lifting for you (e.g.,starts_with()). This is often referred to as a helper function. In the case of starts_with(), the variable names that starts_with() specific character patterns get passed to select(). This process represents a good example of what is referred to as functional programming. This usage represents a good example of working smarter, not harder. Rather than coding specifically what to pass as an argument to another function, you utilize another function to pass its returned object as an argument to another function.

Selecting Variables Using starts_with()

Understanding starts_with()

We first need to understand what dplyr::starts_with(), actually does. The function name should provide you some insight but for more information, use ?starts_with. You should always read the docs on functions so that you know how they work. You cannot brute force a function to do something it does not do. One thing to keep in mind is that starts_with() and many other functions operate on vectors and not data frames. Yes, data frames are comprised of vectors but you cannot apply many functions to the entire data frame without performing some iteration of the function for each vector in that data frame.

starts_with(match, 
            ignore.case = TRUE, 
            vars = NULL
            )

Parameters/Arguments:

  • match: a character vector
  • ignore.case: if TRUE, the default, ignores case when matching names. This is most flexible.
  • vars: a character vector of variable names. If not supplied, the variables are taken from the current selection context (as established by functions like select() or pivot_longer()).

The best way to understand the function is to use it (after reading the documentation). Let’s try out starts_with(). Let’s set a required pattern match = some character and because vars = NULL by default, let’s just set vars = some character vector. If you reviewed the documentation or looked at the arguments above, vars is not the second parameter, so you will want to name it in the function call if it takes that second position as seen below. We will first test this function without using it on a data frame.

We set match = "a" and vars = c("Hello", "Hi", "Bye"). This will look for character "a" to match the elements of a character vector c("Hello", "Hi", "Bye").

starts_with(match = "a", 
            vars = c("Hello", "Hi", "Bye")
            )
integer(0)

What is returned is integer(0), which is actually speak for “there is no match”. Not very intuitive. Hmm, OK. Let’s try another character.

starts_with(match = "b", 
            vars = c("Hello", "Hi", "Bye")
            )
[1] 3

OK, so now an integer is returned (yes, try is.integer() if you don’t believe me).

is.integer(
  starts_with("b", 
              vars = c("Hello", "Hi", "Bye")
              ))
[1] TRUE

And even though you only see one value, this object is still a vector. We can check using is.vector().

is.vector(
  starts_with("b", 
              vars = c("Hello", "Hi", "Bye")
              )
  )
[1] TRUE

Importantly, starts_with() returns a vector of values. What do the values correspond to? The values in the vector refer to the element index/position of a match and the number of the elements in the vector tells refers to the number of matches found. Because the third string element "Bye" starts with "b", that’s what is returned.

Try something else:

starts_with("h", 
            vars = c("Hello", "Hi", "Bye")
            )
[1] 1 2

Now the function returns a vector containing 1 and a 2, giving length = 2, representing both the first and the second elements start with “h”.

Importantly, letter casing is ignored because ignore.case = TRUE by default. Set to FALSE if you want your match to be case sensitive.

starts_with("h", 
            vars = c("Hello", "Hi", "Bye"), 
            ignore.case = FALSE
            )
integer(0)

And now there are no matches.

Selecting Variables from a Data Frame using starts_with()

You will typically use starts_with() when working with a data frame and in conjunction with other functions, like select() in order to select variables that start with certain characters. When using starts_with() in the context of select(), the vars argument is essentially passing vars = the names of the columns of the data frame passed to select().

From .data, select variable names that start with this pattern of characters.

Example:

select(.data = mydataframe,
       starts_with(match = "my pattern",
                   vars = "var names of mydataframe"
                   )
      )

Let’s provide two use cases:

  1. Without Piping

You can pass the data frame as the first argument to select(). You can then pass that (now modified) data frame using a second |> pipe to the head() in order to see the first 6 rows only.

select(.data = USArrests, 
       starts_with("m")
       ) |> 
  head()
           Murder
Alabama      13.2
Alaska       10.0
Arizona       8.1
Arkansas      8.8
California    9.0
Colorado      7.9

As long as you remember that the first argument is the data frame, you can omit the parameter.

select(USArrests, 
       starts_with("m")
       ) |> 
  head()
           Murder
Alabama      13.2
Alaska       10.0
Arizona       8.1
Arkansas      8.8
California    9.0
Colorado      7.9

Notice that the only column variable that is returned is Murder.

  1. With Piping

A better approach, and one used more commonly, is to start with the data frame and then |> it to your functions. Such an approach also makes code editing easier.

Pipe the data frame into select(), which assumes the first argument is a data frame. This is a process you will see used often by others.

USArrests |>
  select(starts_with("m")) |> 
  head()
           Murder
Alabama      13.2
Alaska       10.0
Arizona       8.1
Arkansas      8.8
California    9.0
Colorado      7.9

WARNING: When piping data frames into functions, some students set a data frame within select() as shown below. This will either cause R to throw an error message or will lead to you working with incorrect data. This is especially true if you are specifying different data frames. Always remember that functions inherit what has been piped to them, either data frames or vectors.

Do not do also set .data = USArrests:

USArrests |>
  select(.data = USArrests, 
         starts_with("m")
         ) |> 
  head()
Selecting Variables using ends_with()

You can use dplyr::ends_with() to find matches at the end of strings. This function is useful if you have variables that end based on their measurement type (e.g., “_rt” or “_acc” for reaction times and accuracy, respectively). Having the foresight to create variables is such a way to make accessing them easily later is a good example of working smarter, not harder.

USArrests |>
  select(ends_with("t")) |> 
  head()
           Assault
Alabama        236
Alaska         263
Arizona        294
Arkansas       190
California     276
Colorado       204

Selecting and Selecting Out Variables By/Between Index

So far, we have address selecting variables. There are, however, many approaches for selecting but also selecting out or omitting column variables. In other words, rather than including 14 out of 15 variables, you could simply exclude 1 of the 15 variables to achieve the same result.

You can pass multiple arguments for each specification or you can pass a single vector that contains all specifications.

  • select(1, 2): select first and second columns

  • select(c(1, 2)): select first and second columns

  • select(-c(1, 2)) or select(!c(1, 2)): select out first and second columns

  • select(1:2): select first through second columns

  • select(c(1:2)): select first through second columns

  • select(-c(1:2)) or select(!c(1:2)): select out first through second columns

Potential Recommendation: Use options utilizing c() to pass a vector because this habit will be more versatile with base R functionality. However, online solutions will likely not take this approach.

Let’s first make a data frame to work with. We will name the data frame DAT even though this is a terrible name because it lacks diagnosticity for what the data represent. Using data.frame(), define variable names and for each assign a vector of elements of the same length. Some vectors are numeric and others are character/string as seen by the quotes. The are assigned to a name DAT using the <- operator. Within the function, however, notice that you cannot assign vectors to variables using <-. Inside functions, you use = for assignment or for passing arguments to their parameters.

DAT <- 
  data.frame(
    Id  = c(100, 101, 102, 103, 104, 100, 105),
    Sex = c('male', 'female', 'Male', NA, 'man', "male", "neither"),
    Age = c(25, 33, 27, 40, 44, 25, 40),
    Renting = c("yes", NA, "yes", NA, "no", "yes", "yes")
)

Select columns 1 and 2:

DAT |>
  select(1, 2) 
   Id     Sex
1 100    male
2 101  female
3 102    Male
4 103    <NA>
5 104     man
6 100    male
7 105 neither

Select columns 1 and 2 as a vector containing values 1 and 2:

DAT |>
  select(c(1, 2)) 
   Id     Sex
1 100    male
2 101  female
3 102    Male
4 103    <NA>
5 104     man
6 100    male
7 105 neither

Select out columns 1 and 2 as a vector containing values 1 and 2:

DAT |>
  select(-c(1,2)) 
  Age Renting
1  25     yes
2  33    <NA>
3  27     yes
4  40    <NA>
5  44      no
6  25     yes
7  40     yes

Or not these columns using !:

DAT |>
  select(!c(1,2)) 
  Age Renting
1  25     yes
2  33    <NA>
3  27     yes
4  40    <NA>
5  44      no
6  25     yes
7  40     yes

Select from columns 1 through 2 using the : operator:

DAT |>
  select(1:2) 
   Id     Sex
1 100    male
2 101  female
3 102    Male
4 103    <NA>
5 104     man
6 100    male
7 105 neither

Select from columns 1 through 3 using a vector (notice the inclusion of c()) containing the : operator:

DAT |>
  select(c(1:3)) 
   Id     Sex Age
1 100    male  25
2 101  female  33
3 102    Male  27
4 103    <NA>  40
5 104     man  44
6 100    male  25
7 105 neither  40

Select out columns 1 through 3 using !:

DAT |>
  select(-c(1:3))   
  Renting
1     yes
2    <NA>
3     yes
4    <NA>
5      no
6     yes
7     yes

Selecting and Selecting Out Variables By or Between Character Name

These approaches are similar to those offered earlier except that some involve passing variables by their name (e.g., character names). Whereas the order of the variables in a data frame may move around, the names may be more stable or permanent, at least after you have cleaned up the names. Consequently, passing variables by name may be more foolproof.

You don’t have to be familiar with all approaches and you may settle on using one that makes the most sense to you.

  • select("var1", "var2")

  • select(c("var1", "var2"))

  • select(-c("var1", "var2"))

  • select(!c("var1", "var2"))

  • select(var1:var2))

  • select(c("var1":"var2))

  • select(-c("var1":"var2))

  • select(!c("var1":"var2))

Recommendation: use options utilizing c() as this will be more versatile with base R functionality.

These function approaches also work but they may lead to some confusion regarding usage of quotes:

  • select(var1, var2)
  • select(c(var1, var2))
  • select(-c(var1, var2))
  • select(!c(var1, var2))

Select variables Id though Age using the : operator:

DAT |>
  select(Id:Age)          # select from here to there
   Id     Sex Age
1 100    male  25
2 101  female  33
3 102    Male  27
4 103    <NA>  40
5 104     man  44
6 100    male  25
7 105 neither  40

Select variables Id though Age passed as strings using the : operator:

DAT |>
  select("Id":"Age")      # select from here to there
   Id     Sex Age
1 100    male  25
2 101  female  33
3 102    Male  27
4 103    <NA>  40
5 104     man  44
6 100    male  25
7 105 neither  40

Select variables Id though Age as a vector containing the variable names passed as strings and using the : operator:

DAT |>
  select(c("Id":"Age"))    # select from here to there
   Id     Sex Age
1 100    male  25
2 101  female  33
3 102    Male  27
4 103    <NA>  40
5 104     man  44
6 100    male  25
7 105 neither  40

Select out, or remove, variables Id though Age as a vector (e.g., c()) containing the variable names passed as strings and using the : operator:

DAT |>
  select(-c("Id":"Age"))     # select out from here to there
  Renting
1     yes
2    <NA>
3     yes
4    <NA>
5      no
6     yes
7     yes

You can also use the ! operator to select NOT these variables (therefore, all others)

DAT |>
  select(!c("Id":"Age"))                    # select out from here to there
  Renting
1     yes
2    <NA>
3     yes
4    <NA>
5      no
6     yes
7     yes

Sometimes you will use another function to help you select() variables from the data frame. This is the case when you want to select variables without specifying them by name or position but rather by some quality. For example, you want to select variables with similar names, patterns, in their names or select variables that are numeric only.

When you use another function inside of another function, the function may be referred to as a helper function. In the following examples, we use starts_with(), ends_with(), and contains() as helpfer functions for select().

Selecting and Selecting Out Variables Characters in Their Names

  • select(starts_with("some character or set of characters"))
  • select(ends_with("some character or set of characters"))
  • select(contains("some character or set of characters"))

Selecting using starts_with()

Select variables which start with character “i”:

DAT |> 
  select(starts_with("i"))
   Id
1 100
2 101
3 102
4 103
5 104
6 100
7 105

Select variables which DO NOT start with character “s”:

DAT |> 
  select(-starts_with("s"))
   Id Age Renting
1 100  25     yes
2 101  33    <NA>
3 102  27     yes
4 103  40    <NA>
5 104  44      no
6 100  25     yes
7 105  40     yes

Selecting using ends_with()

Select variables which end with character “e”:

DAT |> 
  select(ends_with("e"))
  Age
1  25
2  33
3  27
4  40
5  44
6  25
7  40

Select variables which end with character “e”:

DAT |> 
  select(-ends_with("e"))
   Id     Sex Renting
1 100    male     yes
2 101  female    <NA>
3 102    Male     yes
4 103    <NA>    <NA>
5 104     man      no
6 100    male     yes
7 105 neither     yes

Selecting using contains()

Select variables which contain character “g”:

DAT |> 
  select(contains("g"))
  Age Renting
1  25     yes
2  33    <NA>
3  27     yes
4  40    <NA>
5  44      no
6  25     yes
7  40     yes

Select variables which DO NOT contain character “g”:

DAT |> 
  select(-contains("g"))
   Id     Sex
1 100    male
2 101  female
3 102    Male
4 103    <NA>
5 104     man
6 100    male
7 105 neither

Selecting using regular expressions

Sometimes you do not want to search for specific character strings. Instead, you might want to look for strings that contain particular character patterns. For example, typing out var1, var2, var3 could be annoying. You might prefer just grabbing variables that match the pattern "var".

We will use some regular expressions, or regex, to help us. For the regex, you we will need to specify some code for finding patterns. For example, \\d{4} will search for patterns with 4 digits, whereas the dot star .* will help with all patterns. In particular, . refers to any character (e.g,. digit, alpha character, or any other special character) and * means zero or more times, so this pattern will search for all files that start with 4 digits followed by anything in the name. A pattern will restrict the search to files containing that exact character string. Some example will be helpful.

A brief note on mutate(). In order to work through some examples using some regular expressions, we will add, or create, new variables in the data frame. The dplyr::mutate() function will be explained in detail later but for now just understand that it can be used to add variables. We will add 4 new variables which very similar names, which also contain different numbers of digits. Some digits will be at the end of the variable name and some digits will be contained someplace within the name. For simplicity, we will set all rows of the variable equal to the value of 1. Each variable will passed as an argument to mutate() and thus separated by a comma. When they are coded on separate lines, the are easier to see and remove. Focus on making code easy to read.

DAT |> 
  mutate(var_1    = 1,
         var_11   = 1,
         var_3    = 1,
         var1_var = 1
         )
Selecting variables of all patterns using ".*"

Select variables containing a regular expression, use matches():

.* will grab all column variable names. As a regular expression, it translates to any character and any number of instances. Put quotes around it to make it a string.

DAT |> 
  select(matches(".*")) 
   Id     Sex Age Renting
1 100    male  25     yes
2 101  female  33    <NA>
3 102    Male  27     yes
4 103    <NA>  40    <NA>
5 104     man  44      no
6 100    male  25     yes
7 105 neither  40     yes
Selecting variables with digits using "\\d"

Add the new variables and |> select. Use the regular expression \\d will grab all variables containing a digit. Put quotes around it so that it is a string:

DAT |>                 
  mutate(var_1    = 1,
         var_11   = 1,
         var_3    = 1,
         var1_var = 1
         ) |>
  select(matches("\\d"))
  var_1 var_11 var_3 var1_var
1     1      1     1        1
2     1      1     1        1
3     1      1     1        1
4     1      1     1        1
5     1      1     1        1
6     1      1     1        1
7     1      1     1        1
Selecting variables that start and contain characters "v.*\\d"

v.*\\d will grab all variables that start with v and then contain any characters which are followed by a digit:

DAT |>
    mutate(var_1    = 1,
           var_11   = 1,
           var_3    = 1,
           var1_var = 1
         ) |>
  select(matches("v.*\\d"))
  var_1 var_11 var_3 var1_var
1     1      1     1        1
2     1      1     1        1
3     1      1     1        1
4     1      1     1        1
5     1      1     1        1
6     1      1     1        1
7     1      1     1        1
Selecting variables that end with a digit using "\\d$"

\\d$ will grab all variables ending in a digit ($ means at the end):

DAT |> 
    mutate(var_1    = 1,
           var_11   = 1,
           var_3    = 1,
           var1_var = 1
           ) |>
  select(matches("\\d$"))
  var_1 var_11 var_3
1     1      1     1
2     1      1     1
3     1      1     1
4     1      1     1
5     1      1     1
6     1      1     1
7     1      1     1

Selecting variables by negation

You can also negate all regular expression matches if you want to exclude variable names ending with digits:

DAT |>
      mutate(var_1    = 1,
             var_11   = 1,
             var_3    = 1,
             var1_var = 1
             ) |>
  select(matches("\\d$"))
  var_1 var_11 var_3
1     1      1     1
2     1      1     1
3     1      1     1
4     1      1     1
5     1      1     1
6     1      1     1
7     1      1     1

Note: The functions will return lowercase and uppercase variable name matches because the default behavior is ignore.case = TRUE. Set to FALSE if you want to perform precise surgery on the variables.

Selecting and Selecting Out Variables by Type

A very useful function for selecting variables by type is from the {tidyselect} library, tidyselect::where(). Examine how it works.

help(where)

It has one main parameter, fn to which you pass another function. where() will return variables for which the function argument returns TRUE. Some common functions one might wish to pass as arguments to fn include is.numeric() for numeric columns, is.character() for character/string columns, or is.factor() for factor columns. For all of these, the is. part of the function informs you that the function is asking whether something is that type.

For example,

is.numeric(c(1, 2, 3, 4))
[1] TRUE
is.numeric(c("1", "2", "3", "4"))
[1] FALSE

The complement to is.*() is as.*() which is used to convert vectors. Convert, then check.

is.numeric(as.numeric(c("1", "2", "3", "4")))
[1] TRUE

One trickly part to using where() is that you cannot include the () for the function. Instead, we need to drop those off. Let’s try this with the data frame.

Select variables that are numeric:

DAT |>
  select(where(is.numeric))
   Id Age
1 100  25
2 101  33
3 102  27
4 103  40
5 104  44
6 100  25
7 105  40

Now, select() will return only the numeric columns.

Select variables that are NOT numeric:

DAT |> 
  select(-where(is.numeric))
      Sex Renting
1    male     yes
2  female    <NA>
3    Male     yes
4    <NA>    <NA>
5     man      no
6    male     yes
7 neither     yes

Select variables that are character:

DAT |> 
  select(where(is.character))
      Sex Renting
1    male     yes
2  female    <NA>
3    Male     yes
4    <NA>    <NA>
5     man      no
6    male     yes
7 neither     yes

Select variables that are NOT character:

DAT |>
  select(-where(is.character))
   Id Age
1 100  25
2 101  33
3 102  27
4 103  40
5 104  44
6 100  25
7 105  40

Select variables that are logical (TRUE to FALSE):

DAT |> 
  select(where(is.logical))
data frame with 0 columns and 7 rows

Select variables that are NOT logical (TRUE to FALSE):

DAT |> 
  select(-where(is.logical))
   Id     Sex Age Renting
1 100    male  25     yes
2 101  female  33    <NA>
3 102    Male  27     yes
4 103    <NA>  40    <NA>
5 104     man  44      no
6 100    male  25     yes
7 105 neither  40     yes

Creating/Mutating Variables

When we want to add new variables or modify existing ones, we will use mutate(). Before using the function, review what it does using ?mutate.

mutate(
  .data,
  ...,
  .by = NULL,
  .keep = c("all", "used", "unused", "none"),
  .before = NULL,
  .after = NULL
)

Parameters/Arguments:

  • .data: a data frame
  • ...: name-value pairs; name is the name of the column and the value would be the value or function used

Let’s start with a constant and move up from there.

Creating/Mutating A Constant Value

Let’s create a new variable, new_var and set it equal to 1.

USArrests |>
  mutate(new_var1 = 1)
               Murder Assault UrbanPop Rape new_var1
Alabama          13.2     236       58 21.2        1
Alaska           10.0     263       48 44.5        1
Arizona           8.1     294       80 31.0        1
Arkansas          8.8     190       50 19.5        1
California        9.0     276       91 40.6        1
Colorado          7.9     204       78 38.7        1
Connecticut       3.3     110       77 11.1        1
Delaware          5.9     238       72 15.8        1
Florida          15.4     335       80 31.9        1
Georgia          17.4     211       60 25.8        1
Hawaii            5.3      46       83 20.2        1
Idaho             2.6     120       54 14.2        1
Illinois         10.4     249       83 24.0        1
Indiana           7.2     113       65 21.0        1
Iowa              2.2      56       57 11.3        1
Kansas            6.0     115       66 18.0        1
Kentucky          9.7     109       52 16.3        1
Louisiana        15.4     249       66 22.2        1
Maine             2.1      83       51  7.8        1
Maryland         11.3     300       67 27.8        1
Massachusetts     4.4     149       85 16.3        1
Michigan         12.1     255       74 35.1        1
Minnesota         2.7      72       66 14.9        1
Mississippi      16.1     259       44 17.1        1
Missouri          9.0     178       70 28.2        1
Montana           6.0     109       53 16.4        1
Nebraska          4.3     102       62 16.5        1
Nevada           12.2     252       81 46.0        1
New Hampshire     2.1      57       56  9.5        1
New Jersey        7.4     159       89 18.8        1
New Mexico       11.4     285       70 32.1        1
New York         11.1     254       86 26.1        1
North Carolina   13.0     337       45 16.1        1
North Dakota      0.8      45       44  7.3        1
Ohio              7.3     120       75 21.4        1
Oklahoma          6.6     151       68 20.0        1
Oregon            4.9     159       67 29.3        1
Pennsylvania      6.3     106       72 14.9        1
Rhode Island      3.4     174       87  8.3        1
South Carolina   14.4     279       48 22.5        1
South Dakota      3.8      86       45 12.8        1
Tennessee        13.2     188       59 26.9        1
Texas            12.7     201       80 25.5        1
Utah              3.2     120       80 22.9        1
Vermont           2.2      48       32 11.2        1
Virginia          8.5     156       63 20.7        1
Washington        4.0     145       73 26.2        1
West Virginia     5.7      81       39  9.3        1
Wisconsin         2.6      53       66 10.8        1
Wyoming           6.8     161       60 15.6        1

To mutate more than one variable, you can add a new name-value pair as a new argument:

USArrests |>
  mutate(new_var1 = 1,
         new_var2 = 9999
         )
               Murder Assault UrbanPop Rape new_var1 new_var2
Alabama          13.2     236       58 21.2        1     9999
Alaska           10.0     263       48 44.5        1     9999
Arizona           8.1     294       80 31.0        1     9999
Arkansas          8.8     190       50 19.5        1     9999
California        9.0     276       91 40.6        1     9999
Colorado          7.9     204       78 38.7        1     9999
Connecticut       3.3     110       77 11.1        1     9999
Delaware          5.9     238       72 15.8        1     9999
Florida          15.4     335       80 31.9        1     9999
Georgia          17.4     211       60 25.8        1     9999
Hawaii            5.3      46       83 20.2        1     9999
Idaho             2.6     120       54 14.2        1     9999
Illinois         10.4     249       83 24.0        1     9999
Indiana           7.2     113       65 21.0        1     9999
Iowa              2.2      56       57 11.3        1     9999
Kansas            6.0     115       66 18.0        1     9999
Kentucky          9.7     109       52 16.3        1     9999
Louisiana        15.4     249       66 22.2        1     9999
Maine             2.1      83       51  7.8        1     9999
Maryland         11.3     300       67 27.8        1     9999
Massachusetts     4.4     149       85 16.3        1     9999
Michigan         12.1     255       74 35.1        1     9999
Minnesota         2.7      72       66 14.9        1     9999
Mississippi      16.1     259       44 17.1        1     9999
Missouri          9.0     178       70 28.2        1     9999
Montana           6.0     109       53 16.4        1     9999
Nebraska          4.3     102       62 16.5        1     9999
Nevada           12.2     252       81 46.0        1     9999
New Hampshire     2.1      57       56  9.5        1     9999
New Jersey        7.4     159       89 18.8        1     9999
New Mexico       11.4     285       70 32.1        1     9999
New York         11.1     254       86 26.1        1     9999
North Carolina   13.0     337       45 16.1        1     9999
North Dakota      0.8      45       44  7.3        1     9999
Ohio              7.3     120       75 21.4        1     9999
Oklahoma          6.6     151       68 20.0        1     9999
Oregon            4.9     159       67 29.3        1     9999
Pennsylvania      6.3     106       72 14.9        1     9999
Rhode Island      3.4     174       87  8.3        1     9999
South Carolina   14.4     279       48 22.5        1     9999
South Dakota      3.8      86       45 12.8        1     9999
Tennessee        13.2     188       59 26.9        1     9999
Texas            12.7     201       80 25.5        1     9999
Utah              3.2     120       80 22.9        1     9999
Vermont           2.2      48       32 11.2        1     9999
Virginia          8.5     156       63 20.7        1     9999
Washington        4.0     145       73 26.2        1     9999
West Virginia     5.7      81       39  9.3        1     9999
Wisconsin         2.6      53       66 10.8        1     9999
Wyoming           6.8     161       60 15.6        1     9999

Or you can add a new mutate() by piping:

USArrests |>
  mutate(new_var1 = 1) |>  # add the variable 
  mutate(new_var1 = 9999)  # then add another variable 
               Murder Assault UrbanPop Rape new_var1
Alabama          13.2     236       58 21.2     9999
Alaska           10.0     263       48 44.5     9999
Arizona           8.1     294       80 31.0     9999
Arkansas          8.8     190       50 19.5     9999
California        9.0     276       91 40.6     9999
Colorado          7.9     204       78 38.7     9999
Connecticut       3.3     110       77 11.1     9999
Delaware          5.9     238       72 15.8     9999
Florida          15.4     335       80 31.9     9999
Georgia          17.4     211       60 25.8     9999
Hawaii            5.3      46       83 20.2     9999
Idaho             2.6     120       54 14.2     9999
Illinois         10.4     249       83 24.0     9999
Indiana           7.2     113       65 21.0     9999
Iowa              2.2      56       57 11.3     9999
Kansas            6.0     115       66 18.0     9999
Kentucky          9.7     109       52 16.3     9999
Louisiana        15.4     249       66 22.2     9999
Maine             2.1      83       51  7.8     9999
Maryland         11.3     300       67 27.8     9999
Massachusetts     4.4     149       85 16.3     9999
Michigan         12.1     255       74 35.1     9999
Minnesota         2.7      72       66 14.9     9999
Mississippi      16.1     259       44 17.1     9999
Missouri          9.0     178       70 28.2     9999
Montana           6.0     109       53 16.4     9999
Nebraska          4.3     102       62 16.5     9999
Nevada           12.2     252       81 46.0     9999
New Hampshire     2.1      57       56  9.5     9999
New Jersey        7.4     159       89 18.8     9999
New Mexico       11.4     285       70 32.1     9999
New York         11.1     254       86 26.1     9999
North Carolina   13.0     337       45 16.1     9999
North Dakota      0.8      45       44  7.3     9999
Ohio              7.3     120       75 21.4     9999
Oklahoma          6.6     151       68 20.0     9999
Oregon            4.9     159       67 29.3     9999
Pennsylvania      6.3     106       72 14.9     9999
Rhode Island      3.4     174       87  8.3     9999
South Carolina   14.4     279       48 22.5     9999
South Dakota      3.8      86       45 12.8     9999
Tennessee        13.2     188       59 26.9     9999
Texas            12.7     201       80 25.5     9999
Utah              3.2     120       80 22.9     9999
Vermont           2.2      48       32 11.2     9999
Virginia          8.5     156       63 20.7     9999
Washington        4.0     145       73 26.2     9999
West Virginia     5.7      81       39  9.3     9999
Wisconsin         2.6      53       66 10.8     9999
Wyoming           6.8     161       60 15.6     9999

Whatever your approach, keep the variable creation on separate lines. Don’t put both variable specification on the same line just to same space as seen here:

USArrests |>
  mutate(new_var1 = 1, new_var1 = 9999)   
               Murder Assault UrbanPop Rape new_var1
Alabama          13.2     236       58 21.2     9999
Alaska           10.0     263       48 44.5     9999
Arizona           8.1     294       80 31.0     9999
Arkansas          8.8     190       50 19.5     9999
California        9.0     276       91 40.6     9999
Colorado          7.9     204       78 38.7     9999
Connecticut       3.3     110       77 11.1     9999
Delaware          5.9     238       72 15.8     9999
Florida          15.4     335       80 31.9     9999
Georgia          17.4     211       60 25.8     9999
Hawaii            5.3      46       83 20.2     9999
Idaho             2.6     120       54 14.2     9999
Illinois         10.4     249       83 24.0     9999
Indiana           7.2     113       65 21.0     9999
Iowa              2.2      56       57 11.3     9999
Kansas            6.0     115       66 18.0     9999
Kentucky          9.7     109       52 16.3     9999
Louisiana        15.4     249       66 22.2     9999
Maine             2.1      83       51  7.8     9999
Maryland         11.3     300       67 27.8     9999
Massachusetts     4.4     149       85 16.3     9999
Michigan         12.1     255       74 35.1     9999
Minnesota         2.7      72       66 14.9     9999
Mississippi      16.1     259       44 17.1     9999
Missouri          9.0     178       70 28.2     9999
Montana           6.0     109       53 16.4     9999
Nebraska          4.3     102       62 16.5     9999
Nevada           12.2     252       81 46.0     9999
New Hampshire     2.1      57       56  9.5     9999
New Jersey        7.4     159       89 18.8     9999
New Mexico       11.4     285       70 32.1     9999
New York         11.1     254       86 26.1     9999
North Carolina   13.0     337       45 16.1     9999
North Dakota      0.8      45       44  7.3     9999
Ohio              7.3     120       75 21.4     9999
Oklahoma          6.6     151       68 20.0     9999
Oregon            4.9     159       67 29.3     9999
Pennsylvania      6.3     106       72 14.9     9999
Rhode Island      3.4     174       87  8.3     9999
South Carolina   14.4     279       48 22.5     9999
South Dakota      3.8      86       45 12.8     9999
Tennessee        13.2     188       59 26.9     9999
Texas            12.7     201       80 25.5     9999
Utah              3.2     120       80 22.9     9999
Vermont           2.2      48       32 11.2     9999
Virginia          8.5     156       63 20.7     9999
Washington        4.0     145       73 26.2     9999
West Virginia     5.7      81       39  9.3     9999
Wisconsin         2.6      53       66 10.8     9999
Wyoming           6.8     161       60 15.6     9999

Although there is nothing technically incorrect with doing so, (1) this is harder to read, (2) the R community won’t like it, and importantly (3) it is less flexible when you want to temporarily comment out a line of your code as seen here.

USArrests |>
#  mutate(new_var1 = 1) |>  # add the variable 
  mutate(new_var1 = 9999)  # then add another variable 
               Murder Assault UrbanPop Rape new_var1
Alabama          13.2     236       58 21.2     9999
Alaska           10.0     263       48 44.5     9999
Arizona           8.1     294       80 31.0     9999
Arkansas          8.8     190       50 19.5     9999
California        9.0     276       91 40.6     9999
Colorado          7.9     204       78 38.7     9999
Connecticut       3.3     110       77 11.1     9999
Delaware          5.9     238       72 15.8     9999
Florida          15.4     335       80 31.9     9999
Georgia          17.4     211       60 25.8     9999
Hawaii            5.3      46       83 20.2     9999
Idaho             2.6     120       54 14.2     9999
Illinois         10.4     249       83 24.0     9999
Indiana           7.2     113       65 21.0     9999
Iowa              2.2      56       57 11.3     9999
Kansas            6.0     115       66 18.0     9999
Kentucky          9.7     109       52 16.3     9999
Louisiana        15.4     249       66 22.2     9999
Maine             2.1      83       51  7.8     9999
Maryland         11.3     300       67 27.8     9999
Massachusetts     4.4     149       85 16.3     9999
Michigan         12.1     255       74 35.1     9999
Minnesota         2.7      72       66 14.9     9999
Mississippi      16.1     259       44 17.1     9999
Missouri          9.0     178       70 28.2     9999
Montana           6.0     109       53 16.4     9999
Nebraska          4.3     102       62 16.5     9999
Nevada           12.2     252       81 46.0     9999
New Hampshire     2.1      57       56  9.5     9999
New Jersey        7.4     159       89 18.8     9999
New Mexico       11.4     285       70 32.1     9999
New York         11.1     254       86 26.1     9999
North Carolina   13.0     337       45 16.1     9999
North Dakota      0.8      45       44  7.3     9999
Ohio              7.3     120       75 21.4     9999
Oklahoma          6.6     151       68 20.0     9999
Oregon            4.9     159       67 29.3     9999
Pennsylvania      6.3     106       72 14.9     9999
Rhode Island      3.4     174       87  8.3     9999
South Carolina   14.4     279       48 22.5     9999
South Dakota      3.8      86       45 12.8     9999
Tennessee        13.2     188       59 26.9     9999
Texas            12.7     201       80 25.5     9999
Utah              3.2     120       80 22.9     9999
Vermont           2.2      48       32 11.2     9999
Virginia          8.5     156       63 20.7     9999
Washington        4.0     145       73 26.2     9999
West Virginia     5.7      81       39  9.3     9999
Wisconsin         2.6      53       66 10.8     9999
Wyoming           6.8     161       60 15.6     9999

Same for:

USArrests |>
  mutate(#new_var1 = 1,
         new_var2 = 9999
         )
               Murder Assault UrbanPop Rape new_var2
Alabama          13.2     236       58 21.2     9999
Alaska           10.0     263       48 44.5     9999
Arizona           8.1     294       80 31.0     9999
Arkansas          8.8     190       50 19.5     9999
California        9.0     276       91 40.6     9999
Colorado          7.9     204       78 38.7     9999
Connecticut       3.3     110       77 11.1     9999
Delaware          5.9     238       72 15.8     9999
Florida          15.4     335       80 31.9     9999
Georgia          17.4     211       60 25.8     9999
Hawaii            5.3      46       83 20.2     9999
Idaho             2.6     120       54 14.2     9999
Illinois         10.4     249       83 24.0     9999
Indiana           7.2     113       65 21.0     9999
Iowa              2.2      56       57 11.3     9999
Kansas            6.0     115       66 18.0     9999
Kentucky          9.7     109       52 16.3     9999
Louisiana        15.4     249       66 22.2     9999
Maine             2.1      83       51  7.8     9999
Maryland         11.3     300       67 27.8     9999
Massachusetts     4.4     149       85 16.3     9999
Michigan         12.1     255       74 35.1     9999
Minnesota         2.7      72       66 14.9     9999
Mississippi      16.1     259       44 17.1     9999
Missouri          9.0     178       70 28.2     9999
Montana           6.0     109       53 16.4     9999
Nebraska          4.3     102       62 16.5     9999
Nevada           12.2     252       81 46.0     9999
New Hampshire     2.1      57       56  9.5     9999
New Jersey        7.4     159       89 18.8     9999
New Mexico       11.4     285       70 32.1     9999
New York         11.1     254       86 26.1     9999
North Carolina   13.0     337       45 16.1     9999
North Dakota      0.8      45       44  7.3     9999
Ohio              7.3     120       75 21.4     9999
Oklahoma          6.6     151       68 20.0     9999
Oregon            4.9     159       67 29.3     9999
Pennsylvania      6.3     106       72 14.9     9999
Rhode Island      3.4     174       87  8.3     9999
South Carolina   14.4     279       48 22.5     9999
South Dakota      3.8      86       45 12.8     9999
Tennessee        13.2     188       59 26.9     9999
Texas            12.7     201       80 25.5     9999
Utah              3.2     120       80 22.9     9999
Vermont           2.2      48       32 11.2     9999
Virginia          8.5     156       63 20.7     9999
Washington        4.0     145       73 26.2     9999
West Virginia     5.7      81       39  9.3     9999
Wisconsin         2.6      53       66 10.8     9999
Wyoming           6.8     161       60 15.6     9999

Creating/Mutating A Constant Character

USArrests |>
  mutate(new_var1 = "Below Average",
         )
               Murder Assault UrbanPop Rape      new_var1
Alabama          13.2     236       58 21.2 Below Average
Alaska           10.0     263       48 44.5 Below Average
Arizona           8.1     294       80 31.0 Below Average
Arkansas          8.8     190       50 19.5 Below Average
California        9.0     276       91 40.6 Below Average
Colorado          7.9     204       78 38.7 Below Average
Connecticut       3.3     110       77 11.1 Below Average
Delaware          5.9     238       72 15.8 Below Average
Florida          15.4     335       80 31.9 Below Average
Georgia          17.4     211       60 25.8 Below Average
Hawaii            5.3      46       83 20.2 Below Average
Idaho             2.6     120       54 14.2 Below Average
Illinois         10.4     249       83 24.0 Below Average
Indiana           7.2     113       65 21.0 Below Average
Iowa              2.2      56       57 11.3 Below Average
Kansas            6.0     115       66 18.0 Below Average
Kentucky          9.7     109       52 16.3 Below Average
Louisiana        15.4     249       66 22.2 Below Average
Maine             2.1      83       51  7.8 Below Average
Maryland         11.3     300       67 27.8 Below Average
Massachusetts     4.4     149       85 16.3 Below Average
Michigan         12.1     255       74 35.1 Below Average
Minnesota         2.7      72       66 14.9 Below Average
Mississippi      16.1     259       44 17.1 Below Average
Missouri          9.0     178       70 28.2 Below Average
Montana           6.0     109       53 16.4 Below Average
Nebraska          4.3     102       62 16.5 Below Average
Nevada           12.2     252       81 46.0 Below Average
New Hampshire     2.1      57       56  9.5 Below Average
New Jersey        7.4     159       89 18.8 Below Average
New Mexico       11.4     285       70 32.1 Below Average
New York         11.1     254       86 26.1 Below Average
North Carolina   13.0     337       45 16.1 Below Average
North Dakota      0.8      45       44  7.3 Below Average
Ohio              7.3     120       75 21.4 Below Average
Oklahoma          6.6     151       68 20.0 Below Average
Oregon            4.9     159       67 29.3 Below Average
Pennsylvania      6.3     106       72 14.9 Below Average
Rhode Island      3.4     174       87  8.3 Below Average
South Carolina   14.4     279       48 22.5 Below Average
South Dakota      3.8      86       45 12.8 Below Average
Tennessee        13.2     188       59 26.9 Below Average
Texas            12.7     201       80 25.5 Below Average
Utah              3.2     120       80 22.9 Below Average
Vermont           2.2      48       32 11.2 Below Average
Virginia          8.5     156       63 20.7 Below Average
Washington        4.0     145       73 26.2 Below Average
West Virginia     5.7      81       39  9.3 Below Average
Wisconsin         2.6      53       66 10.8 Below Average
Wyoming           6.8     161       60 15.6 Below Average

Creating/Mutating A Logical Object

Logical objects are TRUE or FALSE. We can perform the logical test by asking whether Assault > Murder. If the Assault value for a row is greater than the Murder value, TRUE will be returned, otherwise FALSE.

USArrests |>
  mutate(new_var = Assault > Murder)
               Murder Assault UrbanPop Rape new_var
Alabama          13.2     236       58 21.2    TRUE
Alaska           10.0     263       48 44.5    TRUE
Arizona           8.1     294       80 31.0    TRUE
Arkansas          8.8     190       50 19.5    TRUE
California        9.0     276       91 40.6    TRUE
Colorado          7.9     204       78 38.7    TRUE
Connecticut       3.3     110       77 11.1    TRUE
Delaware          5.9     238       72 15.8    TRUE
Florida          15.4     335       80 31.9    TRUE
Georgia          17.4     211       60 25.8    TRUE
Hawaii            5.3      46       83 20.2    TRUE
Idaho             2.6     120       54 14.2    TRUE
Illinois         10.4     249       83 24.0    TRUE
Indiana           7.2     113       65 21.0    TRUE
Iowa              2.2      56       57 11.3    TRUE
Kansas            6.0     115       66 18.0    TRUE
Kentucky          9.7     109       52 16.3    TRUE
Louisiana        15.4     249       66 22.2    TRUE
Maine             2.1      83       51  7.8    TRUE
Maryland         11.3     300       67 27.8    TRUE
Massachusetts     4.4     149       85 16.3    TRUE
Michigan         12.1     255       74 35.1    TRUE
Minnesota         2.7      72       66 14.9    TRUE
Mississippi      16.1     259       44 17.1    TRUE
Missouri          9.0     178       70 28.2    TRUE
Montana           6.0     109       53 16.4    TRUE
Nebraska          4.3     102       62 16.5    TRUE
Nevada           12.2     252       81 46.0    TRUE
New Hampshire     2.1      57       56  9.5    TRUE
New Jersey        7.4     159       89 18.8    TRUE
New Mexico       11.4     285       70 32.1    TRUE
New York         11.1     254       86 26.1    TRUE
North Carolina   13.0     337       45 16.1    TRUE
North Dakota      0.8      45       44  7.3    TRUE
Ohio              7.3     120       75 21.4    TRUE
Oklahoma          6.6     151       68 20.0    TRUE
Oregon            4.9     159       67 29.3    TRUE
Pennsylvania      6.3     106       72 14.9    TRUE
Rhode Island      3.4     174       87  8.3    TRUE
South Carolina   14.4     279       48 22.5    TRUE
South Dakota      3.8      86       45 12.8    TRUE
Tennessee        13.2     188       59 26.9    TRUE
Texas            12.7     201       80 25.5    TRUE
Utah              3.2     120       80 22.9    TRUE
Vermont           2.2      48       32 11.2    TRUE
Virginia          8.5     156       63 20.7    TRUE
Washington        4.0     145       73 26.2    TRUE
West Virginia     5.7      81       39  9.3    TRUE
Wisconsin         2.6      53       66 10.8    TRUE
Wyoming           6.8     161       60 15.6    TRUE

We see that all states (rows) have more assaults than murders. No surprise.

Creating/Mutating Based on Function

Here, the name value pair will contain a function. For example, let’s create a variable based on the mean() of a variable.

USArrests |>
  mutate(Mean_Assault = mean(Assault))
               Murder Assault UrbanPop Rape Mean_Assault
Alabama          13.2     236       58 21.2       170.76
Alaska           10.0     263       48 44.5       170.76
Arizona           8.1     294       80 31.0       170.76
Arkansas          8.8     190       50 19.5       170.76
California        9.0     276       91 40.6       170.76
Colorado          7.9     204       78 38.7       170.76
Connecticut       3.3     110       77 11.1       170.76
Delaware          5.9     238       72 15.8       170.76
Florida          15.4     335       80 31.9       170.76
Georgia          17.4     211       60 25.8       170.76
Hawaii            5.3      46       83 20.2       170.76
Idaho             2.6     120       54 14.2       170.76
Illinois         10.4     249       83 24.0       170.76
Indiana           7.2     113       65 21.0       170.76
Iowa              2.2      56       57 11.3       170.76
Kansas            6.0     115       66 18.0       170.76
Kentucky          9.7     109       52 16.3       170.76
Louisiana        15.4     249       66 22.2       170.76
Maine             2.1      83       51  7.8       170.76
Maryland         11.3     300       67 27.8       170.76
Massachusetts     4.4     149       85 16.3       170.76
Michigan         12.1     255       74 35.1       170.76
Minnesota         2.7      72       66 14.9       170.76
Mississippi      16.1     259       44 17.1       170.76
Missouri          9.0     178       70 28.2       170.76
Montana           6.0     109       53 16.4       170.76
Nebraska          4.3     102       62 16.5       170.76
Nevada           12.2     252       81 46.0       170.76
New Hampshire     2.1      57       56  9.5       170.76
New Jersey        7.4     159       89 18.8       170.76
New Mexico       11.4     285       70 32.1       170.76
New York         11.1     254       86 26.1       170.76
North Carolina   13.0     337       45 16.1       170.76
North Dakota      0.8      45       44  7.3       170.76
Ohio              7.3     120       75 21.4       170.76
Oklahoma          6.6     151       68 20.0       170.76
Oregon            4.9     159       67 29.3       170.76
Pennsylvania      6.3     106       72 14.9       170.76
Rhode Island      3.4     174       87  8.3       170.76
South Carolina   14.4     279       48 22.5       170.76
South Dakota      3.8      86       45 12.8       170.76
Tennessee        13.2     188       59 26.9       170.76
Texas            12.7     201       80 25.5       170.76
Utah              3.2     120       80 22.9       170.76
Vermont           2.2      48       32 11.2       170.76
Virginia          8.5     156       63 20.7       170.76
Washington        4.0     145       73 26.2       170.76
West Virginia     5.7      81       39  9.3       170.76
Wisconsin         2.6      53       66 10.8       170.76
Wyoming           6.8     161       60 15.6       170.76

Notice that the new column contains the mean for all rows.

We may also wish to categorize rows based on their relative Assault data. How about assault lower than the mean assault?

USArrests |>
  mutate(Lower_than_Mean_Assault = Assault < mean(Assault))
               Murder Assault UrbanPop Rape Lower_than_Mean_Assault
Alabama          13.2     236       58 21.2                   FALSE
Alaska           10.0     263       48 44.5                   FALSE
Arizona           8.1     294       80 31.0                   FALSE
Arkansas          8.8     190       50 19.5                   FALSE
California        9.0     276       91 40.6                   FALSE
Colorado          7.9     204       78 38.7                   FALSE
Connecticut       3.3     110       77 11.1                    TRUE
Delaware          5.9     238       72 15.8                   FALSE
Florida          15.4     335       80 31.9                   FALSE
Georgia          17.4     211       60 25.8                   FALSE
Hawaii            5.3      46       83 20.2                    TRUE
Idaho             2.6     120       54 14.2                    TRUE
Illinois         10.4     249       83 24.0                   FALSE
Indiana           7.2     113       65 21.0                    TRUE
Iowa              2.2      56       57 11.3                    TRUE
Kansas            6.0     115       66 18.0                    TRUE
Kentucky          9.7     109       52 16.3                    TRUE
Louisiana        15.4     249       66 22.2                   FALSE
Maine             2.1      83       51  7.8                    TRUE
Maryland         11.3     300       67 27.8                   FALSE
Massachusetts     4.4     149       85 16.3                    TRUE
Michigan         12.1     255       74 35.1                   FALSE
Minnesota         2.7      72       66 14.9                    TRUE
Mississippi      16.1     259       44 17.1                   FALSE
Missouri          9.0     178       70 28.2                   FALSE
Montana           6.0     109       53 16.4                    TRUE
Nebraska          4.3     102       62 16.5                    TRUE
Nevada           12.2     252       81 46.0                   FALSE
New Hampshire     2.1      57       56  9.5                    TRUE
New Jersey        7.4     159       89 18.8                    TRUE
New Mexico       11.4     285       70 32.1                   FALSE
New York         11.1     254       86 26.1                   FALSE
North Carolina   13.0     337       45 16.1                   FALSE
North Dakota      0.8      45       44  7.3                    TRUE
Ohio              7.3     120       75 21.4                    TRUE
Oklahoma          6.6     151       68 20.0                    TRUE
Oregon            4.9     159       67 29.3                    TRUE
Pennsylvania      6.3     106       72 14.9                    TRUE
Rhode Island      3.4     174       87  8.3                   FALSE
South Carolina   14.4     279       48 22.5                   FALSE
South Dakota      3.8      86       45 12.8                    TRUE
Tennessee        13.2     188       59 26.9                   FALSE
Texas            12.7     201       80 25.5                   FALSE
Utah              3.2     120       80 22.9                    TRUE
Vermont           2.2      48       32 11.2                    TRUE
Virginia          8.5     156       63 20.7                    TRUE
Washington        4.0     145       73 26.2                    TRUE
West Virginia     5.7      81       39  9.3                    TRUE
Wisconsin         2.6      53       66 10.8                    TRUE
Wyoming           6.8     161       60 15.6                    TRUE

TRUE and False make terrible variable levels, however. Using ifelse(), we will specify conditions for categorization. For example, make rows that are below the mean take on a value, else/otherwise, make the rows a different value.

Example:

ifelse(Assault < Mean_Assault, "Below", "Equal or Above"))

If the Assault value is < Mean_Assault, assign the row a value of "Below", else, assign "Equal or Above". Let’s use ifelse() to create a name-value pair

Relative_Assult = ifelse(Assault < Mean_Assault, "Below", "Equal or Above"))

USArrests |>
  mutate(Mean_Assault = mean(Assault), 
         Relative_Assult = ifelse(Assault < Mean_Assault, "Below", "Equal or Above")
         )
               Murder Assault UrbanPop Rape Mean_Assault Relative_Assult
Alabama          13.2     236       58 21.2       170.76  Equal or Above
Alaska           10.0     263       48 44.5       170.76  Equal or Above
Arizona           8.1     294       80 31.0       170.76  Equal or Above
Arkansas          8.8     190       50 19.5       170.76  Equal or Above
California        9.0     276       91 40.6       170.76  Equal or Above
Colorado          7.9     204       78 38.7       170.76  Equal or Above
Connecticut       3.3     110       77 11.1       170.76           Below
Delaware          5.9     238       72 15.8       170.76  Equal or Above
Florida          15.4     335       80 31.9       170.76  Equal or Above
Georgia          17.4     211       60 25.8       170.76  Equal or Above
Hawaii            5.3      46       83 20.2       170.76           Below
Idaho             2.6     120       54 14.2       170.76           Below
Illinois         10.4     249       83 24.0       170.76  Equal or Above
Indiana           7.2     113       65 21.0       170.76           Below
Iowa              2.2      56       57 11.3       170.76           Below
Kansas            6.0     115       66 18.0       170.76           Below
Kentucky          9.7     109       52 16.3       170.76           Below
Louisiana        15.4     249       66 22.2       170.76  Equal or Above
Maine             2.1      83       51  7.8       170.76           Below
Maryland         11.3     300       67 27.8       170.76  Equal or Above
Massachusetts     4.4     149       85 16.3       170.76           Below
Michigan         12.1     255       74 35.1       170.76  Equal or Above
Minnesota         2.7      72       66 14.9       170.76           Below
Mississippi      16.1     259       44 17.1       170.76  Equal or Above
Missouri          9.0     178       70 28.2       170.76  Equal or Above
Montana           6.0     109       53 16.4       170.76           Below
Nebraska          4.3     102       62 16.5       170.76           Below
Nevada           12.2     252       81 46.0       170.76  Equal or Above
New Hampshire     2.1      57       56  9.5       170.76           Below
New Jersey        7.4     159       89 18.8       170.76           Below
New Mexico       11.4     285       70 32.1       170.76  Equal or Above
New York         11.1     254       86 26.1       170.76  Equal or Above
North Carolina   13.0     337       45 16.1       170.76  Equal or Above
North Dakota      0.8      45       44  7.3       170.76           Below
Ohio              7.3     120       75 21.4       170.76           Below
Oklahoma          6.6     151       68 20.0       170.76           Below
Oregon            4.9     159       67 29.3       170.76           Below
Pennsylvania      6.3     106       72 14.9       170.76           Below
Rhode Island      3.4     174       87  8.3       170.76  Equal or Above
South Carolina   14.4     279       48 22.5       170.76  Equal or Above
South Dakota      3.8      86       45 12.8       170.76           Below
Tennessee        13.2     188       59 26.9       170.76  Equal or Above
Texas            12.7     201       80 25.5       170.76  Equal or Above
Utah              3.2     120       80 22.9       170.76           Below
Vermont           2.2      48       32 11.2       170.76           Below
Virginia          8.5     156       63 20.7       170.76           Below
Washington        4.0     145       73 26.2       170.76           Below
West Virginia     5.7      81       39  9.3       170.76           Below
Wisconsin         2.6      53       66 10.8       170.76           Below
Wyoming           6.8     161       60 15.6       170.76           Below

Bu you could also combine the steps together:

USArrests |>
  mutate(Relative_Assult = ifelse(Assault < mean(Assault), "Below", "Equal or Above"))
               Murder Assault UrbanPop Rape Relative_Assult
Alabama          13.2     236       58 21.2  Equal or Above
Alaska           10.0     263       48 44.5  Equal or Above
Arizona           8.1     294       80 31.0  Equal or Above
Arkansas          8.8     190       50 19.5  Equal or Above
California        9.0     276       91 40.6  Equal or Above
Colorado          7.9     204       78 38.7  Equal or Above
Connecticut       3.3     110       77 11.1           Below
Delaware          5.9     238       72 15.8  Equal or Above
Florida          15.4     335       80 31.9  Equal or Above
Georgia          17.4     211       60 25.8  Equal or Above
Hawaii            5.3      46       83 20.2           Below
Idaho             2.6     120       54 14.2           Below
Illinois         10.4     249       83 24.0  Equal or Above
Indiana           7.2     113       65 21.0           Below
Iowa              2.2      56       57 11.3           Below
Kansas            6.0     115       66 18.0           Below
Kentucky          9.7     109       52 16.3           Below
Louisiana        15.4     249       66 22.2  Equal or Above
Maine             2.1      83       51  7.8           Below
Maryland         11.3     300       67 27.8  Equal or Above
Massachusetts     4.4     149       85 16.3           Below
Michigan         12.1     255       74 35.1  Equal or Above
Minnesota         2.7      72       66 14.9           Below
Mississippi      16.1     259       44 17.1  Equal or Above
Missouri          9.0     178       70 28.2  Equal or Above
Montana           6.0     109       53 16.4           Below
Nebraska          4.3     102       62 16.5           Below
Nevada           12.2     252       81 46.0  Equal or Above
New Hampshire     2.1      57       56  9.5           Below
New Jersey        7.4     159       89 18.8           Below
New Mexico       11.4     285       70 32.1  Equal or Above
New York         11.1     254       86 26.1  Equal or Above
North Carolina   13.0     337       45 16.1  Equal or Above
North Dakota      0.8      45       44  7.3           Below
Ohio              7.3     120       75 21.4           Below
Oklahoma          6.6     151       68 20.0           Below
Oregon            4.9     159       67 29.3           Below
Pennsylvania      6.3     106       72 14.9           Below
Rhode Island      3.4     174       87  8.3  Equal or Above
South Carolina   14.4     279       48 22.5  Equal or Above
South Dakota      3.8      86       45 12.8           Below
Tennessee        13.2     188       59 26.9  Equal or Above
Texas            12.7     201       80 25.5  Equal or Above
Utah              3.2     120       80 22.9           Below
Vermont           2.2      48       32 11.2           Below
Virginia          8.5     156       63 20.7           Below
Washington        4.0     145       73 26.2           Below
West Virginia     5.7      81       39  9.3           Below
Wisconsin         2.6      53       66 10.8           Below
Wyoming           6.8     161       60 15.6           Below

Filtering Rows

Data files are messy and as a result require cleaning. You will have missing rows, incorrect variable names, files with columns named the same, NAs, strings for numbers, duplicate rows of data, people who completed a survey twice, and all sorts of unimaginable and unbelievable data problems. So cleaning is important.

Whereas selecting is the processes of retaining variable columns, filtering is the process of retaining rows from a data set. For example, if you want your data frame to contain only individuals who are 20 or more years old, you would filter to retain them and remove others. If you want to explore data for only individuals from Kansas, you would filter rows to retain only them and remove all others.

Data frame manipulation may involve keeping only certain rows for data, for example, male or female respondents, male respondents, those who do not contain missing values (e.g., NAs) for a specific column variable, who are of a certain age (or born in in certain year), who are above (or below) some acceptable criterion, etc.

When a column variable has more than one value (e.g., check using unique() to determine the unique elements contained), you may wish to filter on some but not others.

You may even need to filter rows in a data frame that are distinct (e.g., not duplicate responses). This is often a good first step in order to determine the size of the usable data set. dplyr::distinct() makes de-duplicating easy as this function will return only distinct rows.

Understanding Filtering Operators

We will filter data using filter() from {dplyr}

filter(.data, 
       ...
       )

Parameters/Arguments:

  • .data: a data frame
  • ...: expressions that return a logical value (TRUE or FALSE)

NOTE:: Filtering cases using the filter() verbs works by removing rows that do not match a specific criterion and then by returning the data frame that omits the mismatched condition. It keeps only rows that are TRUE (match) your specification.

Some useful filtering operators and functions include: ==, >, >=, &, |, !, xor(), c(), is.na(), between(), near().

Filtering Using filter()

We will work with different data frames because the the nature of their contents.

What does the data frame look like again?

DAT    # or print(DAT)
   Id     Sex Age Renting
1 100    male  25     yes
2 101  female  33    <NA>
3 102    Male  27     yes
4 103    <NA>  40    <NA>
5 104     man  44      no
6 100    male  25     yes
7 105 neither  40     yes

Row/Observations/Cases can be filtered to “include” only certain matched conditions or can be filtered to “exclude” by negating those matched conditions. If the column variable Sex is in the data frame and cases are 'male', 'men', 'female', 'women', 'neither', NA, etc., you can specify the column Sex variable and then the row matching condition(s).

Filtering Matches to a Specific Character Value

The first argument in filter() is a data frame, and the function all filter(DAT, Sex == 'female') will filter the data frame named DAT to include rows for which the sex column equals 'female'. In other words, TRUE rows.

filter(DAT, 
       Sex == 'female'
       )
   Id    Sex Age Renting
1 101 female  33    <NA>

Similarly, the function call filter(DAT, Sex == 'male') can be read “filter the data frame to include rows for which the value of Sex == 'male' is TRUE”.

More flexibly, however, you could specify a vector containing acceptable strings using c(). filter(Sex %in% c('male')) filters the rows to include only those for which the value for sex is in the string vector which includes a single string,'male' whereas filter(Sex %in% c('male', 'Man')) filters the rows to include only those for which the value for Sex is in the string vector which includes 'male' and 'Man'. Cases containing 'Male', 'Men' (R is a case-sensitive language), or 'female', for example, will not be included in the returned data frame because they do not match values in the string vector.

Filtering by a Relative Value

Let’s |> the data frame to filter():

USArrests |>
  filter(Murder > 10)
               Murder Assault UrbanPop Rape
Alabama          13.2     236       58 21.2
Florida          15.4     335       80 31.9
Georgia          17.4     211       60 25.8
Illinois         10.4     249       83 24.0
Louisiana        15.4     249       66 22.2
Maryland         11.3     300       67 27.8
Michigan         12.1     255       74 35.1
Mississippi      16.1     259       44 17.1
Nevada           12.2     252       81 46.0
New Mexico       11.4     285       70 32.1
New York         11.1     254       86 26.1
North Carolina   13.0     337       45 16.1
South Carolina   14.4     279       48 22.5
Tennessee        13.2     188       59 26.9
Texas            12.7     201       80 25.5

The rows with Murder above 10 will be retained (they are TRUE), others removed (they are FALSE).

Filtering by a Specific Value

To filter() by a specific value rather than a relative value, you will need to perform a test of equality using the == operator.

For some quick examples:

"Hello" == "Hello"
[1] TRUE
"Hello" == "hello"
[1] FALSE

Because filter() operates on rows of data frames, the test of equality to a value is performed on each row. When the test is TRUE, that row will be retained.

Get rows for which Murder == 12.1:

USArrests |>
  filter(Murder == 12.1)
         Murder Assault UrbanPop Rape
Michigan   12.1     255       74 35.1

Using a previous example with this data set, we can filter based on a character variable if we had one. Let’s add one.

USArrests |>
  mutate(Relative_Murder = ifelse(Murder < mean(Murder), "Below", "Equal or Above")) |>
  filter(Relative_Murder == "Below")
              Murder Assault UrbanPop Rape Relative_Murder
Connecticut      3.3     110       77 11.1           Below
Delaware         5.9     238       72 15.8           Below
Hawaii           5.3      46       83 20.2           Below
Idaho            2.6     120       54 14.2           Below
Indiana          7.2     113       65 21.0           Below
Iowa             2.2      56       57 11.3           Below
Kansas           6.0     115       66 18.0           Below
Maine            2.1      83       51  7.8           Below
Massachusetts    4.4     149       85 16.3           Below
Minnesota        2.7      72       66 14.9           Below
Montana          6.0     109       53 16.4           Below
Nebraska         4.3     102       62 16.5           Below
New Hampshire    2.1      57       56  9.5           Below
New Jersey       7.4     159       89 18.8           Below
North Dakota     0.8      45       44  7.3           Below
Ohio             7.3     120       75 21.4           Below
Oklahoma         6.6     151       68 20.0           Below
Oregon           4.9     159       67 29.3           Below
Pennsylvania     6.3     106       72 14.9           Below
Rhode Island     3.4     174       87  8.3           Below
South Dakota     3.8      86       45 12.8           Below
Utah             3.2     120       80 22.9           Below
Vermont          2.2      48       32 11.2           Below
Washington       4.0     145       73 26.2           Below
West Virginia    5.7      81       39  9.3           Below
Wisconsin        2.6      53       66 10.8           Below
Wyoming          6.8     161       60 15.6           Below

Filtering Using a Function

The mean() function will return the mean of a vector. We can pull() a variable from the data frame and pipe it to mean():

USArrests |>
  pull(Murder) |>
  mean()
[1] 7.788

Alternatively, use the $ operator:

mean(USArrests$Murder)
[1] 7.788

We now see the mean is 7.788. We can filter the data frame to retain rows for which Murder > 7.788

USArrests |>
  filter(Murder > 7.788)
               Murder Assault UrbanPop Rape
Alabama          13.2     236       58 21.2
Alaska           10.0     263       48 44.5
Arizona           8.1     294       80 31.0
Arkansas          8.8     190       50 19.5
California        9.0     276       91 40.6
Colorado          7.9     204       78 38.7
Florida          15.4     335       80 31.9
Georgia          17.4     211       60 25.8
Illinois         10.4     249       83 24.0
Kentucky          9.7     109       52 16.3
Louisiana        15.4     249       66 22.2
Maryland         11.3     300       67 27.8
Michigan         12.1     255       74 35.1
Mississippi      16.1     259       44 17.1
Missouri          9.0     178       70 28.2
Nevada           12.2     252       81 46.0
New Mexico       11.4     285       70 32.1
New York         11.1     254       86 26.1
North Carolina   13.0     337       45 16.1
South Carolina   14.4     279       48 22.5
Tennessee        13.2     188       59 26.9
Texas            12.7     201       80 25.5
Virginia          8.5     156       63 20.7

This approach, however, is not helpful when your data change. And even though you think your data will not change, updates to data, filtering of rows or cases, different cleaning methods, all have a high probability of changing the mean. Working smarter rather than harder involves ensuring that you circumvent such issues.

Never hard code!

Instead, use mean() to do this for us.

USArrests |>
  filter(Murder > mean(Murder))
               Murder Assault UrbanPop Rape
Alabama          13.2     236       58 21.2
Alaska           10.0     263       48 44.5
Arizona           8.1     294       80 31.0
Arkansas          8.8     190       50 19.5
California        9.0     276       91 40.6
Colorado          7.9     204       78 38.7
Florida          15.4     335       80 31.9
Georgia          17.4     211       60 25.8
Illinois         10.4     249       83 24.0
Kentucky          9.7     109       52 16.3
Louisiana        15.4     249       66 22.2
Maryland         11.3     300       67 27.8
Michigan         12.1     255       74 35.1
Mississippi      16.1     259       44 17.1
Missouri          9.0     178       70 28.2
Nevada           12.2     252       81 46.0
New Mexico       11.4     285       70 32.1
New York         11.1     254       86 26.1
North Carolina   13.0     337       45 16.1
South Carolina   14.4     279       48 22.5
Tennessee        13.2     188       59 26.9
Texas            12.7     201       80 25.5
Virginia          8.5     156       63 20.7

If you have been paying attention, you know that you need to understand how the functions you use work.

help(mean)

In the usage section, you see that by default, mean() will not remove NA (missing) values from the vector.

mean(x, 
     trim = 0, 
     na.rm = FALSE, 
     ...
     )

For example:

mean(c(1, 9, 88, NA))
[1] NA

The mean is clearly not NA but the function is operating just as it should. If you ever get an NA in your vector, the above code will not work. By default, na.rm = FALSE, which means that NA values will not be removed. If they exist, the mean will be returned as NA.

The smarter approach is to pass TRUE to the na.rm parameter (na.rm = na remove).

mean(c(1, 9, 88, NA), na.rm = TRUE)
[1] 32.66667

And for the data frame:

USArrests |>
  filter(Murder > mean(Murder, na.rm = TRUE))
               Murder Assault UrbanPop Rape
Alabama          13.2     236       58 21.2
Alaska           10.0     263       48 44.5
Arizona           8.1     294       80 31.0
Arkansas          8.8     190       50 19.5
California        9.0     276       91 40.6
Colorado          7.9     204       78 38.7
Florida          15.4     335       80 31.9
Georgia          17.4     211       60 25.8
Illinois         10.4     249       83 24.0
Kentucky          9.7     109       52 16.3
Louisiana        15.4     249       66 22.2
Maryland         11.3     300       67 27.8
Michigan         12.1     255       74 35.1
Mississippi      16.1     259       44 17.1
Missouri          9.0     178       70 28.2
Nevada           12.2     252       81 46.0
New Mexico       11.4     285       70 32.1
New York         11.1     254       86 26.1
North Carolina   13.0     337       45 16.1
South Carolina   14.4     279       48 22.5
Tennessee        13.2     188       59 26.9
Texas            12.7     201       80 25.5
Virginia          8.5     156       63 20.7

Notice changes here because there were not NA values in the variable but there could be at various times.

Filtering Missing Data (NAs)

is.na() will return a logical vector for which TRUE represents there are missing values.

Try on the entire data frame:

is.na(DAT)
        Id   Sex   Age Renting
[1,] FALSE FALSE FALSE   FALSE
[2,] FALSE FALSE FALSE    TRUE
[3,] FALSE FALSE FALSE   FALSE
[4,] FALSE  TRUE FALSE    TRUE
[5,] FALSE FALSE FALSE   FALSE
[6,] FALSE FALSE FALSE   FALSE
[7,] FALSE FALSE FALSE   FALSE

You can see that some columns contain cases/rows with TRUE indicating the cell contains NA.

WE will use the negation operator, !, to illustrate some filtering approaches. Because filter() will filter out FALSE cases and retain TRUE ones, so you may sometimes need to negate a function so that you keep the rows you want to keep.

  • na.omit(): removes rows with NAs
  • filter(is.na(column_name)): keep rows with NA in specific variable
  • filter(!is.na(column_name)): remove (don’t keep) rows with NA in specific variable
  • filter(., complete.cases(.)): remove rows with NAs
  • filter(filter_complete_cases()): remove rows with any NAs (custom function)

Filter using na.omit()

Filter the entire data frame:

DAT |>
  na.omit()                # omit any rows with NAs 
   Id     Sex Age Renting
1 100    male  25     yes
3 102    Male  27     yes
5 104     man  44      no
6 100    male  25     yes
7 105 neither  40     yes

Filter using is.na() and !is.na()

Filter based on a specific variable:

DAT |>
  filter(is.na(Sex))       # keep NAs by variable
   Id  Sex Age Renting
1 103 <NA>  40    <NA>

But your goal may likely be to keep everything that is not NA:

DAT |>
  filter(!is.na(Sex))      # remove NAs by variable
   Id     Sex Age Renting
1 100    male  25     yes
2 101  female  33    <NA>
3 102    Male  27     yes
4 104     man  44      no
5 100    male  25     yes
6 105 neither  40     yes

And filter step-by-step for each variable using |> and separate function calls:

DAT |>
  filter(!is.na(Sex)) |>
  filter(!is.na(Renting))
   Id     Sex Age Renting
1 100    male  25     yes
2 102    Male  27     yes
3 104     man  44      no
4 100    male  25     yes
5 105 neither  40     yes

So why use separate lines of code if you can use & all in one line? One reason is that separate function calls written as separate lines of code make code inclusion/exclusion extremely easy.

Comment out what you do not want using #:

DAT |>
  #filter(!is.na(Sex)) |>
  filter(!is.na(Renting))
   Id     Sex Age Renting
1 100    male  25     yes
2 102    Male  27     yes
3 104     man  44      no
4 100    male  25     yes
5 105 neither  40     yes

Filter using complete.cases()

The complete.cases() function returns a logical vector for which TRUE reflects the row has complete information and no missing cases. Using complete.cases() along with filter(), you would retain all rows TRUE rows.

There is a caveat, however, which is that you will need to pipe the data frame with {magrittrs} %>% piping operator. And this involves passing . as the argument to the .data parameter of filter() (the first parameter).

DAT %>%
  filter(.data = ., complete.cases(.))
   Id     Sex Age Renting
1 100    male  25     yes
2 102    Male  27     yes
3 104     man  44      no
4 100    male  25     yes
5 105 neither  40     yes

If you use |>, you will get an error:

DAT |>
  filter(.data = ., complete.cases(.))

A way around this is to create your own function in a .R script and add it to /src/functions. We can name filter_complete_cases.

filter_complete_cases <- function() {
  complete.cases(dplyr::cur_data())
}

And then pipe using |>:

DAT |>
  filter(filter_complete_cases())
Warning: There was 1 warning in `filter()`.
ℹ In argument: `filter_complete_cases()`.
Caused by warning:
! `cur_data()` was deprecated in dplyr 1.1.0.
ℹ Please use `pick()` instead.
   Id     Sex Age Renting
1 100    male  25     yes
2 102    Male  27     yes
3 104     man  44      no
4 100    male  25     yes
5 105 neither  40     yes

You choose your method.

Removing duplicate rows using distinct()

We will examine some other filtering methods that do not use filter() specifically but that which perform filtering operations.

  • dplyr::distinct(): remove duplicate rows
  • dplyr::distinct(., column): remove duplicate rows by specific column
  • na.omit(): remove any row with NA’s (missing values)

Let’s use the simple DAT data frame.

DAT
   Id     Sex Age Renting
1 100    male  25     yes
2 101  female  33    <NA>
3 102    Male  27     yes
4 103    <NA>  40    <NA>
5 104     man  44      no
6 100    male  25     yes
7 105 neither  40     yes
the_file <- gsub(" ", "%20", paste0(get_url_for_module_setup(), "R Basics/", "vectors_and_data_frame_basics.html"))
#https://gabrielcook.xyz/dataviz24/modules_setup/R%20Basics/vectors_and_data_frame_basics.html

Remember that data frames are composed of rows and columns. As discussed previously) we can subset in base R to illustrate a point.

Row 1, all columns:

DAT[1,]
   Id  Sex Age Renting
1 100 male  25     yes

Row 6, all columns:

DAT[6,]
   Id  Sex Age Renting
6 100 male  25     yes

Notice that rows 1 and 6 are the same person (e.g., Id) and have exactly the same data for all variables. If we compare the two logically, we will confirm that all is TRUE.

DAT[1,] == DAT[6,]
    Id  Sex  Age Renting
1 TRUE TRUE TRUE    TRUE

You clearly do not want duplicates of data. So let’s just remove any rows that are identical using distinct()

Removing Duplicates Across All Variables

DAT |>
  distinct()       # Remove exact duplicates across all columns
   Id     Sex Age Renting
1 100    male  25     yes
2 101  female  33    <NA>
3 102    Male  27     yes
4 103    <NA>  40    <NA>
5 104     man  44      no
6 105 neither  40     yes

If you know each row is unique based on a variable in the data frame, you can also use distinct() to remove duplicates for a specific variable. Make sure that this variable specification is actually one that you would not want duplicates of.

Removing Duplicates Based on a Variable

DAT |>           
  distinct(Id)     #  Remove duplicates by variable; passes unique values for data frame
   Id
1 100
2 101
3 102
4 103
5 104
6 105

But this function simply returns the unique values in Id. To retain the variables, set .keep_all = TRUE. If you want to remove duplicates and assign the cleaned data frame to an object, you would likely want to keep all of your variables.

DAT |>             
  distinct(Id, 
           .keep_all = T
           )
   Id     Sex Age Renting
1 100    male  25     yes
2 101  female  33    <NA>
3 102    Male  27     yes
4 103    <NA>  40    <NA>
5 104     man  44      no
6 105 neither  40     yes

Notice, however, this only removed the last instance or Id == 100. Which row to include is a judgment call. The first, the last, neither, the average? Is there a correct answer?

The Data Manipulation Workflow: Putting It All Together

Of course, all of these functions can be paired with select(), mutate(), filter(), etc. Here is the data manipulation workflow.

dataframe |>
    select(...) |>     # select variables of interest
    mutate(...) |>     # then create new variables
    filter(...) |>     # then filter by rows
    group_by(...) |>   # then group for subsetting
    summarize(...)     # then summarize

Many More Use-Case Examples (Optional)

Filter Cases using ==

Filter rows for which the Sex variable is equal to the string 'female':

DAT |>
  filter(Sex == 'female')
   Id    Sex Age Renting
1 101 female  33    <NA>

Filter rows for which the Sex variable is not equal to the string 'female':

DAT |>
  filter(Sex != 'female')
   Id     Sex Age Renting
1 100    male  25     yes
2 102    Male  27     yes
3 104     man  44      no
4 100    male  25     yes
5 105 neither  40     yes

Filter rows for which the Sex variable is equal to the string 'female' AND Age is greater than the numeric 27:

DAT |>
  filter(Sex == 'female' & Age > 27) # this "AND" that
   Id    Sex Age Renting
1 101 female  33    <NA>

Filter rows for which the Sex variable is equal to the string 'female' OR Age is greater than the numeric 27:

DAT |>
  filter(Sex == 'female' | Age > 27) # this "OR" that
   Id     Sex Age Renting
1 101  female  33    <NA>
2 103    <NA>  40    <NA>
3 104     man  44      no
4 105 neither  40     yes

A cleaner method involves separate lines of code. Although cleaner, this will not allow the “OR” option because the data frame that is returned from the first filter() is passed to the second filter() and all cases other than "female" have already been removed from the data frame.

DAT |>
  filter(Sex == 'female') |>    # keep female (and add another pipe)
  filter(Age >= 27)             # keep only those equal to or older than 27
   Id    Sex Age Renting
1 101 female  33    <NA>

Filtering Cases by Value

Filter by < and > or <= or >=

Keep those less than:

DAT |> 
  filter(Age < 40)   
   Id    Sex Age Renting
1 100   male  25     yes
2 101 female  33    <NA>
3 102   Male  27     yes
4 100   male  25     yes

Keep older than:

DAT |> 
  filter(Age > 40)  
   Id Sex Age Renting
1 104 man  44      no

Keep equal to or older than:

DAT |> 
  filter(Age >= 40)  
   Id     Sex Age Renting
1 103    <NA>  40    <NA>
2 104     man  44      no
3 105 neither  40     yes

Filter Cases by Conditional X or Y Using | Operator…

Using the “OR” operator, |, cases can be included if “this” OR “that” condition.

Filter numbers:

DAT |>
  filter(Age == 25 | Age == 40)    # filter out numeric values IN a range
   Id     Sex Age Renting
1 100    male  25     yes
2 103    <NA>  40    <NA>
3 100    male  25     yes
4 105 neither  40     yes

Filter characters:

DAT |>
  filter(Sex == 'male' | Sex == 'female')
   Id    Sex Age Renting
1 100   male  25     yes
2 101 female  33    <NA>
3 100   male  25     yes

Although dplyr::filter(sex %in% c('male', 'female')) would be easier.

Filter rows of variables of both types:

DAT |>
  filter(Sex == 'male' | Age == 27)  
   Id  Sex Age Renting
1 100 male  25     yes
2 102 Male  27     yes
3 100 male  25     yes

Filter Cases Between Values with between()

Between ages 25 and 33:

DAT |>
  filter(between(Age, 27, 33))
   Id    Sex Age Renting
1 101 female  33    <NA>
2 102   Male  27     yes

Filter by range using the %in% operator (this is IN meaning in)

Though less flexible than using between(), %in% may be easier to remember:

DAT |>
  filter(Age %in% 20:43)    # filter out numeric values IN a range
   Id     Sex Age Renting
1 100    male  25     yes
2 101  female  33    <NA>
3 102    Male  27     yes
4 103    <NA>  40    <NA>
5 100    male  25     yes
6 105 neither  40     yes

One’s age is in the range from 20 through 43.

If a vector object is already defined (e.g., my_levels = c('male', 'female')), you can use that for filtering also. Such approaches are useful when data manipulation involves reusing a reference as it simplifies coding and reduces errors because the specification is defined only once.

my_levels = c('male', 'female')

DAT |>
  filter(Sex %in% my_levels)
   Id    Sex Age Renting
1 100   male  25     yes
2 101 female  33    <NA>
3 100   male  25     yes

Filter by exclusion

When inclusion of variables is inappropriate, exclusion of them may be useful. The ! operator means “NOT” in R so you can use that to accomplish the opposite of the statement. For example, dplyr::filter(., !sex %in% c('male', NA)) will “filter the data frame to include rows in the sex column for which the value is NOT in the vector”.

Exclude rows in the Sex variable that are NA or 'male':

DAT |>
  filter(!Sex %in% c('male', NA))  # keep only if NOT in vector
   Id     Sex Age Renting
1 101  female  33    <NA>
2 102    Male  27     yes
3 104     man  44      no
4 105 neither  40     yes

Exclude rows in the Sex variable that are Men or 'male':

DAT |>
  filter(!Sex %in% c('male', 'Men'))  # keep only if NOT in vector
   Id     Sex Age Renting
1 101  female  33    <NA>
2 102    Male  27     yes
3 103    <NA>  40    <NA>
4 104     man  44      no
5 105 neither  40     yes

Filter by conditional X and Y using & operator…

By range:

DAT |>
  filter(Id >= 102 & Age <= 43)    
   Id     Sex Age Renting
1 102    Male  27     yes
2 103    <NA>  40    <NA>
3 105 neither  40     yes
DAT |>
  filter(Age >= 20 & Age <= 43)    
   Id     Sex Age Renting
1 100    male  25     yes
2 101  female  33    <NA>
3 102    Male  27     yes
4 103    <NA>  40    <NA>
5 100    male  25     yes
6 105 neither  40     yes

Note: Age 20:43 won’t work. Can you figure out why?

Filter Cases Containing Characters Using stringr::str_detect() and %in%:

If you want to filter cases that contain certain characters, unfortunately, you cannot use contains() as you would for variable names. For matching characters in levels of variables, you’ll need something like stringr::str_detect() or grepl().

  • stringr::str_detect(): returns matching conditions

The example below uses str_detect() to detect the presence of a character match and returns a logic vector with TRUEs for matching cases. When paired with filter(), the data frame is filtered by to contains cases that match the pattern for the variable.

Example: stringr::str_detect(my_variable, "pattern"):

Let’s look for levels of Sex for which "ma" is detected.

DAT |> 
  filter(stringr::str_detect(Sex, "ma")) 
   Id    Sex Age Renting
1 100   male  25     yes
2 101 female  33    <NA>
3 104    man  44      no
4 100   male  25     yes

But the case for which Sex = Male is now missing. This is because stringr::str_detect() is a case-sensitive pattern match.

You can fix this is a couple ways:

  1. make the cases in Sex all lower case to mutate() the fix or
  2. to wrap Sex in tolower() to make cases lowercase.

The first option might be better if you want to fix the problem in the data frame.

Other casing functions are:

  • tolower(): returns lower case of string; see also stringr::str_to_lower()
  • toupper(): returns upper case of string; see also stringr::str_to_upper()
  • tools::toTitleCase(): returns Title Case (capitalize first letter of string)
DAT |>
  filter(stringr::str_detect(tolower(Sex), "ma"))
   Id    Sex Age Renting
1 100   male  25     yes
2 101 female  33    <NA>
3 102   Male  27     yes
4 104    man  44      no
5 100   male  25     yes

For some, reading this code may be difficult. In order to understand the code, you need to read it inside out like slices of an onion. First understand the first function as the returned value is passed to the next function and so forth.

For example, elements of the Sex vector are converted to lower case and returned using tolower(). We can demonstrate this using only the vector from the data frame using the $ operator, DAT$Sex.

The initial vector:

DAT$Sex
[1] "male"    "female"  "Male"    NA        "man"     "male"    "neither"

To lowercase:

tolower(DAT$Sex)
[1] "male"    "female"  "male"    NA        "man"     "male"    "neither"

Next, stringr::str_detect() will return a vector of TRUE or FALSE for all element positions of DAT$Sex for which the (now lowercase) string matches "ma":

stringr::str_detect(tolower(DAT$Sex), "ma")
[1]  TRUE  TRUE  TRUE    NA  TRUE  TRUE FALSE

Another approach would be to perform the steps in pieces by first mutating the data frame so that elements in the Sex vector are converted to lowercase and then pipe that mutated data from to filter().

Briefly, we will introduce dplyr::mutate() to accomplish this goal. mutate() is used to add, update, etc. variables in a data frame or tibble.

Mutate Sex using mutate():

DAT |>
  mutate(Sex = tolower(Sex))
   Id     Sex Age Renting
1 100    male  25     yes
2 101  female  33    <NA>
3 102    male  27     yes
4 103    <NA>  40    <NA>
5 104     man  44      no
6 100    male  25     yes
7 105 neither  40     yes

Now mutate() and |> that mutated data frame to filter():

DAT |>
  mutate(Sex = tolower(Sex)) |>
  filter(stringr::str_detect(Sex, "ma"))
   Id    Sex Age Renting
1 100   male  25     yes
2 101 female  33    <NA>
3 102   male  27     yes
4 104    man  44      no
5 100   male  25     yes

Pairing this piping approach with toTitleCase() may be better as only the first letter will be capitalized and you are looking for “Ma”.

DAT |>
  mutate(Sex = tools::toTitleCase(Sex)) |>
  filter(stringr::str_detect(Sex, "Ma"))
   Id  Sex Age Renting
1 100 Male  25     yes
2 102 Male  27     yes
3 104  Man  44      no
4 100 Male  25     yes

But notice the male and man issue is still a problem.

So what are the unique elements in DAT$Sex? Your friend here is unique(). And maybe you are too.

unique(DAT$Sex)
[1] "male"    "female"  "Male"    NA        "man"     "neither"

You can hard code a fix using %in% and case_when(). case_when() is {dplyr}’s solution for multiple-case conditional.

DAT |>
  mutate(Sex = tolower(Sex), # convert to lowercase first
         Sex = case_when(    # then recode cases when matched
           Sex %in% c("male", "man") ~ "Male",
           Sex %in% c("female", "woman") ~ "Female"
         ))
   Id    Sex Age Renting
1 100   Male  25     yes
2 101 Female  33    <NA>
3 102   Male  27     yes
4 103   <NA>  40    <NA>
5 104   Male  44      no
6 100   Male  25     yes
7 105   <NA>  40     yes

For a more flexible fix, you can use str_detect():

BUT beware that the order of operations matters with this approach. Because the string “female” contains characters “ma”, you could accidentally recode all “female” cases to “male” if you perform the case_when() conversion on “male” first.

DAT |>
  mutate(Sex = tolower(Sex),   # convert first
         Sex = case_when(      # then recode
           stringr::str_detect(Sex, "fe") ~ "Female",
           stringr::str_detect(Sex, "ma") ~ "Male",
         ))
   Id    Sex Age Renting
1 100   Male  25     yes
2 101 Female  33    <NA>
3 102   Male  27     yes
4 103   <NA>  40    <NA>
5 104   Male  44      no
6 100   Male  25     yes
7 105   <NA>  40     yes

Note that conditions in not declared in case_when() will be recoded as NA, which is what happens for "neither". Make sure to include all ways you wish you recode cases.

Session Information

sessionInfo()
R version 4.4.1 (2024-06-14 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 22631)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] tidyselect_1.2.1  htmltools_0.5.8.1 DT_0.33           vroom_1.6.5      
 [5] lubridate_1.9.3   forcats_1.0.0     stringr_1.5.1     dplyr_1.1.4      
 [9] purrr_1.0.2       readr_2.1.5       tidyr_1.3.1       tibble_3.2.1     
[13] ggplot2_3.5.1     tidyverse_2.0.0  

loaded via a namespace (and not attached):
 [1] bit_4.0.5         gtable_0.3.5      jsonlite_1.8.8    crayon_1.5.3     
 [5] compiler_4.4.1    scales_1.3.0      yaml_2.3.10       fastmap_1.2.0    
 [9] here_1.0.1        R6_2.5.1          generics_0.1.3    knitr_1.47       
[13] htmlwidgets_1.6.4 munsell_0.5.1     rprojroot_2.0.4   tzdb_0.4.0       
[17] pillar_1.9.0      R.utils_2.12.3    rlang_1.1.4       utf8_1.2.4       
[21] stringi_1.8.4     xfun_0.45         bit64_4.0.5       timechange_0.3.0 
[25] cli_3.6.3         withr_3.0.1       magrittr_2.0.3    digest_0.6.36    
[29] grid_4.4.1        rstudioapi_0.16.0 hms_1.1.3         lifecycle_1.0.4  
[33] R.methodsS3_1.8.2 R.oo_1.26.0       vctrs_0.6.5       evaluate_0.24.0  
[37] glue_1.7.0        fansi_1.0.6       colorspace_2.1-0  rmarkdown_2.27   
[41] tools_4.4.1       pkgconfig_2.0.3