Selecting, Filtering, & Mutating

Author

Gabriel I. Cook

Published

February 9, 2024

Under construction.

This page is a work in progress and may contain areas that need more detail or that required syntactical, grammatical, and typographical changes. If you find some part requiring some editing, please let me know so I can fix it for you.

Overview

In this module, we will start manipulating data frames. Data frames are composed of column variables and row observations/cases and we will address how to perform operations on both. We will learn how to use {dplyr}, rather than base R to select, remove, add, and modify vectors in data frames. In addition, we will address how to filter and arrange rows in various ways. The focus will be on cleaning or wrangling data. Helper functions from libraries like {tidyr}, {stringr}, and {tidyselect} are also used in conjunction with {dplyr}.

Readings and Preparation

Before Class: First, read to familiarize yourself with the concepts rather than master them. I will assume that you attend class with some level of basic understanding of concepts and working of functions. The goal of reading should be to understand and implement code functions as well as support your understanding and help your troubleshooting of problems. This cannot happen if you just read the content without interacting with it, however reading is absolutely essential to being successful during class time. Work through some examples so that you have a good idea of your level of understanding and confidence.

Class: In class, some functions and concepts will be introduced and we will practice implementing code through exercises.

Supplementary Readings (Optional)

Libraries

{here}: 1.0.1: for path management
{dplyr} 1.1.4: for selecting, filtering, and mutating
{stringr}: 1.5.1: for working with strings
{tidyselect}: 1.2.0: for selecting sets from strings
{magrittr}: 2.0.3: for piping with %>% (plus other piping operators)

External Functions

view_html(): for viewing data frames in HTML format. You can source this from your /r/functions directory.

R.utils::sourceDirectory(here::here("r", "functions"))

If for some reason, you do not have the function saved, you can get it from the remote source.

source("https://raw.githubusercontent.com/slicesofdata/fods24/main/r/functions/view_html.R")

Loading Libraries

We will work with a few different libraries for data manipulation. Let’s load them into our work space using library().

library(dplyr)
library(stringr)
library(tidyselect)
library(magrittr)

Make note of any warnings that appear when loading libraries. There are some libraries that contain functions with the same names. Be aware that the most recently loaded library function will take precedence. You can avoid confusion using :: to call a function from a particular library (e.g., libraryname::functionname()).

A Grammar for Data Wrangling

The {dplyr} package presents a type of grammar for wrangling data (Wickham and Francois 2014). {dplyr} is part of the {tidyverse} ecosystem which loads using library(tidyverse), or specifically using library(dplyr). {dplyr} also imports functions from {tidyselect}, which is also part of the {tidyverse} ecosystem.

The functions from any R package can be called after loading the package library. One does not need to reference the package explicitly when calling a function but to avoid confusion, we will often reference the package/library and the function together (e.g., package::function()).

Working with data involves creating new column variables, removing or renaming column variables, removing or deleting row observations, sorting, and summarizing data (often by groups/factors). etc. Consequently, there are five main function verbs for working with data in a data frame: select, filter, mutate, arange, and summarize. Their corresponding functions are:

select(dataframe, variables_to_select): select columns to retain
mutate(dataframe, variables_to_create) and dplyr::rename(): add or modify existing columns
filter(dataframe, rows_to_select): subset by rows
arrange(dataframe, variable_to_sort_by): sort rows
summarize(dataframe, procedures_for_summarizing) in conjunction with dplyr::group_by(): aggregate the data in some way

Some Common Ways for Selecting Variables Using `{dplyr}`

You may often wish to subset data frames containing multiple variables to make them more lean and manageable for carrying out a specific goal. For such tasks, select() will be used.

Selecting Variables using `dplyr::select()`

You can reference variables by their names or by their column position in a data frame (e.g., 1, 25, etc.).

Using select(), you can select columns/variables from a data frame. The variables you select are retained and variables omitted that you don’t select are omitted in the returned data frame.

Before using a function you have never used before, you would always review the documentation for what the function is designed to accomplish.

help(dplyr::select)

Or:

?dplyr::select

You can see that select() is used for selecting variables from a data frame and there are many selection features for accomplishing such tasks.

Parameters/Arguments:

.data: a data frame
...: one ore more expressions separated by commas

We will definitely need a data frame and we will need variables to select.

Let’s start with a data frame. The USAressts data is built into R so let’s just use that for now. Let’s use select() to subset a data frame in two ways, first without piping and then with piping (preferred method).

Without piping

At a bare minimum, we will need to pass a data frame to select() at very least. But what happens if we don’t pass variables to select?

select(.data = USArrests)

data frame with 0 columns and 50 rows

You will see the that the returned data frame has 0 columns and 50 rows.

Piping using |>

If you pipe the data frame into the function, .data = USArrests is inherited and you do not need to declare it within select().

USArrests |>
  select()

data frame with 0 columns and 50 rows

Again, the returned data frame has 0 columns and 50 rows. If you use select() without specifying any variables for the function to select, as seen in both examples, you will have data frame that contains no columns. Let’s address some methods for selecting variables.

Selecting Variables by Column Position

Variables are columns in a data frame, so they can be selected by their element position (previously covered here). If you wanted to select the first and third column, one way is to specify each position separated by a comma.

USArrests |>        # take the data frame
  select(1, 3)      # select columns 1 and 3

               Murder UrbanPop
Alabama          13.2       58
Alaska           10.0       48
Arizona           8.1       80
Arkansas          8.8       50
California        9.0       91
Colorado          7.9       78
Connecticut       3.3       77
Delaware          5.9       72
Florida          15.4       80
Georgia          17.4       60
Hawaii            5.3       83
Idaho             2.6       54
Illinois         10.4       83
Indiana           7.2       65
Iowa              2.2       57
Kansas            6.0       66
Kentucky          9.7       52
Louisiana        15.4       66
Maine             2.1       51
Maryland         11.3       67
Massachusetts     4.4       85
Michigan         12.1       74
Minnesota         2.7       66
Mississippi      16.1       44
Missouri          9.0       70
Montana           6.0       53
Nebraska          4.3       62
Nevada           12.2       81
New Hampshire     2.1       56
New Jersey        7.4       89
New Mexico       11.4       70
New York         11.1       86
North Carolina   13.0       45
North Dakota      0.8       44
Ohio              7.3       75
Oklahoma          6.6       68
Oregon            4.9       67
Pennsylvania      6.3       72
Rhode Island      3.4       87
South Carolina   14.4       48
South Dakota      3.8       45
Tennessee        13.2       59
Texas            12.7       80
Utah              3.2       80
Vermont           2.2       32
Virginia          8.5       63
Washington        4.0       73
West Virginia     5.7       39
Wisconsin         2.6       66
Wyoming           6.8       60

Let’s remind ourselves of how piping works. Remember that piping will modify the data frame in steps, with each pipe passing the current form of the data frame to the next function. Not that you ever want to write two lines of code and perform this extra step to select only column 1 but you could pipe the data frame to another select().

USArrests |>        # take the data frame
  select(1, 3) |>   # select columns 1 and 3
  select(1)         # then select column 1 only

               Murder
Alabama          13.2
Alaska           10.0
Arizona           8.1
Arkansas          8.8
California        9.0
Colorado          7.9
Connecticut       3.3
Delaware          5.9
Florida          15.4
Georgia          17.4
Hawaii            5.3
Idaho             2.6
Illinois         10.4
Indiana           7.2
Iowa              2.2
Kansas            6.0
Kentucky          9.7
Louisiana        15.4
Maine             2.1
Maryland         11.3
Massachusetts     4.4
Michigan         12.1
Minnesota         2.7
Mississippi      16.1
Missouri          9.0
Montana           6.0
Nebraska          4.3
Nevada           12.2
New Hampshire     2.1
New Jersey        7.4
New Mexico       11.4
New York         11.1
North Carolina   13.0
North Dakota      0.8
Ohio              7.3
Oklahoma          6.6
Oregon            4.9
Pennsylvania      6.3
Rhode Island      3.4
South Carolina   14.4
South Dakota      3.8
Tennessee        13.2
Texas            12.7
Utah              3.2
Vermont           2.2
Virginia          8.5
Washington        4.0
West Virginia     5.7
Wisconsin         2.6
Wyoming           6.8

The better approach is just to select column 1 in a single line.

USArrests |>
  select(1)

               Murder
Alabama          13.2
Alaska           10.0
Arizona           8.1
Arkansas          8.8
California        9.0
Colorado          7.9
Connecticut       3.3
Delaware          5.9
Florida          15.4
Georgia          17.4
Hawaii            5.3
Idaho             2.6
Illinois         10.4
Indiana           7.2
Iowa              2.2
Kansas            6.0
Kentucky          9.7
Louisiana        15.4
Maine             2.1
Maryland         11.3
Massachusetts     4.4
Michigan         12.1
Minnesota         2.7
Mississippi      16.1
Missouri          9.0
Montana           6.0
Nebraska          4.3
Nevada           12.2
New Hampshire     2.1
New Jersey        7.4
New Mexico       11.4
New York         11.1
North Carolina   13.0
North Dakota      0.8
Ohio              7.3
Oklahoma          6.6
Oregon            4.9
Pennsylvania      6.3
Rhode Island      3.4
South Carolina   14.4
South Dakota      3.8
Tennessee        13.2
Texas            12.7
Utah              3.2
Vermont           2.2
Virginia          8.5
Washington        4.0
West Virginia     5.7
Wisconsin         2.6
Wyoming           6.8

Selecting Variables by Column Name

Rather than position, you will likely wish to select by variable name. We will address several approaches for doing so. Each approach are benefical in different contexts.

Selecting Variables as Unquoted Expressions

Variables can be passed separately without quotes. We can pass two arguments, one with Murder and the other with Assault, separating each with a comma. Because of some special characteristics of the {dplyr}, we can use variable names without quotes. Let’s also reduce the returned data frame using head() in order to save space.

USArrests |> 
  select(Murder, Assault) |> 
  head()

           Murder Assault
Alabama      13.2     236
Alaska       10.0     263
Arizona       8.1     294
Arkansas      8.8     190
California    9.0     276
Colorado      7.9     204

Selecting Variables as Characters/Strings

You are most likely familiar with variable names as quoted strings. We can pass two arguments here using the variable names in quotes. This approach is more similar to a procedures use with in base R functions as you cannot pass unquoted expressions in base R.

USArrests |> 
  select("Murder", "Assault") |> 
  head()

           Murder Assault
Alabama      13.2     236
Alaska       10.0     263
Arizona       8.1     294
Arkansas      8.8     190
California    9.0     276
Colorado      7.9     204

Selecting Variables as a Character Vector

What if your variables were in a vector, c("Murder", "Assault"))?

USArrests |>
  select(c("Murder", "Assault")) |> 
  head()

           Murder Assault
Alabama      13.2     236
Alaska       10.0     263
Arizona       8.1     294
Arkansas      8.8     190
California    9.0     276
Colorado      7.9     204

The same will work if you passed a numeric vector c(1, 3).

Selecting Using an External Vector Object

In many use cases, you will will have a vector of variable names that would will want to pass rather than type individually. This approach also reduces confusion when you have key variables that you want to work with. Create a vector once and then use again and again. If you make any changes, you will only have to make them once, rather than in multiple places.

First, create the vector object:

keep_vars <- c("Murder", "Assault")

Second, Then pass the vector into the function:

USArrests |>
  select(keep_vars)

Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.
  # Was:
  data %>% select(keep_vars)

  # Now:
  data %>% select(all_of(keep_vars))

See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.

               Murder Assault
Alabama          13.2     236
Alaska           10.0     263
Arizona           8.1     294
Arkansas          8.8     190
California        9.0     276
Colorado          7.9     204
Connecticut       3.3     110
Delaware          5.9     238
Florida          15.4     335
Georgia          17.4     211
Hawaii            5.3      46
Idaho             2.6     120
Illinois         10.4     249
Indiana           7.2     113
Iowa              2.2      56
Kansas            6.0     115
Kentucky          9.7     109
Louisiana        15.4     249
Maine             2.1      83
Maryland         11.3     300
Massachusetts     4.4     149
Michigan         12.1     255
Minnesota         2.7      72
Mississippi      16.1     259
Missouri          9.0     178
Montana           6.0     109
Nebraska          4.3     102
Nevada           12.2     252
New Hampshire     2.1      57
New Jersey        7.4     159
New Mexico       11.4     285
New York         11.1     254
North Carolina   13.0     337
North Dakota      0.8      45
Ohio              7.3     120
Oklahoma          6.6     151
Oregon            4.9     159
Pennsylvania      6.3     106
Rhode Island      3.4     174
South Carolina   14.4     279
South Dakota      3.8      86
Tennessee        13.2     188
Texas            12.7     201
Utah              3.2     120
Vermont           2.2      48
Virginia          8.5     156
Washington        4.0     145
West Virginia     5.7      81
Wisconsin         2.6      53
Wyoming           6.8     161

Hmm. Warning message.

Note: select(keep_vars) is deprecated as a solution, so if you see this approach someplace, it won’t work properly at this time or in the future. The warning message states that cannot do this without some modifications. What is suggested is all_of(). When your variables are in an external variable, all_of(keep_vars) will tell select() that your variables are all of the variables in the vector. Yes, this is an extra step but worth the flexibility.

USArrests |>
  select(all_of(keep_vars)) |> 
  head()

           Murder Assault
Alabama      13.2     236
Alaska       10.0     263
Arizona       8.1     294
Arkansas      8.8     190
California    9.0     276
Colorado      7.9     204

But this could be problematic if a variable in your vector does not exists. Let’s add some variables that are not in the data frame and then pass that vector.

keep_vars_more <- c("Murder", "Assault", "Var_x", "Var_y", "Var_z")

USArrests |>
  select(all_of(keep_vars_more)) |> 
  head()

You will see the following error message:

Error in `all_of()`:
! Can't subset columns that don't exist.
✖ Columns `Var_x`, `Var_y`, and `Var_z` don't exist.

Tip: Rather than using all_of(), use any_of(). In this instance, any of the variables that exist in the data frame will be selection. Those variables that are not in the data frame will be ignored. With this approach, your code will not break if there something changes.

USArrests |>
  select(any_of(keep_vars_more)) |> 
  head()

           Murder Assault
Alabama      13.2     236
Alaska       10.0     263
Arizona       8.1     294
Arkansas      8.8     190
California    9.0     276
Colorado      7.9     204

Selecting Variables Starting with or Ending with Certain Characters

One thing about {dplyr}, when you load the library, there are functions from other libraries that are imported along with {dplyr}’s own functions. These important functions are designed to work with each other, so the people who maintain the libraries have packaged them up nicely so you don’t have to load separate libraries.

Many of the functions are imported from the {tidyselect} library and these functions are what give you additional manipulation ability. Some imported functions are: all_of(), any_of(), contains(), ends_with(), everything(), last_col(), matches(), and starts_with().

With functions like starts_with(), contains(), and ends_with(), you can select variables containing character patterns in their names. For example, rather than code out all variables for a memory span task, you might want to grab variables that contain the characters "span".

Rather than hard coding variable names (e.g., "span_time1", "span_time2", c("Murder", "Assault")) to pass as an argument to select(), you would use another function to perform the heaving lifting for you (e.g.,starts_with()). In this case, the variable names that starts_with() specific character patterns get passed to select(). This process represents a good example of whe what is referred to as functional programming. This is also a good example of working smarter, not harder. Rather than coding specifically what to pass as an argument to another function, you utilize another function to pass its returned object as an argument to another function.

Selecting Variables Using `starts_with()`

Understanding `starts_with()`

But first, we need to understand what starts_with(), actually does. The function name should provide you some insight but for more information, use ?starts_with.

starts_with(match, 
            ignore.case = TRUE, 
            vars = NULL
            )

Parameters/Arguments:

match: a character vector
ignore.case: if TRUE, the default, ignores case when matching names. This is most flexible.
vars: a character vector of variable names. If not supplied, the variables are taken from the current selection context (as established by functions like select() or pivot_longer()).

The best way to understand the function is to use it (after reading the documentation). Let’s try out starts_with(). Let’s set a required pattern match = some character and because vars = NULL by default, let’s just set vars = some character vector. If you reviewed the documentation or looked at the arguments above, vars is not the second parameter, so you will want to name it in the function call if it takes that second position as seen below. We will first test this function without using it on a data frame.

We set match = "a" and vars = c("Hello", "Hi", "Bye"). This will look for character "a" to match the elements of a character vector c("Hello", "Hi", "Bye").

starts_with(match = "a", 
            vars = c("Hello", "Hi", "Bye")
            )

integer(0)

What is returned is integer(0), which is actually speak for “there is no match”. Not very intuitive. Hmm, OK. Let’s try another character.

starts_with(match = "b", 
            vars = c("Hello", "Hi", "Bye")
            )

[1] 3

OK, so now an integer is returned (yes, try is.integer() if you don’t believe me).

is.integer(
  starts_with("b", 
              vars = c("Hello", "Hi", "Bye")
              ))

[1] TRUE

And even though you only see one value, this object is still a vector. We can check using is.vector().

is.vector(
  starts_with("b", 
              vars = c("Hello", "Hi", "Bye")
              ))

[1] TRUE

Importantly, starts_with() returns a vector of values. What do the values correspond to? The values in the vector refer to the element index/position of a match and the number of the elements in the vector tells refers to the number of matches found. Because the third string element "Bye" starts with "b", that’s what is returned.

Try something else:

starts_with("h", 
            vars = c("Hello", "Hi", "Bye")
            )

[1] 1 2

Now the function returns a vector containing 1 and a 2, giving length = 2, representing both the first and the second elements start with “h”.

Importantly, letter casing is ignored because ignore.case = TRUE by default. Set to FALSE if you want your match to be case sensitive.

starts_with("h", 
            vars = c("Hello", "Hi", "Bye"), 
            ignore.case = F
            )

integer(0)

And now there are no matches.

Selecting Variables from a Data Frame using `starts_with()`

You will typically use starts_with() when working with a data frame and in conjunction with other functions, like select() in order to select variables that start with certain characters. When using starts_with() in the context of select(), the vars argument is essentially passing vars = the names of the columns of the data frame passed to select().

Example:

select(mydataframe,
       starts_with(match = "my pattern",
                   vars = "var names of mydataframe")
      )

Let’s provide two use cases:

Without Piping

You can pass the data frame as the first argument to select().

select(USArrests, 
       starts_with("m")
       ) |> 
  head()

           Murder
Alabama      13.2
Alaska       10.0
Arizona       8.1
Arkansas      8.8
California    9.0
Colorado      7.9

Notice that the only column variable that is returned is Murder.

With Piping

You can also pipe the data frame into select(), which assumes the first argument is a data frame. This is a process you will see used often by others.

USArrests |>
  select(starts_with("m")) |> 
  head()

           Murder
Alabama      13.2
Alaska       10.0
Arizona       8.1
Arkansas      8.8
California    9.0
Colorado      7.9

Selecting Variables using `ends_with()`

You can use ends_with() to find matches at the end of strings. This function is useful if you have variables that end based on their measurement type (e.g., “_rt” or “_acc” for reaction times and accuracy, respectively). Having the foresight to create variables is such a way to make accessing them easily later is a good example of working smarter, not harder.

USArrests |>
  select(ends_with("t")) |> 
  head()

           Assault
Alabama        236
Alaska         263
Arizona        294
Arkansas       190
California     276
Colorado       204

Selecting and Selecting Out Variables By/Between Index

So far, we have address selecting variables. There are, however, many approaches for selecting but also selecting out or omitting column variables. In other words, rather than including 14 out of 15 variables, you could simply exclude 1 of the 15 variables to achieve the same result.

You can pass multiple arguments for each specification or you can pass a single vector that contains all specifications.

select(1, 2): select first and second columns
select(c(1, 2)): select first and second columns
select(-c(1, 2)): select out first and second columns
select(1:2): select first through second columns
select(c(1:2)): select first through second columns
select(-c(1:2)): select out first through second columns

Potential Recommendation: use options utilizing c() to pass a vector because this habit will be more versatile with base R functionality. However, online solutions will likely not take this approach.

Let’s make a data frame to work with first. We will name the data frame DAT even though this is a terrible name because it lacks diagnosticity.

DAT <- 
  data.frame(
    Id  = c(100, 101, 102, 103, 104, 100, 105),
    Sex = c('male', 'female', 'Male', NA, 'man', "male", "neither"),
    Age = c(25, 33, 27, 40, 44, 25, 40),
    Renting = c("yes", NA, "yes", NA, "no", "yes", "yes")
)

Select columns 1 and 2:

DAT |>
  select(1, 2)

   Id     Sex
1 100    male
2 101  female
3 102    Male
4 103    <NA>
5 104     man
6 100    male
7 105 neither

Select columns 1 and 2 as a vector containing values 1 and 2:

DAT |>
  select(c(1, 2))

   Id     Sex
1 100    male
2 101  female
3 102    Male
4 103    <NA>
5 104     man
6 100    male
7 105 neither

Select out columns 1 and 2 as a vector containing values 1 and 2:

DAT |>
  select(-c(1,2))

  Age Renting
1  25     yes
2  33    <NA>
3  27     yes
4  40    <NA>
5  44      no
6  25     yes
7  40     yes

Select from columns 1 to 2 using the : operator:

DAT |>
  select(1:2)

   Id     Sex
1 100    male
2 101  female
3 102    Male
4 103    <NA>
5 104     man
6 100    male
7 105 neither

Select from columns 1 through 3 using a vector containing the : operator:

DAT |>
  select(c(1:3))

   Id     Sex Age
1 100    male  25
2 101  female  33
3 102    Male  27
4 103    <NA>  40
5 104     man  44
6 100    male  25
7 105 neither  40

Select out columns 1 through 3 using a vector containing the : operator:

DAT |>
  select(-c(1:3))   # select out from here to there

  Renting
1     yes
2    <NA>
3     yes
4    <NA>
5      no
6     yes
7     yes

Selecting and Selecting Out Variables By or Between Character Name

These approaches are similar to those offered earlier except that some involve passing variables by their name (e.g., character names). Whereas the order of the variables in a data frame may move around, the names may be more stable or permanent, at least after you have cleaned up the names. Consequently, passing variables by name may be more foolproof.

You don’t have to be familiar with all approaches and you may settle on using one that makes the most sense to you.

select("var1", "var2")
select(c("var1", "var2"))
select(-c("var1", "var2"))
select(var1:var2))
select(c("var1":"var2))
select(-c("var1":"var2))

Recommendation: use options utilizing c() as this will be more versatile with base R functionality.

These function approaches also work but they may lead to some confusion regarding usage of quotes:

select(var1, var2)
select(c(var1, var2))
select(-c(var1, var2))

Select variables Id though Age using the : operator:

DAT |>
  select(Id:Age)                            # select from here to there

   Id     Sex Age
1 100    male  25
2 101  female  33
3 102    Male  27
4 103    <NA>  40
5 104     man  44
6 100    male  25
7 105 neither  40

Select variables Id though Age passed as strings using the : operator:

DAT |>
  select("Id":"Age")                        # select from here to there

   Id     Sex Age
1 100    male  25
2 101  female  33
3 102    Male  27
4 103    <NA>  40
5 104     man  44
6 100    male  25
7 105 neither  40

Select variables Id though Age as a vector containing the variable names passed as strings and using the : operator:

DAT |>
  select(c("Id":"Age"))                     # select from here to there

   Id     Sex Age
1 100    male  25
2 101  female  33
3 102    Male  27
4 103    <NA>  40
5 104     man  44
6 100    male  25
7 105 neither  40

Select out variables Id though Age as a vector containing the variable names passed as strings and using the : operator:

DAT |>
  select(-c("Id":"Age"))                    # select out from here to there

  Renting
1     yes
2    <NA>
3     yes
4    <NA>
5      no
6     yes
7     yes

You can also use the ! operator to select NOT these variables (therefore, all others)

DAT |>
  select(!c("Id":"Age"))                    # select out from here to there

  Renting
1     yes
2    <NA>
3     yes
4    <NA>
5      no
6     yes
7     yes

Selecting and Selecting Out Variables Characters in Their Names

select(starts_with("characters"))
select(ends_with("characters"))
select(contains('e'))

Selecting using `starts_with()`

Select variables which start with character “i”:

DAT |> 
  select(starts_with('i'))

Select variables which DO NOT start with character “s”:

DAT |> 
  select(-starts_with('s'))

   Id Age Renting
1 100  25     yes
2 101  33    <NA>
3 102  27     yes
4 103  40    <NA>
5 104  44      no
6 100  25     yes
7 105  40     yes

Selecting using `ends_with()`

Select variables which end with character “e”:

DAT |> 
  select(ends_with('e'))

Select variables which end with character “e”:

DAT |> 
  select(-ends_with('e'))

   Id     Sex Renting
1 100    male     yes
2 101  female    <NA>
3 102    Male     yes
4 103    <NA>    <NA>
5 104     man      no
6 100    male     yes
7 105 neither     yes

Selecting using `contains()`

Select variables which contain character “g”:

DAT |> 
  select(contains('g'))

  Age Renting
1  25     yes
2  33    <NA>
3  27     yes
4  40    <NA>
5  44      no
6  25     yes
7  40     yes

Select variables which DO NOT contain character “g”:

DAT |> 
  select(-contains('g'))

   Id     Sex
1 100    male
2 101  female
3 102    Male
4 103    <NA>
5 104     man
6 100    male
7 105 neither

Selecting using regular expressions

Sometimes you do not want to search for specific character strings. Instead, you might want to look for strings that contain particular character patterns. For example, typing out var1, var2, var3 could be annoying. You might prefer just grabbing variables that match the pattern "var".

We will use some regular expressions, or regex, to help us. For the regex, you we will need to specify some code for finding patterns. For example, \\d{4} will search for patterns with 4 digits, whereas the dot star .* will help with all patterns. In particular, . refers to any character (e.g,. digit, alpha character, or any other special character) and * means zero or more times, so this pattern will search for all files that start with 4 digits followed by anything in the name. A pattern will restrict the search to files containing that exact character string. Some example will be helpful.

**Selecting variables of all patterns using `".*"`**

Select variables containing a regular expression, use matches():

.* will grab all names because it means any character and any number of times

DAT |> 
  select(matches(".*"))

   Id     Sex Age Renting
1 100    male  25     yes
2 101  female  33    <NA>
3 102    Male  27     yes
4 103    <NA>  40    <NA>
5 104     man  44      no
6 100    male  25     yes
7 105 neither  40     yes

Selecting variables with digits using `"\\d"`

\\d will grab all variables containing a digit:

DAT |> 
  mutate(var_1    = 1,
         var_11   = 1,
         var_3    = 1,
         var1_var = 1
         ) |>
  select(matches("\\d"))

  var_1 var_11 var_3 var1_var
1     1      1     1        1
2     1      1     1        1
3     1      1     1        1
4     1      1     1        1
5     1      1     1        1
6     1      1     1        1
7     1      1     1        1

**Selecting variables that start and contain characters `"v.*\\d"`**

v.*\\d will grab all variables that start with v and then contain any characters which are followed by a digit:

DAT |>
    mutate(var_1    = 1,
           var_11   = 1,
           var_3    = 1,
           var1_var = 1
         ) |>
  select(matches("v.*\\d"))

  var_1 var_11 var_3 var1_var
1     1      1     1        1
2     1      1     1        1
3     1      1     1        1
4     1      1     1        1
5     1      1     1        1
6     1      1     1        1
7     1      1     1        1

Selecting variables that end with a digit using `"\\d$"`

\\d$ will grab all variables ending in a digit ($ means end):

DAT |> 
    mutate(var_1    = 1,
           var_11   = 1,
           var_3    = 1,
           var1_var = 1
           ) |>
  select(matches("\\d$"))

  var_1 var_11 var_3
1     1      1     1
2     1      1     1
3     1      1     1
4     1      1     1
5     1      1     1
6     1      1     1
7     1      1     1

Selecting variables by negation

You can also negate all regular expression matches if you want to exclude:

DAT |>
      mutate(var_1    = 1,
             var_11   = 1,
             var_3    = 1,
             var1_var = 1
             ) |>
  select(-matches("\\d$"))

   Id     Sex Age Renting var1_var
1 100    male  25     yes        1
2 101  female  33    <NA>        1
3 102    Male  27     yes        1
4 103    <NA>  40    <NA>        1
5 104     man  44      no        1
6 100    male  25     yes        1
7 105 neither  40     yes        1

Note: The functions will return lowercase and uppercase variable name matches because the default behavior is ignore.case = TRUE. Set to FALSE if you want to perform precise surgery on the variables.

Selecting and Selecting Out Variables by Type

Select variables that are numeric:

DAT |>
  select(where(is.numeric))

Select variables that are NOT numeric:

DAT |> 
  select(-where(is.numeric))

      Sex Renting
1    male     yes
2  female    <NA>
3    Male     yes
4    <NA>    <NA>
5     man      no
6    male     yes
7 neither     yes

Select variables that are character:

DAT |> 
  select(where(is.character))

      Sex Renting
1    male     yes
2  female    <NA>
3    Male     yes
4    <NA>    <NA>
5     man      no
6    male     yes
7 neither     yes

Select variables that are NOT character:

DAT |>
  select(-where(is.character))

Select variables that are logical (TRUE to FALSE):

DAT |> 
  select(where(is.logical))

data frame with 0 columns and 7 rows

Select variables that are NOT logical (TRUE to FALSE):

DAT |> 
  select(-where(is.logical))

   Id     Sex Age Renting
1 100    male  25     yes
2 101  female  33    <NA>
3 102    Male  27     yes
4 103    <NA>  40    <NA>
5 104     man  44      no
6 100    male  25     yes
7 105 neither  40     yes

Creating/Mutating Variables

When we want to add new variables or modify existing ones, we will use mutate(). Before using the function, review what it does using ?mutate.

mutate(
  .data,
  ...,
  .by = NULL,
  .keep = c("all", "used", "unused", "none"),
  .before = NULL,
  .after = NULL
)

Parameters/Arguments:

.data: a data frame
...: name-value pairs; name is the name of the column and the value would be the value or function used

Let’s start with a constant and move up from there.

Creating/Mutating A Constant Value

Let’s create a new variable, new_var and set it equal to 1.

USArrests |>
  mutate(new_var1 = 1)

               Murder Assault UrbanPop Rape new_var1
Alabama          13.2     236       58 21.2        1
Alaska           10.0     263       48 44.5        1
Arizona           8.1     294       80 31.0        1
Arkansas          8.8     190       50 19.5        1
California        9.0     276       91 40.6        1
Colorado          7.9     204       78 38.7        1
Connecticut       3.3     110       77 11.1        1
Delaware          5.9     238       72 15.8        1
Florida          15.4     335       80 31.9        1
Georgia          17.4     211       60 25.8        1
Hawaii            5.3      46       83 20.2        1
Idaho             2.6     120       54 14.2        1
Illinois         10.4     249       83 24.0        1
Indiana           7.2     113       65 21.0        1
Iowa              2.2      56       57 11.3        1
Kansas            6.0     115       66 18.0        1
Kentucky          9.7     109       52 16.3        1
Louisiana        15.4     249       66 22.2        1
Maine             2.1      83       51  7.8        1
Maryland         11.3     300       67 27.8        1
Massachusetts     4.4     149       85 16.3        1
Michigan         12.1     255       74 35.1        1
Minnesota         2.7      72       66 14.9        1
Mississippi      16.1     259       44 17.1        1
Missouri          9.0     178       70 28.2        1
Montana           6.0     109       53 16.4        1
Nebraska          4.3     102       62 16.5        1
Nevada           12.2     252       81 46.0        1
New Hampshire     2.1      57       56  9.5        1
New Jersey        7.4     159       89 18.8        1
New Mexico       11.4     285       70 32.1        1
New York         11.1     254       86 26.1        1
North Carolina   13.0     337       45 16.1        1
North Dakota      0.8      45       44  7.3        1
Ohio              7.3     120       75 21.4        1
Oklahoma          6.6     151       68 20.0        1
Oregon            4.9     159       67 29.3        1
Pennsylvania      6.3     106       72 14.9        1
Rhode Island      3.4     174       87  8.3        1
South Carolina   14.4     279       48 22.5        1
South Dakota      3.8      86       45 12.8        1
Tennessee        13.2     188       59 26.9        1
Texas            12.7     201       80 25.5        1
Utah              3.2     120       80 22.9        1
Vermont           2.2      48       32 11.2        1
Virginia          8.5     156       63 20.7        1
Washington        4.0     145       73 26.2        1
West Virginia     5.7      81       39  9.3        1
Wisconsin         2.6      53       66 10.8        1
Wyoming           6.8     161       60 15.6        1

To mutate more than one variable, you can add a new name-value pair as a new argument:

USArrests |>
  mutate(new_var1 = 1,
         new_var2 = 9999
         )

               Murder Assault UrbanPop Rape new_var1 new_var2
Alabama          13.2     236       58 21.2        1     9999
Alaska           10.0     263       48 44.5        1     9999
Arizona           8.1     294       80 31.0        1     9999
Arkansas          8.8     190       50 19.5        1     9999
California        9.0     276       91 40.6        1     9999
Colorado          7.9     204       78 38.7        1     9999
Connecticut       3.3     110       77 11.1        1     9999
Delaware          5.9     238       72 15.8        1     9999
Florida          15.4     335       80 31.9        1     9999
Georgia          17.4     211       60 25.8        1     9999
Hawaii            5.3      46       83 20.2        1     9999
Idaho             2.6     120       54 14.2        1     9999
Illinois         10.4     249       83 24.0        1     9999
Indiana           7.2     113       65 21.0        1     9999
Iowa              2.2      56       57 11.3        1     9999
Kansas            6.0     115       66 18.0        1     9999
Kentucky          9.7     109       52 16.3        1     9999
Louisiana        15.4     249       66 22.2        1     9999
Maine             2.1      83       51  7.8        1     9999
Maryland         11.3     300       67 27.8        1     9999
Massachusetts     4.4     149       85 16.3        1     9999
Michigan         12.1     255       74 35.1        1     9999
Minnesota         2.7      72       66 14.9        1     9999
Mississippi      16.1     259       44 17.1        1     9999
Missouri          9.0     178       70 28.2        1     9999
Montana           6.0     109       53 16.4        1     9999
Nebraska          4.3     102       62 16.5        1     9999
Nevada           12.2     252       81 46.0        1     9999
New Hampshire     2.1      57       56  9.5        1     9999
New Jersey        7.4     159       89 18.8        1     9999
New Mexico       11.4     285       70 32.1        1     9999
New York         11.1     254       86 26.1        1     9999
North Carolina   13.0     337       45 16.1        1     9999
North Dakota      0.8      45       44  7.3        1     9999
Ohio              7.3     120       75 21.4        1     9999
Oklahoma          6.6     151       68 20.0        1     9999
Oregon            4.9     159       67 29.3        1     9999
Pennsylvania      6.3     106       72 14.9        1     9999
Rhode Island      3.4     174       87  8.3        1     9999
South Carolina   14.4     279       48 22.5        1     9999
South Dakota      3.8      86       45 12.8        1     9999
Tennessee        13.2     188       59 26.9        1     9999
Texas            12.7     201       80 25.5        1     9999
Utah              3.2     120       80 22.9        1     9999
Vermont           2.2      48       32 11.2        1     9999
Virginia          8.5     156       63 20.7        1     9999
Washington        4.0     145       73 26.2        1     9999
West Virginia     5.7      81       39  9.3        1     9999
Wisconsin         2.6      53       66 10.8        1     9999
Wyoming           6.8     161       60 15.6        1     9999

Or you can add a new mutate() by piping:

USArrests |>
  mutate(new_var1 = 1) |>  # add the variable 
  mutate(new_var1 = 9999)  # then add another variable

               Murder Assault UrbanPop Rape new_var1
Alabama          13.2     236       58 21.2     9999
Alaska           10.0     263       48 44.5     9999
Arizona           8.1     294       80 31.0     9999
Arkansas          8.8     190       50 19.5     9999
California        9.0     276       91 40.6     9999
Colorado          7.9     204       78 38.7     9999
Connecticut       3.3     110       77 11.1     9999
Delaware          5.9     238       72 15.8     9999
Florida          15.4     335       80 31.9     9999
Georgia          17.4     211       60 25.8     9999
Hawaii            5.3      46       83 20.2     9999
Idaho             2.6     120       54 14.2     9999
Illinois         10.4     249       83 24.0     9999
Indiana           7.2     113       65 21.0     9999
Iowa              2.2      56       57 11.3     9999
Kansas            6.0     115       66 18.0     9999
Kentucky          9.7     109       52 16.3     9999
Louisiana        15.4     249       66 22.2     9999
Maine             2.1      83       51  7.8     9999
Maryland         11.3     300       67 27.8     9999
Massachusetts     4.4     149       85 16.3     9999
Michigan         12.1     255       74 35.1     9999
Minnesota         2.7      72       66 14.9     9999
Mississippi      16.1     259       44 17.1     9999
Missouri          9.0     178       70 28.2     9999
Montana           6.0     109       53 16.4     9999
Nebraska          4.3     102       62 16.5     9999
Nevada           12.2     252       81 46.0     9999
New Hampshire     2.1      57       56  9.5     9999
New Jersey        7.4     159       89 18.8     9999
New Mexico       11.4     285       70 32.1     9999
New York         11.1     254       86 26.1     9999
North Carolina   13.0     337       45 16.1     9999
North Dakota      0.8      45       44  7.3     9999
Ohio              7.3     120       75 21.4     9999
Oklahoma          6.6     151       68 20.0     9999
Oregon            4.9     159       67 29.3     9999
Pennsylvania      6.3     106       72 14.9     9999
Rhode Island      3.4     174       87  8.3     9999
South Carolina   14.4     279       48 22.5     9999
South Dakota      3.8      86       45 12.8     9999
Tennessee        13.2     188       59 26.9     9999
Texas            12.7     201       80 25.5     9999
Utah              3.2     120       80 22.9     9999
Vermont           2.2      48       32 11.2     9999
Virginia          8.5     156       63 20.7     9999
Washington        4.0     145       73 26.2     9999
West Virginia     5.7      81       39  9.3     9999
Wisconsin         2.6      53       66 10.8     9999
Wyoming           6.8     161       60 15.6     9999

Whatever your approach, keep the variable creation on separate lines. Don’t do this:

USArrests |>
  mutate(new_var1 = 1, new_var1 = 9999)

               Murder Assault UrbanPop Rape new_var1
Alabama          13.2     236       58 21.2     9999
Alaska           10.0     263       48 44.5     9999
Arizona           8.1     294       80 31.0     9999
Arkansas          8.8     190       50 19.5     9999
California        9.0     276       91 40.6     9999
Colorado          7.9     204       78 38.7     9999
Connecticut       3.3     110       77 11.1     9999
Delaware          5.9     238       72 15.8     9999
Florida          15.4     335       80 31.9     9999
Georgia          17.4     211       60 25.8     9999
Hawaii            5.3      46       83 20.2     9999
Idaho             2.6     120       54 14.2     9999
Illinois         10.4     249       83 24.0     9999
Indiana           7.2     113       65 21.0     9999
Iowa              2.2      56       57 11.3     9999
Kansas            6.0     115       66 18.0     9999
Kentucky          9.7     109       52 16.3     9999
Louisiana        15.4     249       66 22.2     9999
Maine             2.1      83       51  7.8     9999
Maryland         11.3     300       67 27.8     9999
Massachusetts     4.4     149       85 16.3     9999
Michigan         12.1     255       74 35.1     9999
Minnesota         2.7      72       66 14.9     9999
Mississippi      16.1     259       44 17.1     9999
Missouri          9.0     178       70 28.2     9999
Montana           6.0     109       53 16.4     9999
Nebraska          4.3     102       62 16.5     9999
Nevada           12.2     252       81 46.0     9999
New Hampshire     2.1      57       56  9.5     9999
New Jersey        7.4     159       89 18.8     9999
New Mexico       11.4     285       70 32.1     9999
New York         11.1     254       86 26.1     9999
North Carolina   13.0     337       45 16.1     9999
North Dakota      0.8      45       44  7.3     9999
Ohio              7.3     120       75 21.4     9999
Oklahoma          6.6     151       68 20.0     9999
Oregon            4.9     159       67 29.3     9999
Pennsylvania      6.3     106       72 14.9     9999
Rhode Island      3.4     174       87  8.3     9999
South Carolina   14.4     279       48 22.5     9999
South Dakota      3.8      86       45 12.8     9999
Tennessee        13.2     188       59 26.9     9999
Texas            12.7     201       80 25.5     9999
Utah              3.2     120       80 22.9     9999
Vermont           2.2      48       32 11.2     9999
Virginia          8.5     156       63 20.7     9999
Washington        4.0     145       73 26.2     9999
West Virginia     5.7      81       39  9.3     9999
Wisconsin         2.6      53       66 10.8     9999
Wyoming           6.8     161       60 15.6     9999

Although there is nothing technically incorrect with doing so, (1) this is harder to read, (2) the R community won’t like it, and importantly (3) it is less flexible when you want to temporarily comment out a line of your code as seen here.

USArrests |>
#  mutate(new_var1 = 1) |>  # add the variable 
  mutate(new_var1 = 9999)  # then add another variable

               Murder Assault UrbanPop Rape new_var1
Alabama          13.2     236       58 21.2     9999
Alaska           10.0     263       48 44.5     9999
Arizona           8.1     294       80 31.0     9999
Arkansas          8.8     190       50 19.5     9999
California        9.0     276       91 40.6     9999
Colorado          7.9     204       78 38.7     9999
Connecticut       3.3     110       77 11.1     9999
Delaware          5.9     238       72 15.8     9999
Florida          15.4     335       80 31.9     9999
Georgia          17.4     211       60 25.8     9999
Hawaii            5.3      46       83 20.2     9999
Idaho             2.6     120       54 14.2     9999
Illinois         10.4     249       83 24.0     9999
Indiana           7.2     113       65 21.0     9999
Iowa              2.2      56       57 11.3     9999
Kansas            6.0     115       66 18.0     9999
Kentucky          9.7     109       52 16.3     9999
Louisiana        15.4     249       66 22.2     9999
Maine             2.1      83       51  7.8     9999
Maryland         11.3     300       67 27.8     9999
Massachusetts     4.4     149       85 16.3     9999
Michigan         12.1     255       74 35.1     9999
Minnesota         2.7      72       66 14.9     9999
Mississippi      16.1     259       44 17.1     9999
Missouri          9.0     178       70 28.2     9999
Montana           6.0     109       53 16.4     9999
Nebraska          4.3     102       62 16.5     9999
Nevada           12.2     252       81 46.0     9999
New Hampshire     2.1      57       56  9.5     9999
New Jersey        7.4     159       89 18.8     9999
New Mexico       11.4     285       70 32.1     9999
New York         11.1     254       86 26.1     9999
North Carolina   13.0     337       45 16.1     9999
North Dakota      0.8      45       44  7.3     9999
Ohio              7.3     120       75 21.4     9999
Oklahoma          6.6     151       68 20.0     9999
Oregon            4.9     159       67 29.3     9999
Pennsylvania      6.3     106       72 14.9     9999
Rhode Island      3.4     174       87  8.3     9999
South Carolina   14.4     279       48 22.5     9999
South Dakota      3.8      86       45 12.8     9999
Tennessee        13.2     188       59 26.9     9999
Texas            12.7     201       80 25.5     9999
Utah              3.2     120       80 22.9     9999
Vermont           2.2      48       32 11.2     9999
Virginia          8.5     156       63 20.7     9999
Washington        4.0     145       73 26.2     9999
West Virginia     5.7      81       39  9.3     9999
Wisconsin         2.6      53       66 10.8     9999
Wyoming           6.8     161       60 15.6     9999

Creating/Mutating A Constant Character

USArrests |>
  mutate(new_var1 = "Below Average",
         )

               Murder Assault UrbanPop Rape      new_var1
Alabama          13.2     236       58 21.2 Below Average
Alaska           10.0     263       48 44.5 Below Average
Arizona           8.1     294       80 31.0 Below Average
Arkansas          8.8     190       50 19.5 Below Average
California        9.0     276       91 40.6 Below Average
Colorado          7.9     204       78 38.7 Below Average
Connecticut       3.3     110       77 11.1 Below Average
Delaware          5.9     238       72 15.8 Below Average
Florida          15.4     335       80 31.9 Below Average
Georgia          17.4     211       60 25.8 Below Average
Hawaii            5.3      46       83 20.2 Below Average
Idaho             2.6     120       54 14.2 Below Average
Illinois         10.4     249       83 24.0 Below Average
Indiana           7.2     113       65 21.0 Below Average
Iowa              2.2      56       57 11.3 Below Average
Kansas            6.0     115       66 18.0 Below Average
Kentucky          9.7     109       52 16.3 Below Average
Louisiana        15.4     249       66 22.2 Below Average
Maine             2.1      83       51  7.8 Below Average
Maryland         11.3     300       67 27.8 Below Average
Massachusetts     4.4     149       85 16.3 Below Average
Michigan         12.1     255       74 35.1 Below Average
Minnesota         2.7      72       66 14.9 Below Average
Mississippi      16.1     259       44 17.1 Below Average
Missouri          9.0     178       70 28.2 Below Average
Montana           6.0     109       53 16.4 Below Average
Nebraska          4.3     102       62 16.5 Below Average
Nevada           12.2     252       81 46.0 Below Average
New Hampshire     2.1      57       56  9.5 Below Average
New Jersey        7.4     159       89 18.8 Below Average
New Mexico       11.4     285       70 32.1 Below Average
New York         11.1     254       86 26.1 Below Average
North Carolina   13.0     337       45 16.1 Below Average
North Dakota      0.8      45       44  7.3 Below Average
Ohio              7.3     120       75 21.4 Below Average
Oklahoma          6.6     151       68 20.0 Below Average
Oregon            4.9     159       67 29.3 Below Average
Pennsylvania      6.3     106       72 14.9 Below Average
Rhode Island      3.4     174       87  8.3 Below Average
South Carolina   14.4     279       48 22.5 Below Average
South Dakota      3.8      86       45 12.8 Below Average
Tennessee        13.2     188       59 26.9 Below Average
Texas            12.7     201       80 25.5 Below Average
Utah              3.2     120       80 22.9 Below Average
Vermont           2.2      48       32 11.2 Below Average
Virginia          8.5     156       63 20.7 Below Average
Washington        4.0     145       73 26.2 Below Average
West Virginia     5.7      81       39  9.3 Below Average
Wisconsin         2.6      53       66 10.8 Below Average
Wyoming           6.8     161       60 15.6 Below Average

Creating/Mutating A Logical Object

Logical objects are TRUE or FALSE. We can perform the logical test by asking whether Assault > Murder. If the Assault value for a row is greater than the Murder value, TRUE will be returned, otherwise FALSE.

USArrests |>
  mutate(new_var = Assault > Murder)

               Murder Assault UrbanPop Rape new_var
Alabama          13.2     236       58 21.2    TRUE
Alaska           10.0     263       48 44.5    TRUE
Arizona           8.1     294       80 31.0    TRUE
Arkansas          8.8     190       50 19.5    TRUE
California        9.0     276       91 40.6    TRUE
Colorado          7.9     204       78 38.7    TRUE
Connecticut       3.3     110       77 11.1    TRUE
Delaware          5.9     238       72 15.8    TRUE
Florida          15.4     335       80 31.9    TRUE
Georgia          17.4     211       60 25.8    TRUE
Hawaii            5.3      46       83 20.2    TRUE
Idaho             2.6     120       54 14.2    TRUE
Illinois         10.4     249       83 24.0    TRUE
Indiana           7.2     113       65 21.0    TRUE
Iowa              2.2      56       57 11.3    TRUE
Kansas            6.0     115       66 18.0    TRUE
Kentucky          9.7     109       52 16.3    TRUE
Louisiana        15.4     249       66 22.2    TRUE
Maine             2.1      83       51  7.8    TRUE
Maryland         11.3     300       67 27.8    TRUE
Massachusetts     4.4     149       85 16.3    TRUE
Michigan         12.1     255       74 35.1    TRUE
Minnesota         2.7      72       66 14.9    TRUE
Mississippi      16.1     259       44 17.1    TRUE
Missouri          9.0     178       70 28.2    TRUE
Montana           6.0     109       53 16.4    TRUE
Nebraska          4.3     102       62 16.5    TRUE
Nevada           12.2     252       81 46.0    TRUE
New Hampshire     2.1      57       56  9.5    TRUE
New Jersey        7.4     159       89 18.8    TRUE
New Mexico       11.4     285       70 32.1    TRUE
New York         11.1     254       86 26.1    TRUE
North Carolina   13.0     337       45 16.1    TRUE
North Dakota      0.8      45       44  7.3    TRUE
Ohio              7.3     120       75 21.4    TRUE
Oklahoma          6.6     151       68 20.0    TRUE
Oregon            4.9     159       67 29.3    TRUE
Pennsylvania      6.3     106       72 14.9    TRUE
Rhode Island      3.4     174       87  8.3    TRUE
South Carolina   14.4     279       48 22.5    TRUE
South Dakota      3.8      86       45 12.8    TRUE
Tennessee        13.2     188       59 26.9    TRUE
Texas            12.7     201       80 25.5    TRUE
Utah              3.2     120       80 22.9    TRUE
Vermont           2.2      48       32 11.2    TRUE
Virginia          8.5     156       63 20.7    TRUE
Washington        4.0     145       73 26.2    TRUE
West Virginia     5.7      81       39  9.3    TRUE
Wisconsin         2.6      53       66 10.8    TRUE
Wyoming           6.8     161       60 15.6    TRUE

We see that all states (rows) have more assults than murders. No surprise.

Creating/Mutating Based on Function

Here, the name value pair will contain a function. For example, let’s create a variable based on the mean() of a variable.

USArrests |>
  mutate(Mean_Assault = mean(Assault))

               Murder Assault UrbanPop Rape Mean_Assault
Alabama          13.2     236       58 21.2       170.76
Alaska           10.0     263       48 44.5       170.76
Arizona           8.1     294       80 31.0       170.76
Arkansas          8.8     190       50 19.5       170.76
California        9.0     276       91 40.6       170.76
Colorado          7.9     204       78 38.7       170.76
Connecticut       3.3     110       77 11.1       170.76
Delaware          5.9     238       72 15.8       170.76
Florida          15.4     335       80 31.9       170.76
Georgia          17.4     211       60 25.8       170.76
Hawaii            5.3      46       83 20.2       170.76
Idaho             2.6     120       54 14.2       170.76
Illinois         10.4     249       83 24.0       170.76
Indiana           7.2     113       65 21.0       170.76
Iowa              2.2      56       57 11.3       170.76
Kansas            6.0     115       66 18.0       170.76
Kentucky          9.7     109       52 16.3       170.76
Louisiana        15.4     249       66 22.2       170.76
Maine             2.1      83       51  7.8       170.76
Maryland         11.3     300       67 27.8       170.76
Massachusetts     4.4     149       85 16.3       170.76
Michigan         12.1     255       74 35.1       170.76
Minnesota         2.7      72       66 14.9       170.76
Mississippi      16.1     259       44 17.1       170.76
Missouri          9.0     178       70 28.2       170.76
Montana           6.0     109       53 16.4       170.76
Nebraska          4.3     102       62 16.5       170.76
Nevada           12.2     252       81 46.0       170.76
New Hampshire     2.1      57       56  9.5       170.76
New Jersey        7.4     159       89 18.8       170.76
New Mexico       11.4     285       70 32.1       170.76
New York         11.1     254       86 26.1       170.76
North Carolina   13.0     337       45 16.1       170.76
North Dakota      0.8      45       44  7.3       170.76
Ohio              7.3     120       75 21.4       170.76
Oklahoma          6.6     151       68 20.0       170.76
Oregon            4.9     159       67 29.3       170.76
Pennsylvania      6.3     106       72 14.9       170.76
Rhode Island      3.4     174       87  8.3       170.76
South Carolina   14.4     279       48 22.5       170.76
South Dakota      3.8      86       45 12.8       170.76
Tennessee        13.2     188       59 26.9       170.76
Texas            12.7     201       80 25.5       170.76
Utah              3.2     120       80 22.9       170.76
Vermont           2.2      48       32 11.2       170.76
Virginia          8.5     156       63 20.7       170.76
Washington        4.0     145       73 26.2       170.76
West Virginia     5.7      81       39  9.3       170.76
Wisconsin         2.6      53       66 10.8       170.76
Wyoming           6.8     161       60 15.6       170.76

Notice that the new column contains the mean for all rows.

We may also wish to categorize rows based on their relative Assault data. How about assault lower than the mean assault?

USArrests |>
  mutate(Lower_than_Mean_Assault = Assault < mean(Assault))

               Murder Assault UrbanPop Rape Lower_than_Mean_Assault
Alabama          13.2     236       58 21.2                   FALSE
Alaska           10.0     263       48 44.5                   FALSE
Arizona           8.1     294       80 31.0                   FALSE
Arkansas          8.8     190       50 19.5                   FALSE
California        9.0     276       91 40.6                   FALSE
Colorado          7.9     204       78 38.7                   FALSE
Connecticut       3.3     110       77 11.1                    TRUE
Delaware          5.9     238       72 15.8                   FALSE
Florida          15.4     335       80 31.9                   FALSE
Georgia          17.4     211       60 25.8                   FALSE
Hawaii            5.3      46       83 20.2                    TRUE
Idaho             2.6     120       54 14.2                    TRUE
Illinois         10.4     249       83 24.0                   FALSE
Indiana           7.2     113       65 21.0                    TRUE
Iowa              2.2      56       57 11.3                    TRUE
Kansas            6.0     115       66 18.0                    TRUE
Kentucky          9.7     109       52 16.3                    TRUE
Louisiana        15.4     249       66 22.2                   FALSE
Maine             2.1      83       51  7.8                    TRUE
Maryland         11.3     300       67 27.8                   FALSE
Massachusetts     4.4     149       85 16.3                    TRUE
Michigan         12.1     255       74 35.1                   FALSE
Minnesota         2.7      72       66 14.9                    TRUE
Mississippi      16.1     259       44 17.1                   FALSE
Missouri          9.0     178       70 28.2                   FALSE
Montana           6.0     109       53 16.4                    TRUE
Nebraska          4.3     102       62 16.5                    TRUE
Nevada           12.2     252       81 46.0                   FALSE
New Hampshire     2.1      57       56  9.5                    TRUE
New Jersey        7.4     159       89 18.8                    TRUE
New Mexico       11.4     285       70 32.1                   FALSE
New York         11.1     254       86 26.1                   FALSE
North Carolina   13.0     337       45 16.1                   FALSE
North Dakota      0.8      45       44  7.3                    TRUE
Ohio              7.3     120       75 21.4                    TRUE
Oklahoma          6.6     151       68 20.0                    TRUE
Oregon            4.9     159       67 29.3                    TRUE
Pennsylvania      6.3     106       72 14.9                    TRUE
Rhode Island      3.4     174       87  8.3                   FALSE
South Carolina   14.4     279       48 22.5                   FALSE
South Dakota      3.8      86       45 12.8                    TRUE
Tennessee        13.2     188       59 26.9                   FALSE
Texas            12.7     201       80 25.5                   FALSE
Utah              3.2     120       80 22.9                    TRUE
Vermont           2.2      48       32 11.2                    TRUE
Virginia          8.5     156       63 20.7                    TRUE
Washington        4.0     145       73 26.2                    TRUE
West Virginia     5.7      81       39  9.3                    TRUE
Wisconsin         2.6      53       66 10.8                    TRUE
Wyoming           6.8     161       60 15.6                    TRUE

TRUE and False make terrible variable levels, however. Using ifelse(), we will specify conditions for categorization. For example, make rows that are below the mean take on a value, else/otherwise, make the rows a different value.

Example:

ifelse(Assault < Mean_Assault, "Below", "Equal or Above"))

If the Assault value is < Mean_Assault, assign the row a value of "Below", else, assign "Equal or Above". Let’s use ifelse() to create a name-value pair

Relative_Assult = ifelse(Assault < Mean_Assault, "Below", "Equal or Above"))

USArrests |>
  mutate(Mean_Assault = mean(Assault), 
         Relative_Assult = ifelse(Assault < Mean_Assault, "Below", "Equal or Above")
         )

               Murder Assault UrbanPop Rape Mean_Assault Relative_Assult
Alabama          13.2     236       58 21.2       170.76  Equal or Above
Alaska           10.0     263       48 44.5       170.76  Equal or Above
Arizona           8.1     294       80 31.0       170.76  Equal or Above
Arkansas          8.8     190       50 19.5       170.76  Equal or Above
California        9.0     276       91 40.6       170.76  Equal or Above
Colorado          7.9     204       78 38.7       170.76  Equal or Above
Connecticut       3.3     110       77 11.1       170.76           Below
Delaware          5.9     238       72 15.8       170.76  Equal or Above
Florida          15.4     335       80 31.9       170.76  Equal or Above
Georgia          17.4     211       60 25.8       170.76  Equal or Above
Hawaii            5.3      46       83 20.2       170.76           Below
Idaho             2.6     120       54 14.2       170.76           Below
Illinois         10.4     249       83 24.0       170.76  Equal or Above
Indiana           7.2     113       65 21.0       170.76           Below
Iowa              2.2      56       57 11.3       170.76           Below
Kansas            6.0     115       66 18.0       170.76           Below
Kentucky          9.7     109       52 16.3       170.76           Below
Louisiana        15.4     249       66 22.2       170.76  Equal or Above
Maine             2.1      83       51  7.8       170.76           Below
Maryland         11.3     300       67 27.8       170.76  Equal or Above
Massachusetts     4.4     149       85 16.3       170.76           Below
Michigan         12.1     255       74 35.1       170.76  Equal or Above
Minnesota         2.7      72       66 14.9       170.76           Below
Mississippi      16.1     259       44 17.1       170.76  Equal or Above
Missouri          9.0     178       70 28.2       170.76  Equal or Above
Montana           6.0     109       53 16.4       170.76           Below
Nebraska          4.3     102       62 16.5       170.76           Below
Nevada           12.2     252       81 46.0       170.76  Equal or Above
New Hampshire     2.1      57       56  9.5       170.76           Below
New Jersey        7.4     159       89 18.8       170.76           Below
New Mexico       11.4     285       70 32.1       170.76  Equal or Above
New York         11.1     254       86 26.1       170.76  Equal or Above
North Carolina   13.0     337       45 16.1       170.76  Equal or Above
North Dakota      0.8      45       44  7.3       170.76           Below
Ohio              7.3     120       75 21.4       170.76           Below
Oklahoma          6.6     151       68 20.0       170.76           Below
Oregon            4.9     159       67 29.3       170.76           Below
Pennsylvania      6.3     106       72 14.9       170.76           Below
Rhode Island      3.4     174       87  8.3       170.76  Equal or Above
South Carolina   14.4     279       48 22.5       170.76  Equal or Above
South Dakota      3.8      86       45 12.8       170.76           Below
Tennessee        13.2     188       59 26.9       170.76  Equal or Above
Texas            12.7     201       80 25.5       170.76  Equal or Above
Utah              3.2     120       80 22.9       170.76           Below
Vermont           2.2      48       32 11.2       170.76           Below
Virginia          8.5     156       63 20.7       170.76           Below
Washington        4.0     145       73 26.2       170.76           Below
West Virginia     5.7      81       39  9.3       170.76           Below
Wisconsin         2.6      53       66 10.8       170.76           Below
Wyoming           6.8     161       60 15.6       170.76           Below

Bu you could also combine the steps together:

USArrests |>
  mutate(Relative_Assult = ifelse(Assault < mean(Assault), "Below", "Equal or Above"))

               Murder Assault UrbanPop Rape Relative_Assult
Alabama          13.2     236       58 21.2  Equal or Above
Alaska           10.0     263       48 44.5  Equal or Above
Arizona           8.1     294       80 31.0  Equal or Above
Arkansas          8.8     190       50 19.5  Equal or Above
California        9.0     276       91 40.6  Equal or Above
Colorado          7.9     204       78 38.7  Equal or Above
Connecticut       3.3     110       77 11.1           Below
Delaware          5.9     238       72 15.8  Equal or Above
Florida          15.4     335       80 31.9  Equal or Above
Georgia          17.4     211       60 25.8  Equal or Above
Hawaii            5.3      46       83 20.2           Below
Idaho             2.6     120       54 14.2           Below
Illinois         10.4     249       83 24.0  Equal or Above
Indiana           7.2     113       65 21.0           Below
Iowa              2.2      56       57 11.3           Below
Kansas            6.0     115       66 18.0           Below
Kentucky          9.7     109       52 16.3           Below
Louisiana        15.4     249       66 22.2  Equal or Above
Maine             2.1      83       51  7.8           Below
Maryland         11.3     300       67 27.8  Equal or Above
Massachusetts     4.4     149       85 16.3           Below
Michigan         12.1     255       74 35.1  Equal or Above
Minnesota         2.7      72       66 14.9           Below
Mississippi      16.1     259       44 17.1  Equal or Above
Missouri          9.0     178       70 28.2  Equal or Above
Montana           6.0     109       53 16.4           Below
Nebraska          4.3     102       62 16.5           Below
Nevada           12.2     252       81 46.0  Equal or Above
New Hampshire     2.1      57       56  9.5           Below
New Jersey        7.4     159       89 18.8           Below
New Mexico       11.4     285       70 32.1  Equal or Above
New York         11.1     254       86 26.1  Equal or Above
North Carolina   13.0     337       45 16.1  Equal or Above
North Dakota      0.8      45       44  7.3           Below
Ohio              7.3     120       75 21.4           Below
Oklahoma          6.6     151       68 20.0           Below
Oregon            4.9     159       67 29.3           Below
Pennsylvania      6.3     106       72 14.9           Below
Rhode Island      3.4     174       87  8.3  Equal or Above
South Carolina   14.4     279       48 22.5  Equal or Above
South Dakota      3.8      86       45 12.8           Below
Tennessee        13.2     188       59 26.9  Equal or Above
Texas            12.7     201       80 25.5  Equal or Above
Utah              3.2     120       80 22.9           Below
Vermont           2.2      48       32 11.2           Below
Virginia          8.5     156       63 20.7           Below
Washington        4.0     145       73 26.2           Below
West Virginia     5.7      81       39  9.3           Below
Wisconsin         2.6      53       66 10.8           Below
Wyoming           6.8     161       60 15.6           Below

Filtering Rows

Data files are messy and as a result require cleaning. You will have missing rows, incorrect variable names, files with columns named the same, NAs, strings for numbers, duplicate rows of data, people who completed a survey twice, and all sorts of unimaginable and unbelievable data problems. So cleaning is important.

Whereas selecting is the processes of retaining variable columns, filtering is the process of retaining rows from a data set. For example, if you want your data frame to contain only individuals who are 20 or more years old, you would filter to retain them and remove others. If you want to explore data for only individuals from Kansas, you would filter rows to retain only them and remove all others.

Data frame manipulation may involve keeping only certain rows for data, for example, male or female respondents, male respondents, those who do not contain missing values (e.g., NAs) for a specific column variable, who are of a certain age (or born in in certain year), who are above (or below) some acceptable criterion, etc.

When a column variable has more than one value (e.g., check using unique() to determine the unique elements contained), you may wish to filter on some but not others.

You may even need to filter rows in a data frame that are distinct (e.g., not duplicate responses). This is often a good first step in order to determine the size of the usable data set. dplyr::distinct() makes de-duplicating easy as this function will return only distinct rows.

Understanding Filtering Operators

We will filter data using filter() from {dplyr}

filter(.data, 
       ...
       )

Parameters/Arguments:

.data: a data frame
...: expressions that return a logical value (TRUE or FALSE)

NOTE:: Filtering cases using the dplyr::filter() verbs works by removing rows that do not match a specific criterion and then by returning the data frame that omits the mismatched condition. It keeps only rows that are TRUE (match) your specification.

Some useful filtering operators and functions include: ==, >, >=, &, |, !, xor(), c(), is.na(), between(), near().

Filtering Using `filter()`

We will work with different data frames because the the nature of their contents.

What does the data frame look like again?

DAT

   Id     Sex Age Renting
1 100    male  25     yes
2 101  female  33    <NA>
3 102    Male  27     yes
4 103    <NA>  40    <NA>
5 104     man  44      no
6 100    male  25     yes
7 105 neither  40     yes

Row/Observations/Cases can be filtered to “include” only certain matched conditions or can be filtered to “exclude” by negating those matched conditions. If the column variable Sex is in the data frame and cases are 'male', 'men', 'female', 'women', 'neither', NA, etc., you can specify the column Sex variable and then the row matching condition(s).

Filtering Matches to a Specific Character Value

The first argument in dplyr::filter() is a data frame, and the function all dplyr::filter(DAT, Sex == 'female') will filter the data frame named DAT to include rows for which the sex column equals 'female'. In other words, TRUE rows.

dplyr::filter(DAT, 
              Sex == 'female'
              )

   Id    Sex Age Renting
1 101 female  33    <NA>

Similarly, the function call dplyr::filter(., Sex == 'male') can be read “filter the data frame to include rows for which the value of Sex == 'male' is TRUE”.

More flexibly, however, you could specify a vector containing acceptable strings using c(). dplyr::filter(., Sex %in% c('male')) filters the rows to include only those for which the value for sex is in the string vector which includes a single string,'male' whereas dplyr::filter(., Sex %in% c('male', 'Man')) filters the rows to include only those for which the value for Sex is in the string vector which includes 'male' and 'Man'. Cases containing 'Male', 'Men' (R is a case-sensitive language), or 'female', for example, will not be included in the returned data frame because they do not match values in the string vector.

Filtering by a Relative Value

USArrests |>
  filter(Murder > 10)

               Murder Assault UrbanPop Rape
Alabama          13.2     236       58 21.2
Florida          15.4     335       80 31.9
Georgia          17.4     211       60 25.8
Illinois         10.4     249       83 24.0
Louisiana        15.4     249       66 22.2
Maryland         11.3     300       67 27.8
Michigan         12.1     255       74 35.1
Mississippi      16.1     259       44 17.1
Nevada           12.2     252       81 46.0
New Mexico       11.4     285       70 32.1
New York         11.1     254       86 26.1
North Carolina   13.0     337       45 16.1
South Carolina   14.4     279       48 22.5
Tennessee        13.2     188       59 26.9
Texas            12.7     201       80 25.5

Those rows with Murder above 10 will be retained, others removed.

Filtering by a Specific Value

USArrests |>
  filter(Murder == 12.1)

         Murder Assault UrbanPop Rape
Michigan   12.1     255       74 35.1

Those rows with Murder == 12.1 will be retained.

Using a previous example with this data set, we can filter based on a character variable if we had one. Let’s add one.

USArrests |>
  mutate(Relative_Murder = ifelse(Murder < mean(Murder), "Below", "Equal or Above")) |>
  filter(Relative_Murder == "Below")

              Murder Assault UrbanPop Rape Relative_Murder
Connecticut      3.3     110       77 11.1           Below
Delaware         5.9     238       72 15.8           Below
Hawaii           5.3      46       83 20.2           Below
Idaho            2.6     120       54 14.2           Below
Indiana          7.2     113       65 21.0           Below
Iowa             2.2      56       57 11.3           Below
Kansas           6.0     115       66 18.0           Below
Maine            2.1      83       51  7.8           Below
Massachusetts    4.4     149       85 16.3           Below
Minnesota        2.7      72       66 14.9           Below
Montana          6.0     109       53 16.4           Below
Nebraska         4.3     102       62 16.5           Below
New Hampshire    2.1      57       56  9.5           Below
New Jersey       7.4     159       89 18.8           Below
North Dakota     0.8      45       44  7.3           Below
Ohio             7.3     120       75 21.4           Below
Oklahoma         6.6     151       68 20.0           Below
Oregon           4.9     159       67 29.3           Below
Pennsylvania     6.3     106       72 14.9           Below
Rhode Island     3.4     174       87  8.3           Below
South Dakota     3.8      86       45 12.8           Below
Utah             3.2     120       80 22.9           Below
Vermont          2.2      48       32 11.2           Below
Washington       4.0     145       73 26.2           Below
West Virginia    5.7      81       39  9.3           Below
Wisconsin        2.6      53       66 10.8           Below
Wyoming          6.8     161       60 15.6           Below

Filtering Using on a Function

USArrests |>
  filter(Murder > mean(Murder))

               Murder Assault UrbanPop Rape
Alabama          13.2     236       58 21.2
Alaska           10.0     263       48 44.5
Arizona           8.1     294       80 31.0
Arkansas          8.8     190       50 19.5
California        9.0     276       91 40.6
Colorado          7.9     204       78 38.7
Florida          15.4     335       80 31.9
Georgia          17.4     211       60 25.8
Illinois         10.4     249       83 24.0
Kentucky          9.7     109       52 16.3
Louisiana        15.4     249       66 22.2
Maryland         11.3     300       67 27.8
Michigan         12.1     255       74 35.1
Mississippi      16.1     259       44 17.1
Missouri          9.0     178       70 28.2
Nevada           12.2     252       81 46.0
New Mexico       11.4     285       70 32.1
New York         11.1     254       86 26.1
North Carolina   13.0     337       45 16.1
South Carolina   14.4     279       48 22.5
Tennessee        13.2     188       59 26.9
Texas            12.7     201       80 25.5
Virginia          8.5     156       63 20.7

Filtering Missing Data (`NAs`)

is.na() will return a logical vector for which TRUE represents there are missing values.

Try on the entire data frame…

is.na(DAT)

        Id   Sex   Age Renting
[1,] FALSE FALSE FALSE   FALSE
[2,] FALSE FALSE FALSE    TRUE
[3,] FALSE FALSE FALSE   FALSE
[4,] FALSE  TRUE FALSE    TRUE
[5,] FALSE FALSE FALSE   FALSE
[6,] FALSE FALSE FALSE   FALSE
[7,] FALSE FALSE FALSE   FALSE

You can see that some columns contain cases/rows with TRUE indicating the cell contains NA.

The negation operator, !, will be used to illustrate some filtering approaches. Because filter() will filter out FALSE cases and retain TRUE ones, so you may sometimes need to negate a function so that you keep the rows you want to keep.

na.omit(): removes rows with NAs
filter(is.na(column_name)): keep rows with NA in specific variable
filter(!is.na(column_name)): remove rows with NA in specific variable
filter(complete.cases(.)): remove rows with NAs

Filter using `na.omit()`

DAT |>
  na.omit()           # omit any rows with NAs

   Id     Sex Age Renting
1 100    male  25     yes
3 102    Male  27     yes
5 104     man  44      no
6 100    male  25     yes
7 105 neither  40     yes

Filter using `is.na()` and `!is.na()`

DAT |>
  filter(is.na(Sex))       # keep NAs by variable

   Id  Sex Age Renting
1 103 <NA>  40    <NA>

But your goal may likely be to keep everything that is not NA:

DAT |>
  filter(!is.na(Sex))      # remove NAs by variable

   Id     Sex Age Renting
1 100    male  25     yes
2 101  female  33    <NA>
3 102    Male  27     yes
4 104     man  44      no
5 100    male  25     yes
6 105 neither  40     yes

And filter step-by-step for each variable using |> and separate function calls:

DAT |>
  filter(!is.na(Sex)) |>
  filter(!is.na(Renting))

   Id     Sex Age Renting
1 100    male  25     yes
2 102    Male  27     yes
3 104     man  44      no
4 100    male  25     yes
5 105 neither  40     yes

So why use separate lines of code if you can use & all in one line? One reason is that separate function calls written as separate lines of code make code inclusion/exclusion extremely easy.

Comment out what you don’t want using #:

DAT |>
  #filter(., !is.na(Sex)) |>
  filter(!is.na(Renting))

   Id     Sex Age Renting
1 100    male  25     yes
2 102    Male  27     yes
3 104     man  44      no
4 100    male  25     yes
5 105 neither  40     yes

Filter using `complete.cases()`

The complete.cases() function returns a logical vector for which TRUE reflects the row has complete information and no missing cases. Using complete.cases() along with filter(), you would retain all rows TRUE rows.

DAT %>%
  dplyr::filter(complete.cases(.))

   Id     Sex Age Renting
1 100    male  25     yes
2 102    Male  27     yes
3 104     man  44      no
4 100    male  25     yes
5 105 neither  40     yes

Removing duplicate rows using `distinct()`

We will examine some other filtering methods that do not use filter() specifically but that which perform filtering operations.

dplyr::distinct(): remove duplicate rows
dplyr::distinct(., column): remove duplicate rows by specific column
na.omit(): remove any row with NA’s (missing values)

Let’s use the simple DAT data frame.

DAT

   Id     Sex Age Renting
1 100    male  25     yes
2 101  female  33    <NA>
3 102    Male  27     yes
4 103    <NA>  40    <NA>
5 104     man  44      no
6 100    male  25     yes
7 105 neither  40     yes

Remember that data frames are composed of rows and columns. As discussed previously we can subset in base R to illustrate a point.

Row row 1, all columns:

DAT[1,]

   Id  Sex Age Renting
1 100 male  25     yes

Row row 6, all columns:

DAT[6,]

   Id  Sex Age Renting
6 100 male  25     yes

Notice that rows 1 and 6 are the same person (e.g., Id) and have exactly the same data for all variables. If we compare the two logically, we will confirm that all is TRUE.

DAT[1,] == DAT[6,]

    Id  Sex  Age Renting
1 TRUE TRUE TRUE    TRUE

You clearly do not want duplicates of data. So let’s just remove any rows that are identical using distinct()

Removing Duplicates Across All Variables

DAT |>
  distinct()       # Remove exact duplicates across all columns

   Id     Sex Age Renting
1 100    male  25     yes
2 101  female  33    <NA>
3 102    Male  27     yes
4 103    <NA>  40    <NA>
5 104     man  44      no
6 105 neither  40     yes

If you know each row is unique based on a variable in the data frame, you can also use distinct() to remove duplicates for a specific variable. Make sure that this variable specification is actually one that you would not want duplicates of.

Removing Duplicates Based on a Variable

DAT |>           
  distinct(Id)     #  Remove duplicates by variable; passes unique values for data frame

But this function simply returns the unique values in Id. To retain the variables, set .keep_all = T. If you want to remove duplicates and assign the cleaned data frame to an object, you would likely want to keep all of your variables.

DAT |>             
  distinct(Id, 
           .keep_all = T
           ) #

   Id     Sex Age Renting
1 100    male  25     yes
2 101  female  33    <NA>
3 102    Male  27     yes
4 103    <NA>  40    <NA>
5 104     man  44      no
6 105 neither  40     yes

Notice, however, this only removed the last instance or Id == 100. Which row to include is a judgment call. The first, the last, neither, the average? Is there a correct answer?

The Data Manipulation Workflow: Putting It All Together

Of course, all of these functions can be paired with select(), mutate(), filter(), etc. Here is the data manipulation workflow.

dataframe |>
    select(...) |>     # select variables of interest
    mutate(...) |>     # then create new variables
    filter(...) |>     # then filter by rows
    group_by(...) |>   # then group for subsetting
    summarize(...)     # then summarize

Many More Use-Case Examples (Optional)

Filter Cases using `==`

Filter rows for which the Sex variable is equal to the string 'female':

DAT |>
  dplyr::filter(Sex == 'female')

   Id    Sex Age Renting
1 101 female  33    <NA>

Filter rows for which the Sex variable is not equal to the string 'female':

DAT |>
  dplyr::filter(Sex != 'female')

   Id     Sex Age Renting
1 100    male  25     yes
2 102    Male  27     yes
3 104     man  44      no
4 100    male  25     yes
5 105 neither  40     yes

Filter rows for which the Sex variable is equal to the string 'female' AND Age is greater than the numeric 27:

DAT |>
  dplyr::filter(Sex == 'female' & Age > 27) # this "AND" that

   Id    Sex Age Renting
1 101 female  33    <NA>

Filter rows for which the Sex variable is equal to the string 'female' OR Age is greater than the numeric 27:

DAT |>
  dplyr::filter(Sex == 'female' | Age > 27) # this "OR" that

   Id     Sex Age Renting
1 101  female  33    <NA>
2 103    <NA>  40    <NA>
3 104     man  44      no
4 105 neither  40     yes

A cleaner method involves separate lines of code. Although cleaner, this will not allow the “OR” option because the data frame that is returned from the first filter() is passed to the second filter() and all cases other than "female" have already been removed from the data frame.

DAT |>
  dplyr::filter(Sex == 'female') |>    # keep female (and add another pipe)
  dplyr::filter(Age >= 27)             # keep only those equal to or older than 27

   Id    Sex Age Renting
1 101 female  33    <NA>

Filtering Cases by Value

Filter by `<` and `>` or `<=` or `>=`

Keep those less than:

DAT |> 
  dplyr::filter(Age < 40)

   Id    Sex Age Renting
1 100   male  25     yes
2 101 female  33    <NA>
3 102   Male  27     yes
4 100   male  25     yes

Keep older than:

DAT |> 
  dplyr::filter(Age > 40)

   Id Sex Age Renting
1 104 man  44      no

Keep equal to or older than:

DAT |> 
  dplyr::filter(Age >= 40)

   Id     Sex Age Renting
1 103    <NA>  40    <NA>
2 104     man  44      no
3 105 neither  40     yes

Filter Cases by Conditional X or Y Using `|` Operator…

Using the “OR” operator, |, cases can be included if “this” OR “that” condition.

Filter numbers:

DAT |>
  dplyr::filter(Age == 25 | Age == 40)    # filter out numeric values IN a range

   Id     Sex Age Renting
1 100    male  25     yes
2 103    <NA>  40    <NA>
3 100    male  25     yes
4 105 neither  40     yes

Filter characters:

DAT |>
  dplyr::filter(Sex == 'male' | Sex == 'female')

   Id    Sex Age Renting
1 100   male  25     yes
2 101 female  33    <NA>
3 100   male  25     yes

Although dplyr::filter(sex %in% c('male', 'female')) would be easier.

Filter rows of variables of both types:

DAT |>
  dplyr::filter(Sex == 'male' | Age == 27)

   Id  Sex Age Renting
1 100 male  25     yes
2 102 Male  27     yes
3 100 male  25     yes

Filter Cases Between Values with `between()`

Between ages 25 and 33:

DAT |>
  dplyr::filter(between(Age, 27, 33))

   Id    Sex Age Renting
1 101 female  33    <NA>
2 102   Male  27     yes

Filter by range using the `%in%` operator (this is IN meaning in)

Though less flexible than using between(), %in% may be easier to remember:

DAT |>
  dplyr::filter(Age %in% 20:43)    # filter out numeric values IN a range

   Id     Sex Age Renting
1 100    male  25     yes
2 101  female  33    <NA>
3 102    Male  27     yes
4 103    <NA>  40    <NA>
5 100    male  25     yes
6 105 neither  40     yes

One’s age is in the range from 20 through 43.

If a vector object is already defined (e.g., my_levels = c('male', 'female')), you can use that for filtering also. Such approaches are useful when data manipulation involves reusing a reference as it simplifies coding and reduces errors because the specification is defined only once.

my_levels = c('male', 'female')

DAT |>
  dplyr::filter(Sex %in% my_levels)

   Id    Sex Age Renting
1 100   male  25     yes
2 101 female  33    <NA>
3 100   male  25     yes

Filter by exclusion

When inclusion of variables is inappropriate, exclusion of them may be useful. The ! operator means “NOT” in R so you can use that to accomplish the opposite of the statement. For example, dplyr::filter(., !sex %in% c('male', NA)) will “filter the data frame to include rows in the sex column for which the value is NOT in the vector”.

Exclude rows in the Sex variable that are NA or 'male':

DAT |>
  dplyr::filter(!Sex %in% c('male', NA))  # keep only if NOT in vector

   Id     Sex Age Renting
1 101  female  33    <NA>
2 102    Male  27     yes
3 104     man  44      no
4 105 neither  40     yes

Exclude rows in the Sex variable that are Men or 'male':

DAT |>
  dplyr::filter(!Sex %in% c('male', 'Men'))  # keep only if NOT in vector

   Id     Sex Age Renting
1 101  female  33    <NA>
2 102    Male  27     yes
3 103    <NA>  40    <NA>
4 104     man  44      no
5 105 neither  40     yes

Filter by conditional X and Y using `&` operator…

By range:

DAT |>
  dplyr::filter(Id >= 102 & Age <= 43)

   Id     Sex Age Renting
1 102    Male  27     yes
2 103    <NA>  40    <NA>
3 105 neither  40     yes

DAT |>
  dplyr::filter(Age >= 20 & Age <= 43)

   Id     Sex Age Renting
1 100    male  25     yes
2 101  female  33    <NA>
3 102    Male  27     yes
4 103    <NA>  40    <NA>
5 100    male  25     yes
6 105 neither  40     yes

Note: Age 20:43 won’t work. Can you figure out why?

Filter Cases Containing Characters Using `stringr::str_detect()` and `%in%`:

If you want to filter cases that contain certain characters, unfortunately, you cannot use contains() as you would for variable names. For matching characters in levels of variables, you’ll need something like stringr::str_detect() or grepl().

stringr::str_detect(): returns matching conditions

The example below uses str_detect() to detect the presence of a character match and returns a logic vector with TRUEs for matching cases. When paired with filter(), the data frame is filtered by to contains cases that match the pattern for the variable.

Example: stringr::str_detect(my_variable, "pattern"):

Let’s look for levels of Sex for which "ma" is detected.

DAT |> 
  dplyr::filter(stringr::str_detect(Sex, "ma"))

   Id    Sex Age Renting
1 100   male  25     yes
2 101 female  33    <NA>
3 104    man  44      no
4 100   male  25     yes

But the case for which Sex = Male is now missing. This is because stringr::str_detect() is a case-sensitive pattern match.

You can fix this is a couple ways:

make the cases in Sex all lower case to mutate() the fix or
to wrap Sex in tolower() to make cases lowercase.

The first option might be better if you want to fix the problem in the data frame.

Other casing functions are:

tolower(): returns lower case of string; see also stringr::str_to_lower()
toupper(): returns upper case of string; see also stringr::str_to_upper()
tools::toTitleCase(): returns Title Case (capitalize first letter of string)

DAT |>
  filter(stringr::str_detect(tolower(Sex), "ma"))

   Id    Sex Age Renting
1 100   male  25     yes
2 101 female  33    <NA>
3 102   Male  27     yes
4 104    man  44      no
5 100   male  25     yes

For some, reading this code may be difficult. In order to understand the code, you need to read it inside out like slices of an onion. First understand the first function as the returned value is passed to the next function and so forth.

For example, elements of the Sex vector are converted to lower case and returned using tolower(). We can demonstrate this using only the vector from the data frame using the $ operator, DAT$Sex.

The initial vector:

DAT$Sex

[1] "male"    "female"  "Male"    NA        "man"     "male"    "neither"

To lowercase:

tolower(DAT$Sex)

[1] "male"    "female"  "male"    NA        "man"     "male"    "neither"

Next, stringr::str_detect() will return a vector of TRUE or FALSE for all element positions of DAT$Sex for which the (now lowercase) string matches "ma":

stringr::str_detect(tolower(DAT$Sex), "ma")

[1]  TRUE  TRUE  TRUE    NA  TRUE  TRUE FALSE

Another approach would be to perform the steps in pieces by first mutating the data frame so that elements in the Sex vector are converted to lowercase and then pipe that mutated data from to filter().

Briefly, we will introduce dplyr::mutate() to accomplish this goal. mutate() is used to add, update, etc. variables in a data frame or tibble.

Mutate Sex using mutate():

DAT |>
  mutate(Sex = tolower(Sex))

   Id     Sex Age Renting
1 100    male  25     yes
2 101  female  33    <NA>
3 102    male  27     yes
4 103    <NA>  40    <NA>
5 104     man  44      no
6 100    male  25     yes
7 105 neither  40     yes

Now mutate and pipe that mutated data frame using %>% (from {margittr}) or |> (base R version 4.1+) to filter():

DAT |>
  mutate(Sex = tolower(Sex)) |>
  filter(stringr::str_detect(Sex, "ma"))

   Id    Sex Age Renting
1 100   male  25     yes
2 101 female  33    <NA>
3 102   male  27     yes
4 104    man  44      no
5 100   male  25     yes

Pairing this piping approach with toTitleCase() may be better as only the first letter will be capitalized and you are looking for “Ma”.

DAT |>
  mutate(Sex = tools::toTitleCase(Sex)) |>
  filter(stringr::str_detect(Sex, "Ma"))

   Id  Sex Age Renting
1 100 Male  25     yes
2 102 Male  27     yes
3 104  Man  44      no
4 100 Male  25     yes

But notice the male and man issue is still a problem.

So what are the unique elements in DAT$Sex? Your friend here is unique(). And maybe you are too.

unique(DAT$Sex)

[1] "male"    "female"  "Male"    NA        "man"     "neither"

You can hard code a fix using %in% and case_when(). case_when() is {dplyr}’s solution for multiple-case conditional.

DAT |>
  mutate(Sex = tolower(Sex), # convert to lowercase first
         Sex = case_when(    # then recode cases when matched
           Sex %in% c("male", "man") ~ "Male",
           Sex %in% c("female", "woman") ~ "Female"
         ))

   Id    Sex Age Renting
1 100   Male  25     yes
2 101 Female  33    <NA>
3 102   Male  27     yes
4 103   <NA>  40    <NA>
5 104   Male  44      no
6 100   Male  25     yes
7 105   <NA>  40     yes

For a more flexible fix, you can use str_detect():

BUT beware that the order of operations matters with this approach. Because the string “female” contains characters “ma”, you could accidentally recode all “female” cases to “male” if you perform the case_when() conversion on “male” first.

DAT |>
  mutate(Sex = tolower(Sex),   # convert first
         Sex = case_when(      # then recode
           stringr::str_detect(Sex, "fe") ~ "Female",
           stringr::str_detect(Sex, "ma") ~ "Male",
         ))

   Id    Sex Age Renting
1 100   Male  25     yes
2 101 Female  33    <NA>
3 102   Male  27     yes
4 103   <NA>  40    <NA>
5 104   Male  44      no
6 100   Male  25     yes
7 105   <NA>  40     yes

Note that conditions in not declared in case_when() will be recoded as NA, which is what happens for "neither". Make sure to include all ways you wish you recode cases.

Session Information

sessionInfo()

R version 4.3.2 (2023-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22621)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] magrittr_2.0.3   tidyselect_1.2.0 stringr_1.5.1    dplyr_1.1.4     

loaded via a namespace (and not attached):
 [1] vctrs_0.6.5         cli_3.6.1           knitr_1.45         
 [4] rlang_1.1.1         xfun_0.40           stringi_1.7.12     
 [7] renv_1.0.3          generics_0.1.3      jsonlite_1.8.7     
[10] glue_1.6.2          rprojroot_2.0.3     htmltools_0.5.7    
[13] fansi_1.0.4         rmarkdown_2.25      tibble_3.2.1       
[16] evaluate_0.21       fastmap_1.1.1       yaml_2.3.7         
[19] lifecycle_1.0.3     BiocManager_1.30.22 compiler_4.3.2     
[22] pkgconfig_2.0.3     htmlwidgets_1.6.4   here_1.0.1         
[25] rstudioapi_0.15.0   R.oo_1.25.0         R.utils_2.12.2     
[28] digest_0.6.33       R6_2.5.1            utf8_1.2.4         
[31] pillar_1.9.0        R.methodsS3_1.8.2   withr_2.5.0        
[34] tools_4.3.2

Overview

Readings and Preparation

Supplementary Readings (Optional)

Libraries

External Functions

Loading Libraries

A Grammar for Data Wrangling

Some Common Ways for Selecting Variables Using {dplyr}

Selecting Variables using dplyr::select()

Selecting Variables by Column Position

Selecting Variables by Column Name

Selecting Variables as Unquoted Expressions

Selecting Variables as Characters/Strings

Selecting Variables as a Character Vector

Selecting Using an External Vector Object

Selecting Variables Starting with or Ending with Certain Characters

Selecting Variables Using starts_with()

Understanding starts_with()

Selecting Variables from a Data Frame using starts_with()

Selecting Variables using ends_with()

Selecting and Selecting Out Variables By/Between Index

Selecting and Selecting Out Variables By or Between Character Name

Selecting and Selecting Out Variables Characters in Their Names

Selecting using starts_with()

Selecting using ends_with()

Selecting using contains()

Selecting using regular expressions

Selecting variables of all patterns using ".*"

Selecting variables with digits using "\\d"

Selecting variables that start and contain characters "v.*\\d"

Selecting variables that end with a digit using "\\d$"

Selecting variables by negation

Selecting and Selecting Out Variables by Type

Creating/Mutating Variables

Creating/Mutating A Constant Value

Creating/Mutating A Constant Character

Creating/Mutating A Logical Object

Creating/Mutating Based on Function

Filtering Rows

Understanding Filtering Operators

Filtering Using filter()

Filtering Matches to a Specific Character Value

Filtering by a Relative Value

Filtering by a Specific Value

Filtering Using on a Function

Filtering Missing Data (NAs)

Filter using na.omit()

Filter using is.na() and !is.na()

Filter using complete.cases()

Removing duplicate rows using distinct()

Removing Duplicates Across All Variables

Removing Duplicates Based on a Variable

The Data Manipulation Workflow: Putting It All Together

Many More Use-Case Examples (Optional)

Filter Cases using ==

Filtering Cases by Value

Filter by < and > or <= or >=

Filter Cases by Conditional X or Y Using | Operator…

Filter Cases Between Values with between()

Filter by range using the %in% operator (this is IN meaning in)

Filter by exclusion

Filter by conditional X and Y using & operator…

Filter Cases Containing Characters Using stringr::str_detect() and %in%:

Session Information

Some Common Ways for Selecting Variables Using `{dplyr}`

Selecting Variables using `dplyr::select()`

Selecting Variables Using `starts_with()`

Understanding `starts_with()`

Selecting Variables from a Data Frame using `starts_with()`

Selecting Variables using `ends_with()`

Selecting using `starts_with()`

Selecting using `ends_with()`

Selecting using `contains()`

**Selecting variables of all patterns using `".*"`**

Selecting variables with digits using `"\\d"`

**Selecting variables that start and contain characters `"v.*\\d"`**

Selecting variables that end with a digit using `"\\d$"`

Filtering Using `filter()`

Filtering Missing Data (`NAs`)

Filter using `na.omit()`

Filter using `is.na()` and `!is.na()`

Filter using `complete.cases()`

Removing duplicate rows using `distinct()`

Filter Cases using `==`

Filter by `<` and `>` or `<=` or `>=`

Filter Cases by Conditional X or Y Using `|` Operator…

Filter Cases Between Values with `between()`

Filter by range using the `%in%` operator (this is IN meaning in)

Filter by conditional X and Y using `&` operator…

Filter Cases Containing Characters Using `stringr::str_detect()` and `%in%`: