::sourceDirectory(here::here("r", "functions")) R.utils
Selecting, Filtering, & Mutating
This page is a work in progress and may contain areas that need more detail or that required syntactical, grammatical, and typographical changes. If you find some part requiring some editing, please let me know so I can fix it for you.
Overview
In this module, we will start manipulating data frames. Data frames are composed of column variables and row observations/cases and we will address how to perform operations on both. We will learn how to use {dplyr}, rather than base R
to select, remove, add, and modify vectors in data frames. In addition, we will address how to filter and arrange rows in various ways. The focus will be on cleaning or wrangling data. Helper functions from libraries like {tidyr}, {stringr}, and {tidyselect} are also used in conjunction with {dplyr}.
Readings and Preparation
Before Class: First, read to familiarize yourself with the concepts rather than master them. I will assume that you attend class with some level of basic understanding of concepts and working of functions. The goal of reading should be to understand and implement code functions as well as support your understanding and help your troubleshooting of problems. This cannot happen if you just read the content without interacting with it, however reading is absolutely essential to being successful during class time. Work through some examples so that you have a good idea of your level of understanding and confidence.
Class: In class, some functions and concepts will be introduced and we will practice implementing code through exercises.
Supplementary Readings (Optional)
Libraries
- {here}: 1.0.1: for path management
- {dplyr} 1.1.4: for selecting, filtering, and mutating
- {stringr}: 1.5.1: for working with strings
- {tidyselect}: 1.2.0: for selecting sets from strings
- {magrittr}: 2.0.3: for piping with
%>%
(plus other piping operators)
External Functions
view_html()
: for viewing data frames in HTML
format. You can source this from your /r/functions
directory.
If for some reason, you do not have the function saved, you can get it from the remote source.
source("https://raw.githubusercontent.com/slicesofdata/fods24/main/r/functions/view_html.R")
Loading Libraries
We will work with a few different libraries for data manipulation. Let’s load them into our work space using library()
.
library(dplyr)
library(stringr)
library(tidyselect)
library(magrittr)
Make note of any warnings that appear when loading libraries. There are some libraries that contain functions with the same names. Be aware that the most recently loaded library function will take precedence. You can avoid confusion using ::
to call a function from a particular library (e.g., libraryname::functionname()
).
A Grammar for Data Wrangling
The {dplyr} package presents a type of grammar for wrangling data (Wickham and Francois 2014). {dplyr} is part of the {tidyverse} ecosystem which loads using library(tidyverse)
, or specifically using library(dplyr)
. {dplyr} also imports functions from {tidyselect}, which is also part of the {tidyverse} ecosystem.
The functions from any R
package can be called after loading the package library. One does not need to reference the package explicitly when calling a function but to avoid confusion, we will often reference the package/library and the function together (e.g., package::function()
).
Working with data involves creating new column variables, removing or renaming column variables, removing or deleting row observations, sorting, and summarizing data (often by groups/factors). etc. Consequently, there are five main function verbs for working with data in a data frame: select
, filter
, mutate
, arange
, and summarize
. Their corresponding functions are:
select(dataframe, variables_to_select)
: select columns to retainmutate(dataframe, variables_to_create)
anddplyr::rename()
: add or modify existing columnsfilter(dataframe, rows_to_select)
: subset by rowsarrange(dataframe, variable_to_sort_by)
: sort rowssummarize(dataframe, procedures_for_summarizing)
in conjunction withdplyr::group_by()
: aggregate the data in some way
Some Common Ways for Selecting Variables Using {dplyr}
You may often wish to subset data frames containing multiple variables to make them more lean and manageable for carrying out a specific goal. For such tasks, select()
will be used.
Selecting Variables using dplyr::select()
You can reference variables by their names or by their column position in a data frame (e.g., 1, 25, etc.).
Using select()
, you can select columns/variables from a data frame. The variables you select are retained and variables omitted that you don’t select are omitted in the returned data frame.
Before using a function you have never used before, you would always review the documentation for what the function is designed to accomplish.
help(dplyr::select)
Or:
::select ?dplyr
You can see that select()
is used for selecting variables from a data frame and there are many selection features for accomplishing such tasks.
Parameters/Arguments:
.data
: a data frame...
: one ore more expressions separated by commas
We will definitely need a data frame and we will need variables to select.
Let’s start with a data frame. The USAressts
data is built into R
so let’s just use that for now. Let’s use select()
to subset a data frame in two ways, first without piping and then with piping (preferred method).
- Without piping
At a bare minimum, we will need to pass a data frame to select()
at very least. But what happens if we don’t pass variables to select?
select(.data = USArrests)
data frame with 0 columns and 50 rows
You will see the that the returned data frame has 0 columns and 50 rows.
- Piping using
|>
If you pipe the data frame into the function, .data = USArrests
is inherited and you do not need to declare it within select()
.
|>
USArrests select()
data frame with 0 columns and 50 rows
Again, the returned data frame has 0 columns and 50 rows. If you use select()
without specifying any variables for the function to select, as seen in both examples, you will have data frame that contains no columns. Let’s address some methods for selecting variables.
Selecting Variables by Column Position
Variables are columns in a data frame, so they can be selected by their element position (previously covered here). If you wanted to select the first and third column, one way is to specify each position separated by a comma.
|> # take the data frame
USArrests select(1, 3) # select columns 1 and 3
Murder UrbanPop
Alabama 13.2 58
Alaska 10.0 48
Arizona 8.1 80
Arkansas 8.8 50
California 9.0 91
Colorado 7.9 78
Connecticut 3.3 77
Delaware 5.9 72
Florida 15.4 80
Georgia 17.4 60
Hawaii 5.3 83
Idaho 2.6 54
Illinois 10.4 83
Indiana 7.2 65
Iowa 2.2 57
Kansas 6.0 66
Kentucky 9.7 52
Louisiana 15.4 66
Maine 2.1 51
Maryland 11.3 67
Massachusetts 4.4 85
Michigan 12.1 74
Minnesota 2.7 66
Mississippi 16.1 44
Missouri 9.0 70
Montana 6.0 53
Nebraska 4.3 62
Nevada 12.2 81
New Hampshire 2.1 56
New Jersey 7.4 89
New Mexico 11.4 70
New York 11.1 86
North Carolina 13.0 45
North Dakota 0.8 44
Ohio 7.3 75
Oklahoma 6.6 68
Oregon 4.9 67
Pennsylvania 6.3 72
Rhode Island 3.4 87
South Carolina 14.4 48
South Dakota 3.8 45
Tennessee 13.2 59
Texas 12.7 80
Utah 3.2 80
Vermont 2.2 32
Virginia 8.5 63
Washington 4.0 73
West Virginia 5.7 39
Wisconsin 2.6 66
Wyoming 6.8 60
Let’s remind ourselves of how piping works. Remember that piping will modify the data frame in steps, with each pipe passing the current form of the data frame to the next function. Not that you ever want to write two lines of code and perform this extra step to select only column 1 but you could pipe the data frame to another select()
.
|> # take the data frame
USArrests select(1, 3) |> # select columns 1 and 3
select(1) # then select column 1 only
Murder
Alabama 13.2
Alaska 10.0
Arizona 8.1
Arkansas 8.8
California 9.0
Colorado 7.9
Connecticut 3.3
Delaware 5.9
Florida 15.4
Georgia 17.4
Hawaii 5.3
Idaho 2.6
Illinois 10.4
Indiana 7.2
Iowa 2.2
Kansas 6.0
Kentucky 9.7
Louisiana 15.4
Maine 2.1
Maryland 11.3
Massachusetts 4.4
Michigan 12.1
Minnesota 2.7
Mississippi 16.1
Missouri 9.0
Montana 6.0
Nebraska 4.3
Nevada 12.2
New Hampshire 2.1
New Jersey 7.4
New Mexico 11.4
New York 11.1
North Carolina 13.0
North Dakota 0.8
Ohio 7.3
Oklahoma 6.6
Oregon 4.9
Pennsylvania 6.3
Rhode Island 3.4
South Carolina 14.4
South Dakota 3.8
Tennessee 13.2
Texas 12.7
Utah 3.2
Vermont 2.2
Virginia 8.5
Washington 4.0
West Virginia 5.7
Wisconsin 2.6
Wyoming 6.8
The better approach is just to select column 1 in a single line.
|>
USArrests select(1)
Murder
Alabama 13.2
Alaska 10.0
Arizona 8.1
Arkansas 8.8
California 9.0
Colorado 7.9
Connecticut 3.3
Delaware 5.9
Florida 15.4
Georgia 17.4
Hawaii 5.3
Idaho 2.6
Illinois 10.4
Indiana 7.2
Iowa 2.2
Kansas 6.0
Kentucky 9.7
Louisiana 15.4
Maine 2.1
Maryland 11.3
Massachusetts 4.4
Michigan 12.1
Minnesota 2.7
Mississippi 16.1
Missouri 9.0
Montana 6.0
Nebraska 4.3
Nevada 12.2
New Hampshire 2.1
New Jersey 7.4
New Mexico 11.4
New York 11.1
North Carolina 13.0
North Dakota 0.8
Ohio 7.3
Oklahoma 6.6
Oregon 4.9
Pennsylvania 6.3
Rhode Island 3.4
South Carolina 14.4
South Dakota 3.8
Tennessee 13.2
Texas 12.7
Utah 3.2
Vermont 2.2
Virginia 8.5
Washington 4.0
West Virginia 5.7
Wisconsin 2.6
Wyoming 6.8
Selecting Variables by Column Name
Rather than position, you will likely wish to select by variable name. We will address several approaches for doing so. Each approach are benefical in different contexts.
Selecting Variables as Unquoted Expressions
Variables can be passed separately without quotes. We can pass two arguments, one with Murder
and the other with Assault
, separating each with a comma. Because of some special characteristics of the {dplyr}, we can use variable names without quotes. Let’s also reduce the returned data frame using head()
in order to save space.
|>
USArrests select(Murder, Assault) |>
head()
Murder Assault
Alabama 13.2 236
Alaska 10.0 263
Arizona 8.1 294
Arkansas 8.8 190
California 9.0 276
Colorado 7.9 204
Selecting Variables as Characters/Strings
You are most likely familiar with variable names as quoted strings. We can pass two arguments here using the variable names in quotes. This approach is more similar to a procedures use with in base R
functions as you cannot pass unquoted expressions in base R
.
|>
USArrests select("Murder", "Assault") |>
head()
Murder Assault
Alabama 13.2 236
Alaska 10.0 263
Arizona 8.1 294
Arkansas 8.8 190
California 9.0 276
Colorado 7.9 204
Selecting Variables as a Character Vector
What if your variables were in a vector, c("Murder", "Assault"))
?
|>
USArrests select(c("Murder", "Assault")) |>
head()
Murder Assault
Alabama 13.2 236
Alaska 10.0 263
Arizona 8.1 294
Arkansas 8.8 190
California 9.0 276
Colorado 7.9 204
The same will work if you passed a numeric vector c(1, 3)
.
Selecting Using an External Vector Object
In many use cases, you will will have a vector of variable names that would will want to pass rather than type individually. This approach also reduces confusion when you have key variables that you want to work with. Create a vector once and then use again and again. If you make any changes, you will only have to make them once, rather than in multiple places.
First, create the vector object:
<- c("Murder", "Assault") keep_vars
Second, Then pass the vector into the function:
|>
USArrests select(keep_vars)
Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.
# Was:
data %>% select(keep_vars)
# Now:
data %>% select(all_of(keep_vars))
See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
Murder Assault
Alabama 13.2 236
Alaska 10.0 263
Arizona 8.1 294
Arkansas 8.8 190
California 9.0 276
Colorado 7.9 204
Connecticut 3.3 110
Delaware 5.9 238
Florida 15.4 335
Georgia 17.4 211
Hawaii 5.3 46
Idaho 2.6 120
Illinois 10.4 249
Indiana 7.2 113
Iowa 2.2 56
Kansas 6.0 115
Kentucky 9.7 109
Louisiana 15.4 249
Maine 2.1 83
Maryland 11.3 300
Massachusetts 4.4 149
Michigan 12.1 255
Minnesota 2.7 72
Mississippi 16.1 259
Missouri 9.0 178
Montana 6.0 109
Nebraska 4.3 102
Nevada 12.2 252
New Hampshire 2.1 57
New Jersey 7.4 159
New Mexico 11.4 285
New York 11.1 254
North Carolina 13.0 337
North Dakota 0.8 45
Ohio 7.3 120
Oklahoma 6.6 151
Oregon 4.9 159
Pennsylvania 6.3 106
Rhode Island 3.4 174
South Carolina 14.4 279
South Dakota 3.8 86
Tennessee 13.2 188
Texas 12.7 201
Utah 3.2 120
Vermont 2.2 48
Virginia 8.5 156
Washington 4.0 145
West Virginia 5.7 81
Wisconsin 2.6 53
Wyoming 6.8 161
Hmm. Warning message.
Note: select(keep_vars)
is deprecated as a solution, so if you see this approach someplace, it won’t work properly at this time or in the future. The warning message states that cannot do this without some modifications. What is suggested is all_of()
. When your variables are in an external variable, all_of(keep_vars)
will tell select()
that your variables are all of the variables in the vector. Yes, this is an extra step but worth the flexibility.
|>
USArrests select(all_of(keep_vars)) |>
head()
Murder Assault
Alabama 13.2 236
Alaska 10.0 263
Arizona 8.1 294
Arkansas 8.8 190
California 9.0 276
Colorado 7.9 204
But this could be problematic if a variable in your vector does not exists. Let’s add some variables that are not in the data frame and then pass that vector.
<- c("Murder", "Assault", "Var_x", "Var_y", "Var_z")
keep_vars_more
|>
USArrests select(all_of(keep_vars_more)) |>
head()
You will see the following error message:
Error in `all_of()`:
! Can't subset columns that don't exist.
✖ Columns `Var_x`, `Var_y`, and `Var_z` don't exist.
Tip: Rather than using all_of()
, use any_of()
. In this instance, any of the variables that exist in the data frame will be selection. Those variables that are not in the data frame will be ignored. With this approach, your code will not break if there something changes.
|>
USArrests select(any_of(keep_vars_more)) |>
head()
Murder Assault
Alabama 13.2 236
Alaska 10.0 263
Arizona 8.1 294
Arkansas 8.8 190
California 9.0 276
Colorado 7.9 204
Selecting Variables Starting with or Ending with Certain Characters
One thing about {dplyr}, when you load the library, there are functions from other libraries that are imported along with {dplyr}’s own functions. These important functions are designed to work with each other, so the people who maintain the libraries have packaged them up nicely so you don’t have to load separate libraries.
Many of the functions are imported from the {tidyselect} library and these functions are what give you additional manipulation ability. Some imported functions are: all_of()
, any_of()
, contains()
, ends_with()
, everything()
, last_col()
, matches()
, and starts_with()
.
With functions like starts_with()
, contains()
, and ends_with()
, you can select variables containing character patterns in their names. For example, rather than code out all variables for a memory span task, you might want to grab variables that contain the characters "span"
.
Rather than hard coding variable names (e.g., "span_time1"
, "span_time2"
, c("Murder", "Assault")
) to pass as an argument to select()
, you would use another function to perform the heaving lifting for you (e.g.,starts_with()
). In this case, the variable names that starts_with()
specific character patterns get passed to select()
. This process represents a good example of whe what is referred to as functional programming. This is also a good example of working smarter, not harder. Rather than coding specifically what to pass as an argument to another function, you utilize another function to pass its returned object as an argument to another function.
Selecting Variables Using starts_with()
Understanding starts_with()
But first, we need to understand what starts_with()
, actually does. The function name should provide you some insight but for more information, use ?starts_with
.
starts_with(match,
ignore.case = TRUE,
vars = NULL
)
Parameters/Arguments:
match
: a character vectorignore.case
: ifTRUE
, the default, ignores case when matching names. This is most flexible.vars
: a character vector of variable names. If not supplied, the variables are taken from the current selection context (as established by functions likeselect()
orpivot_longer()
).
The best way to understand the function is to use it (after reading the documentation). Let’s try out starts_with()
. Let’s set a required pattern match = some character
and because vars = NULL
by default, let’s just set vars = some character vector
. If you reviewed the documentation or looked at the arguments above, vars
is not the second parameter, so you will want to name it in the function call if it takes that second position as seen below. We will first test this function without using it on a data frame.
We set match = "a"
and vars = c("Hello", "Hi", "Bye")
. This will look for character "a"
to match the elements of a character vector c("Hello", "Hi", "Bye")
.
starts_with(match = "a",
vars = c("Hello", "Hi", "Bye")
)
integer(0)
What is returned is integer(0)
, which is actually speak for “there is no match”. Not very intuitive. Hmm, OK. Let’s try another character.
starts_with(match = "b",
vars = c("Hello", "Hi", "Bye")
)
[1] 3
OK, so now an integer is returned (yes, try is.integer()
if you don’t believe me).
is.integer(
starts_with("b",
vars = c("Hello", "Hi", "Bye")
))
[1] TRUE
And even though you only see one value, this object is still a vector. We can check using is.vector()
.
is.vector(
starts_with("b",
vars = c("Hello", "Hi", "Bye")
))
[1] TRUE
Importantly, starts_with()
returns a vector of values. What do the values correspond to? The values in the vector refer to the element index/position of a match and the number of the elements in the vector tells refers to the number of matches found. Because the third string element "Bye"
starts with "b"
, that’s what is returned.
Try something else:
starts_with("h",
vars = c("Hello", "Hi", "Bye")
)
[1] 1 2
Now the function returns a vector containing 1
and a 2
, giving length = 2, representing both the first and the second elements start with “h”.
Importantly, letter casing is ignored because ignore.case = TRUE
by default. Set to FALSE
if you want your match to be case sensitive.
starts_with("h",
vars = c("Hello", "Hi", "Bye"),
ignore.case = F
)
integer(0)
And now there are no matches.
Selecting Variables from a Data Frame using starts_with()
You will typically use starts_with()
when working with a data frame and in conjunction with other functions, like select()
in order to select variables that start with certain characters. When using starts_with()
in the context of select()
, the vars
argument is essentially passing vars = the names of the columns of the data frame passed to select()
.
Example:
select(mydataframe,
starts_with(match = "my pattern",
vars = "var names of mydataframe")
)
Let’s provide two use cases:
- Without Piping
You can pass the data frame
as the first argument to select()
.
select(USArrests,
starts_with("m")
|>
) head()
Murder
Alabama 13.2
Alaska 10.0
Arizona 8.1
Arkansas 8.8
California 9.0
Colorado 7.9
Notice that the only column variable that is returned is Murder.
- With Piping
You can also pipe the data frame
into select()
, which assumes the first argument is a data frame. This is a process you will see used often by others.
|>
USArrests select(starts_with("m")) |>
head()
Murder
Alabama 13.2
Alaska 10.0
Arizona 8.1
Arkansas 8.8
California 9.0
Colorado 7.9
Selecting Variables using ends_with()
You can use ends_with()
to find matches at the end of strings. This function is useful if you have variables that end based on their measurement type (e.g., “_rt” or “_acc” for reaction times and accuracy, respectively). Having the foresight to create variables is such a way to make accessing them easily later is a good example of working smarter, not harder.
|>
USArrests select(ends_with("t")) |>
head()
Assault
Alabama 236
Alaska 263
Arizona 294
Arkansas 190
California 276
Colorado 204
Selecting and Selecting Out Variables By/Between Index
So far, we have address selecting variables. There are, however, many approaches for selecting but also selecting out or omitting column variables. In other words, rather than including 14 out of 15 variables, you could simply exclude 1 of the 15 variables to achieve the same result.
You can pass multiple arguments for each specification or you can pass a single vector that contains all specifications.
select(1, 2)
: select first and second columnsselect(c(1, 2))
: select first and second columnsselect(-c(1, 2))
: select out first and second columnsselect(1:2)
: select first through second columnsselect(c(1:2))
: select first through second columnsselect(-c(1:2))
: select out first through second columns
Potential Recommendation: use options utilizing c()
to pass a vector because this habit will be more versatile with base R
functionality. However, online solutions will likely not take this approach.
Let’s make a data frame to work with first. We will name the data frame DAT
even though this is a terrible name because it lacks diagnosticity.
<-
DAT data.frame(
Id = c(100, 101, 102, 103, 104, 100, 105),
Sex = c('male', 'female', 'Male', NA, 'man', "male", "neither"),
Age = c(25, 33, 27, 40, 44, 25, 40),
Renting = c("yes", NA, "yes", NA, "no", "yes", "yes")
)
Select columns 1 and 2:
|>
DAT select(1, 2)
Id Sex
1 100 male
2 101 female
3 102 Male
4 103 <NA>
5 104 man
6 100 male
7 105 neither
Select columns 1 and 2 as a vector containing values 1 and 2:
|>
DAT select(c(1, 2))
Id Sex
1 100 male
2 101 female
3 102 Male
4 103 <NA>
5 104 man
6 100 male
7 105 neither
Select out columns 1 and 2 as a vector containing values 1 and 2:
|>
DAT select(-c(1,2))
Age Renting
1 25 yes
2 33 <NA>
3 27 yes
4 40 <NA>
5 44 no
6 25 yes
7 40 yes
Select from columns 1 to 2 using the :
operator:
|>
DAT select(1:2)
Id Sex
1 100 male
2 101 female
3 102 Male
4 103 <NA>
5 104 man
6 100 male
7 105 neither
Select from columns 1 through 3 using a vector containing the :
operator:
|>
DAT select(c(1:3))
Id Sex Age
1 100 male 25
2 101 female 33
3 102 Male 27
4 103 <NA> 40
5 104 man 44
6 100 male 25
7 105 neither 40
Select out columns 1 through 3 using a vector containing the :
operator:
|>
DAT select(-c(1:3)) # select out from here to there
Renting
1 yes
2 <NA>
3 yes
4 <NA>
5 no
6 yes
7 yes
Selecting and Selecting Out Variables By or Between Character Name
These approaches are similar to those offered earlier except that some involve passing variables by their name (e.g., character names). Whereas the order of the variables in a data frame may move around, the names may be more stable or permanent, at least after you have cleaned up the names. Consequently, passing variables by name may be more foolproof.
You don’t have to be familiar with all approaches and you may settle on using one that makes the most sense to you.
select("var1", "var2")
select(c("var1", "var2"))
select(-c("var1", "var2"))
select(var1:var2))
select(c("var1":"var2))
select(-c("var1":"var2))
Recommendation: use options utilizing c()
as this will be more versatile with base R
functionality.
These function approaches also work but they may lead to some confusion regarding usage of quotes:
select(var1, var2)
select(c(var1, var2))
select(-c(var1, var2))
Select variables Id
though Age
using the :
operator:
|>
DAT select(Id:Age) # select from here to there
Id Sex Age
1 100 male 25
2 101 female 33
3 102 Male 27
4 103 <NA> 40
5 104 man 44
6 100 male 25
7 105 neither 40
Select variables Id
though Age
passed as strings using the :
operator:
|>
DAT select("Id":"Age") # select from here to there
Id Sex Age
1 100 male 25
2 101 female 33
3 102 Male 27
4 103 <NA> 40
5 104 man 44
6 100 male 25
7 105 neither 40
Select variables Id
though Age
as a vector containing the variable names passed as strings and using the :
operator:
|>
DAT select(c("Id":"Age")) # select from here to there
Id Sex Age
1 100 male 25
2 101 female 33
3 102 Male 27
4 103 <NA> 40
5 104 man 44
6 100 male 25
7 105 neither 40
Select out variables Id
though Age
as a vector containing the variable names passed as strings and using the :
operator:
|>
DAT select(-c("Id":"Age")) # select out from here to there
Renting
1 yes
2 <NA>
3 yes
4 <NA>
5 no
6 yes
7 yes
You can also use the !
operator to select NOT these variables (therefore, all others)
|>
DAT select(!c("Id":"Age")) # select out from here to there
Renting
1 yes
2 <NA>
3 yes
4 <NA>
5 no
6 yes
7 yes
Selecting and Selecting Out Variables Characters in Their Names
select(starts_with("characters"))
select(ends_with("characters"))
select(contains('e'))
Selecting using starts_with()
Select variables which start with character “i”:
|>
DAT select(starts_with('i'))
Id
1 100
2 101
3 102
4 103
5 104
6 100
7 105
Select variables which DO NOT start with character “s”:
|>
DAT select(-starts_with('s'))
Id Age Renting
1 100 25 yes
2 101 33 <NA>
3 102 27 yes
4 103 40 <NA>
5 104 44 no
6 100 25 yes
7 105 40 yes
Selecting using ends_with()
Select variables which end with character “e”:
|>
DAT select(ends_with('e'))
Age
1 25
2 33
3 27
4 40
5 44
6 25
7 40
Select variables which end with character “e”:
|>
DAT select(-ends_with('e'))
Id Sex Renting
1 100 male yes
2 101 female <NA>
3 102 Male yes
4 103 <NA> <NA>
5 104 man no
6 100 male yes
7 105 neither yes
Selecting using contains()
Select variables which contain character “g”:
|>
DAT select(contains('g'))
Age Renting
1 25 yes
2 33 <NA>
3 27 yes
4 40 <NA>
5 44 no
6 25 yes
7 40 yes
Select variables which DO NOT contain character “g”:
|>
DAT select(-contains('g'))
Id Sex
1 100 male
2 101 female
3 102 Male
4 103 <NA>
5 104 man
6 100 male
7 105 neither
Selecting using regular expressions
Sometimes you do not want to search for specific character strings. Instead, you might want to look for strings that contain particular character patterns. For example, typing out var1
, var2
, var3
could be annoying. You might prefer just grabbing variables that match the pattern "var"
.
We will use some regular expressions, or regex, to help us. For the regex, you we will need to specify some code for finding patterns. For example, \\d{4}
will search for patterns with 4 digits, whereas the dot star .*
will help with all patterns. In particular, .
refers to any character (e.g,. digit, alpha character, or any other special character) and *
means zero or more times, so this pattern will search for all files that start with 4 digits followed by anything in the name. A pattern will restrict the search to files containing that exact character string. Some example will be helpful.
Selecting variables of all patterns using ".*"
Select variables containing a regular expression, use matches()
:
.*
will grab all names because it means any character and any number of times
|>
DAT select(matches(".*"))
Id Sex Age Renting
1 100 male 25 yes
2 101 female 33 <NA>
3 102 Male 27 yes
4 103 <NA> 40 <NA>
5 104 man 44 no
6 100 male 25 yes
7 105 neither 40 yes
Selecting variables with digits using "\\d"
\\d
will grab all variables containing a digit:
|>
DAT mutate(var_1 = 1,
var_11 = 1,
var_3 = 1,
var1_var = 1
|>
) select(matches("\\d"))
var_1 var_11 var_3 var1_var
1 1 1 1 1
2 1 1 1 1
3 1 1 1 1
4 1 1 1 1
5 1 1 1 1
6 1 1 1 1
7 1 1 1 1
Selecting variables that start and contain characters "v.*\\d"
v.*\\d
will grab all variables that start with v and then contain any characters which are followed by a digit:
|>
DAT mutate(var_1 = 1,
var_11 = 1,
var_3 = 1,
var1_var = 1
|>
) select(matches("v.*\\d"))
var_1 var_11 var_3 var1_var
1 1 1 1 1
2 1 1 1 1
3 1 1 1 1
4 1 1 1 1
5 1 1 1 1
6 1 1 1 1
7 1 1 1 1
Selecting variables that end with a digit using "\\d$"
\\d$
will grab all variables ending in a digit ($
means end):
|>
DAT mutate(var_1 = 1,
var_11 = 1,
var_3 = 1,
var1_var = 1
|>
) select(matches("\\d$"))
var_1 var_11 var_3
1 1 1 1
2 1 1 1
3 1 1 1
4 1 1 1
5 1 1 1
6 1 1 1
7 1 1 1
Selecting variables by negation
You can also negate all regular expression matches if you want to exclude:
|>
DAT mutate(var_1 = 1,
var_11 = 1,
var_3 = 1,
var1_var = 1
|>
) select(-matches("\\d$"))
Id Sex Age Renting var1_var
1 100 male 25 yes 1
2 101 female 33 <NA> 1
3 102 Male 27 yes 1
4 103 <NA> 40 <NA> 1
5 104 man 44 no 1
6 100 male 25 yes 1
7 105 neither 40 yes 1
Note: The functions will return lowercase and uppercase variable name matches because the default behavior is ignore.case = TRUE
. Set to FALSE
if you want to perform precise surgery on the variables.
Selecting and Selecting Out Variables by Type
Select variables that are numeric:
|>
DAT select(where(is.numeric))
Id Age
1 100 25
2 101 33
3 102 27
4 103 40
5 104 44
6 100 25
7 105 40
Select variables that are NOT numeric:
|>
DAT select(-where(is.numeric))
Sex Renting
1 male yes
2 female <NA>
3 Male yes
4 <NA> <NA>
5 man no
6 male yes
7 neither yes
Select variables that are character:
|>
DAT select(where(is.character))
Sex Renting
1 male yes
2 female <NA>
3 Male yes
4 <NA> <NA>
5 man no
6 male yes
7 neither yes
Select variables that are NOT character:
|>
DAT select(-where(is.character))
Id Age
1 100 25
2 101 33
3 102 27
4 103 40
5 104 44
6 100 25
7 105 40
Select variables that are logical (TRUE
to FALSE
):
|>
DAT select(where(is.logical))
data frame with 0 columns and 7 rows
Select variables that are NOT logical (TRUE
to FALSE
):
|>
DAT select(-where(is.logical))
Id Sex Age Renting
1 100 male 25 yes
2 101 female 33 <NA>
3 102 Male 27 yes
4 103 <NA> 40 <NA>
5 104 man 44 no
6 100 male 25 yes
7 105 neither 40 yes
Creating/Mutating Variables
When we want to add new variables or modify existing ones, we will use mutate()
. Before using the function, review what it does using ?mutate
.
mutate(
.data,
...,
.by = NULL,
.keep = c("all", "used", "unused", "none"),
.before = NULL,
.after = NULL
)
Parameters/Arguments:
.data
: a data frame...
: name-value pairs; name is the name of the column and the value would be the value or function used
Let’s start with a constant and move up from there.
Creating/Mutating A Constant Value
Let’s create a new variable, new_var
and set it equal to 1
.
|>
USArrests mutate(new_var1 = 1)
Murder Assault UrbanPop Rape new_var1
Alabama 13.2 236 58 21.2 1
Alaska 10.0 263 48 44.5 1
Arizona 8.1 294 80 31.0 1
Arkansas 8.8 190 50 19.5 1
California 9.0 276 91 40.6 1
Colorado 7.9 204 78 38.7 1
Connecticut 3.3 110 77 11.1 1
Delaware 5.9 238 72 15.8 1
Florida 15.4 335 80 31.9 1
Georgia 17.4 211 60 25.8 1
Hawaii 5.3 46 83 20.2 1
Idaho 2.6 120 54 14.2 1
Illinois 10.4 249 83 24.0 1
Indiana 7.2 113 65 21.0 1
Iowa 2.2 56 57 11.3 1
Kansas 6.0 115 66 18.0 1
Kentucky 9.7 109 52 16.3 1
Louisiana 15.4 249 66 22.2 1
Maine 2.1 83 51 7.8 1
Maryland 11.3 300 67 27.8 1
Massachusetts 4.4 149 85 16.3 1
Michigan 12.1 255 74 35.1 1
Minnesota 2.7 72 66 14.9 1
Mississippi 16.1 259 44 17.1 1
Missouri 9.0 178 70 28.2 1
Montana 6.0 109 53 16.4 1
Nebraska 4.3 102 62 16.5 1
Nevada 12.2 252 81 46.0 1
New Hampshire 2.1 57 56 9.5 1
New Jersey 7.4 159 89 18.8 1
New Mexico 11.4 285 70 32.1 1
New York 11.1 254 86 26.1 1
North Carolina 13.0 337 45 16.1 1
North Dakota 0.8 45 44 7.3 1
Ohio 7.3 120 75 21.4 1
Oklahoma 6.6 151 68 20.0 1
Oregon 4.9 159 67 29.3 1
Pennsylvania 6.3 106 72 14.9 1
Rhode Island 3.4 174 87 8.3 1
South Carolina 14.4 279 48 22.5 1
South Dakota 3.8 86 45 12.8 1
Tennessee 13.2 188 59 26.9 1
Texas 12.7 201 80 25.5 1
Utah 3.2 120 80 22.9 1
Vermont 2.2 48 32 11.2 1
Virginia 8.5 156 63 20.7 1
Washington 4.0 145 73 26.2 1
West Virginia 5.7 81 39 9.3 1
Wisconsin 2.6 53 66 10.8 1
Wyoming 6.8 161 60 15.6 1
To mutate more than one variable, you can add a new name-value pair as a new argument:
|>
USArrests mutate(new_var1 = 1,
new_var2 = 9999
)
Murder Assault UrbanPop Rape new_var1 new_var2
Alabama 13.2 236 58 21.2 1 9999
Alaska 10.0 263 48 44.5 1 9999
Arizona 8.1 294 80 31.0 1 9999
Arkansas 8.8 190 50 19.5 1 9999
California 9.0 276 91 40.6 1 9999
Colorado 7.9 204 78 38.7 1 9999
Connecticut 3.3 110 77 11.1 1 9999
Delaware 5.9 238 72 15.8 1 9999
Florida 15.4 335 80 31.9 1 9999
Georgia 17.4 211 60 25.8 1 9999
Hawaii 5.3 46 83 20.2 1 9999
Idaho 2.6 120 54 14.2 1 9999
Illinois 10.4 249 83 24.0 1 9999
Indiana 7.2 113 65 21.0 1 9999
Iowa 2.2 56 57 11.3 1 9999
Kansas 6.0 115 66 18.0 1 9999
Kentucky 9.7 109 52 16.3 1 9999
Louisiana 15.4 249 66 22.2 1 9999
Maine 2.1 83 51 7.8 1 9999
Maryland 11.3 300 67 27.8 1 9999
Massachusetts 4.4 149 85 16.3 1 9999
Michigan 12.1 255 74 35.1 1 9999
Minnesota 2.7 72 66 14.9 1 9999
Mississippi 16.1 259 44 17.1 1 9999
Missouri 9.0 178 70 28.2 1 9999
Montana 6.0 109 53 16.4 1 9999
Nebraska 4.3 102 62 16.5 1 9999
Nevada 12.2 252 81 46.0 1 9999
New Hampshire 2.1 57 56 9.5 1 9999
New Jersey 7.4 159 89 18.8 1 9999
New Mexico 11.4 285 70 32.1 1 9999
New York 11.1 254 86 26.1 1 9999
North Carolina 13.0 337 45 16.1 1 9999
North Dakota 0.8 45 44 7.3 1 9999
Ohio 7.3 120 75 21.4 1 9999
Oklahoma 6.6 151 68 20.0 1 9999
Oregon 4.9 159 67 29.3 1 9999
Pennsylvania 6.3 106 72 14.9 1 9999
Rhode Island 3.4 174 87 8.3 1 9999
South Carolina 14.4 279 48 22.5 1 9999
South Dakota 3.8 86 45 12.8 1 9999
Tennessee 13.2 188 59 26.9 1 9999
Texas 12.7 201 80 25.5 1 9999
Utah 3.2 120 80 22.9 1 9999
Vermont 2.2 48 32 11.2 1 9999
Virginia 8.5 156 63 20.7 1 9999
Washington 4.0 145 73 26.2 1 9999
West Virginia 5.7 81 39 9.3 1 9999
Wisconsin 2.6 53 66 10.8 1 9999
Wyoming 6.8 161 60 15.6 1 9999
Or you can add a new mutate()
by piping:
|>
USArrests mutate(new_var1 = 1) |> # add the variable
mutate(new_var1 = 9999) # then add another variable
Murder Assault UrbanPop Rape new_var1
Alabama 13.2 236 58 21.2 9999
Alaska 10.0 263 48 44.5 9999
Arizona 8.1 294 80 31.0 9999
Arkansas 8.8 190 50 19.5 9999
California 9.0 276 91 40.6 9999
Colorado 7.9 204 78 38.7 9999
Connecticut 3.3 110 77 11.1 9999
Delaware 5.9 238 72 15.8 9999
Florida 15.4 335 80 31.9 9999
Georgia 17.4 211 60 25.8 9999
Hawaii 5.3 46 83 20.2 9999
Idaho 2.6 120 54 14.2 9999
Illinois 10.4 249 83 24.0 9999
Indiana 7.2 113 65 21.0 9999
Iowa 2.2 56 57 11.3 9999
Kansas 6.0 115 66 18.0 9999
Kentucky 9.7 109 52 16.3 9999
Louisiana 15.4 249 66 22.2 9999
Maine 2.1 83 51 7.8 9999
Maryland 11.3 300 67 27.8 9999
Massachusetts 4.4 149 85 16.3 9999
Michigan 12.1 255 74 35.1 9999
Minnesota 2.7 72 66 14.9 9999
Mississippi 16.1 259 44 17.1 9999
Missouri 9.0 178 70 28.2 9999
Montana 6.0 109 53 16.4 9999
Nebraska 4.3 102 62 16.5 9999
Nevada 12.2 252 81 46.0 9999
New Hampshire 2.1 57 56 9.5 9999
New Jersey 7.4 159 89 18.8 9999
New Mexico 11.4 285 70 32.1 9999
New York 11.1 254 86 26.1 9999
North Carolina 13.0 337 45 16.1 9999
North Dakota 0.8 45 44 7.3 9999
Ohio 7.3 120 75 21.4 9999
Oklahoma 6.6 151 68 20.0 9999
Oregon 4.9 159 67 29.3 9999
Pennsylvania 6.3 106 72 14.9 9999
Rhode Island 3.4 174 87 8.3 9999
South Carolina 14.4 279 48 22.5 9999
South Dakota 3.8 86 45 12.8 9999
Tennessee 13.2 188 59 26.9 9999
Texas 12.7 201 80 25.5 9999
Utah 3.2 120 80 22.9 9999
Vermont 2.2 48 32 11.2 9999
Virginia 8.5 156 63 20.7 9999
Washington 4.0 145 73 26.2 9999
West Virginia 5.7 81 39 9.3 9999
Wisconsin 2.6 53 66 10.8 9999
Wyoming 6.8 161 60 15.6 9999
Whatever your approach, keep the variable creation on separate lines. Don’t do this:
|>
USArrests mutate(new_var1 = 1, new_var1 = 9999)
Murder Assault UrbanPop Rape new_var1
Alabama 13.2 236 58 21.2 9999
Alaska 10.0 263 48 44.5 9999
Arizona 8.1 294 80 31.0 9999
Arkansas 8.8 190 50 19.5 9999
California 9.0 276 91 40.6 9999
Colorado 7.9 204 78 38.7 9999
Connecticut 3.3 110 77 11.1 9999
Delaware 5.9 238 72 15.8 9999
Florida 15.4 335 80 31.9 9999
Georgia 17.4 211 60 25.8 9999
Hawaii 5.3 46 83 20.2 9999
Idaho 2.6 120 54 14.2 9999
Illinois 10.4 249 83 24.0 9999
Indiana 7.2 113 65 21.0 9999
Iowa 2.2 56 57 11.3 9999
Kansas 6.0 115 66 18.0 9999
Kentucky 9.7 109 52 16.3 9999
Louisiana 15.4 249 66 22.2 9999
Maine 2.1 83 51 7.8 9999
Maryland 11.3 300 67 27.8 9999
Massachusetts 4.4 149 85 16.3 9999
Michigan 12.1 255 74 35.1 9999
Minnesota 2.7 72 66 14.9 9999
Mississippi 16.1 259 44 17.1 9999
Missouri 9.0 178 70 28.2 9999
Montana 6.0 109 53 16.4 9999
Nebraska 4.3 102 62 16.5 9999
Nevada 12.2 252 81 46.0 9999
New Hampshire 2.1 57 56 9.5 9999
New Jersey 7.4 159 89 18.8 9999
New Mexico 11.4 285 70 32.1 9999
New York 11.1 254 86 26.1 9999
North Carolina 13.0 337 45 16.1 9999
North Dakota 0.8 45 44 7.3 9999
Ohio 7.3 120 75 21.4 9999
Oklahoma 6.6 151 68 20.0 9999
Oregon 4.9 159 67 29.3 9999
Pennsylvania 6.3 106 72 14.9 9999
Rhode Island 3.4 174 87 8.3 9999
South Carolina 14.4 279 48 22.5 9999
South Dakota 3.8 86 45 12.8 9999
Tennessee 13.2 188 59 26.9 9999
Texas 12.7 201 80 25.5 9999
Utah 3.2 120 80 22.9 9999
Vermont 2.2 48 32 11.2 9999
Virginia 8.5 156 63 20.7 9999
Washington 4.0 145 73 26.2 9999
West Virginia 5.7 81 39 9.3 9999
Wisconsin 2.6 53 66 10.8 9999
Wyoming 6.8 161 60 15.6 9999
Although there is nothing technically incorrect with doing so, (1) this is harder to read, (2) the R
community won’t like it, and importantly (3) it is less flexible when you want to temporarily comment out a line of your code as seen here.
|>
USArrests # mutate(new_var1 = 1) |> # add the variable
mutate(new_var1 = 9999) # then add another variable
Murder Assault UrbanPop Rape new_var1
Alabama 13.2 236 58 21.2 9999
Alaska 10.0 263 48 44.5 9999
Arizona 8.1 294 80 31.0 9999
Arkansas 8.8 190 50 19.5 9999
California 9.0 276 91 40.6 9999
Colorado 7.9 204 78 38.7 9999
Connecticut 3.3 110 77 11.1 9999
Delaware 5.9 238 72 15.8 9999
Florida 15.4 335 80 31.9 9999
Georgia 17.4 211 60 25.8 9999
Hawaii 5.3 46 83 20.2 9999
Idaho 2.6 120 54 14.2 9999
Illinois 10.4 249 83 24.0 9999
Indiana 7.2 113 65 21.0 9999
Iowa 2.2 56 57 11.3 9999
Kansas 6.0 115 66 18.0 9999
Kentucky 9.7 109 52 16.3 9999
Louisiana 15.4 249 66 22.2 9999
Maine 2.1 83 51 7.8 9999
Maryland 11.3 300 67 27.8 9999
Massachusetts 4.4 149 85 16.3 9999
Michigan 12.1 255 74 35.1 9999
Minnesota 2.7 72 66 14.9 9999
Mississippi 16.1 259 44 17.1 9999
Missouri 9.0 178 70 28.2 9999
Montana 6.0 109 53 16.4 9999
Nebraska 4.3 102 62 16.5 9999
Nevada 12.2 252 81 46.0 9999
New Hampshire 2.1 57 56 9.5 9999
New Jersey 7.4 159 89 18.8 9999
New Mexico 11.4 285 70 32.1 9999
New York 11.1 254 86 26.1 9999
North Carolina 13.0 337 45 16.1 9999
North Dakota 0.8 45 44 7.3 9999
Ohio 7.3 120 75 21.4 9999
Oklahoma 6.6 151 68 20.0 9999
Oregon 4.9 159 67 29.3 9999
Pennsylvania 6.3 106 72 14.9 9999
Rhode Island 3.4 174 87 8.3 9999
South Carolina 14.4 279 48 22.5 9999
South Dakota 3.8 86 45 12.8 9999
Tennessee 13.2 188 59 26.9 9999
Texas 12.7 201 80 25.5 9999
Utah 3.2 120 80 22.9 9999
Vermont 2.2 48 32 11.2 9999
Virginia 8.5 156 63 20.7 9999
Washington 4.0 145 73 26.2 9999
West Virginia 5.7 81 39 9.3 9999
Wisconsin 2.6 53 66 10.8 9999
Wyoming 6.8 161 60 15.6 9999
Creating/Mutating A Constant Character
|>
USArrests mutate(new_var1 = "Below Average",
)
Murder Assault UrbanPop Rape new_var1
Alabama 13.2 236 58 21.2 Below Average
Alaska 10.0 263 48 44.5 Below Average
Arizona 8.1 294 80 31.0 Below Average
Arkansas 8.8 190 50 19.5 Below Average
California 9.0 276 91 40.6 Below Average
Colorado 7.9 204 78 38.7 Below Average
Connecticut 3.3 110 77 11.1 Below Average
Delaware 5.9 238 72 15.8 Below Average
Florida 15.4 335 80 31.9 Below Average
Georgia 17.4 211 60 25.8 Below Average
Hawaii 5.3 46 83 20.2 Below Average
Idaho 2.6 120 54 14.2 Below Average
Illinois 10.4 249 83 24.0 Below Average
Indiana 7.2 113 65 21.0 Below Average
Iowa 2.2 56 57 11.3 Below Average
Kansas 6.0 115 66 18.0 Below Average
Kentucky 9.7 109 52 16.3 Below Average
Louisiana 15.4 249 66 22.2 Below Average
Maine 2.1 83 51 7.8 Below Average
Maryland 11.3 300 67 27.8 Below Average
Massachusetts 4.4 149 85 16.3 Below Average
Michigan 12.1 255 74 35.1 Below Average
Minnesota 2.7 72 66 14.9 Below Average
Mississippi 16.1 259 44 17.1 Below Average
Missouri 9.0 178 70 28.2 Below Average
Montana 6.0 109 53 16.4 Below Average
Nebraska 4.3 102 62 16.5 Below Average
Nevada 12.2 252 81 46.0 Below Average
New Hampshire 2.1 57 56 9.5 Below Average
New Jersey 7.4 159 89 18.8 Below Average
New Mexico 11.4 285 70 32.1 Below Average
New York 11.1 254 86 26.1 Below Average
North Carolina 13.0 337 45 16.1 Below Average
North Dakota 0.8 45 44 7.3 Below Average
Ohio 7.3 120 75 21.4 Below Average
Oklahoma 6.6 151 68 20.0 Below Average
Oregon 4.9 159 67 29.3 Below Average
Pennsylvania 6.3 106 72 14.9 Below Average
Rhode Island 3.4 174 87 8.3 Below Average
South Carolina 14.4 279 48 22.5 Below Average
South Dakota 3.8 86 45 12.8 Below Average
Tennessee 13.2 188 59 26.9 Below Average
Texas 12.7 201 80 25.5 Below Average
Utah 3.2 120 80 22.9 Below Average
Vermont 2.2 48 32 11.2 Below Average
Virginia 8.5 156 63 20.7 Below Average
Washington 4.0 145 73 26.2 Below Average
West Virginia 5.7 81 39 9.3 Below Average
Wisconsin 2.6 53 66 10.8 Below Average
Wyoming 6.8 161 60 15.6 Below Average
Creating/Mutating A Logical Object
Logical objects are TRUE
or FALSE
. We can perform the logical test by asking whether Assault > Murder
. If the Assault
value for a row is greater than the Murder
value, TRUE
will be returned, otherwise FALSE
.
|>
USArrests mutate(new_var = Assault > Murder)
Murder Assault UrbanPop Rape new_var
Alabama 13.2 236 58 21.2 TRUE
Alaska 10.0 263 48 44.5 TRUE
Arizona 8.1 294 80 31.0 TRUE
Arkansas 8.8 190 50 19.5 TRUE
California 9.0 276 91 40.6 TRUE
Colorado 7.9 204 78 38.7 TRUE
Connecticut 3.3 110 77 11.1 TRUE
Delaware 5.9 238 72 15.8 TRUE
Florida 15.4 335 80 31.9 TRUE
Georgia 17.4 211 60 25.8 TRUE
Hawaii 5.3 46 83 20.2 TRUE
Idaho 2.6 120 54 14.2 TRUE
Illinois 10.4 249 83 24.0 TRUE
Indiana 7.2 113 65 21.0 TRUE
Iowa 2.2 56 57 11.3 TRUE
Kansas 6.0 115 66 18.0 TRUE
Kentucky 9.7 109 52 16.3 TRUE
Louisiana 15.4 249 66 22.2 TRUE
Maine 2.1 83 51 7.8 TRUE
Maryland 11.3 300 67 27.8 TRUE
Massachusetts 4.4 149 85 16.3 TRUE
Michigan 12.1 255 74 35.1 TRUE
Minnesota 2.7 72 66 14.9 TRUE
Mississippi 16.1 259 44 17.1 TRUE
Missouri 9.0 178 70 28.2 TRUE
Montana 6.0 109 53 16.4 TRUE
Nebraska 4.3 102 62 16.5 TRUE
Nevada 12.2 252 81 46.0 TRUE
New Hampshire 2.1 57 56 9.5 TRUE
New Jersey 7.4 159 89 18.8 TRUE
New Mexico 11.4 285 70 32.1 TRUE
New York 11.1 254 86 26.1 TRUE
North Carolina 13.0 337 45 16.1 TRUE
North Dakota 0.8 45 44 7.3 TRUE
Ohio 7.3 120 75 21.4 TRUE
Oklahoma 6.6 151 68 20.0 TRUE
Oregon 4.9 159 67 29.3 TRUE
Pennsylvania 6.3 106 72 14.9 TRUE
Rhode Island 3.4 174 87 8.3 TRUE
South Carolina 14.4 279 48 22.5 TRUE
South Dakota 3.8 86 45 12.8 TRUE
Tennessee 13.2 188 59 26.9 TRUE
Texas 12.7 201 80 25.5 TRUE
Utah 3.2 120 80 22.9 TRUE
Vermont 2.2 48 32 11.2 TRUE
Virginia 8.5 156 63 20.7 TRUE
Washington 4.0 145 73 26.2 TRUE
West Virginia 5.7 81 39 9.3 TRUE
Wisconsin 2.6 53 66 10.8 TRUE
Wyoming 6.8 161 60 15.6 TRUE
We see that all states (rows) have more assults than murders. No surprise.
Creating/Mutating Based on Function
Here, the name value pair will contain a function. For example, let’s create a variable based on the mean()
of a variable.
|>
USArrests mutate(Mean_Assault = mean(Assault))
Murder Assault UrbanPop Rape Mean_Assault
Alabama 13.2 236 58 21.2 170.76
Alaska 10.0 263 48 44.5 170.76
Arizona 8.1 294 80 31.0 170.76
Arkansas 8.8 190 50 19.5 170.76
California 9.0 276 91 40.6 170.76
Colorado 7.9 204 78 38.7 170.76
Connecticut 3.3 110 77 11.1 170.76
Delaware 5.9 238 72 15.8 170.76
Florida 15.4 335 80 31.9 170.76
Georgia 17.4 211 60 25.8 170.76
Hawaii 5.3 46 83 20.2 170.76
Idaho 2.6 120 54 14.2 170.76
Illinois 10.4 249 83 24.0 170.76
Indiana 7.2 113 65 21.0 170.76
Iowa 2.2 56 57 11.3 170.76
Kansas 6.0 115 66 18.0 170.76
Kentucky 9.7 109 52 16.3 170.76
Louisiana 15.4 249 66 22.2 170.76
Maine 2.1 83 51 7.8 170.76
Maryland 11.3 300 67 27.8 170.76
Massachusetts 4.4 149 85 16.3 170.76
Michigan 12.1 255 74 35.1 170.76
Minnesota 2.7 72 66 14.9 170.76
Mississippi 16.1 259 44 17.1 170.76
Missouri 9.0 178 70 28.2 170.76
Montana 6.0 109 53 16.4 170.76
Nebraska 4.3 102 62 16.5 170.76
Nevada 12.2 252 81 46.0 170.76
New Hampshire 2.1 57 56 9.5 170.76
New Jersey 7.4 159 89 18.8 170.76
New Mexico 11.4 285 70 32.1 170.76
New York 11.1 254 86 26.1 170.76
North Carolina 13.0 337 45 16.1 170.76
North Dakota 0.8 45 44 7.3 170.76
Ohio 7.3 120 75 21.4 170.76
Oklahoma 6.6 151 68 20.0 170.76
Oregon 4.9 159 67 29.3 170.76
Pennsylvania 6.3 106 72 14.9 170.76
Rhode Island 3.4 174 87 8.3 170.76
South Carolina 14.4 279 48 22.5 170.76
South Dakota 3.8 86 45 12.8 170.76
Tennessee 13.2 188 59 26.9 170.76
Texas 12.7 201 80 25.5 170.76
Utah 3.2 120 80 22.9 170.76
Vermont 2.2 48 32 11.2 170.76
Virginia 8.5 156 63 20.7 170.76
Washington 4.0 145 73 26.2 170.76
West Virginia 5.7 81 39 9.3 170.76
Wisconsin 2.6 53 66 10.8 170.76
Wyoming 6.8 161 60 15.6 170.76
Notice that the new column contains the mean for all rows.
We may also wish to categorize rows based on their relative Assault
data. How about assault lower than the mean assault?
|>
USArrests mutate(Lower_than_Mean_Assault = Assault < mean(Assault))
Murder Assault UrbanPop Rape Lower_than_Mean_Assault
Alabama 13.2 236 58 21.2 FALSE
Alaska 10.0 263 48 44.5 FALSE
Arizona 8.1 294 80 31.0 FALSE
Arkansas 8.8 190 50 19.5 FALSE
California 9.0 276 91 40.6 FALSE
Colorado 7.9 204 78 38.7 FALSE
Connecticut 3.3 110 77 11.1 TRUE
Delaware 5.9 238 72 15.8 FALSE
Florida 15.4 335 80 31.9 FALSE
Georgia 17.4 211 60 25.8 FALSE
Hawaii 5.3 46 83 20.2 TRUE
Idaho 2.6 120 54 14.2 TRUE
Illinois 10.4 249 83 24.0 FALSE
Indiana 7.2 113 65 21.0 TRUE
Iowa 2.2 56 57 11.3 TRUE
Kansas 6.0 115 66 18.0 TRUE
Kentucky 9.7 109 52 16.3 TRUE
Louisiana 15.4 249 66 22.2 FALSE
Maine 2.1 83 51 7.8 TRUE
Maryland 11.3 300 67 27.8 FALSE
Massachusetts 4.4 149 85 16.3 TRUE
Michigan 12.1 255 74 35.1 FALSE
Minnesota 2.7 72 66 14.9 TRUE
Mississippi 16.1 259 44 17.1 FALSE
Missouri 9.0 178 70 28.2 FALSE
Montana 6.0 109 53 16.4 TRUE
Nebraska 4.3 102 62 16.5 TRUE
Nevada 12.2 252 81 46.0 FALSE
New Hampshire 2.1 57 56 9.5 TRUE
New Jersey 7.4 159 89 18.8 TRUE
New Mexico 11.4 285 70 32.1 FALSE
New York 11.1 254 86 26.1 FALSE
North Carolina 13.0 337 45 16.1 FALSE
North Dakota 0.8 45 44 7.3 TRUE
Ohio 7.3 120 75 21.4 TRUE
Oklahoma 6.6 151 68 20.0 TRUE
Oregon 4.9 159 67 29.3 TRUE
Pennsylvania 6.3 106 72 14.9 TRUE
Rhode Island 3.4 174 87 8.3 FALSE
South Carolina 14.4 279 48 22.5 FALSE
South Dakota 3.8 86 45 12.8 TRUE
Tennessee 13.2 188 59 26.9 FALSE
Texas 12.7 201 80 25.5 FALSE
Utah 3.2 120 80 22.9 TRUE
Vermont 2.2 48 32 11.2 TRUE
Virginia 8.5 156 63 20.7 TRUE
Washington 4.0 145 73 26.2 TRUE
West Virginia 5.7 81 39 9.3 TRUE
Wisconsin 2.6 53 66 10.8 TRUE
Wyoming 6.8 161 60 15.6 TRUE
TRUE
and False
make terrible variable levels, however. Using ifelse()
, we will specify conditions for categorization. For example, make rows that are below the mean take on a value, else/otherwise, make the rows a different value.
Example:
ifelse(Assault < Mean_Assault, "Below", "Equal or Above"))
If the Assault
value is <
Mean_Assault, assign the row a value of "Below"
, else, assign "Equal or Above"
. Let’s use ifelse()
to create a name-value pair
Relative_Assult = ifelse(Assault < Mean_Assault, "Below", "Equal or Above"))
|>
USArrests mutate(Mean_Assault = mean(Assault),
Relative_Assult = ifelse(Assault < Mean_Assault, "Below", "Equal or Above")
)
Murder Assault UrbanPop Rape Mean_Assault Relative_Assult
Alabama 13.2 236 58 21.2 170.76 Equal or Above
Alaska 10.0 263 48 44.5 170.76 Equal or Above
Arizona 8.1 294 80 31.0 170.76 Equal or Above
Arkansas 8.8 190 50 19.5 170.76 Equal or Above
California 9.0 276 91 40.6 170.76 Equal or Above
Colorado 7.9 204 78 38.7 170.76 Equal or Above
Connecticut 3.3 110 77 11.1 170.76 Below
Delaware 5.9 238 72 15.8 170.76 Equal or Above
Florida 15.4 335 80 31.9 170.76 Equal or Above
Georgia 17.4 211 60 25.8 170.76 Equal or Above
Hawaii 5.3 46 83 20.2 170.76 Below
Idaho 2.6 120 54 14.2 170.76 Below
Illinois 10.4 249 83 24.0 170.76 Equal or Above
Indiana 7.2 113 65 21.0 170.76 Below
Iowa 2.2 56 57 11.3 170.76 Below
Kansas 6.0 115 66 18.0 170.76 Below
Kentucky 9.7 109 52 16.3 170.76 Below
Louisiana 15.4 249 66 22.2 170.76 Equal or Above
Maine 2.1 83 51 7.8 170.76 Below
Maryland 11.3 300 67 27.8 170.76 Equal or Above
Massachusetts 4.4 149 85 16.3 170.76 Below
Michigan 12.1 255 74 35.1 170.76 Equal or Above
Minnesota 2.7 72 66 14.9 170.76 Below
Mississippi 16.1 259 44 17.1 170.76 Equal or Above
Missouri 9.0 178 70 28.2 170.76 Equal or Above
Montana 6.0 109 53 16.4 170.76 Below
Nebraska 4.3 102 62 16.5 170.76 Below
Nevada 12.2 252 81 46.0 170.76 Equal or Above
New Hampshire 2.1 57 56 9.5 170.76 Below
New Jersey 7.4 159 89 18.8 170.76 Below
New Mexico 11.4 285 70 32.1 170.76 Equal or Above
New York 11.1 254 86 26.1 170.76 Equal or Above
North Carolina 13.0 337 45 16.1 170.76 Equal or Above
North Dakota 0.8 45 44 7.3 170.76 Below
Ohio 7.3 120 75 21.4 170.76 Below
Oklahoma 6.6 151 68 20.0 170.76 Below
Oregon 4.9 159 67 29.3 170.76 Below
Pennsylvania 6.3 106 72 14.9 170.76 Below
Rhode Island 3.4 174 87 8.3 170.76 Equal or Above
South Carolina 14.4 279 48 22.5 170.76 Equal or Above
South Dakota 3.8 86 45 12.8 170.76 Below
Tennessee 13.2 188 59 26.9 170.76 Equal or Above
Texas 12.7 201 80 25.5 170.76 Equal or Above
Utah 3.2 120 80 22.9 170.76 Below
Vermont 2.2 48 32 11.2 170.76 Below
Virginia 8.5 156 63 20.7 170.76 Below
Washington 4.0 145 73 26.2 170.76 Below
West Virginia 5.7 81 39 9.3 170.76 Below
Wisconsin 2.6 53 66 10.8 170.76 Below
Wyoming 6.8 161 60 15.6 170.76 Below
Bu you could also combine the steps together:
|>
USArrests mutate(Relative_Assult = ifelse(Assault < mean(Assault), "Below", "Equal or Above"))
Murder Assault UrbanPop Rape Relative_Assult
Alabama 13.2 236 58 21.2 Equal or Above
Alaska 10.0 263 48 44.5 Equal or Above
Arizona 8.1 294 80 31.0 Equal or Above
Arkansas 8.8 190 50 19.5 Equal or Above
California 9.0 276 91 40.6 Equal or Above
Colorado 7.9 204 78 38.7 Equal or Above
Connecticut 3.3 110 77 11.1 Below
Delaware 5.9 238 72 15.8 Equal or Above
Florida 15.4 335 80 31.9 Equal or Above
Georgia 17.4 211 60 25.8 Equal or Above
Hawaii 5.3 46 83 20.2 Below
Idaho 2.6 120 54 14.2 Below
Illinois 10.4 249 83 24.0 Equal or Above
Indiana 7.2 113 65 21.0 Below
Iowa 2.2 56 57 11.3 Below
Kansas 6.0 115 66 18.0 Below
Kentucky 9.7 109 52 16.3 Below
Louisiana 15.4 249 66 22.2 Equal or Above
Maine 2.1 83 51 7.8 Below
Maryland 11.3 300 67 27.8 Equal or Above
Massachusetts 4.4 149 85 16.3 Below
Michigan 12.1 255 74 35.1 Equal or Above
Minnesota 2.7 72 66 14.9 Below
Mississippi 16.1 259 44 17.1 Equal or Above
Missouri 9.0 178 70 28.2 Equal or Above
Montana 6.0 109 53 16.4 Below
Nebraska 4.3 102 62 16.5 Below
Nevada 12.2 252 81 46.0 Equal or Above
New Hampshire 2.1 57 56 9.5 Below
New Jersey 7.4 159 89 18.8 Below
New Mexico 11.4 285 70 32.1 Equal or Above
New York 11.1 254 86 26.1 Equal or Above
North Carolina 13.0 337 45 16.1 Equal or Above
North Dakota 0.8 45 44 7.3 Below
Ohio 7.3 120 75 21.4 Below
Oklahoma 6.6 151 68 20.0 Below
Oregon 4.9 159 67 29.3 Below
Pennsylvania 6.3 106 72 14.9 Below
Rhode Island 3.4 174 87 8.3 Equal or Above
South Carolina 14.4 279 48 22.5 Equal or Above
South Dakota 3.8 86 45 12.8 Below
Tennessee 13.2 188 59 26.9 Equal or Above
Texas 12.7 201 80 25.5 Equal or Above
Utah 3.2 120 80 22.9 Below
Vermont 2.2 48 32 11.2 Below
Virginia 8.5 156 63 20.7 Below
Washington 4.0 145 73 26.2 Below
West Virginia 5.7 81 39 9.3 Below
Wisconsin 2.6 53 66 10.8 Below
Wyoming 6.8 161 60 15.6 Below
Filtering Rows
Data files are messy and as a result require cleaning. You will have missing rows, incorrect variable names, files with columns named the same, NA
s, strings for numbers, duplicate rows of data, people who completed a survey twice, and all sorts of unimaginable and unbelievable data problems. So cleaning is important.
Whereas selecting is the processes of retaining variable columns, filtering is the process of retaining rows from a data set. For example, if you want your data frame to contain only individuals who are 20 or more years old, you would filter to retain them and remove others. If you want to explore data for only individuals from Kansas, you would filter rows to retain only them and remove all others.
Data frame manipulation may involve keeping only certain rows for data, for example, male or female respondents, male respondents, those who do not contain missing values (e.g., NA
s) for a specific column variable, who are of a certain age (or born in in certain year), who are above (or below) some acceptable criterion, etc.
When a column variable has more than one value (e.g., check using unique()
to determine the unique elements contained), you may wish to filter on some but not others.
You may even need to filter rows in a data frame that are distinct (e.g., not duplicate responses). This is often a good first step in order to determine the size of the usable data set. dplyr::distinct()
makes de-duplicating easy as this function will return only distinct rows.
Understanding Filtering Operators
We will filter data using filter()
from {dplyr}
filter(.data,
... )
Parameters/Arguments:
.data
: a data frame...
: expressions that return a logical value (TRUE
orFALSE
)
NOTE:: Filtering cases using the dplyr::filter()
verbs works by removing rows that do not match a specific criterion and then by returning the data frame that omits the mismatched condition. It keeps only rows that are TRUE
(match) your specification.
Some useful filtering operators and functions include: ==
, >
, >=
, &
, |
, !
, xor()
, c()
, is.na()
, between()
, near()
.
Filtering Using filter()
We will work with different data frames because the the nature of their contents.
What does the data frame look like again?
DAT
Id Sex Age Renting
1 100 male 25 yes
2 101 female 33 <NA>
3 102 Male 27 yes
4 103 <NA> 40 <NA>
5 104 man 44 no
6 100 male 25 yes
7 105 neither 40 yes
Row/Observations/Cases can be filtered to “include” only certain matched conditions or can be filtered to “exclude” by negating those matched conditions. If the column variable Sex
is in the data frame and cases are 'male'
, 'men'
, 'female'
, 'women'
, 'neither'
, NA
, etc., you can specify the column Sex
variable and then the row matching condition(s).
Filtering Matches to a Specific Character Value
The first argument in dplyr::filter()
is a data frame, and the function all dplyr::filter(DAT, Sex == 'female')
will filter the data frame named DAT
to include rows for which the sex
column equals 'female'
. In other words, TRUE
rows.
::filter(DAT,
dplyr== 'female'
Sex )
Id Sex Age Renting
1 101 female 33 <NA>
Similarly, the function call dplyr::filter(., Sex == 'male')
can be read “filter the data frame to include rows for which the value of Sex == 'male'
is TRUE
”.
More flexibly, however, you could specify a vector containing acceptable strings using c()
. dplyr::filter(., Sex %in% c('male'))
filters the rows to include only those for which the value for sex
is in the string vector which includes a single string,'male'
whereas dplyr::filter(., Sex %in% c('male', 'Man'))
filters the rows to include only those for which the value for Sex
is in the string vector which includes 'male'
and 'Man'
. Cases containing 'Male'
, 'Men'
(R is a case-sensitive language), or 'female'
, for example, will not be included in the returned data frame because they do not match values in the string vector.
Filtering by a Relative Value
|>
USArrests filter(Murder > 10)
Murder Assault UrbanPop Rape
Alabama 13.2 236 58 21.2
Florida 15.4 335 80 31.9
Georgia 17.4 211 60 25.8
Illinois 10.4 249 83 24.0
Louisiana 15.4 249 66 22.2
Maryland 11.3 300 67 27.8
Michigan 12.1 255 74 35.1
Mississippi 16.1 259 44 17.1
Nevada 12.2 252 81 46.0
New Mexico 11.4 285 70 32.1
New York 11.1 254 86 26.1
North Carolina 13.0 337 45 16.1
South Carolina 14.4 279 48 22.5
Tennessee 13.2 188 59 26.9
Texas 12.7 201 80 25.5
Those rows with Murder
above 10 will be retained, others removed.
Filtering by a Specific Value
|>
USArrests filter(Murder == 12.1)
Murder Assault UrbanPop Rape
Michigan 12.1 255 74 35.1
Those rows with Murder == 12.1
will be retained.
Using a previous example with this data set, we can filter based on a character variable if we had one. Let’s add one.
|>
USArrests mutate(Relative_Murder = ifelse(Murder < mean(Murder), "Below", "Equal or Above")) |>
filter(Relative_Murder == "Below")
Murder Assault UrbanPop Rape Relative_Murder
Connecticut 3.3 110 77 11.1 Below
Delaware 5.9 238 72 15.8 Below
Hawaii 5.3 46 83 20.2 Below
Idaho 2.6 120 54 14.2 Below
Indiana 7.2 113 65 21.0 Below
Iowa 2.2 56 57 11.3 Below
Kansas 6.0 115 66 18.0 Below
Maine 2.1 83 51 7.8 Below
Massachusetts 4.4 149 85 16.3 Below
Minnesota 2.7 72 66 14.9 Below
Montana 6.0 109 53 16.4 Below
Nebraska 4.3 102 62 16.5 Below
New Hampshire 2.1 57 56 9.5 Below
New Jersey 7.4 159 89 18.8 Below
North Dakota 0.8 45 44 7.3 Below
Ohio 7.3 120 75 21.4 Below
Oklahoma 6.6 151 68 20.0 Below
Oregon 4.9 159 67 29.3 Below
Pennsylvania 6.3 106 72 14.9 Below
Rhode Island 3.4 174 87 8.3 Below
South Dakota 3.8 86 45 12.8 Below
Utah 3.2 120 80 22.9 Below
Vermont 2.2 48 32 11.2 Below
Washington 4.0 145 73 26.2 Below
West Virginia 5.7 81 39 9.3 Below
Wisconsin 2.6 53 66 10.8 Below
Wyoming 6.8 161 60 15.6 Below
Filtering Using on a Function
|>
USArrests filter(Murder > mean(Murder))
Murder Assault UrbanPop Rape
Alabama 13.2 236 58 21.2
Alaska 10.0 263 48 44.5
Arizona 8.1 294 80 31.0
Arkansas 8.8 190 50 19.5
California 9.0 276 91 40.6
Colorado 7.9 204 78 38.7
Florida 15.4 335 80 31.9
Georgia 17.4 211 60 25.8
Illinois 10.4 249 83 24.0
Kentucky 9.7 109 52 16.3
Louisiana 15.4 249 66 22.2
Maryland 11.3 300 67 27.8
Michigan 12.1 255 74 35.1
Mississippi 16.1 259 44 17.1
Missouri 9.0 178 70 28.2
Nevada 12.2 252 81 46.0
New Mexico 11.4 285 70 32.1
New York 11.1 254 86 26.1
North Carolina 13.0 337 45 16.1
South Carolina 14.4 279 48 22.5
Tennessee 13.2 188 59 26.9
Texas 12.7 201 80 25.5
Virginia 8.5 156 63 20.7
Filtering Missing Data (NAs
)
is.na()
will return a logical vector for which TRUE
represents there are missing values.
Try on the entire data frame…
is.na(DAT)
Id Sex Age Renting
[1,] FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE TRUE
[3,] FALSE FALSE FALSE FALSE
[4,] FALSE TRUE FALSE TRUE
[5,] FALSE FALSE FALSE FALSE
[6,] FALSE FALSE FALSE FALSE
[7,] FALSE FALSE FALSE FALSE
You can see that some columns contain cases/rows with TRUE
indicating the cell contains NA
.
The negation operator, !
, will be used to illustrate some filtering approaches. Because filter()
will filter out FALSE
cases and retain TRUE
ones, so you may sometimes need to negate a function so that you keep the rows you want to keep.
na.omit()
: removes rows with NAsfilter(is.na(column_name))
: keep rows with NA in specific variablefilter(!is.na(column_name))
: remove rows with NA in specific variablefilter(complete.cases(.))
: remove rows with NAs
Filter using na.omit()
|>
DAT na.omit() # omit any rows with NAs
Id Sex Age Renting
1 100 male 25 yes
3 102 Male 27 yes
5 104 man 44 no
6 100 male 25 yes
7 105 neither 40 yes
Filter using is.na()
and !is.na()
|>
DAT filter(is.na(Sex)) # keep NAs by variable
Id Sex Age Renting
1 103 <NA> 40 <NA>
But your goal may likely be to keep everything that is not NA
:
|>
DAT filter(!is.na(Sex)) # remove NAs by variable
Id Sex Age Renting
1 100 male 25 yes
2 101 female 33 <NA>
3 102 Male 27 yes
4 104 man 44 no
5 100 male 25 yes
6 105 neither 40 yes
And filter step-by-step for each variable using |>
and separate function calls:
|>
DAT filter(!is.na(Sex)) |>
filter(!is.na(Renting))
Id Sex Age Renting
1 100 male 25 yes
2 102 Male 27 yes
3 104 man 44 no
4 100 male 25 yes
5 105 neither 40 yes
So why use separate lines of code if you can use &
all in one line? One reason is that separate function calls written as separate lines of code make code inclusion/exclusion extremely easy.
Comment out what you don’t want using #
:
|>
DAT #filter(., !is.na(Sex)) |>
filter(!is.na(Renting))
Id Sex Age Renting
1 100 male 25 yes
2 102 Male 27 yes
3 104 man 44 no
4 100 male 25 yes
5 105 neither 40 yes
Filter using complete.cases()
The complete.cases()
function returns a logical vector for which TRUE
reflects the row has complete information and no missing cases. Using complete.cases()
along with filter()
, you would retain all rows TRUE
rows.
%>%
DAT ::filter(complete.cases(.)) dplyr
Id Sex Age Renting
1 100 male 25 yes
2 102 Male 27 yes
3 104 man 44 no
4 100 male 25 yes
5 105 neither 40 yes
Removing duplicate rows using distinct()
We will examine some other filtering methods that do not use filter()
specifically but that which perform filtering operations.
dplyr::distinct()
: remove duplicate rowsdplyr::distinct(., column)
: remove duplicate rows by specific columnna.omit()
: remove any row withNA
’s (missing values)
Let’s use the simple DAT
data frame.
DAT
Id Sex Age Renting
1 100 male 25 yes
2 101 female 33 <NA>
3 102 Male 27 yes
4 103 <NA> 40 <NA>
5 104 man 44 no
6 100 male 25 yes
7 105 neither 40 yes
Remember that data frames are composed of rows and columns. As discussed previously we can subset in base R
to illustrate a point.
Row row 1, all columns:
1,] DAT[
Id Sex Age Renting
1 100 male 25 yes
Row row 6, all columns:
6,] DAT[
Id Sex Age Renting
6 100 male 25 yes
Notice that rows 1 and 6 are the same person (e.g., Id
) and have exactly the same data for all variables. If we compare the two logically, we will confirm that all is TRUE
.
1,] == DAT[6,] DAT[
Id Sex Age Renting
1 TRUE TRUE TRUE TRUE
You clearly do not want duplicates of data. So let’s just remove any rows that are identical using distinct()
Removing Duplicates Across All Variables
|>
DAT distinct() # Remove exact duplicates across all columns
Id Sex Age Renting
1 100 male 25 yes
2 101 female 33 <NA>
3 102 Male 27 yes
4 103 <NA> 40 <NA>
5 104 man 44 no
6 105 neither 40 yes
If you know each row is unique based on a variable in the data frame, you can also use distinct()
to remove duplicates for a specific variable. Make sure that this variable specification is actually one that you would not want duplicates of.
Removing Duplicates Based on a Variable
|>
DAT distinct(Id) # Remove duplicates by variable; passes unique values for data frame
Id
1 100
2 101
3 102
4 103
5 104
6 105
But this function simply returns the unique values in Id
. To retain the variables, set .keep_all = T
. If you want to remove duplicates and assign the cleaned data frame to an object, you would likely want to keep all of your variables.
|>
DAT distinct(Id,
.keep_all = T
# )
Id Sex Age Renting
1 100 male 25 yes
2 101 female 33 <NA>
3 102 Male 27 yes
4 103 <NA> 40 <NA>
5 104 man 44 no
6 105 neither 40 yes
Notice, however, this only removed the last instance or Id == 100
. Which row to include is a judgment call. The first, the last, neither, the average? Is there a correct answer?
The Data Manipulation Workflow: Putting It All Together
Of course, all of these functions can be paired with select()
, mutate()
, filter()
, etc. Here is the data manipulation workflow.
|>
dataframe select(...) |> # select variables of interest
mutate(...) |> # then create new variables
filter(...) |> # then filter by rows
group_by(...) |> # then group for subsetting
summarize(...) # then summarize
Many More Use-Case Examples (Optional)
Filter Cases using ==
Filter rows for which the Sex
variable is equal to the string 'female'
:
|>
DAT ::filter(Sex == 'female') dplyr
Id Sex Age Renting
1 101 female 33 <NA>
Filter rows for which the Sex
variable is not equal to the string 'female'
:
|>
DAT ::filter(Sex != 'female') dplyr
Id Sex Age Renting
1 100 male 25 yes
2 102 Male 27 yes
3 104 man 44 no
4 100 male 25 yes
5 105 neither 40 yes
Filter rows for which the Sex
variable is equal to the string 'female'
AND Age
is greater than the numeric 27:
|>
DAT ::filter(Sex == 'female' & Age > 27) # this "AND" that dplyr
Id Sex Age Renting
1 101 female 33 <NA>
Filter rows for which the Sex
variable is equal to the string 'female'
OR Age
is greater than the numeric 27:
|>
DAT ::filter(Sex == 'female' | Age > 27) # this "OR" that dplyr
Id Sex Age Renting
1 101 female 33 <NA>
2 103 <NA> 40 <NA>
3 104 man 44 no
4 105 neither 40 yes
A cleaner method involves separate lines of code. Although cleaner, this will not allow the “OR” option because the data frame that is returned from the first filter()
is passed to the second filter()
and all cases other than "female"
have already been removed from the data frame.
|>
DAT ::filter(Sex == 'female') |> # keep female (and add another pipe)
dplyr::filter(Age >= 27) # keep only those equal to or older than 27 dplyr
Id Sex Age Renting
1 101 female 33 <NA>
Filtering Cases by Value
Filter by <
and >
or <=
or >=
Keep those less than:
|>
DAT ::filter(Age < 40) dplyr
Id Sex Age Renting
1 100 male 25 yes
2 101 female 33 <NA>
3 102 Male 27 yes
4 100 male 25 yes
Keep older than:
|>
DAT ::filter(Age > 40) dplyr
Id Sex Age Renting
1 104 man 44 no
Keep equal to or older than:
|>
DAT ::filter(Age >= 40) dplyr
Id Sex Age Renting
1 103 <NA> 40 <NA>
2 104 man 44 no
3 105 neither 40 yes
Filter Cases by Conditional X or Y Using |
Operator…
Using the “OR” operator, |
, cases can be included if “this” OR “that” condition.
Filter numbers:
|>
DAT ::filter(Age == 25 | Age == 40) # filter out numeric values IN a range dplyr
Id Sex Age Renting
1 100 male 25 yes
2 103 <NA> 40 <NA>
3 100 male 25 yes
4 105 neither 40 yes
Filter characters:
|>
DAT ::filter(Sex == 'male' | Sex == 'female') dplyr
Id Sex Age Renting
1 100 male 25 yes
2 101 female 33 <NA>
3 100 male 25 yes
Although dplyr::filter(sex %in% c('male', 'female'))
would be easier.
Filter rows of variables of both types:
|>
DAT ::filter(Sex == 'male' | Age == 27) dplyr
Id Sex Age Renting
1 100 male 25 yes
2 102 Male 27 yes
3 100 male 25 yes
Filter Cases Between Values with between()
Between ages 25 and 33:
|>
DAT ::filter(between(Age, 27, 33)) dplyr
Id Sex Age Renting
1 101 female 33 <NA>
2 102 Male 27 yes
Filter by range using the %in%
operator (this is IN meaning in)
Though less flexible than using between()
, %in%
may be easier to remember:
|>
DAT ::filter(Age %in% 20:43) # filter out numeric values IN a range dplyr
Id Sex Age Renting
1 100 male 25 yes
2 101 female 33 <NA>
3 102 Male 27 yes
4 103 <NA> 40 <NA>
5 100 male 25 yes
6 105 neither 40 yes
One’s age is in the range from 20 through 43.
If a vector object is already defined (e.g., my_levels = c('male', 'female')
), you can use that for filtering also. Such approaches are useful when data manipulation involves reusing a reference as it simplifies coding and reduces errors because the specification is defined only once.
= c('male', 'female')
my_levels
|>
DAT ::filter(Sex %in% my_levels) dplyr
Id Sex Age Renting
1 100 male 25 yes
2 101 female 33 <NA>
3 100 male 25 yes
Filter by exclusion
When inclusion of variables is inappropriate, exclusion of them may be useful. The !
operator means “NOT” in R
so you can use that to accomplish the opposite of the statement. For example, dplyr::filter(., !sex %in% c('male', NA))
will “filter the data frame to include rows in the sex
column for which the value is NOT in the vector”.
Exclude rows in the Sex
variable that are NA
or 'male'
:
|>
DAT ::filter(!Sex %in% c('male', NA)) # keep only if NOT in vector dplyr
Id Sex Age Renting
1 101 female 33 <NA>
2 102 Male 27 yes
3 104 man 44 no
4 105 neither 40 yes
Exclude rows in the Sex
variable that are Men
or 'male'
:
|>
DAT ::filter(!Sex %in% c('male', 'Men')) # keep only if NOT in vector dplyr
Id Sex Age Renting
1 101 female 33 <NA>
2 102 Male 27 yes
3 103 <NA> 40 <NA>
4 104 man 44 no
5 105 neither 40 yes
Filter by conditional X and Y using &
operator…
By range:
|>
DAT ::filter(Id >= 102 & Age <= 43) dplyr
Id Sex Age Renting
1 102 Male 27 yes
2 103 <NA> 40 <NA>
3 105 neither 40 yes
|>
DAT ::filter(Age >= 20 & Age <= 43) dplyr
Id Sex Age Renting
1 100 male 25 yes
2 101 female 33 <NA>
3 102 Male 27 yes
4 103 <NA> 40 <NA>
5 100 male 25 yes
6 105 neither 40 yes
Note: Age 20:43 won’t work. Can you figure out why?
Filter Cases Containing Characters Using stringr::str_detect()
and %in%
:
If you want to filter cases that contain certain characters, unfortunately, you cannot use contains()
as you would for variable names. For matching characters in levels of variables, you’ll need something like stringr::str_detect()
or grepl()
.
stringr::str_detect()
: returns matching conditions
The example below uses str_detect()
to detect the presence of a character match and returns a logic vector with TRUE
s for matching cases. When paired with filter()
, the data frame is filtered by to contains cases that match the pattern for the variable.
Example: stringr::str_detect(my_variable, "pattern")
:
Let’s look for levels of Sex
for which "ma"
is detected.
|>
DAT ::filter(stringr::str_detect(Sex, "ma")) dplyr
Id Sex Age Renting
1 100 male 25 yes
2 101 female 33 <NA>
3 104 man 44 no
4 100 male 25 yes
But the case for which Sex = Male
is now missing. This is because stringr::str_detect()
is a case-sensitive pattern match.
You can fix this is a couple ways:
- make the cases in
Sex
all lower case tomutate()
the fix or - to wrap
Sex
intolower()
to make cases lowercase.
The first option might be better if you want to fix the problem in the data frame.
Other casing functions are:
tolower()
: returns lower case of string; see alsostringr::str_to_lower()
toupper()
: returns upper case of string; see alsostringr::str_to_upper()
tools::toTitleCase()
: returns Title Case (capitalize first letter of string)
|>
DAT filter(stringr::str_detect(tolower(Sex), "ma"))
Id Sex Age Renting
1 100 male 25 yes
2 101 female 33 <NA>
3 102 Male 27 yes
4 104 man 44 no
5 100 male 25 yes
For some, reading this code may be difficult. In order to understand the code, you need to read it inside out like slices of an onion. First understand the first function as the returned value is passed to the next function and so forth.
For example, elements of the Sex
vector are converted to lower case and returned using tolower()
. We can demonstrate this using only the vector from the data frame using the $
operator, DAT$Sex
.
The initial vector:
$Sex DAT
[1] "male" "female" "Male" NA "man" "male" "neither"
To lowercase:
tolower(DAT$Sex)
[1] "male" "female" "male" NA "man" "male" "neither"
Next, stringr::str_detect()
will return a vector of TRUE
or FALSE
for all element positions of DAT$Sex
for which the (now lowercase) string matches "ma"
:
::str_detect(tolower(DAT$Sex), "ma") stringr
[1] TRUE TRUE TRUE NA TRUE TRUE FALSE
Another approach would be to perform the steps in pieces by first mutating the data frame so that elements in the Sex
vector are converted to lowercase and then pipe that mutated data from to filter()
.
Briefly, we will introduce dplyr::mutate()
to accomplish this goal. mutate()
is used to add, update, etc. variables in a data frame or tibble.
Mutate Sex
using mutate()
:
|>
DAT mutate(Sex = tolower(Sex))
Id Sex Age Renting
1 100 male 25 yes
2 101 female 33 <NA>
3 102 male 27 yes
4 103 <NA> 40 <NA>
5 104 man 44 no
6 100 male 25 yes
7 105 neither 40 yes
Now mutate and pipe that mutated data frame using %>%
(from {margittr}) or |>
(base R
version 4.1+) to filter()
:
|>
DAT mutate(Sex = tolower(Sex)) |>
filter(stringr::str_detect(Sex, "ma"))
Id Sex Age Renting
1 100 male 25 yes
2 101 female 33 <NA>
3 102 male 27 yes
4 104 man 44 no
5 100 male 25 yes
Pairing this piping approach with toTitleCase()
may be better as only the first letter will be capitalized and you are looking for “Ma”.
|>
DAT mutate(Sex = tools::toTitleCase(Sex)) |>
filter(stringr::str_detect(Sex, "Ma"))
Id Sex Age Renting
1 100 Male 25 yes
2 102 Male 27 yes
3 104 Man 44 no
4 100 Male 25 yes
But notice the male
and man
issue is still a problem.
So what are the unique elements in DAT$Sex
? Your friend here is unique()
. And maybe you are too.
unique(DAT$Sex)
[1] "male" "female" "Male" NA "man" "neither"
You can hard code a fix using %in%
and case_when()
. case_when()
is {dplyr}’s solution for multiple-case conditional.
|>
DAT mutate(Sex = tolower(Sex), # convert to lowercase first
Sex = case_when( # then recode cases when matched
%in% c("male", "man") ~ "Male",
Sex %in% c("female", "woman") ~ "Female"
Sex ))
Id Sex Age Renting
1 100 Male 25 yes
2 101 Female 33 <NA>
3 102 Male 27 yes
4 103 <NA> 40 <NA>
5 104 Male 44 no
6 100 Male 25 yes
7 105 <NA> 40 yes
For a more flexible fix, you can use str_detect()
:
BUT beware that the order of operations matters with this approach. Because the string “female” contains characters “ma”, you could accidentally recode all “female” cases to “male” if you perform the case_when()
conversion on “male” first.
|>
DAT mutate(Sex = tolower(Sex), # convert first
Sex = case_when( # then recode
::str_detect(Sex, "fe") ~ "Female",
stringr::str_detect(Sex, "ma") ~ "Male",
stringr ))
Id Sex Age Renting
1 100 Male 25 yes
2 101 Female 33 <NA>
3 102 Male 27 yes
4 103 <NA> 40 <NA>
5 104 Male 44 no
6 100 Male 25 yes
7 105 <NA> 40 yes
Note that conditions in not declared in case_when()
will be recoded as NA
, which is what happens for "neither"
. Make sure to include all ways you wish you recode cases.
Session Information
sessionInfo()
R version 4.3.2 (2023-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22621)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8
[2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
time zone: America/Los_Angeles
tzcode source: internal
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] magrittr_2.0.3 tidyselect_1.2.0 stringr_1.5.1 dplyr_1.1.4
loaded via a namespace (and not attached):
[1] vctrs_0.6.5 cli_3.6.1 knitr_1.45
[4] rlang_1.1.1 xfun_0.40 stringi_1.7.12
[7] renv_1.0.3 generics_0.1.3 jsonlite_1.8.7
[10] glue_1.6.2 rprojroot_2.0.3 htmltools_0.5.7
[13] fansi_1.0.4 rmarkdown_2.25 tibble_3.2.1
[16] evaluate_0.21 fastmap_1.1.1 yaml_2.3.7
[19] lifecycle_1.0.3 BiocManager_1.30.22 compiler_4.3.2
[22] pkgconfig_2.0.3 htmlwidgets_1.6.4 here_1.0.1
[25] rstudioapi_0.15.0 R.oo_1.25.0 R.utils_2.12.2
[28] digest_0.6.33 R6_2.5.1 utf8_1.2.4
[31] pillar_1.9.0 R.methodsS3_1.8.2 withr_2.5.0
[34] tools_4.3.2