source(here::here("r", "my_functions.R"))
Data frame manipulation and wrangling
This page is a work in progress and may contain areas that need more detail or that required syntactical, grammatical, and typographical changes. If you find some part requiring some editing, please let me know so I can fix it for you.
Overview
This module demonstrates how to use {dplyr} to wrangle data frames. There is much that you can do with the library and this module focuses on selecting, filtering, and arranging data frames or tibbles. Helper functions from libraries like {magrittr}, {stringr}, and {tidyselect} are also used in conjunction with {dplyr}.
Libraries
- {here}: 1.0.1: for path management
- {dplyr} 1.1.2: for selecting, filtering, and mutating
- {magrittr} 2.0.3: for code clarity and piping data frame objects
- {stringr}: 1.5.0: for working with strings
- {tidyselect}: 1.2.0: for selecting sets from strings
External Functions
Provided in class:
view_html()
: for viewing data frames in html format, from /r/my_functions.R
Libraries
We will work with a few different libraries for data manipulation. Let’s load them into our work space using library()
.
library(magrittr)
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(stringr)
Make note of any warnings that appear when loading libraries. There are some libraries that contain functions with the same names. Be aware that the most recently loaded library function will take precedence. You can avoid confusion using ::
to call a function from a particular library (e.g., libraryname::functionname()
).
A Grammar for Data Wrangling
The {dplyr} package presents a type of grammar for wrangling data (Wickham and Francois 2014). {dplyr} is part of the {tidyverse} ecosystem which loads using library(tidyverse)
, or specifically using library(dplyr)
. {dplyr} also imports functions from {tidyselect}, which is also part of the {tidyverse} ecosystem.
The functions from any R
package can be called after loading the package library. One does not need to reference the package explicitly when calling a function but to avoid confusion, we will often reference the package/library and the function together (e.g., package::function()
).
Working with data involves creating new column variables, removing or renaming column variables, removing or deleting row observations, sorting, and summarizing data (often by groups/factors). etc. Consequently, there are five main function verbs for working with data in a data frame: select
, filter
, mutate
, arange
, and summarize
. Their corresponding functions are:
select(dataframe, variables_to_select)
: subset by columnsmutate(dataframe, variables_to_create)
anddplyr::rename()
: add or modify existing columnsfilter(dataframe, rows_to_select)
: subset by rowsarrange(dataframe, variable_to_sort_by)
: sort rowssummarize(dataframe, procedures_for_summarizing)
in conjunction withdplyr::group_by()
: aggregate the data in some way
Some Common Ways for Selecting Variables Using {dplyr}
You can reference variables by their names or by their column position in a data frame (e.g., 1, 25, etc.).
Using select()
, you can select columns/variables from a data frame. The variables you select are retained and variables omitted that you don’t select are omitted in the returned data frame.
If you are not piping a data frame from a previous object using either |>
(now in base R
) or %>%
(from {magrittr}), the first argument passed into select()
will be the data frame. The second and subsequent arguments can be column variables you wish to select from the data frame.
select(mydataframe, myvars)
If piping a data frame with {magrittr}, .
or .data
will serve to reference the inherited data frame. This detail is not needed but adding it serves as a reminder that the data frame is the first argument of the function.
dataframe %>%
select(., myvars)
Variables can be passed by their column position.
Passing variables separately by name:
%>%
USArrests select(., 1, 3) %>% head()
Murder UrbanPop
Alabama 13.2 58
Alaska 10.0 48
Arizona 8.1 80
Arkansas 8.8 50
California 9.0 91
Colorado 7.9 78
Variables can be passed separately without quotes or collectively as a character vector.
Passing variables separately by name:
%>%
USArrests select(., Murder, Assault) %>% head()
Murder Assault
Alabama 13.2 236
Alaska 10.0 263
Arizona 8.1 294
Arkansas 8.8 190
California 9.0 276
Colorado 7.9 204
Note: passing variables by name does not work with base R
manipulation, you need to pass the name as a string.
Passing variables separately as characters:
%>%
USArrests select(., "Murder", "Assault") %>% head()
Murder Assault
Alabama 13.2 236
Alaska 10.0 263
Arizona 8.1 294
Arkansas 8.8 190
California 9.0 276
Colorado 7.9 204
Passing a character vector (more like you would do in base R
):
%>%
USArrests select(., c("Murder", "Assault")) %>% head()
Murder Assault
Alabama 13.2 236
Alaska 10.0 263
Arizona 8.1 294
Arkansas 8.8 190
California 9.0 276
Colorado 7.9 204
Passing a numeric vector of column positions:
%>%
USArrests select(., c(1, 3)) %>% head()
Murder UrbanPop
Alabama 13.2 58
Alaska 10.0 48
Arizona 8.1 80
Arkansas 8.8 50
California 9.0 91
Colorado 7.9 78
Passing an object holding a character vector containing two elements:
First, creat the vector object:
<- c("Murder", "Assault") keep_vars
Then pass the vector into the function:
%>%
USArrests select(., keep_vars) %>% head()
Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.
# Was:
data %>% select(keep_vars)
# Now:
data %>% select(all_of(keep_vars))
See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
Murder Assault
Alabama 13.2 236
Alaska 10.0 263
Arizona 8.1 294
Arkansas 8.8 190
California 9.0 276
Colorado 7.9 204
Select Variables Starting with or Ending with Certain Characters
One thing about {dplyr}, when you load the library, there are functions from other libraries that are imported along with {dplyr}’s own functions. These important functions are designed to work with each other, so the people who maintain the libraries have packaged them up nicely so you don’t have to load separate libraries.
Many of the functions are imported from the {tidyselect} library and these functions are what give you additional manipulation ability. Some imported functions are: all_of()
, any_of()
, contains()
, ends_with()
, everything()
, last_col()
, matches()
, and starts_with()
.
With functions like starts_with()
, contains()
, and ends_with()
, you can select variables with character patterns in their names.
Rather that passing the names of the variables as the second argument (e.g., c("Murder", "Assault")
), you would pass the helper function, say starts_with()
. Whatever starts_with()
returns is what gets passed to select()
as the variables. This is what is referred to as functional programming. Rather than coding specifically what to do, you with utilize the task of another function to passed its returned object as an argument to another function.
But first, we need to see what these functions, like starts_with()
, are doing. For more information, use ?starts_with
.
starts_with(match, ignore.case = TRUE, vars = NULL)
Notice the arguments we need to pass:
match
: A character vectorignore.case
: IfTRUE
, the default, ignores case when matching names. This is most flexible.vars
: A character vector of variable names. If not supplied, the variables are taken from the current selection context (as established by functions likeselect()
orpivot_longer()
).
Let’s just try out starts_with()
on its own. Let’s set a required pattern match = some character
and because vars = NULL
by default, let’s just set vars = some character vector
. Note that vars
is not the second argument, so you will want to name it in the function call.
starts_with(match = "a", vars = c("Hello", "Hi", "Bye"))
integer(0)
Returns integer(0)
which is speak for “there is no match”. Hmm, OK. Let’s try another character.
starts_with(match = "b", vars = c("Hello", "Hi", "Bye"))
[1] 3
OK, so now an integer is returned (yes, try is.integer()
if you don’t believe me).
is.integer(starts_with("b", vars = c("Hello", "Hi", "Bye")))
[1] TRUE
Importantly, the value refers to the element index/position in the vars
vector. Because the third string "Bye"
starts with "b"
, that’s what is returned.
Try something else:
starts_with("h", vars = c("Hello", "Hi", "Bye"))
[1] 1 2
Now a vector with length = 2 is returned, representing both the first and the second elements start with “h”.
length(starts_with("h", vars = c("Hello", "Hi", "Bye")))
[1] 2
See, it’s really a vector containing the element(s) of the vars
vector matching the pattern.
And yes, this the letter casing is ignored because the default ignore.case = TRUE
. Set to FALSE
if you want your match to be case sensitive.
starts_with("h",
vars = c("Hello", "Hi", "Bye"),
ignore.case = F
)
integer(0)
OK, no matches.
You will typically use starts_with()
along with other functions. When using starts_with()
in the context of select()
, the vars
argument is essentially passing vars = the names of the columns of the data frame passed to select()
.
Example:
select(mydataframe,
starts_with(match = "my pattern",
vars = "var names of mydataframe")
)
Passing the data frame
into select()
without piping it using %>%
:
select(USArrests, starts_with("m")) %>% head()
Murder
Alabama 13.2
Alaska 10.0
Arizona 8.1
Arkansas 8.8
California 9.0
Colorado 7.9
Piping the data frame
into select()
:
%>%
USArrests select(., starts_with("m")) %>% head()
Murder
Alabama 13.2
Alaska 10.0
Arizona 8.1
Arkansas 8.8
California 9.0
Colorado 7.9
Another example:
%>%
USArrests select(., ends_with("t")) %>% head()
Assault
Alabama 236
Alaska 263
Arizona 294
Arkansas 190
California 276
Colorado 204
Selecting and Selecting Out Variables By/Between Index
There are many approaches for selecting or selecting out column variables. You can pass multiple arguments for each specification or you can pass a single vector that contains all specifications.
select(., 1,2)
: select first and second columnsselect(., c(1,2))
: select first and second columnsselect(., -c(1,2))
: select out first and second columnsselect(., 1:2)
: select first through second columnsselect(., c(1:2))
: select first through second columnsselect(., -c(1:2))
: select out first through second columns
Potential Recommendation: use options utilizing c()
to pass a vector because this habit will be more versatile with base R
functionality. However, online solutions will likely not take this approach.
Let’s make a data frame to work with first.
<- data.frame(
DAT Id = c(100, 101, 102, 103, 104, 100, 105),
Sex = c('male', 'female', 'Male', NA, 'man', "male", "neither"),
Age = c(25, 33, 27, 40, 44, 25, 40),
Renting = c("yes", NA, "yes", NA, "no", "yes", "yes")
)
Select columns 1 and 2:
%>%
DAT select(., 1,2)
Id Sex
1 100 male
2 101 female
3 102 Male
4 103 <NA>
5 104 man
6 100 male
7 105 neither
Select columns 1 and 2 as a vector containing values 1 and 2:
%>%
DAT select(., c(1,2))
Id Sex
1 100 male
2 101 female
3 102 Male
4 103 <NA>
5 104 man
6 100 male
7 105 neither
Select out columns 1 and 2 as a vector containing values 1 and 2:
%>%
DAT select(., -c(1,2))
Age Renting
1 25 yes
2 33 <NA>
3 27 yes
4 40 <NA>
5 44 no
6 25 yes
7 40 yes
Select from columns 1 to 2 using the :
operator:
%>%
DAT select(., 1:2)
Id Sex
1 100 male
2 101 female
3 102 Male
4 103 <NA>
5 104 man
6 100 male
7 105 neither
Select from columns 1 through 3 using a vector containing the :
operator:
%>%
DAT select(., c(1:3))
Id Sex Age
1 100 male 25
2 101 female 33
3 102 Male 27
4 103 <NA> 40
5 104 man 44
6 100 male 25
7 105 neither 40
Select out columns 1 through 3 using a vector containing the :
operator:
%>%
DAT select(., -c(1:3)) # select out from here to there
Renting
1 yes
2 <NA>
3 yes
4 <NA>
5 no
6 yes
7 yes
Selecting and Selecting Out Variables By or Between Character Name
These approaches are similar to those offered earlier except that some involve passing variables by their name (e.g., character names). Whereas the order of the variables in a data frame may move around, the names may be more stable or permanent, at least after you have cleaned up the names. Consequently, passing variables by name may be more foolproof.
You don’t have to be familiar with all approaches and you may settle on using one that makes the most sense to you.
select(., "var1", "var2")
select(., c("var1", "var2"))
select(., -c("var1", "var2"))
select(., var1:var2))
select(., c("var1":"var2))
select(., -c("var1":"var2))
Recommendation: use options utilizing c()
as this will be more versatile with base R
functionality.
These approaches also work but they may lead to some confusion regarding usage of quotes:
select(., var1, var2)
select(., c(var1, var2))
select(., -c(var1, var2))
Select variables Id
though Age
using the :
operator:
%>%
DAT select(., Id:Age) # select from here to there
Id Sex Age
1 100 male 25
2 101 female 33
3 102 Male 27
4 103 <NA> 40
5 104 man 44
6 100 male 25
7 105 neither 40
Select variables Id
though Age
passed as strings using the :
operator:
%>%
DAT select(., "Id":"Age") # select from here to there
Id Sex Age
1 100 male 25
2 101 female 33
3 102 Male 27
4 103 <NA> 40
5 104 man 44
6 100 male 25
7 105 neither 40
Select variables Id
though Age
as a vector containing the variable names passed as strings and using the :
operator:
%>%
DAT select(., c("Id":"Age")) # select from here to there
Id Sex Age
1 100 male 25
2 101 female 33
3 102 Male 27
4 103 <NA> 40
5 104 man 44
6 100 male 25
7 105 neither 40
Select out variables Id
though Age
as a vector containing the variable names passed as strings and using the :
operator:
%>%
DAT select(., -c("Id":"Age")) # select out from here to there
Renting
1 yes
2 <NA>
3 yes
4 <NA>
5 no
6 yes
7 yes
You can also use the !
operator to select NOT these variables (therefore, all others)
%>%
DAT select(., !c("Id":"Age")) # select out from here to there
Renting
1 yes
2 <NA>
3 yes
4 <NA>
5 no
6 yes
7 yes
Selecting and Selecting Out Variables Characters in Their Names
select(., starts_with("characters"))
select(., ends_with("characters"))
select(., contains('e'))
Select variables which start with character “i”:
%>% select(., starts_with('i')) DAT
Id
1 100
2 101
3 102
4 103
5 104
6 100
7 105
Select variables which DO NOT start with character “s”:
%>% select(., -starts_with('s')) DAT
Id Age Renting
1 100 25 yes
2 101 33 <NA>
3 102 27 yes
4 103 40 <NA>
5 104 44 no
6 100 25 yes
7 105 40 yes
Select variables which end with character “e”:
%>% select(., ends_with('e')) DAT
Age
1 25
2 33
3 27
4 40
5 44
6 25
7 40
Select variables which end with character “e”:
%>% select(., -ends_with('e')) DAT
Id Sex Renting
1 100 male yes
2 101 female <NA>
3 102 Male yes
4 103 <NA> <NA>
5 104 man no
6 100 male yes
7 105 neither yes
Select variables which contain character “g”:
%>% select(., contains('g')) DAT
Age Renting
1 25 yes
2 33 <NA>
3 27 yes
4 40 <NA>
5 44 no
6 25 yes
7 40 yes
Select variables which DO NOT contain character “g”:
%>% select(., -contains('g')) DAT
Id Sex
1 100 male
2 101 female
3 102 Male
4 103 <NA>
5 104 man
6 100 male
7 105 neither
Select variables containing a regular expression, use matches()
:
.*
will grab all names because it means any character and any number of times
%>% select(., matches(".*")) DAT
Id Sex Age Renting
1 100 male 25 yes
2 101 female 33 <NA>
3 102 Male 27 yes
4 103 <NA> 40 <NA>
5 104 man 44 no
6 100 male 25 yes
7 105 neither 40 yes
\\d
will grab all variables containing a digit:
%>%
DAT mutate(.,
var_1 = 1,
var_11 = 1,
var_3 = 1,
var1_var = 1
%>%
) select(., matches("\\d"))
var_1 var_11 var_3 var1_var
1 1 1 1 1
2 1 1 1 1
3 1 1 1 1
4 1 1 1 1
5 1 1 1 1
6 1 1 1 1
7 1 1 1 1
v.*\\d
will grab all variables that start with v and then contain any characters which are followed by a digit:
%>%
DAT mutate(.,
var_1 = 1,
var_11 = 1,
var_3 = 1,
var1_var = 1
%>%
) select(., matches("v.*\\d"))
var_1 var_11 var_3 var1_var
1 1 1 1 1
2 1 1 1 1
3 1 1 1 1
4 1 1 1 1
5 1 1 1 1
6 1 1 1 1
7 1 1 1 1
\\d$
will grab all variables ending in a digit ($
means end):
%>%
DAT mutate(.,
var_1 = 1,
var_11 = 1,
var_3 = 1,
var1_var = 1
%>%
) select(., matches("\\d$"))
var_1 var_11 var_3
1 1 1 1
2 1 1 1
3 1 1 1
4 1 1 1
5 1 1 1
6 1 1 1
7 1 1 1
You can also negate all regular expression matches if you want to exclude:
%>%
DAT mutate(.,
var_1 = 1,
var_11 = 1,
var_3 = 1,
var1_var = 1
%>%
) select(., -matches("\\d$"))
Id Sex Age Renting var1_var
1 100 male 25 yes 1
2 101 female 33 <NA> 1
3 102 Male 27 yes 1
4 103 <NA> 40 <NA> 1
5 104 man 44 no 1
6 100 male 25 yes 1
7 105 neither 40 yes 1
Note: The functions will return lowercase and uppercase variable name matches because the default behavior is ignore.case = TRUE
. Set to FALSE
if you want to perform precise surgery on the variables.
Selecting and Selecting Out Variables by Type
Select variables that are numeric:
%>% select(., where(is.numeric)) DAT
Id Age
1 100 25
2 101 33
3 102 27
4 103 40
5 104 44
6 100 25
7 105 40
Select variables that are NOT numeric:
%>% select(., -where(is.numeric)) DAT
Sex Renting
1 male yes
2 female <NA>
3 Male yes
4 <NA> <NA>
5 man no
6 male yes
7 neither yes
Select variables that are character:
%>% select(., where(is.character)) DAT
Sex Renting
1 male yes
2 female <NA>
3 Male yes
4 <NA> <NA>
5 man no
6 male yes
7 neither yes
Select variables that are NOT character:
%>% select(., -where(is.character)) DAT
Id Age
1 100 25
2 101 33
3 102 27
4 103 40
5 104 44
6 100 25
7 105 40
Select variables that are logical (TRUE
to FALSE
):
%>% select(., where(is.logical)) DAT
data frame with 0 columns and 7 rows
Select variables that are NOT logical (TRUE
to FALSE
):
%>% select(., -where(is.logical)) DAT
Id Sex Age Renting
1 100 male 25 yes
2 101 female 33 <NA>
3 102 Male 27 yes
4 103 <NA> 40 <NA>
5 104 man 44 no
6 100 male 25 yes
7 105 neither 40 yes
Cleaning Data
Data files are messy and as a result require cleaning. You will have missing rows, incorrect variable names, files with columns named the same, NA
s, strings for numbers, duplicate rows of data, people who completed a survey twice, and all sorts of unimaginable and unbelievable data problems. So cleaning is important.
Whereas select()
is used for columns, filter()
operates on rows. Data frame manipulation may involve keeping only certain rows for data, for example, male or female respondents, male respondents, those who do not contain missing values (e.g., NA
s) for a specific column variable, who are of a certain age (or born in in certain year), who are above (or below) some acceptable criterion, etc.
When a column variable has more than one value (e.g., check using unique()
to determine the unique elements contained), you may wish to filter on some but not others.
You may even need to filter rows in a data frame that are distinct (e.g., not duplicate responses). This is often a good first step in order to determine the size of the usable data set. dplyr::distinct()
makes de-duplicating easy as this function will return only distinct rows.
Removing duplicate rows using distinct()
dplyr::distinct()
: remove duplicate rowsdplyr::distinct(., column)
: remove duplicate rows by columnna.omit()
: remove any row with NA’s (missing values)
Let’s use the simple DAT
data frame.
# or with %>% print() DAT
Id Sex Age Renting
1 100 male 25 yes
2 101 female 33 <NA>
3 102 Male 27 yes
4 103 <NA> 40 <NA>
5 104 man 44 no
6 100 male 25 yes
7 105 neither 40 yes
Notice that rows 1 and 6 are the same person (e.g., Id) and have exactly the same data for all variables.
1,] == DAT[6,] DAT[
Id Sex Age Renting
1 TRUE TRUE TRUE TRUE
Great that the rows are consistent but you don’t want their data twice. So let’s just remove any rows that are identical.
%>%
DAT distinct(.) #%>% # Remove exact duplicates
Id Sex Age Renting
1 100 male 25 yes
2 101 female 33 <NA>
3 102 Male 27 yes
4 103 <NA> 40 <NA>
5 104 man 44 no
6 105 neither 40 yes
If you know each row is unique based on a variable in the data frame, you can also use distinct()
to remove duplicates for a specific variable. Make sure that this variable specification is actually one that you would not want duplicates of.
%>%
DAT distinct(., Id) # %>% view_html(.) # Remove duplicates by variable; passes unique values for data frame
Id
1 100
2 101
3 102
4 103
5 104
6 105
But this function simply returns the unique values in Id
. To retain the variables, set .keep_all = T
. If you want to remove duplicates and assign the cleaned data frame to an object, you would likely want to keep all of your variables.
%>%
DAT distinct(., Id, .keep_all = T) #%>% view_html(.)
Id Sex Age Renting
1 100 male 25 yes
2 101 female 33 <NA>
3 102 Male 27 yes
4 103 <NA> 40 <NA>
5 104 man 44 no
6 105 neither 40 yes
Notice, however, this only removed the last instance or Id == 100
. Which row to include is a judgment call. The first, the last, neither, the average? Is there a correct answer?
Filtering using {dplyr} and Understanding Filtering Operators
Filtering cases using the dplyr::filter()
verbs works by removing rows that do not match a specific criterion and then by returning the data frame that omits the mismatched condition.
Some useful filtering operators and functions include: ==
, >
, >=
, &
, |
, !
, xor()
, c()
, is.na()
, between()
, near()
.
Row/Observations/Cases can be filtered to “include” only certain matched conditions or can be filtered to “exclude” by negating those matched conditions. If the column variable Sex
is in the data frame and cases are 'male'
, 'men'
, 'female'
, 'women'
, 'neither'
, NA
, etc., you can specify the column Sex
variable and then the row matching condition(s).
The first argument in dplyr::filter()
is a data frame, and the function all dplyr::filter(DAT, Sex == 'female')
will filter the data frame named DAT
to include rows for which the sex
column equals 'female'
. In other words, TRUE
rows.
::filter(DAT, Sex == 'female') dplyr
Id Sex Age Renting
1 101 female 33 <NA>
Similarly, the function call dplyr::filter(., Sex == 'male')
can be read “filter the data frame to include rows for which the value of Sex == 'male'
is TRUE
”.
More flexibly, however, you could specify a vector containing acceptable strings using c()
. dplyr::filter(., Sex %in% c('male'))
filters the rows to include only those for which the value for sex
is in the string vector which includes a single string,'male'
whereas dplyr::filter(., Sex %in% c('male', 'Man'))
filters the rows to include only those for which the value for Sex
is in the string vector which includes 'male'
and 'Man'
. Cases containing 'Male'
, 'Men'
(R is a case-sensitive language), or 'female'
, for example, will not be included in the returned data frame because they do not match values in the string vector.
Piping Multiple Filter Function Calls
In many cases, data filtering will involve different conditions for different column variables, so specifying them separately as separate lines of code is most appropriate.
When passing a data frame using %>%
from {magrittr}, the first argument for the data frame can be specified using a .
because the function inherits the data frame manipulated. However, {dplyr} also understand this so the .
can also be omitted for convenience; this is the general practice you will see in forums like stackoverflow.com. Example to follow.
Filtering Cases by Character Names/String Values
Filter Cases using ==
Filter rows for which the Sex
variable is equal to the string 'female'
:
%>%
DAT ::filter(., Sex == 'female') dplyr
Id Sex Age Renting
1 101 female 33 <NA>
Filter rows for which the Sex
variable is not equal to the string 'female'
:
%>%
DAT ::filter(., Sex != 'female') dplyr
Id Sex Age Renting
1 100 male 25 yes
2 102 Male 27 yes
3 104 man 44 no
4 100 male 25 yes
5 105 neither 40 yes
Filter rows for which the Sex
variable is equal to the string 'female'
AND Age
is greater than the numeric 27:
%>%
DAT ::filter(., Sex == 'female' & Age > 27) # this "AND" that dplyr
Id Sex Age Renting
1 101 female 33 <NA>
Filter rows for which the Sex
variable is equal to the string 'female'
OR Age
is greater than the numeric 27:
%>%
DAT ::filter(., Sex == 'female' | Age > 27) # this "OR" that dplyr
Id Sex Age Renting
1 101 female 33 <NA>
2 103 <NA> 40 <NA>
3 104 man 44 no
4 105 neither 40 yes
A cleaner method involves separate lines of code. Although cleaner, this will not allow the “OR” option because the data frame that is returned from the first filter()
is passed to the second filter()
and all cases other than "female"
have already been removed from the data frame.
%>%
DAT ::filter(., Sex == 'female') %>% # keep female (and add another pipe)
dplyr::filter(., Age >= 27) # keep only those equal to or older than 27 dplyr
Id Sex Age Renting
1 101 female 33 <NA>
Filtering Cases by Value
Filter by <
and >
or <=
or >=
…
%>% dplyr::filter(., Age < 40) # keep those less than DAT
Id Sex Age Renting
1 100 male 25 yes
2 101 female 33 <NA>
3 102 Male 27 yes
4 100 male 25 yes
%>% dplyr::filter(., Age > 40) # keep older than DAT
Id Sex Age Renting
1 104 man 44 no
%>% dplyr::filter(., Age >= 40) # keep equal to or older than DAT
Id Sex Age Renting
1 103 <NA> 40 <NA>
2 104 man 44 no
3 105 neither 40 yes
Filter Cases by Conditional X or Y Using |
Operator…
Using the “OR” operator, |
, cases can be included if “this” OR “that” condition.
Filter numbers:
%>%
DAT ::filter(., Age == 25 | Age == 40) # filter out numeric values IN a range dplyr
Id Sex Age Renting
1 100 male 25 yes
2 103 <NA> 40 <NA>
3 100 male 25 yes
4 105 neither 40 yes
Filter characters:
%>%
DAT ::filter(., Sex == 'male' | Sex == 'female') dplyr
Id Sex Age Renting
1 100 male 25 yes
2 101 female 33 <NA>
3 100 male 25 yes
Although dplyr::filter(sex %in% c('male', 'female'))
would be easier.
Filter rows of variables of both types:
%>%
DAT ::filter(., Sex == 'male' | Age == 27) dplyr
Id Sex Age Renting
1 100 male 25 yes
2 102 Male 27 yes
3 100 male 25 yes
Filter Cases Between Values with between()
Between ages 25 and 33:
%>%
DAT ::filter(., between(Age, 27, 33)) dplyr
Id Sex Age Renting
1 101 female 33 <NA>
2 102 Male 27 yes
Filter by range using the %in%
operator (this is IN meaning in)
Though less flexible than using between()
, %in%
may be easier to remember:
%>%
DAT ::filter(., Age %in% 20:43) # filter out numeric values IN a range dplyr
Id Sex Age Renting
1 100 male 25 yes
2 101 female 33 <NA>
3 102 Male 27 yes
4 103 <NA> 40 <NA>
5 100 male 25 yes
6 105 neither 40 yes
One’s age is in the range from 20 through 43.
If a vector object is already defined (e.g., my_levels = c('male', 'female')
), you can use that for filtering also. Such approaches are useful when data manipulation involves reusing a reference as it simplifies coding and reduces errors because the specification is defined only once.
= c('male', 'female')
my_levels
%>%
DAT ::filter(., Sex %in% my_levels) dplyr
Id Sex Age Renting
1 100 male 25 yes
2 101 female 33 <NA>
3 100 male 25 yes
Filter by exclusion
When inclusion of variables is inappropriate, exclusion of them may be useful. The !
operator means “NOT” in R
so you can use that to accomplish the opposite of the statement. For example, dplyr::filter(., !sex %in% c('male', NA))
will “filter the data frame to include rows in the sex
column for which the value is NOT in the vector”.
Exclude rows in the Sex
variable that are NA
or 'male'
:
%>%
DAT ::filter(., !Sex %in% c('male', NA)) # keep only if NOT in vector dplyr
Id Sex Age Renting
1 101 female 33 <NA>
2 102 Male 27 yes
3 104 man 44 no
4 105 neither 40 yes
Exclude rows in the Sex
variable that are Men
or 'male'
:
%>%
DAT ::filter(., !Sex %in% c('male', 'Men')) # keep only if NOT in vector dplyr
Id Sex Age Renting
1 101 female 33 <NA>
2 102 Male 27 yes
3 103 <NA> 40 <NA>
4 104 man 44 no
5 105 neither 40 yes
Filter by conditional X and Y using &
operator…
By range:
%>%
DAT ::filter(., Id >= 102 & Age <= 43) dplyr
Id Sex Age Renting
1 102 Male 27 yes
2 103 <NA> 40 <NA>
3 105 neither 40 yes
%>%
DAT ::filter(., Age >= 20 & Age <= 43) dplyr
Id Sex Age Renting
1 100 male 25 yes
2 101 female 33 <NA>
3 102 Male 27 yes
4 103 <NA> 40 <NA>
5 100 male 25 yes
6 105 neither 40 yes
Note: Age 20:43 won’t work. Can you figure out why?
Filter Cases Containing Characters Using stringr::str_detect()
and %in%
:
If you want to filter cases that contain certain characters, unfortunately, you cannot use contains()
as you would for variable names. For matching characters in levels of variables, you’ll need something like stringr::str_detect()
or grepl()
.
stringr::str_detect()
: returns matching conditions
The example below uses str_detect()
to detect the presence of a character match and returns a logic vector with TRUE
s for matching cases. When paired with filter()
, the data frame is filtered by to contains cases that match the pattern for the variable.
Example: stringr::str_detect(my_variable, "pattern")
:
Let’s look for levels of Sex
for which "ma"
is detected.
%>% dplyr::filter(., stringr::str_detect(Sex, "ma")) DAT
Id Sex Age Renting
1 100 male 25 yes
2 101 female 33 <NA>
3 104 man 44 no
4 100 male 25 yes
But the case for which Sex = Male
is now missing. This is because stringr::str_detect()
is a case-sensitive pattern match.
You can fix this is a couple ways:
- make the cases in
Sex
all lower case tomutate()
the fix or - to wrap
Sex
intolower()
to make cases lowercase.
The first option might be better if you want to fix the problem in the data frame.
Other casing functions are:
tolower()
: returns lower case of stringtoupper()
: returns upper case of stringtools::toTitleCase()
: returns Title Case (capitalize first letter of string)
%>%
DAT filter(., stringr::str_detect(tolower(Sex), "ma"))
Id Sex Age Renting
1 100 male 25 yes
2 101 female 33 <NA>
3 102 Male 27 yes
4 104 man 44 no
5 100 male 25 yes
For some, reading this code may be difficult. In order to understand the code, you need to read it inside out like slices of an onion. First understand the first function as the returned value is passed to the next function and so forth.
For example, elements of the Sex
vector are converted to lower case and returned using tolower()
. We can demonstrate this using only the vector from the data frame using the $
operator, DAT$Sex
.
The initial vector:
$Sex DAT
[1] "male" "female" "Male" NA "man" "male" "neither"
To lowercase:
tolower(DAT$Sex)
[1] "male" "female" "male" NA "man" "male" "neither"
Next, stringr::str_detect()
will return a vector of TRUE
or FALSE
for all element positions of DAT$Sex
for which the (now lowercase) string matches "ma"
:
::str_detect(tolower(DAT$Sex), "ma") stringr
[1] TRUE TRUE TRUE NA TRUE TRUE FALSE
Another approach would be to perform the steps in pieces by first mutating the data frame so that elements in the Sex
vector are converted to lowercase and then pipe that mutated data from to filter()
.
Briefly, we will introduce dplyr::mutate()
to accomplish this goal. mutate()
is used to add, update, etc. variables in a data frame or tibble.
Mutate Sex
using mutate()
:
%>%
DAT mutate(., Sex = tolower(Sex))
Id Sex Age Renting
1 100 male 25 yes
2 101 female 33 <NA>
3 102 male 27 yes
4 103 <NA> 40 <NA>
5 104 man 44 no
6 100 male 25 yes
7 105 neither 40 yes
Now mutate and pipe that mutated data frame using %>%
(from {margittr}) or |>
(base R
version 4.1+) to filter()
:
%>%
DAT mutate(., Sex = tolower(Sex)) %>%
filter(., stringr::str_detect(Sex, "ma"))
Id Sex Age Renting
1 100 male 25 yes
2 101 female 33 <NA>
3 102 male 27 yes
4 104 man 44 no
5 100 male 25 yes
Pairing this piping approach with toTitleCase()
may be better as only the first letter will be capitalized and you are looking for “Ma”.
%>%
DAT mutate(., Sex = tools::toTitleCase(Sex)) %>%
filter(., stringr::str_detect(Sex, "Ma"))
Id Sex Age Renting
1 100 Male 25 yes
2 102 Male 27 yes
3 104 Man 44 no
4 100 Male 25 yes
But notice the male
and man
issue is still a problem.
So what are the unique elements in DAT$Sex
? Your friend here is unique()
. And maybe you are too.
unique(DAT$Sex)
[1] "male" "female" "Male" NA "man" "neither"
You can hard code a fix using %in%
and case_when()
. case_when()
is {dplyr}’s solution for multiple-case conditional.
%>%
DAT mutate(.,
Sex = tolower(Sex), # convert to lowercase first
Sex = case_when( # then recode cases when matched
%in% c("male", "man") ~ "Male",
Sex %in% c("female", "woman") ~ "Female"
Sex
) )
Id Sex Age Renting
1 100 Male 25 yes
2 101 Female 33 <NA>
3 102 Male 27 yes
4 103 <NA> 40 <NA>
5 104 Male 44 no
6 100 Male 25 yes
7 105 <NA> 40 yes
For a more flexible fix, you can use str_detect()
:
BUT beware that the order of operations matters with this approach. Because the string “female” contains characters “ma”, you could accidentally recode all “female” cases to “male” if you perform the case_when()
conversion on “male” first.
%>%
DAT mutate(.,
Sex = tolower(Sex), # convert first
Sex = case_when( # then recode
::str_detect(Sex, "fe") ~ "Female",
stringr::str_detect(Sex, "ma") ~ "Male",
stringr ))
Id Sex Age Renting
1 100 Male 25 yes
2 101 Female 33 <NA>
3 102 Male 27 yes
4 103 <NA> 40 <NA>
5 104 Male 44 no
6 100 Male 25 yes
7 105 <NA> 40 yes
Note that conditions in not declared in case_when()
will be recoded as NA
, which is what happens for "neither"
. Make sure to include all ways you wish you recode cases.
Filtering Missing Data (NAs
)
is.na()
will return a logical vector for which TRUE
represents there are missing values.
Try on the entire data frame…
is.na(DAT)
Id Sex Age Renting
[1,] FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE TRUE
[3,] FALSE FALSE FALSE FALSE
[4,] FALSE TRUE FALSE TRUE
[5,] FALSE FALSE FALSE FALSE
[6,] FALSE FALSE FALSE FALSE
[7,] FALSE FALSE FALSE FALSE
You can see that some columns contain cases/rows with TRUE
indicating the cell contains NA
.
The negation operator, !
, will be used to illustrate some filtering approaches. Because filter()
will filter out FALSE
cases and retain TRUE
ones, so you may sometimes need to negate a function so that you keep the rows you want to keep.
na.omit()
: removes rows with NAsfilter(., is.na(column_name))
: keep rows with NA in specific variablefilter(., !is.na(column_name))
: remove rows with NA in specific variablefilter(., complete.cases(.))
: remove rows with NAs
Filter using na.omit()
:
%>%
DAT na.omit(.) #%>% # omit any rows with NAs
Id Sex Age Renting
1 100 male 25 yes
3 102 Male 27 yes
5 104 man 44 no
6 100 male 25 yes
7 105 neither 40 yes
Filter using is.na()
and !is.na()
:
%>%
DAT filter(., is.na(Sex)) # keep NAs by variable
Id Sex Age Renting
1 103 <NA> 40 <NA>
But your goal may likely be to keep everything that is not NA
:
%>%
DAT filter(., !is.na(Sex)) # remove NAs by variable
Id Sex Age Renting
1 100 male 25 yes
2 101 female 33 <NA>
3 102 Male 27 yes
4 104 man 44 no
5 100 male 25 yes
6 105 neither 40 yes
And filter step-by-step for each variable using %>%
and separate function calls:
%>%
DAT filter(., !is.na(Sex)) %>%
filter(., !is.na(Renting))
Id Sex Age Renting
1 100 male 25 yes
2 102 Male 27 yes
3 104 man 44 no
4 100 male 25 yes
5 105 neither 40 yes
So why use separate lines of code if you can use &
all in one line? One reason is that separate function calls written as separate lines of code make code inclusion/exclusion extremely easy.
Comment out what you don’t want using #
:
%>%
DAT #filter(., !is.na(Sex)) %>%
filter(., !is.na(Renting))
Id Sex Age Renting
1 100 male 25 yes
2 102 Male 27 yes
3 104 man 44 no
4 100 male 25 yes
5 105 neither 40 yes
Filter using complete.cases()
:
The complete.cases()
function returns a logical vector for which TRUE
reflects the row has complete information and no missing cases. Using complete.cases()
along with filter()
, you would retain all rows TRUE
rows.
%>%
DAT ::filter(., complete.cases(.)) dplyr
Id Sex Age Renting
1 100 male 25 yes
2 102 Male 27 yes
3 104 man 44 no
4 100 male 25 yes
5 105 neither 40 yes
The Data Manipulation Workflow: Putting It All Together
Of course, all of these functions can be paired with select()
, mutate()
, filter()
, etc. Here is the data manipulation workflow.
dataframe %>%
select(., ...) %>% # select variables of interest
mutate(., ...) %>% # then create new variables
filter(., ...) %>% # then filter by rows
group_by(., ...) %>% # then group for subsetting
summarize(., ...) # then summarize