Data frame manipulation and wrangling

Author

Gabriel I. Cook

Published

September 26, 2023

Under construction.

This page is a work in progress and may contain areas that need more detail or that required syntactical, grammatical, and typographical changes. If you find some part requiring some editing, please let me know so I can fix it for you.

Overview

This module demonstrates how to use {dplyr} to wrangle data frames. There is much that you can do with the library and this module focuses on selecting, filtering, and arranging data frames or tibbles. Helper functions from libraries like {magrittr}, {stringr}, and {tidyselect} are also used in conjunction with {dplyr}.

Libraries

{here}: 1.0.1: for path management
{dplyr} 1.1.2: for selecting, filtering, and mutating
{magrittr} 2.0.3: for code clarity and piping data frame objects
{stringr}: 1.5.0: for working with strings
{tidyselect}: 1.2.0: for selecting sets from strings

External Functions

Provided in class:

view_html(): for viewing data frames in html format, from /r/my_functions.R

source(here::here("r", "my_functions.R"))

Libraries

We will work with a few different libraries for data manipulation. Let’s load them into our work space using library().

library(magrittr)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(stringr)

Make note of any warnings that appear when loading libraries. There are some libraries that contain functions with the same names. Be aware that the most recently loaded library function will take precedence. You can avoid confusion using :: to call a function from a particular library (e.g., libraryname::functionname()).

A Grammar for Data Wrangling

The {dplyr} package presents a type of grammar for wrangling data (Wickham and Francois 2014). {dplyr} is part of the {tidyverse} ecosystem which loads using library(tidyverse), or specifically using library(dplyr). {dplyr} also imports functions from {tidyselect}, which is also part of the {tidyverse} ecosystem.

The functions from any R package can be called after loading the package library. One does not need to reference the package explicitly when calling a function but to avoid confusion, we will often reference the package/library and the function together (e.g., package::function()).

Working with data involves creating new column variables, removing or renaming column variables, removing or deleting row observations, sorting, and summarizing data (often by groups/factors). etc. Consequently, there are five main function verbs for working with data in a data frame: select, filter, mutate, arange, and summarize. Their corresponding functions are:

select(dataframe, variables_to_select): subset by columns
mutate(dataframe, variables_to_create) and dplyr::rename(): add or modify existing columns
filter(dataframe, rows_to_select): subset by rows
arrange(dataframe, variable_to_sort_by): sort rows
summarize(dataframe, procedures_for_summarizing) in conjunction with dplyr::group_by(): aggregate the data in some way

Some Common Ways for Selecting Variables Using `{dplyr}`

You can reference variables by their names or by their column position in a data frame (e.g., 1, 25, etc.).

Using select(), you can select columns/variables from a data frame. The variables you select are retained and variables omitted that you don’t select are omitted in the returned data frame.

If you are not piping a data frame from a previous object using either |> (now in base R) or %>% (from {magrittr}), the first argument passed into select() will be the data frame. The second and subsequent arguments can be column variables you wish to select from the data frame.

select(mydataframe, myvars)

If piping a data frame with {magrittr}, . or .data will serve to reference the inherited data frame. This detail is not needed but adding it serves as a reminder that the data frame is the first argument of the function.

dataframe %>%
   select(., myvars)

Variables can be passed by their column position.

Passing variables separately by name:

USArrests %>% 
  select(., 1, 3) %>% head()

           Murder UrbanPop
Alabama      13.2       58
Alaska       10.0       48
Arizona       8.1       80
Arkansas      8.8       50
California    9.0       91
Colorado      7.9       78

Variables can be passed separately without quotes or collectively as a character vector.

Passing variables separately by name:

USArrests %>% 
  select(., Murder, Assault) %>% head()

           Murder Assault
Alabama      13.2     236
Alaska       10.0     263
Arizona       8.1     294
Arkansas      8.8     190
California    9.0     276
Colorado      7.9     204

Note: passing variables by name does not work with base R manipulation, you need to pass the name as a string.

Passing variables separately as characters:

USArrests %>% 
  select(., "Murder", "Assault") %>% head()

           Murder Assault
Alabama      13.2     236
Alaska       10.0     263
Arizona       8.1     294
Arkansas      8.8     190
California    9.0     276
Colorado      7.9     204

Passing a character vector (more like you would do in base R):

USArrests %>% 
  select(., c("Murder", "Assault")) %>% head()

           Murder Assault
Alabama      13.2     236
Alaska       10.0     263
Arizona       8.1     294
Arkansas      8.8     190
California    9.0     276
Colorado      7.9     204

Passing a numeric vector of column positions:

USArrests %>% 
  select(., c(1, 3)) %>% head()

           Murder UrbanPop
Alabama      13.2       58
Alaska       10.0       48
Arizona       8.1       80
Arkansas      8.8       50
California    9.0       91
Colorado      7.9       78

Passing an object holding a character vector containing two elements:

First, creat the vector object:

keep_vars <- c("Murder", "Assault")

Then pass the vector into the function:

USArrests %>% 
  select(., keep_vars) %>% head()

Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.
  # Was:
  data %>% select(keep_vars)

  # Now:
  data %>% select(all_of(keep_vars))

See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.

           Murder Assault
Alabama      13.2     236
Alaska       10.0     263
Arizona       8.1     294
Arkansas      8.8     190
California    9.0     276
Colorado      7.9     204

Select Variables Starting with or Ending with Certain Characters

One thing about {dplyr}, when you load the library, there are functions from other libraries that are imported along with {dplyr}’s own functions. These important functions are designed to work with each other, so the people who maintain the libraries have packaged them up nicely so you don’t have to load separate libraries.

Many of the functions are imported from the {tidyselect} library and these functions are what give you additional manipulation ability. Some imported functions are: all_of(), any_of(), contains(), ends_with(), everything(), last_col(), matches(), and starts_with().

With functions like starts_with(), contains(), and ends_with(), you can select variables with character patterns in their names.

Rather that passing the names of the variables as the second argument (e.g., c("Murder", "Assault")), you would pass the helper function, say starts_with(). Whatever starts_with() returns is what gets passed to select() as the variables. This is what is referred to as functional programming. Rather than coding specifically what to do, you with utilize the task of another function to passed its returned object as an argument to another function.

But first, we need to see what these functions, like starts_with(), are doing. For more information, use ?starts_with.

starts_with(match, ignore.case = TRUE, vars = NULL)

Notice the arguments we need to pass:

match: A character vector
ignore.case: If TRUE, the default, ignores case when matching names. This is most flexible.
vars: A character vector of variable names. If not supplied, the variables are taken from the current selection context (as established by functions like select() or pivot_longer()).

Let’s just try out starts_with() on its own. Let’s set a required pattern match = some character and because vars = NULL by default, let’s just set vars = some character vector. Note that vars is not the second argument, so you will want to name it in the function call.

starts_with(match = "a", vars = c("Hello", "Hi", "Bye"))

integer(0)

Returns integer(0) which is speak for “there is no match”. Hmm, OK. Let’s try another character.

starts_with(match = "b", vars = c("Hello", "Hi", "Bye"))

[1] 3

OK, so now an integer is returned (yes, try is.integer() if you don’t believe me).

is.integer(starts_with("b", vars = c("Hello", "Hi", "Bye")))

[1] TRUE

Importantly, the value refers to the element index/position in the vars vector. Because the third string "Bye" starts with "b", that’s what is returned.

Try something else:

starts_with("h", vars = c("Hello", "Hi", "Bye"))

[1] 1 2

Now a vector with length = 2 is returned, representing both the first and the second elements start with “h”.

length(starts_with("h", vars = c("Hello", "Hi", "Bye")))

[1] 2

See, it’s really a vector containing the element(s) of the vars vector matching the pattern.

And yes, this the letter casing is ignored because the default ignore.case = TRUE. Set to FALSE if you want your match to be case sensitive.

starts_with("h", 
            vars = c("Hello", "Hi", "Bye"), 
            ignore.case = F
            )

integer(0)

OK, no matches.

You will typically use starts_with() along with other functions. When using starts_with() in the context of select(), the vars argument is essentially passing vars = the names of the columns of the data frame passed to select().

Example:

select(mydataframe,
    starts_with(match = "my pattern",
                vars = "var names of mydataframe")
                )

Passing the data frame into select() without piping it using %>%:

select(USArrests, starts_with("m")) %>% head()

           Murder
Alabama      13.2
Alaska       10.0
Arizona       8.1
Arkansas      8.8
California    9.0
Colorado      7.9

Piping the data frame into select():

USArrests %>%
  select(., starts_with("m")) %>% head()

           Murder
Alabama      13.2
Alaska       10.0
Arizona       8.1
Arkansas      8.8
California    9.0
Colorado      7.9

Another example:

USArrests %>%
  select(., ends_with("t")) %>% head()

           Assault
Alabama        236
Alaska         263
Arizona        294
Arkansas       190
California     276
Colorado       204

Selecting and Selecting Out Variables By/Between Index

There are many approaches for selecting or selecting out column variables. You can pass multiple arguments for each specification or you can pass a single vector that contains all specifications.

select(., 1,2): select first and second columns
select(., c(1,2)): select first and second columns
select(., -c(1,2)): select out first and second columns
select(., 1:2): select first through second columns
select(., c(1:2)): select first through second columns
select(., -c(1:2)): select out first through second columns

Potential Recommendation: use options utilizing c() to pass a vector because this habit will be more versatile with base R functionality. However, online solutions will likely not take this approach.

Let’s make a data frame to work with first.

DAT <- data.frame(
  Id  = c(100, 101, 102, 103, 104, 100, 105),
  Sex = c('male', 'female', 'Male', NA, 'man', "male", "neither"),
  Age = c(25, 33, 27, 40, 44, 25, 40),
  Renting = c("yes", NA, "yes", NA, "no", "yes", "yes")
)

Select columns 1 and 2:

DAT %>%
  select(., 1,2)

   Id     Sex
1 100    male
2 101  female
3 102    Male
4 103    <NA>
5 104     man
6 100    male
7 105 neither

Select columns 1 and 2 as a vector containing values 1 and 2:

DAT %>%
  select(., c(1,2))

   Id     Sex
1 100    male
2 101  female
3 102    Male
4 103    <NA>
5 104     man
6 100    male
7 105 neither

Select out columns 1 and 2 as a vector containing values 1 and 2:

DAT %>%
  select(., -c(1,2))

  Age Renting
1  25     yes
2  33    <NA>
3  27     yes
4  40    <NA>
5  44      no
6  25     yes
7  40     yes

Select from columns 1 to 2 using the : operator:

DAT %>%
  select(., 1:2)

   Id     Sex
1 100    male
2 101  female
3 102    Male
4 103    <NA>
5 104     man
6 100    male
7 105 neither

Select from columns 1 through 3 using a vector containing the : operator:

DAT %>%
  select(., c(1:3))

   Id     Sex Age
1 100    male  25
2 101  female  33
3 102    Male  27
4 103    <NA>  40
5 104     man  44
6 100    male  25
7 105 neither  40

Select out columns 1 through 3 using a vector containing the : operator:

DAT %>%
  select(., -c(1:3))   # select out from here to there

  Renting
1     yes
2    <NA>
3     yes
4    <NA>
5      no
6     yes
7     yes

Selecting and Selecting Out Variables By or Between Character Name

These approaches are similar to those offered earlier except that some involve passing variables by their name (e.g., character names). Whereas the order of the variables in a data frame may move around, the names may be more stable or permanent, at least after you have cleaned up the names. Consequently, passing variables by name may be more foolproof.

You don’t have to be familiar with all approaches and you may settle on using one that makes the most sense to you.

select(., "var1", "var2")
select(., c("var1", "var2"))
select(., -c("var1", "var2"))
select(., var1:var2))
select(., c("var1":"var2))
select(., -c("var1":"var2))

Recommendation: use options utilizing c() as this will be more versatile with base R functionality.

These approaches also work but they may lead to some confusion regarding usage of quotes:

select(., var1, var2)
select(., c(var1, var2))
select(., -c(var1, var2))

Select variables Id though Age using the : operator:

DAT %>%
  select(., Id:Age) # select from here to there

   Id     Sex Age
1 100    male  25
2 101  female  33
3 102    Male  27
4 103    <NA>  40
5 104     man  44
6 100    male  25
7 105 neither  40

Select variables Id though Age passed as strings using the : operator:

DAT %>%
  select(., "Id":"Age") # select from here to there

   Id     Sex Age
1 100    male  25
2 101  female  33
3 102    Male  27
4 103    <NA>  40
5 104     man  44
6 100    male  25
7 105 neither  40

Select variables Id though Age as a vector containing the variable names passed as strings and using the : operator:

DAT %>%
  select(., c("Id":"Age")) # select from here to there

   Id     Sex Age
1 100    male  25
2 101  female  33
3 102    Male  27
4 103    <NA>  40
5 104     man  44
6 100    male  25
7 105 neither  40

Select out variables Id though Age as a vector containing the variable names passed as strings and using the : operator:

DAT %>%
  select(., -c("Id":"Age"))   # select out from here to there

  Renting
1     yes
2    <NA>
3     yes
4    <NA>
5      no
6     yes
7     yes

You can also use the ! operator to select NOT these variables (therefore, all others)

DAT %>%
  select(., !c("Id":"Age"))   # select out from here to there

  Renting
1     yes
2    <NA>
3     yes
4    <NA>
5      no
6     yes
7     yes

Selecting and Selecting Out Variables Characters in Their Names

select(., starts_with("characters"))
select(., ends_with("characters"))
select(., contains('e'))

Select variables which start with character “i”:

DAT %>% select(., starts_with('i'))

Select variables which DO NOT start with character “s”:

DAT %>% select(., -starts_with('s'))

   Id Age Renting
1 100  25     yes
2 101  33    <NA>
3 102  27     yes
4 103  40    <NA>
5 104  44      no
6 100  25     yes
7 105  40     yes

Select variables which end with character “e”:

DAT %>% select(., ends_with('e'))

Select variables which end with character “e”:

DAT %>% select(., -ends_with('e'))

   Id     Sex Renting
1 100    male     yes
2 101  female    <NA>
3 102    Male     yes
4 103    <NA>    <NA>
5 104     man      no
6 100    male     yes
7 105 neither     yes

Select variables which contain character “g”:

DAT %>% select(., contains('g'))

  Age Renting
1  25     yes
2  33    <NA>
3  27     yes
4  40    <NA>
5  44      no
6  25     yes
7  40     yes

Select variables which DO NOT contain character “g”:

DAT %>% select(., -contains('g'))

   Id     Sex
1 100    male
2 101  female
3 102    Male
4 103    <NA>
5 104     man
6 100    male
7 105 neither

Select variables containing a regular expression, use matches():

.* will grab all names because it means any character and any number of times

DAT %>% select(., matches(".*"))

   Id     Sex Age Renting
1 100    male  25     yes
2 101  female  33    <NA>
3 102    Male  27     yes
4 103    <NA>  40    <NA>
5 104     man  44      no
6 100    male  25     yes
7 105 neither  40     yes

\\d will grab all variables containing a digit:

DAT %>% 
  mutate(., 
         var_1    = 1,
         var_11   = 1,
         var_3    = 1,
         var1_var = 1
         ) %>%
  select(., matches("\\d"))

  var_1 var_11 var_3 var1_var
1     1      1     1        1
2     1      1     1        1
3     1      1     1        1
4     1      1     1        1
5     1      1     1        1
6     1      1     1        1
7     1      1     1        1

v.*\\d will grab all variables that start with v and then contain any characters which are followed by a digit:

DAT %>% 
    mutate(., 
         var_1    = 1,
         var_11   = 1,
         var_3    = 1,
         var1_var = 1
         ) %>%
  select(., matches("v.*\\d"))

  var_1 var_11 var_3 var1_var
1     1      1     1        1
2     1      1     1        1
3     1      1     1        1
4     1      1     1        1
5     1      1     1        1
6     1      1     1        1
7     1      1     1        1

\\d$ will grab all variables ending in a digit ($ means end):

DAT %>% 
    mutate(., 
         var_1    = 1,
         var_11   = 1,
         var_3    = 1,
         var1_var = 1
         ) %>%
  select(., matches("\\d$"))

  var_1 var_11 var_3
1     1      1     1
2     1      1     1
3     1      1     1
4     1      1     1
5     1      1     1
6     1      1     1
7     1      1     1

You can also negate all regular expression matches if you want to exclude:

DAT %>%
      mutate(., 
         var_1    = 1,
         var_11   = 1,
         var_3    = 1,
         var1_var = 1
         ) %>%
  select(., -matches("\\d$"))

   Id     Sex Age Renting var1_var
1 100    male  25     yes        1
2 101  female  33    <NA>        1
3 102    Male  27     yes        1
4 103    <NA>  40    <NA>        1
5 104     man  44      no        1
6 100    male  25     yes        1
7 105 neither  40     yes        1

Note: The functions will return lowercase and uppercase variable name matches because the default behavior is ignore.case = TRUE. Set to FALSE if you want to perform precise surgery on the variables.

Selecting and Selecting Out Variables by Type

Select variables that are numeric:

DAT %>% select(., where(is.numeric))

Select variables that are NOT numeric:

DAT %>% select(., -where(is.numeric))

      Sex Renting
1    male     yes
2  female    <NA>
3    Male     yes
4    <NA>    <NA>
5     man      no
6    male     yes
7 neither     yes

Select variables that are character:

DAT %>% select(., where(is.character))

      Sex Renting
1    male     yes
2  female    <NA>
3    Male     yes
4    <NA>    <NA>
5     man      no
6    male     yes
7 neither     yes

Select variables that are NOT character:

DAT %>% select(., -where(is.character))

Select variables that are logical (TRUE to FALSE):

DAT %>% select(., where(is.logical))

data frame with 0 columns and 7 rows

Select variables that are NOT logical (TRUE to FALSE):

DAT %>% select(., -where(is.logical))

   Id     Sex Age Renting
1 100    male  25     yes
2 101  female  33    <NA>
3 102    Male  27     yes
4 103    <NA>  40    <NA>
5 104     man  44      no
6 100    male  25     yes
7 105 neither  40     yes

Cleaning Data

Data files are messy and as a result require cleaning. You will have missing rows, incorrect variable names, files with columns named the same, NAs, strings for numbers, duplicate rows of data, people who completed a survey twice, and all sorts of unimaginable and unbelievable data problems. So cleaning is important.

Whereas select() is used for columns, filter() operates on rows. Data frame manipulation may involve keeping only certain rows for data, for example, male or female respondents, male respondents, those who do not contain missing values (e.g., NAs) for a specific column variable, who are of a certain age (or born in in certain year), who are above (or below) some acceptable criterion, etc.

When a column variable has more than one value (e.g., check using unique() to determine the unique elements contained), you may wish to filter on some but not others.

You may even need to filter rows in a data frame that are distinct (e.g., not duplicate responses). This is often a good first step in order to determine the size of the usable data set. dplyr::distinct() makes de-duplicating easy as this function will return only distinct rows.

Removing duplicate rows using `distinct()`

dplyr::distinct(): remove duplicate rows
dplyr::distinct(., column): remove duplicate rows by column
na.omit(): remove any row with NA’s (missing values)

Let’s use the simple DAT data frame.

DAT # or with %>% print()

   Id     Sex Age Renting
1 100    male  25     yes
2 101  female  33    <NA>
3 102    Male  27     yes
4 103    <NA>  40    <NA>
5 104     man  44      no
6 100    male  25     yes
7 105 neither  40     yes

Notice that rows 1 and 6 are the same person (e.g., Id) and have exactly the same data for all variables.

DAT[1,] == DAT[6,]

    Id  Sex  Age Renting
1 TRUE TRUE TRUE    TRUE

Great that the rows are consistent but you don’t want their data twice. So let’s just remove any rows that are identical.

DAT %>%
  distinct(.) #%>%    # Remove exact duplicates

   Id     Sex Age Renting
1 100    male  25     yes
2 101  female  33    <NA>
3 102    Male  27     yes
4 103    <NA>  40    <NA>
5 104     man  44      no
6 105 neither  40     yes

If you know each row is unique based on a variable in the data frame, you can also use distinct() to remove duplicates for a specific variable. Make sure that this variable specification is actually one that you would not want duplicates of.

DAT %>%             
  distinct(., Id) # %>% view_html(.) # Remove duplicates by variable; passes unique values for data frame

But this function simply returns the unique values in Id. To retain the variables, set .keep_all = T. If you want to remove duplicates and assign the cleaned data frame to an object, you would likely want to keep all of your variables.

DAT %>%             
  distinct(., Id, .keep_all = T) #%>% view_html(.)

   Id     Sex Age Renting
1 100    male  25     yes
2 101  female  33    <NA>
3 102    Male  27     yes
4 103    <NA>  40    <NA>
5 104     man  44      no
6 105 neither  40     yes

Notice, however, this only removed the last instance or Id == 100. Which row to include is a judgment call. The first, the last, neither, the average? Is there a correct answer?

Filtering using {dplyr} and Understanding Filtering Operators

Filtering cases using the dplyr::filter() verbs works by removing rows that do not match a specific criterion and then by returning the data frame that omits the mismatched condition.

Some useful filtering operators and functions include: ==, >, >=, &, |, !, xor(), c(), is.na(), between(), near().

Row/Observations/Cases can be filtered to “include” only certain matched conditions or can be filtered to “exclude” by negating those matched conditions. If the column variable Sex is in the data frame and cases are 'male', 'men', 'female', 'women', 'neither', NA, etc., you can specify the column Sex variable and then the row matching condition(s).

The first argument in dplyr::filter() is a data frame, and the function all dplyr::filter(DAT, Sex == 'female') will filter the data frame named DAT to include rows for which the sex column equals 'female'. In other words, TRUE rows.

dplyr::filter(DAT, Sex == 'female')

   Id    Sex Age Renting
1 101 female  33    <NA>

Similarly, the function call dplyr::filter(., Sex == 'male') can be read “filter the data frame to include rows for which the value of Sex == 'male' is TRUE”.

More flexibly, however, you could specify a vector containing acceptable strings using c(). dplyr::filter(., Sex %in% c('male')) filters the rows to include only those for which the value for sex is in the string vector which includes a single string,'male' whereas dplyr::filter(., Sex %in% c('male', 'Man')) filters the rows to include only those for which the value for Sex is in the string vector which includes 'male' and 'Man'. Cases containing 'Male', 'Men' (R is a case-sensitive language), or 'female', for example, will not be included in the returned data frame because they do not match values in the string vector.

Piping Multiple Filter Function Calls

In many cases, data filtering will involve different conditions for different column variables, so specifying them separately as separate lines of code is most appropriate.

When passing a data frame using %>% from {magrittr}, the first argument for the data frame can be specified using a . because the function inherits the data frame manipulated. However, {dplyr} also understand this so the . can also be omitted for convenience; this is the general practice you will see in forums like stackoverflow.com. Example to follow.

Filtering Cases by Character Names/String Values

Filter Cases using `==`

Filter rows for which the Sex variable is equal to the string 'female':

DAT %>%
  dplyr::filter(., Sex == 'female')

   Id    Sex Age Renting
1 101 female  33    <NA>

Filter rows for which the Sex variable is not equal to the string 'female':

DAT %>%
  dplyr::filter(., Sex != 'female')

   Id     Sex Age Renting
1 100    male  25     yes
2 102    Male  27     yes
3 104     man  44      no
4 100    male  25     yes
5 105 neither  40     yes

Filter rows for which the Sex variable is equal to the string 'female' AND Age is greater than the numeric 27:

DAT %>%
  dplyr::filter(., Sex == 'female' & Age > 27) # this "AND" that

   Id    Sex Age Renting
1 101 female  33    <NA>

Filter rows for which the Sex variable is equal to the string 'female' OR Age is greater than the numeric 27:

DAT %>%
  dplyr::filter(., Sex == 'female' | Age > 27) # this "OR" that

   Id     Sex Age Renting
1 101  female  33    <NA>
2 103    <NA>  40    <NA>
3 104     man  44      no
4 105 neither  40     yes

A cleaner method involves separate lines of code. Although cleaner, this will not allow the “OR” option because the data frame that is returned from the first filter() is passed to the second filter() and all cases other than "female" have already been removed from the data frame.

DAT %>%
  dplyr::filter(., Sex == 'female') %>%   # keep female (and add another pipe)
  dplyr::filter(., Age >= 27)             # keep only those equal to or older than 27

   Id    Sex Age Renting
1 101 female  33    <NA>

Filtering Cases by Value

Filter by `<` and `>` or `<=` or `>=`…

DAT %>% dplyr::filter(., Age < 40)  # keep those less than

   Id    Sex Age Renting
1 100   male  25     yes
2 101 female  33    <NA>
3 102   Male  27     yes
4 100   male  25     yes

DAT %>% dplyr::filter(., Age > 40)  # keep older than

   Id Sex Age Renting
1 104 man  44      no

DAT %>% dplyr::filter(., Age >= 40)  # keep equal to or older than

   Id     Sex Age Renting
1 103    <NA>  40    <NA>
2 104     man  44      no
3 105 neither  40     yes

Filter Cases by Conditional X or Y Using `|` Operator…

Using the “OR” operator, |, cases can be included if “this” OR “that” condition.

Filter numbers:

DAT %>%
  dplyr::filter(., Age == 25 | Age == 40)    # filter out numeric values IN a range

   Id     Sex Age Renting
1 100    male  25     yes
2 103    <NA>  40    <NA>
3 100    male  25     yes
4 105 neither  40     yes

Filter characters:

DAT %>%
  dplyr::filter(., Sex == 'male' | Sex == 'female')

   Id    Sex Age Renting
1 100   male  25     yes
2 101 female  33    <NA>
3 100   male  25     yes

Although dplyr::filter(sex %in% c('male', 'female')) would be easier.

Filter rows of variables of both types:

DAT %>%
  dplyr::filter(., Sex == 'male' | Age == 27)

   Id  Sex Age Renting
1 100 male  25     yes
2 102 Male  27     yes
3 100 male  25     yes

Filter Cases Between Values with `between()`

Between ages 25 and 33:

DAT %>%
  dplyr::filter(., between(Age, 27, 33))

   Id    Sex Age Renting
1 101 female  33    <NA>
2 102   Male  27     yes

Filter by range using the `%in%` operator (this is IN meaning in)

Though less flexible than using between(), %in% may be easier to remember:

DAT %>%
  dplyr::filter(., Age %in% 20:43)    # filter out numeric values IN a range

   Id     Sex Age Renting
1 100    male  25     yes
2 101  female  33    <NA>
3 102    Male  27     yes
4 103    <NA>  40    <NA>
5 100    male  25     yes
6 105 neither  40     yes

One’s age is in the range from 20 through 43.

If a vector object is already defined (e.g., my_levels = c('male', 'female')), you can use that for filtering also. Such approaches are useful when data manipulation involves reusing a reference as it simplifies coding and reduces errors because the specification is defined only once.

my_levels = c('male', 'female')

DAT %>%
  dplyr::filter(., Sex %in% my_levels)

   Id    Sex Age Renting
1 100   male  25     yes
2 101 female  33    <NA>
3 100   male  25     yes

Filter by exclusion

When inclusion of variables is inappropriate, exclusion of them may be useful. The ! operator means “NOT” in R so you can use that to accomplish the opposite of the statement. For example, dplyr::filter(., !sex %in% c('male', NA)) will “filter the data frame to include rows in the sex column for which the value is NOT in the vector”.

Exclude rows in the Sex variable that are NA or 'male':

DAT %>%
  dplyr::filter(., !Sex %in% c('male', NA))  # keep only if NOT in vector

   Id     Sex Age Renting
1 101  female  33    <NA>
2 102    Male  27     yes
3 104     man  44      no
4 105 neither  40     yes

Exclude rows in the Sex variable that are Men or 'male':

DAT %>%
  dplyr::filter(., !Sex %in% c('male', 'Men'))  # keep only if NOT in vector

   Id     Sex Age Renting
1 101  female  33    <NA>
2 102    Male  27     yes
3 103    <NA>  40    <NA>
4 104     man  44      no
5 105 neither  40     yes

Filter by conditional X and Y using `&` operator…

By range:

DAT %>%
  dplyr::filter(., Id >= 102 & Age <= 43)

   Id     Sex Age Renting
1 102    Male  27     yes
2 103    <NA>  40    <NA>
3 105 neither  40     yes

DAT %>%
  dplyr::filter(., Age >= 20 & Age <= 43)

   Id     Sex Age Renting
1 100    male  25     yes
2 101  female  33    <NA>
3 102    Male  27     yes
4 103    <NA>  40    <NA>
5 100    male  25     yes
6 105 neither  40     yes

Note: Age 20:43 won’t work. Can you figure out why?

Filter Cases Containing Characters Using `stringr::str_detect()` and `%in%`:

If you want to filter cases that contain certain characters, unfortunately, you cannot use contains() as you would for variable names. For matching characters in levels of variables, you’ll need something like stringr::str_detect() or grepl().

stringr::str_detect(): returns matching conditions

The example below uses str_detect() to detect the presence of a character match and returns a logic vector with TRUEs for matching cases. When paired with filter(), the data frame is filtered by to contains cases that match the pattern for the variable.

Example: stringr::str_detect(my_variable, "pattern"):

Let’s look for levels of Sex for which "ma" is detected.

DAT %>% dplyr::filter(., stringr::str_detect(Sex, "ma"))

   Id    Sex Age Renting
1 100   male  25     yes
2 101 female  33    <NA>
3 104    man  44      no
4 100   male  25     yes

But the case for which Sex = Male is now missing. This is because stringr::str_detect() is a case-sensitive pattern match.

You can fix this is a couple ways:

make the cases in Sex all lower case to mutate() the fix or
to wrap Sex in tolower() to make cases lowercase.

The first option might be better if you want to fix the problem in the data frame.

Other casing functions are:

tolower(): returns lower case of string
toupper(): returns upper case of string
tools::toTitleCase(): returns Title Case (capitalize first letter of string)

DAT %>%
  filter(., stringr::str_detect(tolower(Sex), "ma"))

   Id    Sex Age Renting
1 100   male  25     yes
2 101 female  33    <NA>
3 102   Male  27     yes
4 104    man  44      no
5 100   male  25     yes

For some, reading this code may be difficult. In order to understand the code, you need to read it inside out like slices of an onion. First understand the first function as the returned value is passed to the next function and so forth.

For example, elements of the Sex vector are converted to lower case and returned using tolower(). We can demonstrate this using only the vector from the data frame using the $ operator, DAT$Sex.

The initial vector:

DAT$Sex

[1] "male"    "female"  "Male"    NA        "man"     "male"    "neither"

To lowercase:

tolower(DAT$Sex)

[1] "male"    "female"  "male"    NA        "man"     "male"    "neither"

Next, stringr::str_detect() will return a vector of TRUE or FALSE for all element positions of DAT$Sex for which the (now lowercase) string matches "ma":

stringr::str_detect(tolower(DAT$Sex), "ma")

[1]  TRUE  TRUE  TRUE    NA  TRUE  TRUE FALSE

Another approach would be to perform the steps in pieces by first mutating the data frame so that elements in the Sex vector are converted to lowercase and then pipe that mutated data from to filter().

Briefly, we will introduce dplyr::mutate() to accomplish this goal. mutate() is used to add, update, etc. variables in a data frame or tibble.

Mutate Sex using mutate():

DAT %>%
  mutate(., Sex = tolower(Sex))

   Id     Sex Age Renting
1 100    male  25     yes
2 101  female  33    <NA>
3 102    male  27     yes
4 103    <NA>  40    <NA>
5 104     man  44      no
6 100    male  25     yes
7 105 neither  40     yes

Now mutate and pipe that mutated data frame using %>% (from {margittr}) or |> (base R version 4.1+) to filter():

DAT %>%
  mutate(., Sex = tolower(Sex)) %>%
  filter(., stringr::str_detect(Sex, "ma"))

   Id    Sex Age Renting
1 100   male  25     yes
2 101 female  33    <NA>
3 102   male  27     yes
4 104    man  44      no
5 100   male  25     yes

Pairing this piping approach with toTitleCase() may be better as only the first letter will be capitalized and you are looking for “Ma”.

DAT %>%
  mutate(., Sex = tools::toTitleCase(Sex)) %>%
  filter(., stringr::str_detect(Sex, "Ma"))

   Id  Sex Age Renting
1 100 Male  25     yes
2 102 Male  27     yes
3 104  Man  44      no
4 100 Male  25     yes

But notice the male and man issue is still a problem.

So what are the unique elements in DAT$Sex? Your friend here is unique(). And maybe you are too.

unique(DAT$Sex)

[1] "male"    "female"  "Male"    NA        "man"     "neither"

You can hard code a fix using %in% and case_when(). case_when() is {dplyr}’s solution for multiple-case conditional.

DAT %>%
  mutate(., 
         Sex = tolower(Sex), # convert to lowercase first
         Sex = case_when(    # then recode cases when matched
           Sex %in% c("male", "man") ~ "Male",
           Sex %in% c("female", "woman") ~ "Female"
         )
)

   Id    Sex Age Renting
1 100   Male  25     yes
2 101 Female  33    <NA>
3 102   Male  27     yes
4 103   <NA>  40    <NA>
5 104   Male  44      no
6 100   Male  25     yes
7 105   <NA>  40     yes

For a more flexible fix, you can use str_detect():

BUT beware that the order of operations matters with this approach. Because the string “female” contains characters “ma”, you could accidentally recode all “female” cases to “male” if you perform the case_when() conversion on “male” first.

DAT %>%
  mutate(., 
         Sex = tolower(Sex),   # convert first
         Sex = case_when(      # then recode
           stringr::str_detect(Sex, "fe") ~ "Female",
           stringr::str_detect(Sex, "ma") ~ "Male",
         ))

   Id    Sex Age Renting
1 100   Male  25     yes
2 101 Female  33    <NA>
3 102   Male  27     yes
4 103   <NA>  40    <NA>
5 104   Male  44      no
6 100   Male  25     yes
7 105   <NA>  40     yes

Note that conditions in not declared in case_when() will be recoded as NA, which is what happens for "neither". Make sure to include all ways you wish you recode cases.

Filtering Missing Data (`NAs`)

is.na() will return a logical vector for which TRUE represents there are missing values.

Try on the entire data frame…

is.na(DAT)

        Id   Sex   Age Renting
[1,] FALSE FALSE FALSE   FALSE
[2,] FALSE FALSE FALSE    TRUE
[3,] FALSE FALSE FALSE   FALSE
[4,] FALSE  TRUE FALSE    TRUE
[5,] FALSE FALSE FALSE   FALSE
[6,] FALSE FALSE FALSE   FALSE
[7,] FALSE FALSE FALSE   FALSE

You can see that some columns contain cases/rows with TRUE indicating the cell contains NA.

The negation operator, !, will be used to illustrate some filtering approaches. Because filter() will filter out FALSE cases and retain TRUE ones, so you may sometimes need to negate a function so that you keep the rows you want to keep.

na.omit(): removes rows with NAs
filter(., is.na(column_name)): keep rows with NA in specific variable
filter(., !is.na(column_name)): remove rows with NA in specific variable
filter(., complete.cases(.)): remove rows with NAs

Filter using `na.omit()`:

DAT %>%
  na.omit(.) #%>%     # omit any rows with NAs

   Id     Sex Age Renting
1 100    male  25     yes
3 102    Male  27     yes
5 104     man  44      no
6 100    male  25     yes
7 105 neither  40     yes

Filter using `is.na()` and `!is.na()`:

DAT %>%
  filter(., is.na(Sex))       # keep NAs by variable

   Id  Sex Age Renting
1 103 <NA>  40    <NA>

But your goal may likely be to keep everything that is not NA:

DAT %>%
  filter(., !is.na(Sex))      # remove NAs by variable

   Id     Sex Age Renting
1 100    male  25     yes
2 101  female  33    <NA>
3 102    Male  27     yes
4 104     man  44      no
5 100    male  25     yes
6 105 neither  40     yes

And filter step-by-step for each variable using %>% and separate function calls:

DAT %>%
  filter(., !is.na(Sex)) %>%      
  filter(., !is.na(Renting))

   Id     Sex Age Renting
1 100    male  25     yes
2 102    Male  27     yes
3 104     man  44      no
4 100    male  25     yes
5 105 neither  40     yes

So why use separate lines of code if you can use & all in one line? One reason is that separate function calls written as separate lines of code make code inclusion/exclusion extremely easy.

Comment out what you don’t want using #:

DAT %>%
  #filter(., !is.na(Sex)) %>%      
  filter(., !is.na(Renting))

   Id     Sex Age Renting
1 100    male  25     yes
2 102    Male  27     yes
3 104     man  44      no
4 100    male  25     yes
5 105 neither  40     yes

Filter using `complete.cases()`:

The complete.cases() function returns a logical vector for which TRUE reflects the row has complete information and no missing cases. Using complete.cases() along with filter(), you would retain all rows TRUE rows.

DAT %>%
  dplyr::filter(., complete.cases(.))

   Id     Sex Age Renting
1 100    male  25     yes
2 102    Male  27     yes
3 104     man  44      no
4 100    male  25     yes
5 105 neither  40     yes

The Data Manipulation Workflow: Putting It All Together

Of course, all of these functions can be paired with select(), mutate(), filter(), etc. Here is the data manipulation workflow.

dataframe %>%
    select(., ...) %>%     # select variables of interest
    mutate(., ...) %>%     # then create new variables
    filter(., ...) %>%     # then filter by rows
    group_by(., ...) %>%   # then group for subsetting
    summarize(., ...)      # then summarize

Overview

Libraries

External Functions

Libraries

A Grammar for Data Wrangling

Some Common Ways for Selecting Variables Using {dplyr}

Select Variables Starting with or Ending with Certain Characters

Selecting and Selecting Out Variables By/Between Index

Selecting and Selecting Out Variables By or Between Character Name

Selecting and Selecting Out Variables Characters in Their Names

Selecting and Selecting Out Variables by Type

Cleaning Data

Removing duplicate rows using distinct()

Filtering using {dplyr} and Understanding Filtering Operators

Piping Multiple Filter Function Calls

Filtering Cases by Character Names/String Values

Filter Cases using ==

Filtering Cases by Value

Filter by < and > or <= or >=…

Filter Cases by Conditional X or Y Using | Operator…

Filter Cases Between Values with between()

Filter by range using the %in% operator (this is IN meaning in)

Filter by exclusion

Filter by conditional X and Y using & operator…

Filter Cases Containing Characters Using stringr::str_detect() and %in%:

Filtering Missing Data (NAs)

Filter using na.omit():

Filter using is.na() and !is.na():

Filter using complete.cases():

The Data Manipulation Workflow: Putting It All Together

Some Common Ways for Selecting Variables Using `{dplyr}`

Removing duplicate rows using `distinct()`

Filter Cases using `==`

Filter by `<` and `>` or `<=` or `>=`…

Filter Cases by Conditional X or Y Using `|` Operator…

Filter Cases Between Values with `between()`

Filter by range using the `%in%` operator (this is IN meaning in)

Filter by conditional X and Y using `&` operator…

Filter Cases Containing Characters Using `stringr::str_detect()` and `%in%`:

Filtering Missing Data (`NAs`)

Filter using `na.omit()`:

Filter using `is.na()` and `!is.na()`:

Filter using `complete.cases()`: