Importing, Exporting, and Inspecting Data

File Types

Knowing File Delimiters

name, age, sex, location, marital_stat 
Sally, 28, Female, Michigan, Married

Knowing File Delimiters

name, age, sex, location, marital_stat 
Sally, 28, Female, Michigan, Married


name; age; sex; location; marital_stat 
Sally; 28; Female; Michigan; Married
  • data are stored in files as text
  • delimiters: characters that separate text strings
  • commas, semicolons, tabs, special characters, etc.
  • delimiters help you parse data into meaningful pieces

Knowing File Extensions

File extensions often tell you something about how the delimiter and the format of the contents/data inside the file.

  • .csv: comma-separated value files
  • .Rds: compressed R data frame files
  • .tsv: tab-separated value files
  • .sav: SPSS data files
  • .xlsx: Excel Workbook files
  • … and others

Knowing the Libraries for File Types

Different libraries have options for opening

  • .csv/.tsv with base R or {readr}
  • .Rds with base R or even {readr}
  • .sav with {foreign} or {haven}
  • .xlsx with {readxl}
  • {vroom} will try to guess, see vroom::vroom()

Importing Data Files

Reading data into R

Importing with readr::read_csv()


readr::read_csv(file = <the_file_path>,
                col_names = TRUE
                )
  • Key parameters: file & col_names

  • By default, column/variable names are assumed to exist: col_names = TRUE

Importing with readr::read_csv()


readr::read_csv(file = <the_file_path>,
                col_names = FALSE
                )
  • If no column/variable names: col_names = FALSE

Importing with readRDS()


readRDS(file = <the_file_path>, 
        refhook = NULL
        )

Key parameter: file

Importing with readRDS()

Passing the file path to file:


readRDS(file = here::here("data", "file_name.Rds"))

Importing to an Object

  • Remember to assign the imported data
  • <data_frame_name> <- readRDS(...)
  • <data_frame_name> <- readr::read_csv(...)

Exporting Data Files

Writing data files from R

Exporting Options

  • Consider how the file will be accessed later
  • Consider how the file will be used later
  • .csv: simple; retains no data types
  • .Rds: retains data types; compressed; R object,
  • .xlsx: useful for non-technical users
  • feather::write_feather(): for R and Python

Exporting with saveRDS()


saveRDS(object = <data_frame_object>, 
        file = <the_file_path>, 
        ascii = FALSE, 
        version = NULL,
        compress = TRUE, 
        refhook = NULL
        )
  • Key parameters: object & file

Exporting with saveRDS()

Passing the data frame to object and the file path to file:


saveRDS(object = <data_frame_object>, 
        file = here::here("data", "file_name.Rds")
        )

Exporting with saveRDS()

As long you pass arguments are passed in the same order as their intended parameters, you don’t need parameter specification.


saveRDS(<data_frame_object>, 
        here::here("data", "file_name.Rds")
        )

What to do after importing data

  • View the data frame
  • Take inventory of what issues to correct

Viewing Data Frames

  • head()/tail(): top/bottom rows of data frame
  • View(): view in tab, from base R or tibble::view()
  • * view_html(): view in Viewer, external function (will take longer)
  • dim():
  • str(): the data structure (variables/columns), from base R
  • * dplyr::glimpse(): better str() alternative

Taking Inventory of Variable Names

  • Are variable names in a standard format (e.g., all lowercase)?
  • Are _ used for readability (avoid camelCase)?
  • Are variable names coherent?
  • Make a list of what needs fixing/cleaning

Taking Inventory of Variable Data Types

  • Are numeric variables numeric?
  • Are factor variables factors?
  • Are ordered factors ordered?
  • Do extra characters, spaces, etc. need to be removed?
  • Make a list of what needs fixing/cleaning

Key Functions

  • dplyr::glimpse(): examining data frame structure
  • names(): getting variable names
  • dplyr::rename_with() or dplyr::rename(): renaming variables
  • dplyr::relocate(): moving variables
  • dplyr::select(): selecting variables
  • dplyr::mutate(): creating/modifying
  • gsub(): finding/replacing character patterns in vectors
  • car::Recode(): for recoding values in vectors

Inspecting Taking Inventory: Assessing Variable Types Using glimpse()

<DATA_FRAME> |> glimpse() 

Standardizing/Cleaning Variable Names

Assessing Variable Names Using names()

names(<data_frame>)

[1] "name"  "age"  "sex"  "location"  "marital_stat" 

Renaming Variables Using names()

names(<data_frame>)

[1] "name"  "age"  "sex"  "location"  "marital_stat" 
  • Because names(<data_frame>) contains a vector of variable names, we can assign a vector of different names to the object.
  • names(<data_frame>) <- c("first_name", "age", "sex", "location", "marital_status")

Renaming Variables Using dplyr::rename_with()

rename_with(.data,
            .fn, 
            .cols
            )

Offers greater control, more complex:

  • .data: the data frame containing the variables
  • .fn: the function for renaming
  • .cols: the columns to rename

Renaming All Variables Using everything()

rename_with(.data = <data_frame>,    
            .fn = tolower,           # apply to lower too
            .cols = everything()     # everything(), all variables
            )

Renaming Variables Using dplyr::rename_with() and |>

Assuming piping using |> or %>%, .data is omitted because it’s inherited from the piping procedure.

<DATA_FRAME> |> 
     rename_with(.fn = tolower,           
                 .cols = everything()
                 )

Renaming All Variables Using everything()

rename_with(.data = <data_frame>,    
            .fn = tolower,           # apply to lower too
            .cols = everything()     # everything(), all variables
            )

Renaming Variable by Position in Vector

rename_with(.data = <data_frame>,    
            .fn = tolower,           # apply to lower too
            .cols = 1:5              # columns 1 through 5
            )

Renaming Variables by Characters starts_with()

rename_with(.data = <data_frame>,    
            .fn = tolower,            # apply to lower too
            .cols = starts_with("w")  # columns 1 through 5
            )

Replacing/Removing Characters Using gsub()

Parameters/Arguments:

  • pattern: character string (or regular expression) to find (needle)
  • x, text: a character vector w/in which the pattern exists (haystack)
  • replacement: a character replacement for pattern matches
  • … others

Replacing/Removing Characters Using gsub()

Replace spaces with nothing

gsub(pattern = " ",                         # a space character                
     replacement = "",                      # empty string            
     x = c("Bill    ", "Sally ", "  Joe"),  # the vector to match               
     ignore.case = FALSE, 
     perl = FALSE,
     fixed = FALSE, 
     useBytes = FALSE
     )
[1] "Bill"  "Sally" "Joe"  

Replacing/Removing Characters (Cont.)

For more complicated string repair, use {stringr} but will will address later.

stringr::str_replace_all()

stringr::str_replace_all(
  string = c("B-i-l-l    ", "S-all-y ", "  $Joe"), 
  pattern = "[\\$\\s-]",
  replacement = ""
  )
[1] "Bill"  "Sally" "Joe"  
  • Do characters response values need to be recoded into numbers?
  • Do number response values to be recoded to characters/factors?

Taking Inventory: Assessing Values

  • Do extra characters, spaces, etc. need to be removed?
  • Do characters response values need to be recoded into numbers?
  • Do number response values to be recoded to characters/factors?

Renaming Variables

Relocating/Moving Variable Position

We can use dplyr::relocate() to move variable positions in the data frame

Relocating/Moving Variable Position Using relocate()

Parameters/Arguments”

  • .data: a data frame
  • ...: columns to move
  • .before, .after: Destination of columns selected by …. Supplying neither will move columns to the left-hand side; specifying both is an error

Relocating/Moving Variable Position Using relocate()

Assuming you are piping using |> or %>%:

<DATA_FRAME> |> 
     relocate(...)

Relocating/Moving Variable Position Using relocate()

  • relocate(d, .before = a): move relative to another variable
  • relocate(a, .after = c)
  • relocate(a, b, .after = c): move more than one
  • relocate(d, .before = 1): move relative to a variable position

Relocating/Moving Variables By Vector

  • relocate(c(b, a), .before = c)
  • relocate(c("b", "a"), .before = c): when character vector

Functions for Changing Variable Type

  • as.numeric(), as.character(), etc.
  • as.factor() or forcats::as_factor()
  • forcats::as_ordered(): for ordered factor
  • forcats::fct_lump(): for other category
  • car::Recode(): for recoding values (e.g., 1='male', 'male'=1, etc.)

Selecting Data for Subsets Using dplyr::select()

select(.data,
       ...
       )

Offers greater control, more complex:

  • .data: the data frame containing the variables
  • ...: the variable names, vector of names, etc.

Selecting Data for Subsets Using dplyr::select()

  • by name, select(a, b, c, d)
  • for a sequence, select(a:d)
  • dplyr::starts_with(): beginning characters
  • dplyr::ends_with(): ending characters
  • dplyr::contains(): containing characters/regex

Selecting Data for Subsets Using dplyr::select()

<DATA_FRAME> |>
  select(c, d)

Selecting Data for Subsets Using dplyr::select()

<DATA_FRAME> |>
  select(starts_with("rt_"))

Changing Variable Types Using mutate()

Removing Spaces Using mutate() + gsub()

<data_frame> |>  
   mutate(variable_name = gsub(" ", "", variable_name))`

Removing Spaces Using mutate() + gsub()

Specify .data if not piping:

mutate(.data = <data_frame>,
       variable_name = gsub(" ", "", variable_name)
       )

Removing Spaces Using mutate() + gsub()

When piping the data frame, .data is inherited:

<data_frame> |>  
   mutate(variable_name = gsub(" ", "", variable_name))`

Multiple Piping

A strategy for cleaner code.

Pipe, Pipe, Pipe

<data_frame> |>         # pipe data frame to...
   # rename the variables...
   rename_with(...) |>  # pipe modifed data frame to...
   
   # change the contents of the varbiable
   mutate(...)          # pipe modified-modified data frame to... 
   
   # move the location of variable(s)
   relocate(...)

Pipe, Pipe, Pipe, and Assign

Once you are sure the code works as planned, assign.

Pipe, Pipe, Pipe, and Assign

Once you are sure the code works as planned, assign.

<data_frame> <-        # reassign the data frame with new content
   <data_frame> |>     # pipe data frame to...

   # rename the variables...
   rename_with(...) |> # pipe modifed data frame to...
   
   # change the contents of the varbiable
   mutate(...) |>      # pipe modified-modified data frame to... 
   
   # move the location of variable(s)
   relocate(...)