Reading data files

Author

Gabriel I. Cook

Published

October 31, 2024

Overview

In this module, you will learn how to read and write data files in different formats. Depending how you access data, the process may change so this topic is discussed in a general way with different approaches described. If you find yourself in a bind with reading data, one of these approaches will likely work. In many instances, however, readr::read_csv() will be your friend when reading files and readr::write_csv() will be for writing files.

In addition, concepts related to reading files and managing workflow are discussed. By doing so, my goal is to make you aware of some key issues that could cause some problems with data science projects. When reading files,

To Do

Readings

Task

Libraries

  • {openxlsx} 4.2.5.2: for reading Excel spreadsheets from a URL
  • {readxl} 1.4.3: for reading Excel spreadsheets
  • {readr} 2.1.5: for reading .csv, .tsv, and .fwf files

First, we need an .xlsx data file. You can obtain one locally or online from a URL.

Downloading a .zip File from a URL

GitHub handles Excel files in a way that makes downloading them challenging, so we will first download a .zip file containing the .xlxs file.

You can accomplish this in two ways:

  • Download from here, save in your project /data directory, and unzip

  • Use the URL to a zip file and download to the project /data directory, unzip the contents, and remove the .zip file using the code below

zip_url <- "https://github.com/slicesofdata/dataviz24/raw/main/data/cms-top-all-time-2023-swim.zip"

download.file(zip_url, destfile = here::here("data", "swim.zip"))

unzip(here::here("data", "swim.zip"), 
      overwrite = TRUE, 
      exdir = here::here("data")
      )

file.remove(here::here("data", "swim.zip"))

If the process worked correctly, the file should exist in the directory.

file.exists(here::here("data", "cms-top-all-time-2023-swim.xlsx"))
[1] TRUE

OK, great! Let’s move to importing files more generally.

Reading/Importing Data Files

Excel files from a URL

{readxl} 1.4.3 lacks the ability to read the file from online. We can, however, read it using {openxlsx}. The problem is that you will only be able to read a the first sheet. If the first sheet is all you need, this can work. Pass the URL to openxlsx::read.xlsx() and assign it’s contents to an object named DAT using the assignment operator <-.

URL <- "https://github.com/slicesofdata/dataviz24/raw/main/data/cms-top-all-time-2023-swim.xlsx"

DAT <- openxlsx::read.xlsx(URL, sheet = 1)

What does the head of the data file look like?

head(DAT)
   score              name year          event   team
1 525.35       Maia Presti 2015 1-Meter Diving Athena
2 514.70 Makenna Parkinson 2023 1-Meter Diving Athena
3 512.05      Emma Ng Pack 2023 1-Meter Diving Athena
4 494.95         Izzy Doud 2023 1-Meter Diving Athena
5 462.15     Carli Lessard 2015 1-Meter Diving Athena
6 447.70     Alexis Romero 2023 1-Meter Diving Athena

Just an FYI, when you want a different worksheet you will need to pass a sheet name to the sheet argument. In this case, we saved it as part of the download process. Let’s pass sheet = "swim".

head(
  openxlsx::read.xlsx(URL, sheet = "swim")
)
   time             name year   event   team
1 23.29 Jocelyn Crawford 2019 50 FREE Athena
2 23.31    Ava Sealander 2022 50 FREE Athena
3 23.49        Kelly Ngo 2016 50 FREE Athena
4 23.71        Helen Liu 2014 50 FREE Athena
5 23.76      Michele Kee 2014 50 FREE Athena
6 23.77 Natalia Orbach-M 2020 50 FREE Athena

Reading Excel Spreadsheets with {readxl}

We will use the {readxl} library to handing reading of Excel files. Because Excel files can contain multiple sheets, one goal would be to find out the sheet names using readxl::excel_sheets (see ?readxl::excel_sheets). This function takes one argument, which is the path to the file. Passing the path will return the sheet names in that file. We can pass the path string directly into the function or if the file path is already saved as a object, pass that.

In order to read an Excel spreadsheet file, you will need to specify at very least file and if you want to read a specific sheet other than the first one, then you will need to specify sheet.

  • file: a path to the file, including the file name
  • sheet: the sheet name to read

Getting Sheet Names

First, let’s assign the file path to an object because we will use this path a few times and we don’t want to keep typing it lest we make an error.

file_name <- "cms-top-all-time-2023-swim.xlsx"

We can examine the worksheet names:

readxl::excel_sheets(path = here::here("data", file_name))
[1] "diving" "swim"   "relay" 

Great, we know know the sheet names. The benefit of passing an object is that you you may wish to pass the path to another function, for example, to read a sheet from the file.

Reading a Sheet

In order to read a sheet, we will use readxl::read_excel(), which takes the file path as the first argument and the name of the desired sheet as the second argument. You might get away with passing only the path as long as your goal is to read the first sheet because this is the default action. Let’s wrap the function in head() to see the top.

head(
  readxl::read_excel(here::here("data", file_name))
)
# A tibble: 6 × 5
  score  name              year  event          team  
  <chr>  <chr>             <chr> <chr>          <chr> 
1 525.35 Maia Presti       2015  1-Meter Diving Athena
2 514.70 Makenna Parkinson 2023  1-Meter Diving Athena
3 512.05 Emma Ng Pack      2023  1-Meter Diving Athena
4 494.95 Izzy Doud         2023  1-Meter Diving Athena
5 462.15 Carli Lessard     2015  1-Meter Diving Athena
6 447.70 Alexis Romero     2023  1-Meter Diving Athena

The function also turns the file content into special object type knows as a data frame. A data frame is composed of row and column data. Sometimes data frames are messy but luckily we have a fairly clean file. You can verify using R’s built-in function is.data.frame(), which will return TRUE if it’s a data frame or FALSE if not. We will assign this to an object

is.data.frame(readxl::read_excel(here::here("data", file_name)))
[1] TRUE

But we don’t want the first sheet. Pass sheet = "swim" to read that sheet. Also, let’s read in the data and assign it to an object called DAT which will hold the data frame.

DAT <- readxl::read_excel(here::here("data", file_name), sheet = "swim")

Viewing the head of the data frame, we can see that it is composed of 5 column vectors representing variables with names: time, name, year, event, team.

head(DAT)
# A tibble: 6 × 5
  time  name             year  event   team  
  <chr> <chr>            <chr> <chr>   <chr> 
1 23.29 Jocelyn Crawford 2019  50 FREE Athena
2 23.31 Ava Sealander    2022  50 FREE Athena
3 23.49 Kelly Ngo        2016  50 FREE Athena
4 23.71 Helen Liu        2014  50 FREE Athena
5 23.76 Michele Kee      2014  50 FREE Athena
6 23.77 Natalia Orbach-M 2020  50 FREE Athena

OK, that was fun. In order to demonstrate reading local .csv files, we will take a detour into saving files.

Writing/Exporting Data Files

Despite it’s name, {readr} can save data frames too. Interestingly, at least at the time of this writing, {readr} allows you to write files in excel format even though you cannot open them.

Moving forward, you can save the data frame as a .csv file using readr::write_csv(). When writing a file using write_csv() we need to specify key arguments: x, the data frame object, and file, the file name. You do not need to set a delimiter because, given its name, .csv defaults the separation as a comma ",". Other arguments may be necessary for other files and goals.

If, however, you specify name of the file (e.g., "my file name.csv"), the data frame will be written to a default location because no file path was provided. Um, so where is that default location? Well it depends and it’s complicated depending on whether files are .R or .Rmd. By default, .Rmd files will assume the directory from which the .Rmd file, NOT THE DATA FILE, is opened. But you don’t want to save a data file to the directory where your code lives.

If you are organized, you save your .R and .Rmd files in code directory and your data files in a data directory. So yeah, the default is a problem. The solution is to specify the file path using file = the file path.

Managing file paths with {here}

When downloading the file, you may have noticed using the {here} library. A discussion of the library was delayed at the time. We will now look a little deeper into how the library simplifies working with file paths within the context of the {readr}.

What’s the best way to handle directories and file paths? Undoutedly, that is the {here} library, assuming of course you are smart enough to be using projects in RStudio. When you open a file from within a project, {here} will make the project directory the working directory. And if you are organized, your data files will be in a /data directory inside the project directory. When passing "data" as the first argument to here::here() ( e.g., here::here("data"), you will see that the function returns a string containing the full path to the project directory plus the data subdirectory.

here::here("data")
[1] "C:/Users/gcook/Sync/git/dataviz/dataviz24/data"

Writing with {readr}

Data frames can be written, or saved, as .csv files using readr::write_csv(), as r data files using readr::write_rds(), as Excel .csv files using readr::write_excel_csv(), as well as other formats.

If you query R using ?readr::write_excel_csv or ?write_excel_csv having already loaded {readr} using library(readr), you will see the arguments.

OK, so let’s save the data frame. But remember, you cannot pass the file path and name used to read the .xlsx file because the data frame you want to save is not in .xlsx file format.

If you pass that previous path string (cms-top-all-time-2023-swim.xlsx ), you will overwrite that existing Excel file, which likely is not your end goal for at least two reasons. First, the .csv would be incorrectly named as .xlsx and second, you would completely clobber your Excel spreadsheet along with all of the other worksheets that you did not read. Yeah, that would be bad.

So we need to change the file name. Specifically, we only need to change the file extension part of the path. We could simply copy and paste the path and change ".xlsx" to ".csv".

readr::write_csv(DAT, here::here("data", "cms-top-all-time-2023-swim.csv"))

Sure, that will work. But I would also like you to think about smart workflow.

A Workflow Side Note on Strings

You could avoid hard coding the change of the file extension in order to streamline you workflow. Every time to pass the path and the path changes you will need to change this by hand and doing so could be extremely annoying. For example, if you change the save location or the file name, you’ll need to make updates for all code referencing the path. To avoid potential headaches, we can instead use gsub() to examine a string, look for a pattern, and replace that pattern with another pattern. All we want to do is to change ".xlsx" or ".xls" in the string to ".csv". And because we will next want to use this new name for reading later, let’s assign the change to a new string object, file_csv.

First, let’s see what gsub() is doing.

gsub(pattern = ".xlsx|.xls",  
     replacement = ".csv", 
     x = file_name
     )
[1] "cms-top-all-time-2023-swim.csv"

Assign to an object:

file_csv <- gsub(".xlsx|.xls", ".csv", file_name)

Note: Code was simplified because the arguments were passed in the order expected by the gsub() function.

Second, pass the path object to write_csv():

readr::write_csv(x = DAT, 
                 file = here::here("data", file_csv)
                 )

Did it save? Use file.exists().

file.exists(here::here("data", file_csv))
[1] TRUE

Remember, all we have done is save the data frame. This new file will contain only the data from the spreadsheet that we read earlier. Before opening this new file, we need to take a detour on general handling of reading files with {readr}.

Reading .csv files with {readr}

The base R functions for reading/importing and writing/exporting data can be slow and may cause issues on different operating systems. By contrast, {readr} functions operate independently of the OS. As a result, we will generally use the {readr} equivalent to read.csv files is readr::read_csv() (note the library reference and function name).

If you query R using ?readr::read_csv or read_csv having already loaded the {readr} library using library(readr), you will see how to use the function.

In order to read a data file, you will need to specify at very least file, which could be a physical file or a remote file. We will address a few ways of reading data.

  • file: a path to a file, a connection, or literal data (either a single string or a raw vector)
  • delim = " ": the delimiter used to separate values; this is "," for .csv files
  • col_names = !append: the names of columns/variables

Some examples:

  • Reading a .csv file from a website
  • Reading a locale .csv file
  • Reading raw data that is comma-separated

Reading a .csv File Stored on a Website

For example, although the mtcars data is also a built-in data set in R, if it were a read actual .csv file save on some website, you can pass the URL path as the file. This file does exist on the {tidyverse} github for {readr}.

readr::read_csv(file = "https://github.com/tidyverse/readr/raw/main/inst/extdata/mtcars.csv")
Rows: 32 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (11): mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 32 × 11
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
 2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
 3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
 4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
 5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
 6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
 7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
 8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
 9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
# ℹ 22 more rows

Because file if the first argument of the function, you do not need to reference it specifically. Doing so just eliminates ambiguity for more complicated function calls. You will come across a lot of examples of code that do NOT reference the arguments by name.

readr::read_csv("https://github.com/tidyverse/readr/raw/main/inst/extdata/mtcars.csv")
Rows: 32 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (11): mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 32 × 11
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
 2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
 3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
 4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
 5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
 6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
 7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
 8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
 9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
# ℹ 22 more rows

By default, readr::read_csv() tries to guess whether column/variable names are present. If you know they exist, you can set col_names = TRUE.

readr::read_csv("https://github.com/tidyverse/readr/raw/main/inst/extdata/mtcars.csv", col_names = T)
Rows: 32 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (11): mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 32 × 11
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
 2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
 3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
 4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
 5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
 6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
 7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
 8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
 9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
# ℹ 22 more rows

If the names are present and you set col_names = FALSE, you will get a mess because {readr} will assume the header row is data just as the rest of the file.

readr::read_csv("https://github.com/tidyverse/readr/raw/main/inst/extdata/mtcars.csv", col_names = F)
Rows: 33 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (11): X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 33 × 11
   X1    X2    X3    X4    X5    X6    X7    X8    X9    X10   X11  
   <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
 1 mpg   cyl   disp  hp    drat  wt    qsec  vs    am    gear  carb 
 2 21    6     160   110   3.9   2.62  16.46 0     1     4     4    
 3 21    6     160   110   3.9   2.875 17.02 0     1     4     4    
 4 22.8  4     108   93    3.85  2.32  18.61 1     1     4     1    
 5 21.4  6     258   110   3.08  3.215 19.44 1     0     3     1    
 6 18.7  8     360   175   3.15  3.44  17.02 0     0     3     2    
 7 18.1  6     225   105   2.76  3.46  20.22 1     0     3     1    
 8 14.3  8     360   245   3.21  3.57  15.84 0     0     3     4    
 9 24.4  4     146.7 62    3.69  3.19  20    1     0     4     2    
10 22.8  4     140.8 95    3.92  3.15  22.9  1     0     4     2    
# ℹ 23 more rows

As you can see, the column names are all prefixed with “X” and the first row is now the name of the headers. names() or colnames() will return the column names, so we can apply it and see what happens. We will wrap readr::read_csv() in names(). See how this is a problem. You can use colnames() to test this too.

names(
  readr::read_csv("https://github.com/tidyverse/readr/raw/main/inst/extdata/mtcars.csv", col_names = T)
  )
Rows: 32 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (11): mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
 [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
[11] "carb"

Reading a .csv File Stored Locally on your Computer

If a file actually existed on your computer, the file would not be a URL but rather the path location to where the file is stored.

And now we can read the locale file as before except we are not passing the string name but rather an object (e.g., file_csv) holding the file path and file name. Voilà.

readr::read_csv(here::here("data", file_csv))
Rows: 440 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): time, name, year, event, team

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 440 × 5
   time  name             year  event   team  
   <chr> <chr>            <chr> <chr>   <chr> 
 1 23.29 Jocelyn Crawford 2019  50 FREE Athena
 2 23.31 Ava Sealander    2022  50 FREE Athena
 3 23.49 Kelly Ngo        2016  50 FREE Athena
 4 23.71 Helen Liu        2014  50 FREE Athena
 5 23.76 Michele Kee      2014  50 FREE Athena
 6 23.77 Natalia Orbach-M 2020  50 FREE Athena
 7 23.77 Suzia Starzyk    2020  50 FREE Athena
 8 23.87 Katie Bilotti    2010  50 FREE Athena
 9 23.93 Jenni Rinker     2011  50 FREE Athena
10 24.02 Annika Sharma    2023  50 FREE Athena
# ℹ 430 more rows

Reading Raw Data that is Comma-Separated (e.g., .csv)

We will file use readr::read_csv() to read our data file (viz., cms-top-all-time-2023-swim.csv).

readr::read_csv(here::here("data", file_csv))
Rows: 440 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): time, name, year, event, team

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 440 × 5
   time  name             year  event   team  
   <chr> <chr>            <chr> <chr>   <chr> 
 1 23.29 Jocelyn Crawford 2019  50 FREE Athena
 2 23.31 Ava Sealander    2022  50 FREE Athena
 3 23.49 Kelly Ngo        2016  50 FREE Athena
 4 23.71 Helen Liu        2014  50 FREE Athena
 5 23.76 Michele Kee      2014  50 FREE Athena
 6 23.77 Natalia Orbach-M 2020  50 FREE Athena
 7 23.77 Suzia Starzyk    2020  50 FREE Athena
 8 23.87 Katie Bilotti    2010  50 FREE Athena
 9 23.93 Jenni Rinker     2011  50 FREE Athena
10 24.02 Annika Sharma    2023  50 FREE Athena
# ℹ 430 more rows

If there were only data in the file and no names representing variables on the first row, the file might look like that below. We can imitate this by skipping the first row (containing names) using skip =.

readr::read_csv(here::here("data", file_csv), skip = 1)
Rows: 439 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): 23.29, Jocelyn Crawford, 2019, 50 FREE, Athena

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 439 × 5
   `23.29` `Jocelyn Crawford` `2019` `50 FREE` Athena
   <chr>   <chr>              <chr>  <chr>     <chr> 
 1 23.31   Ava Sealander      2022   50 FREE   Athena
 2 23.49   Kelly Ngo          2016   50 FREE   Athena
 3 23.71   Helen Liu          2014   50 FREE   Athena
 4 23.76   Michele Kee        2014   50 FREE   Athena
 5 23.77   Natalia Orbach-M   2020   50 FREE   Athena
 6 23.77   Suzia Starzyk      2020   50 FREE   Athena
 7 23.87   Katie Bilotti      2010   50 FREE   Athena
 8 23.93   Jenni Rinker       2011   50 FREE   Athena
 9 24.02   Annika Sharma      2023   50 FREE   Athena
10 51.05   Kelly Ngo          2016   100 FREE  Athena
# ℹ 429 more rows

See how the first row is assumed to be names? Setting col_names = F will fix the problem. Putting the arguments on separate rows of R code might improve code legibility.

readr::read_csv(here::here("data", file_csv), 
                skip = 1,
                col_names = F
                )
Rows: 440 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): X1, X2, X3, X4, X5

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 440 × 5
   X1    X2               X3    X4      X5    
   <chr> <chr>            <chr> <chr>   <chr> 
 1 23.29 Jocelyn Crawford 2019  50 FREE Athena
 2 23.31 Ava Sealander    2022  50 FREE Athena
 3 23.49 Kelly Ngo        2016  50 FREE Athena
 4 23.71 Helen Liu        2014  50 FREE Athena
 5 23.76 Michele Kee      2014  50 FREE Athena
 6 23.77 Natalia Orbach-M 2020  50 FREE Athena
 7 23.77 Suzia Starzyk    2020  50 FREE Athena
 8 23.87 Katie Bilotti    2010  50 FREE Athena
 9 23.93 Jenni Rinker     2011  50 FREE Athena
10 24.02 Annika Sharma    2023  50 FREE Athena
# ℹ 430 more rows

But we have no column names now. Setting col_names = will fix that. Use c() to combine 4 names, e.g., col_names = c("name1", "name2", "name3", "name4").

readr::read_csv(here::here("data", file_csv), 
                skip = 1,
                col_names = c("time", "name", "year", "event")
                )
Rows: 440 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): time, name, year, event, X5

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 440 × 5
   time  name             year  event   X5    
   <chr> <chr>            <chr> <chr>   <chr> 
 1 23.29 Jocelyn Crawford 2019  50 FREE Athena
 2 23.31 Ava Sealander    2022  50 FREE Athena
 3 23.49 Kelly Ngo        2016  50 FREE Athena
 4 23.71 Helen Liu        2014  50 FREE Athena
 5 23.76 Michele Kee      2014  50 FREE Athena
 6 23.77 Natalia Orbach-M 2020  50 FREE Athena
 7 23.77 Suzia Starzyk    2020  50 FREE Athena
 8 23.87 Katie Bilotti    2010  50 FREE Athena
 9 23.93 Jenni Rinker     2011  50 FREE Athena
10 24.02 Annika Sharma    2023  50 FREE Athena
# ℹ 430 more rows

If you have column names that are on row 1 of the data frame, don’t skip that row and instead set col_names = TRUE to put them in place.

readr::read_csv(here::here("data", file_csv), 
                col_names = T
                )
Rows: 440 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): time, name, year, event, team

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 440 × 5
   time  name             year  event   team  
   <chr> <chr>            <chr> <chr>   <chr> 
 1 23.29 Jocelyn Crawford 2019  50 FREE Athena
 2 23.31 Ava Sealander    2022  50 FREE Athena
 3 23.49 Kelly Ngo        2016  50 FREE Athena
 4 23.71 Helen Liu        2014  50 FREE Athena
 5 23.76 Michele Kee      2014  50 FREE Athena
 6 23.77 Natalia Orbach-M 2020  50 FREE Athena
 7 23.77 Suzia Starzyk    2020  50 FREE Athena
 8 23.87 Katie Bilotti    2010  50 FREE Athena
 9 23.93 Jenni Rinker     2011  50 FREE Athena
10 24.02 Annika Sharma    2023  50 FREE Athena
# ℹ 430 more rows

Luckily, we have both names and data in the file and by default readr::read_csv() does what we intend.

Reading Data from a Library/Package

As mentioned earlier, mtcars is a data set on cars which is also part of base R, meaning you do not need to read it from anyplace. R does this automatically.

mtcars
                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
# or 
print(mtcars)
                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

Use names() to read the column names:

names(mtcars)
 [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
[11] "carb"

Assigning a Data Frame to an Object

Unfortunately, once you read data into R what you see is only the returned content from the functions. This approach will not really allow you to perform other operations on the data frame itself very easily because you would have to read the file over and over every time you wished to perform a different operation.

Object Assignment using <-

You will want to take the data frame object that is returned by the read.csv() function and assign it to an object of some name using the assignment operator <- . Although the concept of assignment will be covered later, for now just understand that we need to make the data more accessible to work with. You could name the object anything you want. Let’s assign it to DAT standing for data frame and let’s make it ALL CAPS.

A note about case: R is a case-sensitive language so object names like DAT, dat, DaT, etc. are possible and can refer to different objects depending on how you assign them. We will use capital letters only because I like to flag objects that are data frame as special and this approach makes them visually identifiable. You could choose your own convention for naming data frames, other objects, variables in data frames, etc. but I don’t recommend being random about it.

Using read.table():

read.table() is a flexible function for reading files because you can specify how the data are separated in rows by setting the sep argument. A common separation is a comma but you might also have tabs or other special characters.

DAT <- read.table(file = here::here("data", file_csv),
                  sep = ",",
                  header = T
                  )

read.csv() is a specific case of read.table() that sets sep = "," for you so there is no need to pass the argument. read.csv() is the more common function you will come across for reading .csv files but read.table() works the same as long as you set the argument.

DAT <- read.csv(here::here("data", file_csv))

Using read_csv() from {readr}:

There are advantages to using readr::read_csv() over read.csv(), which is why we will prefer it. We will assign it to an object named DAT2.

DAT2 <- readr::read_csv(here::here("data", file_csv))
Rows: 440 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): time, name, year, event, team

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(DAT2)
# A tibble: 6 × 5
  time  name             year  event   team  
  <chr> <chr>            <chr> <chr>   <chr> 
1 23.29 Jocelyn Crawford 2019  50 FREE Athena
2 23.31 Ava Sealander    2022  50 FREE Athena
3 23.49 Kelly Ngo        2016  50 FREE Athena
4 23.71 Helen Liu        2014  50 FREE Athena
5 23.76 Michele Kee      2014  50 FREE Athena
6 23.77 Natalia Orbach-M 2020  50 FREE Athena

We can test whether DAT and DAT2 are the same using a logical test ==. Notice the two =. If we use one =, we will actually assign the contents of DAT2 to DAT because a single = in this context (scope) will do the same as <-. A discussion of the differences is beyond the scope here but suffice it so say <- is the common practice except when you are writing custom functions. In most cases, assignment inside functions are done with = because objects created inside a function are not typically needed outside that scope.

Anyway, you can see that the contents are the same even when files are read by different functions. This is wrapped in the all() function, which will return TRUE if everything is TRUE. This is good that the contents are identical.

all(DAT == DAT2)
[1] TRUE

Data as a Data Frame

You should see an object named DAT that contains the data frame with some swim data. If you want to verify this is actually a data frame object, you can pass the DAT object into the is.data.frame() function. The function will return TRUE if it is and FALSE if it is not.

is.data.frame(DAT)
[1] TRUE
is.data.frame(DAT2)  # tibbles are also data frames
[1] TRUE

Are they both tibbles?

Tibbles are different from data frames, see the {tibble} library.

tibble::is_tibble(DAT)
[1] FALSE
tibble::is_tibble(DAT2)
[1] TRUE

Now the you have the data frame, you can examine some of its contents, for example, the first 6 rows using the head() function.

head(DAT)    # hmm, something seems off.
   time             name year   event   team
1 23.29 Jocelyn Crawford 2019 50 FREE Athena
2 23.31    Ava Sealander 2022 50 FREE Athena
3 23.49        Kelly Ngo 2016 50 FREE Athena
4 23.71        Helen Liu 2014 50 FREE Athena
5 23.76      Michele Kee 2014 50 FREE Athena
6 23.77 Natalia Orbach-M 2020 50 FREE Athena

Because header rows do exist atop the .csv file, specify that they exist by passing TRUE to the header argument of the function (e.g., header = TRUE or header = T).

DAT <- read.table(here::here("data", file_csv),
                  sep = ",", 
                  header = TRUE
                  )

head(DAT)    # Perfect!
   time             name year   event   team
1 23.29 Jocelyn Crawford 2019 50 FREE Athena
2 23.31    Ava Sealander 2022 50 FREE Athena
3 23.49        Kelly Ngo 2016 50 FREE Athena
4 23.71        Helen Liu 2014 50 FREE Athena
5 23.76      Michele Kee 2014 50 FREE Athena
6 23.77 Natalia Orbach-M 2020 50 FREE Athena

Project Organization

For many projects, you will have to perform lots of data cleaning. For those instances, you would likely also want to organize your project in a way that you create dedicated .R used to read and clean the data. Depending on circumstances, you may also wish to create one file for reading data and one for cleaning. You would also have separate files for running analyses or for creating plots. In your report file, you would include a code block for sourcing other scripts as shown here.

source(here::here("r", "read_data.R"))
source(here::here("r", "clean_data.R"))
source(here::here("r", "analyze_data.R"))
source(here::here("r", "create_plots.R"))

The alternative is that you write all of your code in one file and overwhelm yourself because of all of the lines of code. You could, however, think of yourself as members of a team and organize your files as though you were working with others.

When working with teams, you would likely also manage a github repository for your project. Team contributors would be working on separate tasks and separate files. The files might be used by one or all team members. You could imagine how difficult the process would be if there was only one file with all project code. Whether you manage a repo for your project or you do not, be smart and compartmentalize your code into meaningful chunks.

Reading Large Files

Now, sometimes you might work with very large files over 1 GB in size. If so, read.csv() and read.table() will be extremely slow. You’ll want another function. One of the fastest is fread() that is part of a library that’s not built into base R. The library is {data.table}. {vroom} is a new library that may even be better. See {vroom}.