file.exists(here::here("data", "cms-top-all-time-2023-swim.xlsx"))
[1] TRUE
Gabriel I. Cook
October 31, 2024
In this module, you will learn how to read and write data files in different formats. Depending how you access data, the process may change so this topic is discussed in a general way with different approaches described. If you find yourself in a bind with reading data, one of these approaches will likely work. In many instances, however, readr::read_csv()
will be your friend when reading files and readr::write_csv()
will be for writing files.
In addition, concepts related to reading files and managing workflow are discussed. By doing so, my goal is to make you aware of some key issues that could cause some problems with data science projects. When reading files,
.csv
, .tsv
, and .fwf
filesFirst, we need an .xlsx
data file. You can obtain one locally or online from a URL.
.zip
File from a URLGitHub handles Excel files in a way that makes downloading them challenging, so we will first download a .zip
file containing the .xlxs
file.
You can accomplish this in two ways:
Download from here, save in your project /data
directory, and unzip
Use the URL to a zip file and download to the project /data
directory, unzip the contents, and remove the .zip
file using the code below
zip_url <- "https://github.com/slicesofdata/dataviz24/raw/main/data/cms-top-all-time-2023-swim.zip"
download.file(zip_url, destfile = here::here("data", "swim.zip"))
unzip(here::here("data", "swim.zip"),
overwrite = TRUE,
exdir = here::here("data")
)
file.remove(here::here("data", "swim.zip"))
If the process worked correctly, the file should exist in the directory.
OK, great! Let’s move to importing files more generally.
{readxl} 1.4.3 lacks the ability to read the file from online. We can, however, read it using {openxlsx}. The problem is that you will only be able to read a the first sheet. If the first sheet is all you need, this can work. Pass the URL to openxlsx::read.xlsx()
and assign it’s contents to an object named DAT
using the assignment operator <-
.
What does the head of the data file look like?
score name year event team
1 525.35 Maia Presti 2015 1-Meter Diving Athena
2 514.70 Makenna Parkinson 2023 1-Meter Diving Athena
3 512.05 Emma Ng Pack 2023 1-Meter Diving Athena
4 494.95 Izzy Doud 2023 1-Meter Diving Athena
5 462.15 Carli Lessard 2015 1-Meter Diving Athena
6 447.70 Alexis Romero 2023 1-Meter Diving Athena
Just an FYI, when you want a different worksheet you will need to pass a sheet name to the sheet
argument. In this case, we saved it as part of the download process. Let’s pass sheet = "swim"
.
time name year event team
1 23.29 Jocelyn Crawford 2019 50 FREE Athena
2 23.31 Ava Sealander 2022 50 FREE Athena
3 23.49 Kelly Ngo 2016 50 FREE Athena
4 23.71 Helen Liu 2014 50 FREE Athena
5 23.76 Michele Kee 2014 50 FREE Athena
6 23.77 Natalia Orbach-M 2020 50 FREE Athena
{readxl}
We will use the {readxl} library to handing reading of Excel files. Because Excel files can contain multiple sheets, one goal would be to find out the sheet names using readxl::excel_sheets
(see ?readxl::excel_sheets
). This function takes one argument, which is the path to the file. Passing the path will return the sheet names in that file. We can pass the path string directly into the function or if the file path is already saved as a object, pass that.
In order to read an Excel spreadsheet file, you will need to specify at very least file
and if you want to read a specific sheet other than the first one, then you will need to specify sheet
.
file
: a path to the file, including the file namesheet
: the sheet name to readFirst, let’s assign the file path to an object because we will use this path a few times and we don’t want to keep typing it lest we make an error.
We can examine the worksheet names:
Great, we know know the sheet names. The benefit of passing an object is that you you may wish to pass the path to another function, for example, to read a sheet from the file.
In order to read a sheet, we will use readxl::read_excel()
, which takes the file path
as the first argument and the name of the desired sheet
as the second argument. You might get away with passing only the path
as long as your goal is to read the first sheet because this is the default action. Let’s wrap the function in head()
to see the top.
# A tibble: 6 × 5
score name year event team
<chr> <chr> <chr> <chr> <chr>
1 525.35 Maia Presti 2015 1-Meter Diving Athena
2 514.70 Makenna Parkinson 2023 1-Meter Diving Athena
3 512.05 Emma Ng Pack 2023 1-Meter Diving Athena
4 494.95 Izzy Doud 2023 1-Meter Diving Athena
5 462.15 Carli Lessard 2015 1-Meter Diving Athena
6 447.70 Alexis Romero 2023 1-Meter Diving Athena
The function also turns the file content into special object type knows as a data frame
. A data frame is composed of row and column data. Sometimes data frames are messy but luckily we have a fairly clean file. You can verify using R
’s built-in function is.data.frame()
, which will return TRUE
if it’s a data frame or FALSE
if not. We will assign this to an object
But we don’t want the first sheet. Pass sheet = "swim"
to read that sheet. Also, let’s read in the data and assign it to an object called DAT
which will hold the data frame.
Viewing the head of the data frame, we can see that it is composed of 5 column vectors representing variables with names: time, name, year, event, team.
# A tibble: 6 × 5
time name year event team
<chr> <chr> <chr> <chr> <chr>
1 23.29 Jocelyn Crawford 2019 50 FREE Athena
2 23.31 Ava Sealander 2022 50 FREE Athena
3 23.49 Kelly Ngo 2016 50 FREE Athena
4 23.71 Helen Liu 2014 50 FREE Athena
5 23.76 Michele Kee 2014 50 FREE Athena
6 23.77 Natalia Orbach-M 2020 50 FREE Athena
OK, that was fun. In order to demonstrate reading local .csv
files, we will take a detour into saving files.
Despite it’s name, {readr} can save data frames too. Interestingly, at least at the time of this writing, {readr} allows you to write files in excel format even though you cannot open them.
Moving forward, you can save the data frame as a .csv
file using readr::write_csv()
. When writing a file using write_csv()
we need to specify key arguments: x
, the data frame object, and file
, the file name. You do not need to set a delimiter because, given its name, .csv
defaults the separation as a comma ","
. Other arguments may be necessary for other files and goals.
If, however, you specify name of the file (e.g., "my file name.csv"
), the data frame will be written to a default location because no file path was provided. Um, so where is that default location? Well it depends and it’s complicated depending on whether files are .R
or .Rmd
. By default, .Rmd
files will assume the directory from which the .Rmd
file, NOT THE DATA FILE, is opened. But you don’t want to save a data file to the directory where your code lives.
If you are organized, you save your .R
and .Rmd
files in code directory and your data files in a data directory. So yeah, the default is a problem. The solution is to specify the file path using file = the file path
.
When downloading the file, you may have noticed using the {here} library. A discussion of the library was delayed at the time. We will now look a little deeper into how the library simplifies working with file paths within the context of the {readr}.
What’s the best way to handle directories and file paths? Undoutedly, that is the {here} library, assuming of course you are smart enough to be using projects in RStudio
. When you open a file from within a project, {here} will make the project directory the working directory. And if you are organized, your data files will be in a /data
directory inside the project directory. When passing "data"
as the first argument to here::here()
( e.g., here::here("data")
, you will see that the function returns a string containing the full path to the project directory plus the data subdirectory.
Data frames can be written, or saved, as .csv
files using readr::write_csv()
, as r data files using readr::write_rds()
, as Excel .csv
files using readr::write_excel_csv()
, as well as other formats.
If you query R
using ?readr::write_excel_csv
or ?write_excel_csv
having already loaded {readr} using library(readr)
, you will see the arguments.
OK, so let’s save the data frame. But remember, you cannot pass the file path and name used to read the .xlsx
file because the data frame you want to save is not in .xlsx
file format.
If you pass that previous path string (cms-top-all-time-2023-swim.xlsx ), you will overwrite that existing Excel file, which likely is not your end goal for at least two reasons. First, the .csv
would be incorrectly named as .xlsx
and second, you would completely clobber your Excel spreadsheet along with all of the other worksheets that you did not read. Yeah, that would be bad.
So we need to change the file name. Specifically, we only need to change the file extension part of the path. We could simply copy and paste the path and change ".xlsx"
to ".csv"
.
Sure, that will work. But I would also like you to think about smart workflow.
You could avoid hard coding the change of the file extension in order to streamline you workflow. Every time to pass the path and the path changes you will need to change this by hand and doing so could be extremely annoying. For example, if you change the save location or the file name, you’ll need to make updates for all code referencing the path. To avoid potential headaches, we can instead use gsub()
to examine a string, look for a pattern, and replace that pattern with another pattern. All we want to do is to change ".xlsx"
or ".xls"
in the string to ".csv"
. And because we will next want to use this new name for reading later, let’s assign the change to a new string object, file_csv
.
First, let’s see what gsub()
is doing.
[1] "cms-top-all-time-2023-swim.csv"
Assign to an object:
Note: Code was simplified because the arguments were passed in the order expected by the gsub()
function.
Second, pass the path object to write_csv()
:
Did it save? Use file.exists()
.
Remember, all we have done is save the data frame. This new file will contain only the data from the spreadsheet that we read earlier. Before opening this new file, we need to take a detour on general handling of reading files with {readr}.
.csv
files with {readr}The base R
functions for reading/importing and writing/exporting data can be slow and may cause issues on different operating systems. By contrast, {readr} functions operate independently of the OS. As a result, we will generally use the {readr} equivalent to read.csv
files is readr::read_csv()
(note the library reference and function name).
If you query R
using ?readr::read_csv
or read_csv
having already loaded the {readr} library using library(readr)
, you will see how to use the function.
In order to read a data file, you will need to specify at very least file
, which could be a physical file or a remote file. We will address a few ways of reading data.
file
: a path to a file, a connection, or literal data (either a single string or a raw vector)delim = " "
: the delimiter used to separate values; this is ","
for .csv
filescol_names = !append
: the names of columns/variablesSome examples:
.csv
file from a website.csv
file.csv
File Stored on a WebsiteFor example, although the mtcars
data is also a built-in data set in R
, if it were a read actual .csv
file save on some website, you can pass the URL path as the file
. This file does exist on the {tidyverse} github for {readr}
.
Rows: 32 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (11): mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 32 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# ℹ 22 more rows
Because file
if the first argument of the function, you do not need to reference it specifically. Doing so just eliminates ambiguity for more complicated function calls. You will come across a lot of examples of code that do NOT reference the arguments by name.
Rows: 32 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (11): mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 32 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# ℹ 22 more rows
By default, readr::read_csv()
tries to guess whether column/variable names are present. If you know they exist, you can set col_names = TRUE
.
readr::read_csv("https://github.com/tidyverse/readr/raw/main/inst/extdata/mtcars.csv", col_names = T)
Rows: 32 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (11): mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 32 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# ℹ 22 more rows
If the names are present and you set col_names = FALSE
, you will get a mess because {readr} will assume the header row is data just as the rest of the file.
readr::read_csv("https://github.com/tidyverse/readr/raw/main/inst/extdata/mtcars.csv", col_names = F)
Rows: 33 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (11): X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 33 × 11
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 mpg cyl disp hp drat wt qsec vs am gear carb
2 21 6 160 110 3.9 2.62 16.46 0 1 4 4
3 21 6 160 110 3.9 2.875 17.02 0 1 4 4
4 22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
5 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
6 18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
7 18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
8 14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
9 24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
10 22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
# ℹ 23 more rows
As you can see, the column names are all prefixed with “X” and the first row is now the name of the headers. names()
or colnames()
will return the column names, so we can apply it and see what happens. We will wrap readr::read_csv()
in names()
. See how this is a problem. You can use colnames()
to test this too.
names(
readr::read_csv("https://github.com/tidyverse/readr/raw/main/inst/extdata/mtcars.csv", col_names = T)
)
Rows: 32 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (11): mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
[11] "carb"
.csv
File Stored Locally on your ComputerIf a file actually existed on your computer, the file would not be a URL but rather the path location to where the file is stored.
And now we can read the locale file as before except we are not passing the string name but rather an object (e.g., file_csv
) holding the file path and file name. Voilà.
Rows: 440 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): time, name, year, event, team
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 440 × 5
time name year event team
<chr> <chr> <chr> <chr> <chr>
1 23.29 Jocelyn Crawford 2019 50 FREE Athena
2 23.31 Ava Sealander 2022 50 FREE Athena
3 23.49 Kelly Ngo 2016 50 FREE Athena
4 23.71 Helen Liu 2014 50 FREE Athena
5 23.76 Michele Kee 2014 50 FREE Athena
6 23.77 Natalia Orbach-M 2020 50 FREE Athena
7 23.77 Suzia Starzyk 2020 50 FREE Athena
8 23.87 Katie Bilotti 2010 50 FREE Athena
9 23.93 Jenni Rinker 2011 50 FREE Athena
10 24.02 Annika Sharma 2023 50 FREE Athena
# ℹ 430 more rows
.csv
)We will file use readr::read_csv()
to read our data file (viz., cms-top-all-time-2023-swim.csv).
Rows: 440 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): time, name, year, event, team
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 440 × 5
time name year event team
<chr> <chr> <chr> <chr> <chr>
1 23.29 Jocelyn Crawford 2019 50 FREE Athena
2 23.31 Ava Sealander 2022 50 FREE Athena
3 23.49 Kelly Ngo 2016 50 FREE Athena
4 23.71 Helen Liu 2014 50 FREE Athena
5 23.76 Michele Kee 2014 50 FREE Athena
6 23.77 Natalia Orbach-M 2020 50 FREE Athena
7 23.77 Suzia Starzyk 2020 50 FREE Athena
8 23.87 Katie Bilotti 2010 50 FREE Athena
9 23.93 Jenni Rinker 2011 50 FREE Athena
10 24.02 Annika Sharma 2023 50 FREE Athena
# ℹ 430 more rows
If there were only data in the file and no names representing variables on the first row, the file might look like that below. We can imitate this by skipping the first row (containing names) using skip =
.
Rows: 439 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): 23.29, Jocelyn Crawford, 2019, 50 FREE, Athena
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 439 × 5
`23.29` `Jocelyn Crawford` `2019` `50 FREE` Athena
<chr> <chr> <chr> <chr> <chr>
1 23.31 Ava Sealander 2022 50 FREE Athena
2 23.49 Kelly Ngo 2016 50 FREE Athena
3 23.71 Helen Liu 2014 50 FREE Athena
4 23.76 Michele Kee 2014 50 FREE Athena
5 23.77 Natalia Orbach-M 2020 50 FREE Athena
6 23.77 Suzia Starzyk 2020 50 FREE Athena
7 23.87 Katie Bilotti 2010 50 FREE Athena
8 23.93 Jenni Rinker 2011 50 FREE Athena
9 24.02 Annika Sharma 2023 50 FREE Athena
10 51.05 Kelly Ngo 2016 100 FREE Athena
# ℹ 429 more rows
See how the first row is assumed to be names? Setting col_names = F
will fix the problem. Putting the arguments on separate rows of R
code might improve code legibility.
Rows: 440 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): X1, X2, X3, X4, X5
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 440 × 5
X1 X2 X3 X4 X5
<chr> <chr> <chr> <chr> <chr>
1 23.29 Jocelyn Crawford 2019 50 FREE Athena
2 23.31 Ava Sealander 2022 50 FREE Athena
3 23.49 Kelly Ngo 2016 50 FREE Athena
4 23.71 Helen Liu 2014 50 FREE Athena
5 23.76 Michele Kee 2014 50 FREE Athena
6 23.77 Natalia Orbach-M 2020 50 FREE Athena
7 23.77 Suzia Starzyk 2020 50 FREE Athena
8 23.87 Katie Bilotti 2010 50 FREE Athena
9 23.93 Jenni Rinker 2011 50 FREE Athena
10 24.02 Annika Sharma 2023 50 FREE Athena
# ℹ 430 more rows
But we have no column names now. Setting col_names =
will fix that. Use c()
to combine 4 names, e.g., col_names = c("name1", "name2", "name3", "name4")
.
readr::read_csv(here::here("data", file_csv),
skip = 1,
col_names = c("time", "name", "year", "event")
)
Rows: 440 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): time, name, year, event, X5
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 440 × 5
time name year event X5
<chr> <chr> <chr> <chr> <chr>
1 23.29 Jocelyn Crawford 2019 50 FREE Athena
2 23.31 Ava Sealander 2022 50 FREE Athena
3 23.49 Kelly Ngo 2016 50 FREE Athena
4 23.71 Helen Liu 2014 50 FREE Athena
5 23.76 Michele Kee 2014 50 FREE Athena
6 23.77 Natalia Orbach-M 2020 50 FREE Athena
7 23.77 Suzia Starzyk 2020 50 FREE Athena
8 23.87 Katie Bilotti 2010 50 FREE Athena
9 23.93 Jenni Rinker 2011 50 FREE Athena
10 24.02 Annika Sharma 2023 50 FREE Athena
# ℹ 430 more rows
If you have column names that are on row 1 of the data frame, don’t skip that row and instead set col_names = TRUE
to put them in place.
Rows: 440 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): time, name, year, event, team
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 440 × 5
time name year event team
<chr> <chr> <chr> <chr> <chr>
1 23.29 Jocelyn Crawford 2019 50 FREE Athena
2 23.31 Ava Sealander 2022 50 FREE Athena
3 23.49 Kelly Ngo 2016 50 FREE Athena
4 23.71 Helen Liu 2014 50 FREE Athena
5 23.76 Michele Kee 2014 50 FREE Athena
6 23.77 Natalia Orbach-M 2020 50 FREE Athena
7 23.77 Suzia Starzyk 2020 50 FREE Athena
8 23.87 Katie Bilotti 2010 50 FREE Athena
9 23.93 Jenni Rinker 2011 50 FREE Athena
10 24.02 Annika Sharma 2023 50 FREE Athena
# ℹ 430 more rows
Luckily, we have both names and data in the file and by default readr::read_csv()
does what we intend.
As mentioned earlier, mtcars
is a data set on cars which is also part of base R
, meaning you do not need to read it from anyplace. R
does this automatically.
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Use names()
to read the column names:
Unfortunately, once you read data into R
what you see is only the returned content from the functions. This approach will not really allow you to perform other operations on the data frame itself very easily because you would have to read the file over and over every time you wished to perform a different operation.
<-
You will want to take the data frame object that is returned by the read.csv()
function and assign it to an object of some name using the assignment operator <-
. Although the concept of assignment will be covered later, for now just understand that we need to make the data more accessible to work with. You could name the object anything you want. Let’s assign it to DAT
standing for data frame
and let’s make it ALL CAPS.
A note about case: R
is a case-sensitive language so object names like DAT
, dat
, DaT
, etc. are possible and can refer to different objects depending on how you assign them. We will use capital letters only because I like to flag objects that are data frame as special and this approach makes them visually identifiable. You could choose your own convention for naming data frames, other objects, variables in data frames, etc. but I don’t recommend being random about it.
read.table()
:read.table()
is a flexible function for reading files because you can specify how the data are separated in rows by setting the sep
argument. A common separation is a comma but you might also have tabs or other special characters.
read.csv()
is a specific case of read.table()
that sets sep = ","
for you so there is no need to pass the argument. read.csv()
is the more common function you will come across for reading .csv
files but read.table()
works the same as long as you set the argument.
read_csv()
from {readr}:There are advantages to using readr::read_csv()
over read.csv()
, which is why we will prefer it. We will assign it to an object named DAT2
.
Rows: 440 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): time, name, year, event, team
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 6 × 5
time name year event team
<chr> <chr> <chr> <chr> <chr>
1 23.29 Jocelyn Crawford 2019 50 FREE Athena
2 23.31 Ava Sealander 2022 50 FREE Athena
3 23.49 Kelly Ngo 2016 50 FREE Athena
4 23.71 Helen Liu 2014 50 FREE Athena
5 23.76 Michele Kee 2014 50 FREE Athena
6 23.77 Natalia Orbach-M 2020 50 FREE Athena
We can test whether DAT
and DAT2
are the same using a logical test ==
. Notice the two =
. If we use one =
, we will actually assign the contents of DAT2
to DAT
because a single =
in this context (scope) will do the same as <-
. A discussion of the differences is beyond the scope here but suffice it so say <-
is the common practice except when you are writing custom functions. In most cases, assignment inside functions are done with =
because objects created inside a function are not typically needed outside that scope.
Anyway, you can see that the contents are the same even when files are read by different functions. This is wrapped in the all()
function, which will return TRUE
if everything is TRUE
. This is good that the contents are identical.
You should see an object named DAT
that contains the data frame with some swim data. If you want to verify this is actually a data frame object, you can pass the DAT
object into the is.data.frame()
function. The function will return TRUE
if it is and FALSE
if it is not.
tibbles
?Tibbles are different from data frames, see the {tibble} library.
Now the you have the data frame, you can examine some of its contents, for example, the first 6 rows using the head()
function.
time name year event team
1 23.29 Jocelyn Crawford 2019 50 FREE Athena
2 23.31 Ava Sealander 2022 50 FREE Athena
3 23.49 Kelly Ngo 2016 50 FREE Athena
4 23.71 Helen Liu 2014 50 FREE Athena
5 23.76 Michele Kee 2014 50 FREE Athena
6 23.77 Natalia Orbach-M 2020 50 FREE Athena
Because header rows do exist atop the .csv
file, specify that they exist by passing TRUE
to the header
argument of the function (e.g., header = TRUE
or header = T
).
time name year event team
1 23.29 Jocelyn Crawford 2019 50 FREE Athena
2 23.31 Ava Sealander 2022 50 FREE Athena
3 23.49 Kelly Ngo 2016 50 FREE Athena
4 23.71 Helen Liu 2014 50 FREE Athena
5 23.76 Michele Kee 2014 50 FREE Athena
6 23.77 Natalia Orbach-M 2020 50 FREE Athena
For many projects, you will have to perform lots of data cleaning. For those instances, you would likely also want to organize your project in a way that you create dedicated .R
used to read and clean the data. Depending on circumstances, you may also wish to create one file for reading data and one for cleaning. You would also have separate files for running analyses or for creating plots. In your report file, you would include a code block for sourcing other scripts as shown here.
source(here::here("r", "read_data.R"))
source(here::here("r", "clean_data.R"))
source(here::here("r", "analyze_data.R"))
source(here::here("r", "create_plots.R"))
The alternative is that you write all of your code in one file and overwhelm yourself because of all of the lines of code. You could, however, think of yourself as members of a team and organize your files as though you were working with others.
When working with teams, you would likely also manage a github repository for your project. Team contributors would be working on separate tasks and separate files. The files might be used by one or all team members. You could imagine how difficult the process would be if there was only one file with all project code. Whether you manage a repo for your project or you do not, be smart and compartmentalize your code into meaningful chunks.
Now, sometimes you might work with very large files over 1 GB in size. If so, read.csv()
and read.table()
will be extremely slow. You’ll want another function. One of the fastest is fread()
that is part of a library that’s not built into base R. The library is {data.table}. {vroom} is a new library that may even be better. See {vroom}.