library(magrittr)
Importing many delimited files into a single data frame
Libraries
- {here}: 1.0.1: for path management
- {dplyr} 1.1.2: for selecting, filtering, and mutating
- {magrittr} 2.0.3: for code clarity and piping data frame objects
- {vroom}: 1.6.3: for reading many files
Finding Data Files
The goal of this module is provide some solutions to reading in many data files. to find all the files we want to put in a data frame. You could certainly type all the names and put them into a vector but that would be tedious and if you misspelled anything, you would have to troubleshoot an error.
A solution for this pesky problem is to:
- obtain a list of all files (a vector with file names as elements)
- filter down that list to include only the relevant files
- and combine them all
First, use list.files()
to list all files in a particular directory. If you do not define any argument, the function will return files and directories in a the working directory, which you know by now should be your project directory. Passing a directory to path
(the first argument) will return files and directories in that path. If you pass an argument to pattern
, you will get returned files and directories that contain that string in their name.
We will look for .csv
files and assign them to an object.
<- list.files(here::here("data", "tfrrs"), pattern = ".csv") csv_files
list.files()
returns an alphabetized character vector. Because there are many files, there are many elements in the vector. We can take a look at elements 1 through 10 using bracket notation []
on the vector.
1:10] csv_files[
[1] "2010_CMS_Athena_Outdoor_100.csv" "2010_CMS_Athena_Outdoor_10000.csv"
[3] "2010_CMS_Athena_Outdoor_100H.csv" "2010_CMS_Athena_Outdoor_1500.csv"
[5] "2010_CMS_Athena_Outdoor_200.csv" "2010_CMS_Athena_Outdoor_3000.csv"
[7] "2010_CMS_Athena_Outdoor_3000S.csv" "2010_CMS_Athena_Outdoor_400.csv"
[9] "2010_CMS_Athena_Outdoor_400H.csv" "2010_CMS_Athena_Outdoor_4x100.csv"
Looking at the naming convention, you will see files are named by year and school and team and location and event. Underscores separate the names. You might wish to find all files containing another pattern. Modifying the pattern will change the returned file names.
<- list.files(here::here("data", "tfrrs"), pattern = "_100.csv")
csv_files
1:10] csv_files[
[1] "2010_CMS_Athena_Outdoor_100.csv" "2010_CMS_Stag_Outdoor_100.csv"
[3] "2011_CMS_Athena_Outdoor_100.csv" "2011_CMS_Stag_Outdoor_100.csv"
[5] "2012_CMS_Athena_Outdoor_100.csv" "2012_CMS_Stag_Outdoor_100.csv"
[7] "2013_CMS_Athena_Outdoor_100.csv" "2013_CMS_Stag_Outdoor_100.csv"
[9] "2014_CMS_Athena_Outdoor_100.csv" "2014_CMS_Stag_Outdoor_100.csv"
Combining Data Files For Specific Events
The tfrrs data are not all that clean. You can use the logic provided below to:
- Get files by pattern match for a particular event.
- Read them and combine them into a single data frame
- Write out that event uber event data frame.
- Combine these event .csv’s to an uber data frame
- Write out the uber data frame (use
saveRDS()
if you make any changes to variable type (e.g., factor))
The more events you can put in a single data frame, the easier your plotting will be in {ggplot} but remember you can also pass separate data frames at the geom_*()
layer. That said, combining as much as you can might be most helpful. I would recommend dealing first with the field events before dealing with the track events because of the time variable.
Get the File Names by Pattern
Looking at the pattern of file names for field events, you see .csv file names containing patterns like “DT”, “HT”, “JT”, “LJ”, “PV”, “LJ”, “SP”, and “TJ”. Double check that you have all event name types. I would also recommend keeping events measured on a time scale separate from a distance scale.
I will demonstrate the process with two field events. You can apply the logic for all events. We will use vroom::vroom()
to take the file names (full names mind you), open them all up and combine them into a single data frame.
Let’s start with field events like discus:
We will use some regular expressions, or regex, to help us. Because the data are named by year, we can limit the search to files that contain a 4-digit year \\d{4}
so that you don’t also match any aggregated uber files that do not contain a year. Then, the dot star .*
will help with filler patterns. In particular, .
refers to any character (e.g,. digit, alpha character, or any other special character) and *
means zero or more times, so this pattern will search for all files that start with 4 digits followed by anything in the name. Then, the _HT.csv
pattern will restrict the search to files containing that exact character string. All together, you have a search for pattern = "\\d{4}.*_DT.csv"
. This will not return the file without the year.
<- list.files(here::here("data", "tfrrs"),
DT pattern = "\\d{4}.*_DT.csv",
ignore.case = T,
full.names = T
%>%
) ::vroom(.) vroom
Rows: 1743 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (9): Athlete, Year, Mark, Conv, Meet, Meet_Date, Team, Event, Location
dbl (1): Season
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
saveRDS(DT, here::here("data", "tfrrs", "DT.Rds"))
Now how about hammer:
<- list.files(here::here("data", "tfrrs"),
HT pattern = "\\d{4}.*_HT.csv",
full.names = T
%>%
) ::vroom(.) vroom
Rows: 1557 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (9): Athlete, Year, Mark, Conv, Meet, Meet_Date, Team, Event, Location
dbl (1): Season
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
saveRDS(HT, here::here("data", "tfrrs", "HT.Rds"))
Get the Event Files and Combine
For some events, the columns may differ, so you might get an error. If you review the names of columns and locate the pesky dplyr::select()
columns that you do want. If you have not removed the data frame objects just pass those to dplyr::bind_rows()
, which will add the rows of the second data frame below the first data frame so that you have a single data frame.
<- dplyr::bind_rows(DT, HT) FIELD
Save Data
And then combine and write to an uber field event .csv
or .Rds
file.
saveRDS(FIELD, here::here("data", "tfrrs", "TFRRS_FIELD.Rds"))
Session Information
sessionInfo()
R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22621)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8
[2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
time zone: America/Los_Angeles
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] magrittr_2.0.3
loaded via a namespace (and not attached):
[1] vctrs_0.6.3 crayon_1.5.2 cli_3.6.1 knitr_1.43
[5] rlang_1.1.1 xfun_0.40 generics_0.1.3 jsonlite_1.8.7
[9] glue_1.6.2 bit_4.0.5 rprojroot_2.0.3 htmltools_0.5.6
[13] fansi_1.0.4 rmarkdown_2.24 evaluate_0.21 tibble_3.2.1
[17] tzdb_0.4.0 fastmap_1.1.1 yaml_2.3.7 lifecycle_1.0.3
[21] compiler_4.3.1 dplyr_1.1.2 pkgconfig_2.0.3 htmlwidgets_1.6.2
[25] here_1.0.1 rstudioapi_0.15.0 digest_0.6.33 R6_2.5.1
[29] tidyselect_1.2.0 utf8_1.2.3 parallel_4.3.1 vroom_1.6.3
[33] pillar_1.9.0 tools_4.3.1 bit64_4.0.5 archive_1.1.5