Importing many delimited files into a single data frame

Author

Gabriel I. Cook

Published

November 7, 2023

Libraries

{here}: 1.0.1: for path management
{dplyr} 1.1.2: for selecting, filtering, and mutating
{magrittr} 2.0.3: for code clarity and piping data frame objects
{vroom}: 1.6.3: for reading many files

library(magrittr)

Finding Data Files

The goal of this module is provide some solutions to reading in many data files. to find all the files we want to put in a data frame. You could certainly type all the names and put them into a vector but that would be tedious and if you misspelled anything, you would have to troubleshoot an error.

A solution for this pesky problem is to:

obtain a list of all files (a vector with file names as elements)
filter down that list to include only the relevant files
and combine them all

First, use list.files() to list all files in a particular directory. If you do not define any argument, the function will return files and directories in a the working directory, which you know by now should be your project directory. Passing a directory to path (the first argument) will return files and directories in that path. If you pass an argument to pattern, you will get returned files and directories that contain that string in their name.

We will look for .csv files and assign them to an object.

csv_files <- list.files(here::here("data", "tfrrs"), pattern = ".csv")

list.files() returns an alphabetized character vector. Because there are many files, there are many elements in the vector. We can take a look at elements 1 through 10 using bracket notation [] on the vector.

csv_files[1:10]

 [1] "2010_CMS_Athena_Outdoor_100.csv"   "2010_CMS_Athena_Outdoor_10000.csv"
 [3] "2010_CMS_Athena_Outdoor_100H.csv"  "2010_CMS_Athena_Outdoor_1500.csv" 
 [5] "2010_CMS_Athena_Outdoor_200.csv"   "2010_CMS_Athena_Outdoor_3000.csv" 
 [7] "2010_CMS_Athena_Outdoor_3000S.csv" "2010_CMS_Athena_Outdoor_400.csv"  
 [9] "2010_CMS_Athena_Outdoor_400H.csv"  "2010_CMS_Athena_Outdoor_4x100.csv"

Looking at the naming convention, you will see files are named by year and school and team and location and event. Underscores separate the names. You might wish to find all files containing another pattern. Modifying the pattern will change the returned file names.

csv_files <- list.files(here::here("data", "tfrrs"), pattern = "_100.csv")


csv_files[1:10]

 [1] "2010_CMS_Athena_Outdoor_100.csv" "2010_CMS_Stag_Outdoor_100.csv"  
 [3] "2011_CMS_Athena_Outdoor_100.csv" "2011_CMS_Stag_Outdoor_100.csv"  
 [5] "2012_CMS_Athena_Outdoor_100.csv" "2012_CMS_Stag_Outdoor_100.csv"  
 [7] "2013_CMS_Athena_Outdoor_100.csv" "2013_CMS_Stag_Outdoor_100.csv"  
 [9] "2014_CMS_Athena_Outdoor_100.csv" "2014_CMS_Stag_Outdoor_100.csv"

Combining Data Files For Specific Events

The tfrrs data are not all that clean. You can use the logic provided below to:

Get files by pattern match for a particular event.
Read them and combine them into a single data frame
Write out that event uber event data frame.
Combine these event .csv’s to an uber data frame
Write out the uber data frame (use saveRDS() if you make any changes to variable type (e.g., factor))

The more events you can put in a single data frame, the easier your plotting will be in {ggplot} but remember you can also pass separate data frames at the geom_*() layer. That said, combining as much as you can might be most helpful. I would recommend dealing first with the field events before dealing with the track events because of the time variable.

Get the File Names by Pattern

Looking at the pattern of file names for field events, you see .csv file names containing patterns like “DT”, “HT”, “JT”, “LJ”, “PV”, “LJ”, “SP”, and “TJ”. Double check that you have all event name types. I would also recommend keeping events measured on a time scale separate from a distance scale.

I will demonstrate the process with two field events. You can apply the logic for all events. We will use vroom::vroom() to take the file names (full names mind you), open them all up and combine them into a single data frame.

Let’s start with field events like discus:

We will use some regular expressions, or regex, to help us. Because the data are named by year, we can limit the search to files that contain a 4-digit year \\d{4} so that you don’t also match any aggregated uber files that do not contain a year. Then, the dot star .* will help with filler patterns. In particular, . refers to any character (e.g,. digit, alpha character, or any other special character) and * means zero or more times, so this pattern will search for all files that start with 4 digits followed by anything in the name. Then, the _HT.csv pattern will restrict the search to files containing that exact character string. All together, you have a search for pattern = "\\d{4}.*_DT.csv". This will not return the file without the year.

DT <- list.files(here::here("data", "tfrrs"), 
           pattern = "\\d{4}.*_DT.csv",
           ignore.case = T,
           full.names = T
           ) %>% 
  vroom::vroom(.)

Rows: 1743 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (9): Athlete, Year, Mark, Conv, Meet, Meet_Date, Team, Event, Location
dbl (1): Season

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

saveRDS(DT, here::here("data", "tfrrs", "DT.Rds"))

Now how about hammer:

HT <- list.files(here::here("data", "tfrrs"), 
           pattern = "\\d{4}.*_HT.csv",
           full.names = T
           ) %>% 
  vroom::vroom(.)

Rows: 1557 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (9): Athlete, Year, Mark, Conv, Meet, Meet_Date, Team, Event, Location
dbl (1): Season

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

saveRDS(HT, here::here("data", "tfrrs", "HT.Rds"))

Get the Event Files and Combine

For some events, the columns may differ, so you might get an error. If you review the names of columns and locate the pesky dplyr::select() columns that you do want. If you have not removed the data frame objects just pass those to dplyr::bind_rows(), which will add the rows of the second data frame below the first data frame so that you have a single data frame.

FIELD <- dplyr::bind_rows(DT, HT)

Save Data

And then combine and write to an uber field event .csv or .Rds file.

saveRDS(FIELD, here::here("data", "tfrrs", "TFRRS_FIELD.Rds"))

Session Information

sessionInfo()

R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22621)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] magrittr_2.0.3

loaded via a namespace (and not attached):
 [1] vctrs_0.6.3       crayon_1.5.2      cli_3.6.1         knitr_1.43       
 [5] rlang_1.1.1       xfun_0.40         generics_0.1.3    jsonlite_1.8.7   
 [9] glue_1.6.2        bit_4.0.5         rprojroot_2.0.3   htmltools_0.5.6  
[13] fansi_1.0.4       rmarkdown_2.24    evaluate_0.21     tibble_3.2.1     
[17] tzdb_0.4.0        fastmap_1.1.1     yaml_2.3.7        lifecycle_1.0.3  
[21] compiler_4.3.1    dplyr_1.1.2       pkgconfig_2.0.3   htmlwidgets_1.6.2
[25] here_1.0.1        rstudioapi_0.15.0 digest_0.6.33     R6_2.5.1         
[29] tidyselect_1.2.0  utf8_1.2.3        parallel_4.3.1    vroom_1.6.3      
[33] pillar_1.9.0      tools_4.3.1       bit64_4.0.5       archive_1.1.5