Project Management 02: R Projects, Git, and GitHub

Author

Gabriel I. Cook

Published

February 21, 2025

Overview

This module focuses on getting organized. Rather than save files in a haphazard way that will just introduce stress to your life, we will focus on creating order. There are 3 main ways to create order and stay organized:

  • 1) create projects in RStudio
  • 2) create directories and sub-directories that leave no ambiguity about where your files are, and
  • 3) manage all directory paths and file paths simply using the {here} library
  • 4) user lowercase for files names and replaces spaces with hyphens or underscores

Another way is to connect that project with a remote repository saved someplace like GitHub for collaboration. You will use Git to interact with remote repositories connected to Projects in RStudio.

In order to maintain organization for data projects, you will set up a lab (aka directory) on your computer, ideally where you will always know where to look. You will then create an RStudio project and connect it to a remote private repository associated with your GitHub account. The reason for its privacy is because of data related to certain exercises.

You will use this RStudio project for all exercises so that there is no ambiguity about where your files are saved. Finally, you will create directories within your new project directory so that you have an organized directory structure for storing your files. Systems paths for project files and directories will be manage using the {here} library. This process will also ensure that each student’s computer is configured in the same manner.

Reading through these steps, however, will facilitate your ability to apply the concepts and run the associated functions. Thus, all RAs will gain some basic experience with Git commands and with communicating with a remote repository. RAs will be collaborators of a repository for certain projects.

Libraries Used

  • {usethis}: 2.2.3: for project workflow automation
  • {gitcreds}: 0.1.2: for querying git credentials
  • {gh}: 1.4.1: for querying the github api
  • {gert}: 2.0.1: optional R library approach for git commands

Warning

Do not try to cheat the system and jump ahead. If you do, just like playing the Monopoly board game, your chance card may read “Go to jail. Go directly to jail. Do not pass go. Do not collect $200.” In other words, you cannot complete these steps without ensuring that your credentials are set. You will run into errors and try to contact me. If the following code does not return information for your login, your github account, scopes, and a token, you will be unable to proceed. If it does but your token is expired, you cannot proceed. Ensure you have set your credentials.

gh::gh_whoami()

To Do: Steps of the Task

Following the sections below, you will:

  1. Create a Version-Control Project with RStudio
  • Name it cdvlab-exercises (for exercises and practice)
  1. Make file edits, stage those edits, and commit them
  2. Push commits to GitHub

In workshops or for data projects, we will use RStudio along with some simple Git commands for adding, committing, and pushing files.

Creating a Local Directory for all Data Projects

I recommend creating a folder (aka directory) on your computer for managing all or your data-related activities. Such a directory makes finding your projects easy. Name it "data" (yes, all lowercase). I recommend creating the directory someplace where you might not accidentally delete it. Create only one so as not to confuse yourself.

Connecting the Repository to an RStudio Project

You should already have a repository on GitHub named “cdvlab-exercises” which you created from this template repository. You will now create an RStudio project and connect it to that remote repository on your GitHub account.

When you create the project inside your class directory, your directory structure will look like this:

└── data
│   └── cdvlab-exercises 
  1. In RStudio, File > New Project > Version Control > Git.

  2. In the pop-up, you will see a request for the “repository URL”. Paste the URL of the GitHub repository. This URL will be the same as what you see on your GitHub account. However, we need to add .git to the end of it.

    https://github.com/<your_github_username>/cdvlab-exercises.git
  1. When you create the project, a directory will be created as a sub-directory of your main /data directory. Thus, you will see /data/cdvlab-exercises.

WARNING: Do not create the project inside of an existing project’s directory.

Note: I recommend that you also select “Open in new session” in order to compartmentalize projects. When you work on the team project, open the project. When you work on your homework or other class exercises, open your homework project.

  1. Click “Create Project” to create the new project directory, which will create:
    • a project directory on your computer
    • a project file with file extension .Rproj
    • a Git repository or link to the remote GitHub repository for the project (also an RStudio Project)

If the repository already exists on GitHub (and it does in this instance) you should see RStudio flash a connection to GitHub and likely pull the repo contents down to your newly-created project directory. In this case, however, your local Git repository on RStudio will contain few files.

Understanding the Directory Structure

Directory structures are used for organization. Each directory and sub directory has a purpose, which is to contain files of a certain type. As long as you know what the goal of the file is, you know where to save it. When working with teams, this common language avoids many problems.

Although there are different ways to create project directory structures and different ways to name those directories, we will use the following structure. Not all directories will be used for all types of projects.

Inside your /data/cdvlab-exercises directory your full project directory structure should look like the one below.

└── data/
│   └── interim/ 
│   ├── processed/
│   └── raw/
├── cdvlab-exercises.Rproj (the R project file)
├── docs/
├── .gitignore              (a version-control gitignore file)
├── README.md               (a read me file)
├── refs/
├── requirements.R
├── requirements.txt
└── reports/
│   ├── figs/
│   └── images/
└── src/
│   ├── data/
│   ├── figs/
│   ├── functions/
│   └── utils/

Directory and Sub-Directory Purpose

The purpose of each directory and sub-directory is explained following the structure.

  • data/: for raw/virgin data files and modified data files
  • docs/: for document files like the project description, any dictionary of variable names, etc.
  • refs/: for references, papers, reading materials, and other document
  • report/: for R Markdown (e.g., .Rmd) report files and their output file types (e.g., .docx, .pdf, .html)
  • src/: for all source code related files (e.g., .R scripts, functions, .py files, etc.). General scripts can be saved in the top level src/ but most of your script files will be saved in src/figs/ because you will create figures

More directory descriptions are provided below.

Data Files

Inside data/, add the following sub-directories:

  • raw/, for data/raw/: containing raw data files obtained from sources (e.g., .csv, .tsv, .xlxs)
  • interim/, for data/interim/: .Rds (highly recommended) files containing intermediate transformed data; cleaned, merged, etc. but not processed fully to be in final form
  • processed/, for data/processed/; .Rds (highly recommended) files containing finalized data (e.g., aggregated, summaries, and data frames ready for plotting

NOTE: For the lab, you will see me write data as .Rds files using the saveRDS() function because this file format will preserve variable formatting and reduce redoing work later. There are other ways of handling this with {dplyr} functions but I find saveRDS() is the most straightforward.

WARNING: If you process and save those data files as .csv, .xlsx, or similar, you will likely find yourself working harder by recoding solutions you have already performed. I do not recommend this except for final versions that no longer require processing.

Source/Code Files

Inside src/, add the following sub-directories:

  • data, for src/data/: containing .R scripts needed to download or generate data
  • figs/, for src/figs/: containing .R scripts needed to create visualizations
  • functions/, for src/functions/: containing all .R functions needed that do not belong to libraries

Files for Reports

Inside report/, add the following sub-directories:

  • figs/ for report/figs/: containing visualization files (e.g., .png) for the report
  • images/ for report/images/: containing image files (e.g., .png) for the report

When testing your plots, you may wish to add notes or other written content that you can use in conjunction with your plots. In such cases, I recommend creating R Markdown files with meaningful names for taking notes. You can save these reports in the top-level of /report and then source your .R figure script

Below are examples of an .R script for creating your visualizations and an .Rmd file that reads the .R script and renders the .png file within it. These files are also located under the Example Files & Other course tab. Your team report will utilize this same structure, though details and files will be also located under the Project course tab.

Moving forward, save all data to their relevant sub-directories within /data; create all .R code files and scripts in files in /src, including scripts use to create your visualizations and .png plot files; create all exercise or homework R Markdown files (e.g., .Rmd) in /report. Finally, any readings or references can can saved in /refs and any other document files can be saved in /docs. Reserve /report/figs for writing/saving plots or figures. All paths to directories and files for reading and writing files will be managed using the {here} library.

Summary

You now understand how to create projects in R, how to connect projects to remote GitHub repositories, and how to use directories intentionally.

Other Resources

  1. Git Client:

Git clients work like the RStudio Gui option described above but likely much better. One client is GitKraken. * If you find the Terminal command line daunting or limiting, I might recommend a Git Client to use as I am not a big fan of the RStudio interface. * GitKraken is a good option and they have lots of tutorials on their website. GitKraken is seamless to set up. Install, connect your GitHub account, select your repo to add, and voilà. You can stage, commit, and push from there.

  1. happygitwithr

Session Info

sessionInfo()
R version 4.4.1 (2024-06-14 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 22631)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] htmltools_0.5.8.1 DT_0.33           openxlsx_4.2.5.2  vroom_1.6.5      
 [5] lubridate_1.9.3   forcats_1.0.0     stringr_1.5.1     dplyr_1.1.4      
 [9] purrr_1.0.2       readr_2.1.5       tidyr_1.3.1       tibble_3.2.1     
[13] ggplot2_3.5.1     tidyverse_2.0.0  

loaded via a namespace (and not attached):
 [1] utf8_1.2.4        generics_0.1.3    stringi_1.8.4     hms_1.1.3        
 [5] digest_0.6.36     magrittr_2.0.3    evaluate_0.24.0   grid_4.4.1       
 [9] timechange_0.3.0  fastmap_1.2.0     R.oo_1.26.0       rprojroot_2.0.4  
[13] jsonlite_1.8.8    zip_2.3.1         R.utils_2.12.3    fansi_1.0.6      
[17] scales_1.3.0      cli_3.6.3         rlang_1.1.4       crayon_1.5.3     
[21] R.methodsS3_1.8.2 bit64_4.0.5       munsell_0.5.1     withr_3.0.1      
[25] yaml_2.3.10       tools_4.4.1       tzdb_0.4.0        colorspace_2.1-0 
[29] pacman_0.5.1      here_1.0.1        vctrs_0.6.5       R6_2.5.1         
[33] lifecycle_1.0.4   htmlwidgets_1.6.4 bit_4.0.5         pkgconfig_2.0.3  
[37] pillar_1.9.0      gtable_0.3.5      Rcpp_1.0.12       glue_1.7.0       
[41] xfun_0.45         tidyselect_1.2.1  rstudioapi_0.16.0 knitr_1.47       
[45] rmarkdown_2.27    compiler_4.4.1