Project Management 02: R Projects, Git, and GitHub
Author
Gabriel I. Cook
Published
October 30, 2024
Overview
This module focuses on getting organized. Rather than save files in a haphazard way that will just introduce stress to your life, we will focus on creating order. The best way to create order and stay organized is to 1) create projects in RStudio, 2) create directories and sub-directories that leave no ambiguity about where your files are, and 3) manage all directory paths and file paths simply using the {here} library. Another way is to connect that project with a remote repository saved someplace like GitHub for collaboration. In certain classes (and for your team project), you will use Git to interact with remote repositories connected to Projects in RStudio.
In order to maintain organization for this class and the project, you will set up a class folder (aka directory) on your computer. You will then create an RStudio project and connect it to a remote private repository on your GitHub account. The reason for its privacy is because of data related to certain exercises.
You will use this RStudio project for all class exercises and homework so that there is no ambiguity about where your files are saved. Finally, you will create directories within your new project directory so that you have an organized directory structure for storing your files. Systems paths for project files and directories will be manage using the {here} library. This process will also ensure that each student’s computer is configured in the same manner.
Reading through these steps, however, will facilitate your ability to apply the concepts and run the associated functions in class. Thus, all students will gain some basic experience with Git commands and a remote repository. Students will be collaborators of a repository for their team project. Coding leads will carry the responsibility of maintaining the organization of the team’s private repository.
Libraries Used
{usethis}: 2.2.3: for project workflow automation
{gitcreds}: 0.1.2: for querying git credentials
{gh}: 1.4.1: for querying the github api
{gert}: 2.0.1: optional R library approach for git commands
Readings and Preparation
Before Class: First, watch course videos (and/or read) to familiarize yourself with the concepts rather than master them. I will assume that you attend class with some level of basic understanding of concepts and working of functions. The goal of reading should be to understand and implement code functions as well as support your understanding and help your troubleshooting of problems. This cannot happen if you just read the content without interacting with it, however reading is absolutely essential to being successful during class time.
Complete the items in the To Do: Steps of the Task section.
Class: In class, some functions and concepts will be introduced and we will practice implementing code through exercises.
Warning
Do not try to cheat the system and jump ahead. If you do, just like playing the Monopoly board game, your chance card may read “Go to jail. Go directly to jail. Do not pass go. Do not collect $200.” In other words, you cannot complete these steps without ensuring that your credentials are set. You will run into errors and try to contact me. If the following code does not return information for your login, your github account, scopes, and a token, you will be unable to proceed. If it does but your token is expired, you cannot proceed. Ensure you have set your credentials.
gh::gh_whoami()
To Do: Steps of the Task
Following the sections below, you will:
Create a Version-Control Project with RStudio
Name it dataviz-exercises (for class exercises and your homework)
Make file edits, stage those edits, and commit them
Push commits to GitHub
In class, we will practice using RStudio along with some simple Git commands for adding, committing, and pushing files.
Creating a Local Directory for Class
Create a folder (aka directory) named "dataviz" (yes, all lowercase) on your computer. I recommend creating the directory someplace where you might not accidentally delete it. Create only one so as not to confuse yourself.
Connecting the Repository to an RStudio Project
You should already have a repository on GitHub named “dataviz-exercises” which you created from this template repository. You will now create an RStudio project and connect it to that remote repository on your GitHub account.
When you create the project inside your class directory, your directory structure will look like this:
└── dataviz
│ └── dataviz-exercises
In RStudio, File > New Project > Version Control > Git.
In the pop-up, you will see a request for the “repository URL”. Paste the URL of the GitHub repository. This URL will be the same as what you see on your GitHub account. However, we need to add .git to the end of it.
When you create the project, a directory will be created as a sub-directory of your main /dataviz directory. Thus, you will see /dataviz/dataviz-exercises.
WARNING:Do not create the project inside of an existing project’s directory.
Note: I recommend that you also select “Open in new session” in order to compartmentalize projects. When you work on the team project, open the project. When you work on your homework or other class exercises, open your homework project.
Click “Create Project” to create the new project directory, which will create:
a project directory on your computer
a project file with file extension .Rproj
a Git repository or link to the remote GitHub repository for the project (also an RStudio Project)
If the repository already exists on GitHub (and it does in this instance) you should see RStudio flash a connection to GitHub and likely pull the repo contents down to your newly-created project directory. In this case, however, your local Git repository on RStudio will contain few files.
Unzipping the Directory Structure
Although building the directory structure on your own is possible, the repository has a .zip file which contains all of the directories needed for the project. You can unzip them manually or do this in RStudio with the "unzip_compressed_file.R" script. The script has not been tested on a Mac so there may be an error.
Understanding the Directory Structure
Directory structures are used for organization. Each directory and sub directory has a purpose, which is to contain files of a certain type. As long as you know what the goal of the file is, you know where to save it. When working with teams, this common language avoids many problems.
Although there are different ways to create project directory structures and different ways to name those directories, we will use the following structure. Not all directories will be used for all types of projects.
Inside your /dataviz/dataviz-exercises directory your full project directory structure should look like the one below.
└── data
│ └── interim
│ ├── processed
│ └── raw
├── dataviz-exercises.Rproj (the R project file)
├── docs
├── .gitignore (a version-control gitignore file)
├── README.md (a read me file)
├── refs
└── reports
│ ├── figs
│ └── images
└── src
│ ├── data
│ ├── figs
│ ├── functions
│ └── utils
Directory and Sub-Directory Purpose
The purpose of each directory and sub-directory is explained following the structure.
/data: for raw/virgin data files and modified data files
/docs: for document files like the project description, any dictionary of variable names, etc.
/refs: for references, papers, reading materials, and other document
/report: for R Markdown (e.g., .Rmd) report files and their output file types (e.g., .docx, .pdf, .html)
/src: for all source code related files (e.g., .R scripts, functions, .py files, etc.). General scripts can be saved in the top level /src but most of your script files will be saved in /src/figs because you will create figures
More directory descriptions are provided below.
Data Files
Inside /data, add the following sub-directories:
/raw, for /data/raw/: containing raw data files obtained from sources (e.g., .csv, .tsv, .xlxs)
/interim, for /data/interim/: .Rds (highly recommended) files containing intermediate transformed data; cleaned, merged, etc. but not processed fully to be in final form
/processed, for /data/processed/; .Rds (highly recommended) files containing finalized data (e.g., aggregated, summaries, and data frames ready for plotting
NOTE: For this course, you will see me write data as .Rds files using the saveRDS() function because this format will preserve variable formatting which will affect plots.
WARNING: If you process and save those data files as .csv, .xlsx, or similar, you will likely find yourself working harder by recoding solutions you have already performed. I do not recommend this except for final versions that no longer require processing.
Source/Code Files
Inside /src, add the following sub-directories:
/data, for /src/data/: containing .R scripts needed to download or generate data
/figs, for /src/figs/: containing .R scripts needed to create visualizations
/functions, for /src/functions/: containing all .R functions needed that do not belong to libraries
Files for Reports
Inside /report, add the following sub-directories:
/figs for /report/figs/: containing visualization files (e.g., .png) for the report
/images for /report/images/: containing image files (e.g., .png) for the report
When testing your plots, you may wish to add notes or other written content that you can use in conjunction with your plots. In such cases, I recommend creating R Markdown files with meaningful names for taking notes. You can save these reports in the top-level of /report and then source your .R figure script
Below are examples of an .R script for creating your visualizations and an .Rmd file that reads the .R script and renders the .png file within it. These files are also located under the Example Files & Other course tab. Your team report will utilize this same structure, though details and files will be also located under the Project course tab.
Moving forward, save all data to their relevant sub-directories within /data; create all .R code files and scripts in files in /src, including scripts use to create your visualizations and .png plot files; create all exercise or homework R Markdown files (e.g., .Rmd) in /report. Finally, any readings or references can can saved in /refs and any other document files can be saved in /docs. Reserve /report/figs for writing/saving plots or figures. All paths to directories and files for reading and writing files will be managed using the {here} library.
Summary
You now understand how to create projects in R, how to connect projects to remote GitHub repositories, and how to use directories intentionally.
Other Resources
Git Client:
Git clients work like the RStudio Gui option described above but likely much better. One client is GitKraken. * If you find the Terminal command line daunting or limiting, I might recommend a Git Client to use as I am not a big fan of the RStudio interface. * GitKraken is a good option and they have lots of tutorials on their website. GitKraken is seamless to set up. Install, connect your GitHub account, select your repo to add, and voilà. You can stage, commit, and push from there.