Best Practices in R

Adrian Zetner

2024-01-31

Best Good Enough Practices in R

Introduction

Seminar Goals

What to take away from this seminar
- Ideas, resources, methods to improve future work
What not to take away from this seminar
- Any rush to apply these ideas retroactively to all previous projects
Inspiration and content from sources listed at the end

Project Oriented Workflow
Readability
Reproducibility

Project Oriented Workflow

Why?

Work on More Than One Thing at a Time
Team Collaboration
- Easier concurrent work
- Easy distribution
Start and Stop
- Flexible work schedule
- Checkpoint to save progress
Documentation for Continuity
- Easy resumption
- Context and guidelines for collaborators

Work on More Than One Thing at a Time

If waiting for input from a colleague or for an analysis to finish can start another part of the project or another project independently (self-contained).

Collaborate, Communicate, Distribute

Team Collaboration:
- Foster collaboration within a team by structuring the project to enable concurrent contributions.
- Keeping projects self-contained facilitates easy distribution to team members and reporting.

Start and Stop

Flexible Work Sessions:
- Embrace a flexible work schedule, allowing for starting and stopping based on availability and focus.
- Regularly checkpoint the project to save progress.
Documentation for Continuity:
- Utilize thorough documentation for easy resumption and provide context and guidelines for others taking over or collaborating.
- This includes future you!

How?

Standardized organization of files per project
Consistent actions

Organization of Project Directories 📂

Project Organization:
- Folder per project
- Top-level advertisement
  - RStudio/Git/{here} characteristic files
Path Construction with here():
- Utilize here() function
- here package
- Paths relative to top-level

{here} Package 📂

here() displays top-level folder location

library(here)
here()

[1] "C:/Users/azetner/Documents/quarto-presentations"

{here} Package 📂

Build a path to a file in a subdirectory and use it

here("presentations", "20240131-RBP_images", "analysisworkflow.png")

[1] "C:/Users/azetner/Documents/quarto-presentations/presentations/20240131-RBP_images/analysisworkflow.png"

here("presentations/20240131-RBP_images/")

[1] "C:/Users/azetner/Documents/quarto-presentations/presentations/20240131-RBP_images"

arrow.file <- here("presentations/20240124-BWG_images/arrow_dataset.png")

file.info(arrow.file)["size"]

                                                                                                     size
C:/Users/azetner/Documents/quarto-presentations/presentations/20240124-BWG_images/arrow_dataset.png 79441

Folder Structure 📂

Folder Structure:
- Data
- Code
- Documentation
- External scripts
- Outputs
Start project from root
- Console or IDE
- here() for paths

RStudio Projects 📂

Maintain project separation
Settings stored in <NAME>.Rproj.
Open Project in RStudio:
- Dedicated R process
- File browser points to Project directory.
- Working directory set to Project.

Version: 1.0
RestoreWorkspace: No
SaveWorkspace: No
AlwaysSaveHistory: Default
EnableCodeIndexing: Yes
UseSpacesForTab: Yes
NumSpacesForTab: 2
Encoding: UTF-8
RnwWeave: Sweave
LaTeX: pdfLaTeX
AutoAppendNewline: Yes
StripTrailingWhitespace: Yes
LineEndingConversion: Native
BuildType: Package
PackageUseDevtools: Yes
PackageInstallArgs: --no-multiarch --with-keep.source
PackageRoxygenize: rd,collate,namespace

Consistent Actions 💥

Everything that matters should be achieved through saved code

Save Source not the Workspace / Environment 💥

Livestock vs. Pets Analogy from Cloud Computing
- Livestock: managed in herds, disposable
- Pets: unique, precious
Treat R processes like livestock
- Workspace disposability
- Non-reproducible workflows lead to heartache
Explicitly save important objects
Checkpoints for long generation time objects
Design away fear of reproducibility

“Livestock is managed in herds and there is little fuss when individuals are lost or must be sacrificed. A pet, on the other hand, is unique and precious.”
cultivate a workflow in which you treat R processes (a.k.a. “sessions”) like livestock. Any individual R process and the associated workspace is disposable
if your workspace is a pet, i.e. it holds precious objects and you aren’t 100% sure you can reproduce them, you are guaranteeing heartache
Explicitly save important objects generated during analysis
Checkpoints for long generation time objects:
- Isolate each computational or time demanding step in its own script and write the precious object to file
- Can simply reload the object from file while developing downstream scripts
Design your code to explicitly ensure the environment is reproducible
What’s the best way to ensure you’re not too precious about your workspaces?

Use a Blank Slate 💥

R --no-save --no-restore-data

Restart R often 💥

Restart R to wipe environment
Save code not workspaces
Pressure to reinforce correct behaviours
- Ensuring source code recreates important artefacts

Analysis Project Practices ⚡

Use an IDE ⚡

Use an Integrated Development environment to smooth this workflow, I recommend RStudio
RStudio, VSCode, VIM, whatever
Take advantage of code-aware editor that will help you notice simple errors like typos and unclosed brackets as well as offering many was to direct code to a running R process. Using an IDE encourages saving code in source files (R/Rmd) rather than sending commands directly to the console
Use it to organize your workflow and guide towards best practices
“Sometimes people resist the advice to use an IDE because it’s hard to incorporate into their current workflow and dismiss it as something “for experts only”. But this gets the direction of causality backwards: long-time and professional coders don’t have good habits because they use an IDE. They use an IDE because it makes it so much easier to follow best practices and build good habits.”
An IDE only helps you keep organized with how you work on software, but there is much more to it

Software Management ⚡

Research Code and Software:
- Varied forms and sizes
- Includes code processing research data, scripts, and workflows
- Scriptable languages like R, Python, shell, etc
- Standalone programs for specific research tasks

Software Management ⚡

What can go wrong with research code?
- What does code do?
- Why did we do it this way?
- No longer works
- Accuracy at question
Software projects all can benefit modular code:
- Readable
- Reusable
- Testable

Software Management ⚡

Comment brief explanations
Functions first
- Clear inputs and outputs
- Meaningful names
- One main task
Ruthlessly eliminate duplication
- Functions
- Data Structures
- Effort

A function is a reusable section of software that can be treated as a black box by the rest of the program. This is like the way we combine actions in everyday life. Suppose that it is teatime. You could get a teabag, put the teabag in a mug, boil the kettle, pour the boiling water into the mug, wait 3 minutes for the tea to brew, remove the teabag, and add milk if desired. It is much easier to think of this as a single function, “make a cup of tea”.
When we’re writing scripts built of functions we should think of it in much the same way and apply certain rules to each
Give them a brief Description: Short is fine; always include at least one example of how the program is used. Remember, a good example is worth a thousand words: Where possible, the description should also indicate reasonable values for parameters, dependencies, etc
Build programs out of short, single-purpose functions with:
- clearly-defined inputs and outputs
- meaningful names (applies to variables too), considering tab completion and name based on scope (counter can be i, major object should have a name that describes its purpose)
Functions should have one main task, which can be combined into larger functions or workflows or if they get too complex, broken down again into single task functions
- “Make a cup of Tea” shouldn’t include “Take out the garbage” but it should contain smaller functions to “Boil Kettle” “prepare cup with teabag” etc
Eliminate duplication:
Function Usage:
- Write and re-use functions.
- Avoid copy-pasting code.
Data Structure Utilization:
- Use data structures like lists where possible over multiple related objects
- Prefer creating a single structure (e.g., score = (1, 2, 3)) over multiple closely-related variables (e.g., score1, score2, score3).
Effort:
- Look for well-maintained libraries before writing your own
- Seek existing code for specific functions.
- Explore language-specific library catalogs (e.g., CRAN for R, PyPI for Python).
- Test libraries before relying on them.

Software Management ⚡

Data Management 💽

Why Data Management?
- Data loss / corruption
- Confusion about provenance
- Version

Data Management 💽

Data Management
- Save the raw data
- Ensure raw data is backed up
- Create analysis-friendly data
  - Create the data you wish to see in the world
    - Self explanatory naming
    - Open formats
    - Machine readability
  - Export cleaned data that you wish you’d received
- Record all the steps used to process data

Save data in its original form to ensure faithful retention for rerunning analyses, recovering from mishaps, and experimenting fearlessly. Consider making raw data read only and resist them temptation to overwrite with cleaned results
For data that’s impractical to manage this way, document the exact procedure, version details, and other pertinent information when working with large, stable databases.
Back up raw data in multiple places: if it’s not backed up it doesn’t matter
Analysis friendly data
- Replace inscrutable variable and column names with self explanatory, machine-readable alternatives
- Convert data from closed, proprietary formats to open, non-proprietary formats like CSV for tabular data, JSON, YAML, or XML for non-tabular data
- Create an ideal dataset by focusing on improving machine and human readability without extensive filtering or adding external information. Prioritize machine readability for easy reuse

Data Management 💽

Each variable must have its own column
Each observation must have its own row
Each value must have its own cell

Consistency
Vectorization

One common type of analysis friendly data is Tidy data
There are three interrelated rules which make a dataset tidy
- Each variable must have its own column.
- Each observation must have its own row.
- Each value must have its own cell.
Why bother? There’s a general advantage to picking one consistent way of storing data. If you have a consistent data structure, it’s easier to learn the tools that work with it because they have an underlying uniformity.

There’s a specific advantage to placing variables in columns because it allows R’s vectorised nature to shine. As you learned in mutate and summary functions, most built-in R functions work with vectors of values. That makes transforming tidy data feel particularly natural.

There are two main reasons to use other data structures:
- Alternative representations may have substantial performance or space advantages.
- Specialised fields have evolved their own conventions for storing data that may be quite different to the conventions of tidy data.

Data Management 💽

Readability 📑

Naming Conventions 📑

File names should be:
- Machine readable
- Human readable
- Optional: Consistent
- Optional: Play well with default ordering

Machine Readable 📑

Regex/Globbing Friendly
- Avoid spaces, punctuation, and accented characters
- Maintain case sensitivity
Easy Computation
- Use intentional delimiters for straightforward computational processes
- Deliberate delimiter use enhances computational efficiency
  - Dashes for spaces between words
  - Underscores for chunks

20220120_patient-exposure_control.csv
20220120_patient-exposure_treatment.csv
20220215_patient-exposure_control.csv
20220215_patient-exposure_treatment.csv
20220215_patient-info_control.csv
20220310_patient-info_control.csv
20220520_patient-info_treatment.csv
20220805_patient-info_control.csv
20230120_patient-info_treatment.csv
20230215_patient-info_treatment.csv
20230310_patient-exposure_control.csv
20230310_patient-exposure_treatment.csv
20230405_patient-exposure_control.csv
20230405_patient-exposure_treatment.csv
20230405_patient-info_control.csv
20230615_patient-info_treatment.csv
20230710_patient-info_control.csv
20230710_patient-info_treatment.csv
20230805_patient-info_treatment.csv

❯ ls -1 2022*
20220120_patient-exposure_control.csv
20220120_patient-exposure_treatment.csv
20220215_patient-exposure_control.csv
20220215_patient-exposure_treatment.csv
20220215_patient-info_control.csv
20220310_patient-info_control.csv
20220520_patient-info_treatment.csv
20220805_patient-info_control.csv

❯ ls -1 *info*
20220215_patient-info_control.csv
20220310_patient-info_control.csv
20220520_patient-info_treatment.csv
20220805_patient-info_control.csv
20230120_patient-info_treatment.csv
20230215_patient-info_treatment.csv
20230405_patient-info_control.csv
20230615_patient-info_treatment.csv
20230710_patient-info_control.csv
20230710_patient-info_treatment.csv
20230805_patient-info_treatment.csv

Human Readable 📑

Informative File Names:
- Include content information in file names
- Anticipate usage context
Slug:
- User-friendly and descriptive filenames
- 20230710_*patient-info_control*.csv

filedir <- here("presentations/20240131-RBP_images/fakedat/")
flist <- list.files(filedir, pattern = "info")

stringr::str_split_fixed(flist, "[_\\.]", 4)

      [,1]       [,2]           [,3]        [,4] 
 [1,] "20220215" "patient-info" "control"   "csv"
 [2,] "20220310" "patient-info" "control"   "csv"
 [3,] "20220520" "patient-info" "treatment" "csv"
 [4,] "20220805" "patient-info" "control"   "csv"
 [5,] "20230120" "patient-info" "treatment" "csv"
 [6,] "20230215" "patient-info" "treatment" "csv"
 [7,] "20230405" "patient-info" "control"   "csv"
 [8,] "20230615" "patient-info" "treatment" "csv"
 [9,] "20230710" "patient-info" "control"   "csv"
[10,] "20230710" "patient-info" "treatment" "csv"
[11,] "20230805" "patient-info" "treatment" "csv"

Easy Sorting 📑

Numeric Inclusion:
- Often for code
- Include a numeric element for effective sorting
- Left-pad with zeros for consistent width and visual sorting.
- eg 01_import.R
Dates:
- Utilize the ISO 8601 standard for date formatting: YYYYMMDD
- Ensures chronological sorting in file names by default
- eg 20220820_wedding-photos.zip

Naming Conventions 📑

Avoid:
- Internal sequential numbers: result1.csv, result2.csv
- Manuscript locations: fig_3_a.png

Writing Meaningful Comments 💬

Programs must be written for people to read, and only incidentally for machines to execute.

Meaningful Comments 💬

Rule 1: Comments should not duplicate the code

if (x > 3) {
   …
} # close if

i = i + 1 # Add one to i

Meaningful Comments 💬

Rule 2: Good comments do not excuse unclear code.

getBestChildNode <- function(node) {
  n <- NULL  # best child node candidate
  for (child_node in node$children) {
    # update n if the current state is better
    if (is.null(n) || utility(child_node) > utility(n)) {
      n <- child_node
    }
  }
  return(n)
}

getBestChildNode <- function(node) {
  bestNode <- NULL
  for (currentNode in node$children) {
    if (is.null(bestNode) || utility(currentNode) > utility(bestNode)) {
      bestNode <- currentNode
    }
  }
  return(bestNode)
}

- Rule 2: Good comments do not excuse unclear code.
    - For example generic variable names (x, y, etc) explained in comments
    - Need for comments is reduced if variables are named properly

Meaningful Comments 💬

Rule 3: If you can’t write a clear comment, there may be a problem with the code.

“Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.”

Roxygen2 Comments 💬

In-line documentation
Rich, dynamic .Rd file generation
Standard markdown

#' Add together two numbers
#' 
#' @param x A number.
#' @param y A number.
#' @returns A numeric vector.
#' @examples
#' add(1, 1)
#' add(10, 1)
add <- function(x, y) {
  x + y
}

% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/add.R
\name{add}
\alias{add}
\title{Add together two numbers}
\usage{
add(x, y)
}
\arguments{
\item{x}{A number.}

\item{y}{A number.}
}
\value{
A numeric vector.
}
\description{
Add together two numbers
}
\examples{
add(1, 1)
add(10, 1)
}

Roxygen2 Comments 💬

Reproducibility 🐣

Modular and Reusable Code 🐣

Start programs/functions with explanatory comments.
Use meaningful names for functions and variables.
Explicitly state dependencies and requirements.
Avoid commenting/uncommenting code sections for program control.
Include a simple example or test data set for clarity.

Make an R Package 🐣

Why make an R package?
- Code Organization
- Consistent documentation
- Code Distribution
- Dependency handling

Creating R packages for code reuse and distribution Why make an R package?

Code Organization: I am always trying to figure out where that “function” I wrote months, weeks, or even days ago. Often times, I end up just re-writing it because it is faster than searching all my .R files. An R package would help in organizing where my functions go.
Consistent documentation: I can barely remember what half of my functions do let alone the inputs and outputs. An R package provides a great consistent documentation structure and actually encourages you to document your functions.
Code Distribution: No more emailing .R scripts! An R package gives an easy way to distribute your code for others. Especially if you put it on GitHub.
Dependency handling: No more failing scripts because library calls failed on load. Install the package and all of its dependencies are installed alongside it!

Even if it’s just for yourself

Make an R Package 🐣

Package directory structure
- Scripts in R
- Documentation in man
- README.md
- DESCRIPTION
- NAMESPACE
{Roxygen2} for documentation
{devtools} for process

.
├── DESCRIPTION
├── NAMESPACE
├── R
│   ├── cat_function.R
│   ├── cat_images.R
│   └── cats-package.R
├── README.md
└── man
    ├── add_cat.Rd
    ├── cat_function.Rd
    ├── cats.Rd
    ├── get_cat.Rd
    └── here_kitty.Rd

Make an R Package 🐣

DESCRIPTION file
- Dependencies
- Package metadata
- Contact information

Package: cats
Title: Cats
Version: 0.1
Author: Hilary Parker <hilary@etsy.com> [aut, cre]
Maintainer: Hilary Parker <hilary@etsy.com>
Authors@R: c( person("Hilary", "Parker", email = "hilary@etsy.com", role =
    c("aut", "cre")))
Description: Mew.
Depends:
    R (>= 3.0.2)
Imports:
    httr,
    ggplot2,
    jpeg
License: MIT
LazyData: true
Suggests:
    testthat

Make an R Package 🐣

NAMESPACE file
- Functions from your package
- Imported functions from others


# Generated by roxygen2 (4.0.1.99): do not edit by hand

export(add_cat)
export(cat_function)
export(get_cat)
export(here_kitty)
import(ggplot2)
import(httr)
import(jpeg)

Version Control 🧭

Why version control in data analysis projects?

File Versioning
Figure Recreation
Code Modification Impact
Directory Copying Fear
Shared File Row Duplication
Lost Data Files
Analysis Choice

Manuscript Collaboration Merge
Accidental Deletion
Project Status Recall
Experiment Mistake Identification
Directory Pollution
Selective Changes
Ad Infinitum

I have fifteen versions of this file and I don’t know which is which I can’t remake this figure from last year I modified my code and something apparently unrelated does not work anymore I have several copies of the same directory because I’m worried about breaking something Somebody duplicated a record in a shared file with samples You remember seeing a data file but cannot find it anymore: is it deleted ? Moved away ? I tried multiple analysis and I don’t remember which one I chose to generate my output data I have to merge changes to a paper from mails with collaborators I accidentally deleted a part of my work I came to an old project and forgot where I left it I have trouble to find the source of a mistake in an experiment My directory is polluted with a lot of unused/temporary/old folders because I’m afraid of losing something important I made a lot of changes to my paper but only want to bring back one of paragraph

Many Reasons

Version Control 🧭

What to Save:
- Human-created content.
- Plain text files.
- Avoid binary files.

Version Control: Manual 🧭

CHANGELOG.txt

## 2016-04-08

* Switched to cubic interpolation as default.
* Moved question about family's TB history to end of questionnaire.

## 2016-04-06

* Added option for cubic interpolation.
* Removed question about staph exposure (can be inferred from blood test results).

Version Control: Manual 🧭

Backup entire project folder

.
|-- project_name
|   -- current
|       -- ...project content as described earlier...
|   -- 2016-03-01
|       -- ...content of 'current' on Mar 1, 2016
|   -- 2016-02-19
|       -- ...content of 'current' on Feb 19, 2016

… please don’t do this

Version Control: VCS 🧭

Version Control System
Efficient Backup Storage
Automated Timestamps
Automated Changelog and Accuracy
Conflict Resolution and Merging
Examples: Git, SVN, Bitbucket, etc

VCS
Much better than manual
Efficient Backup Storage:
- Version control eliminates the need for users to create backup copies of the entire project.
- Safely stores sufficient information to recreate old versions of files on demand.
Automated Timestamps:
- Timestamps all saved changes automatically, eliminating the reliance on users to choose sensible names for backup copies.
Automated Changelog and Accuracy:
- Version control systems prompt users whenever a change is saved, ensuring a disciplined approach without relying on manual changelogs.
- Maintains a 100% accurate record of actual changes made, valuable for troubleshooting later.
Conflict Resolution and Merging:
- Instead of blindly copying files to remote storage, version control checks for potential overwrites of others’ work.
- Facilitates conflict identification and merging changes, ensuring collaboration without loss of data.
Links to very good tutorials in the resources section of this talk

Version Control: VCS 🧭

Version Control: VCS 🧭

Managing Project Dependencies 📦

The only way to be certain that code written by someone else will run on your machine and will produce the same results is to replicate their environment

Managing Project Dependencies 📦

What is the environment?
1. The version of the software used
  - R 3.6.1 vs R 4.3.2
2. The versions of the packages/extensions used
  - {dplyr} 1.0.10 vs 1.1.12
3. External software dependencies for packages
  - pandoc

What is the environment?
1. The version of the software used
2. The versions of the packages/extensions used
3. External software dependencies for packages (eg. pandoc)
Challenges associated with package versions and dependencies in R projects
- Often your code will run without any hiccups even when environments differ. Software engineers try to make sure your code will not stop working if you use a slightly different versions.
- But with all the software and all the dependencies eventually something will give and things will break. Even if the code runs successfully, there is a chance that the results could differ.
Manually comparing environments is difficult to impossible, using containers to replicate another’s computer is complex so we recommend a mid-point: ensuring package and R versions are the same each time the analysis is run

Best Practices in R

Best Good Enough Practices in R

Introduction

Seminar Goals

Table of Contents

Project Oriented Workflow

Why?

How?

Organization of Project Directories 📂

Organization of Project Directories 📂

{here} Package 📂

{here} Package 📂

Folder Structure 📂

RStudio Projects 📂

Consistent Actions 💥

Save Source not the Workspace / Environment 💥

Use a Blank Slate 💥

Restart R often 💥

Analysis Project Practices ⚡

Use an IDE ⚡

Software Management ⚡

Software Management ⚡

Software Management ⚡

Software Management ⚡

Data Management 💽

Data Management 💽

Data Management 💽

Data Management 💽

Readability 📑

Naming Conventions 📑

Machine Readable 📑

Human Readable 📑

Easy Sorting 📑

Naming Conventions 📑

Writing Meaningful Comments 💬

Meaningful Comments 💬

Meaningful Comments 💬

Meaningful Comments 💬

Roxygen2 Comments 💬

Roxygen2 Comments 💬

Roxygen2 Comments 💬

Reproducibility 🐣

Modular and Reusable Code 🐣

Make an R Package 🐣

Make an R Package 🐣

Make an R Package 🐣

Make an R Package 🐣

Make an R Package 🐣

Version Control 🧭

Version Control 🧭

Version Control 🧭

Version Control: Manual 🧭

Version Control: Manual 🧭

Version Control: VCS 🧭

Version Control: VCS 🧭

Version Control: VCS 🧭

Managing Project Dependencies 📦

Managing Project Dependencies 📦

Resources

http://tinyurl.com/COGseminar

Seminar Content

Moodle Cources

Readability

Packages

Project Dependencies

Useful Links