Tools for Managing and Organizing Computational Analysis Projects

Cecilia Noecker
May 7, 2020

What are features of a good computational biology project?

xkcd

“The design question that you will face most often as you formulate and execute a series of computational experiments is how much effort to put into software engineering. Depending upon your temperament, you may be tempted to execute a quick series of commands in order to test your hypothesis immediately, or you may be tempted to over-engineer your programs to carry out your experiment in a pleasingly automatic fashion. In practice, I find that a happy medium between these two often involves iterative improvement of scripts.”

“A quick guide to organizing computational biology projects”, William Noble, PLOS Comp Bio 2009 doi.org/10.1371/journal.pcbi.1000424

Key objectives to keep in mind

  • Organization: locate and manage code and data across file systems

  • Documentation: future you and others can understand what you did and why

  • Validation/Testing: your code produces the correct results

  • Reproducibility/Provenance: all results and output can be unambiguously linked with the specific versions of the data and code that produced them

“Good enough practices in scientific computing”, Greg Wilson et al, PLOS Comp Bio 2017 10.1371/journal.pcbi.1005510

Tools for organization and documentation in R/RStudio

Organizing repeatedly used code in functions

In your R Markdown document:

library(tidyverse)
library(Biostrings)
source("functions_for_this_project.R")

#Define input file
input_data_file <- "my_seq_file.fasta"

#Read in data
data <- read_16s_seqs(input_data_file)

#Plot data
my_heatmap <- make_heatmap(data, filter_low_abundance = TRUE)

#Fit models of data
model_results <- fit_diffAbund_models(data, test_interactions = TRUE)

In functions_for_this_project.R:

# Read in 16S rRNA sequences and convert to tibble with associated information
read_16s_seqs <- function(data_file){
  seqs <- readDNAStringSet(data_file)
  seqs <- data.frame(FeatureID = names(seqs), Sequence = seqs, SeqLength = width(seqs)) %>% 
    as_tibble()
  return(seqs)
}

# Make heatmap of sequence abundances with rows in a particular order
make_heatmap <- function(seq_data, filter_low_abundance = TRUE){
  seq_order = seq_data %>% arrange(value) %>% select(SeqID) 
  ggplot()
  ## etc
}

# Fit series of linear models of sequence abundances
fit_diffAbund_models <- function(seq_abundances, test_interactions = FALSE){
  ## etc
}

Tools for staying organized across multiple machines

  • Make sure your project has a master home for all raw data! Never edit or move around raw data.
  • Write code locally, run it on a cluster or server: edit locally with a text editor like BBEdit, save remotely, submit
  • Run code locally, import data from server: remote connect
    • Filezilla, Mac remote, rsync, sftp/scp
  • Run RStudio Server on a Wynton interactive node from your browser

Tools for keeping track of complicated workflows and avoiding repeating long computations

Problems we haven't talked about yet

  • How to figure out what's going on when a new version of your analysis is generating different results from a previous version?

  • How to avoid accidentally deleting something important?

  • How to keep track of multiple versions of a script (e.g, on the cluster and on your own computer)?

  • How can multiple people collaborate on the same code without overwriting each other?

Version Control (Git and GitHub)

What is Version Control and Why?

Editing your code: want to track changes over time without saving many separate versions.

git_edit

Collaborating on your code: want to merge changes from multiple sources.

git_collab

Git: software to track and merge changes to text documents

git_merge

  • Can be run from the command line, from an RStudio Project, or with a graphical interface tool

    • GitHub Desktop, Git Kraken, Git For Windows
  • Creates a permanent (hard to delete) record of changes to your project code over time

Frequently used git commands

  • git init: Create a git “repository”, or project you want to track
  • git add, git commit: Record a set of code edits as a “commit” to your repository
  • git status: See what files in your project have been commited and which ones haven't
  • git diff: See what changes you've made to your script since your last commit
  • git push: Propagate your changes to a remote repository (i.e. the online version of your project on GitHub)
  • git pull: Retrieve changes made elsewhere in a remote repository and merge them with the latest version on your computer
  • git checkout: Retrieve a previously commited version of your code, or a different branch (chain) of versions

What does this look like in RStudio?

GitHub

  • Git is the software, GitHub is the sharing platform
  • Can comment, discuss, collaborate on open-source code
  • Can have public or private repositories

Additional Version Control Resources

Summary

  • Have a system to keep your projects organized, shareable, testable, and reproducible
  • RStudio/RMarkdown suite of tools are great for this purpose
  • Use remote connect tools to share data across machines
  • Use version control (Git) to track and save versions of your code

What other questions do you have?