Tools for Managing and Organizing Computational Analysis Projects

Cecilia Noecker
May 7, 2020

What are features of a good computational biology project?

xkcd

“The design question that you will face most often as you formulate and execute a series of computational experiments is how much effort to put into software engineering. Depending upon your temperament, you may be tempted to execute a quick series of commands in order to test your hypothesis immediately, or you may be tempted to over-engineer your programs to carry out your experiment in a pleasingly automatic fashion. In practice, I find that a happy medium between these two often involves iterative improvement of scripts.”

“A quick guide to organizing computational biology projects”, William Noble, PLOS Comp Bio 2009 doi.org/10.1371/journal.pcbi.1000424

Key objectives to keep in mind

Organization: locate and manage code and data across file systems
Documentation: future you and others can understand what you did and why
Validation/Testing: your code produces the correct results
Reproducibility/Provenance: all results and output can be unambiguously linked with the specific versions of the data and code that produced them

“Good enough practices in scientific computing”, Greg Wilson et al, PLOS Comp Bio 2017 10.1371/journal.pcbi.1005510

Tools for organization and documentation in R/RStudio

Whole project: R projects
Single analysis: R markdown and Rpres analysis notebooks
- https://rstudio.com/wp-content/uploads/2016/03/rmarkdown-cheatsheet-2.0.pdf
Frequently used code bits: R packages or other function libraries

Organizing repeatedly used code in functions

In your R Markdown document:

library(tidyverse)
library(Biostrings)
source("functions_for_this_project.R")

#Define input file
input_data_file <- "my_seq_file.fasta"

#Read in data
data <- read_16s_seqs(input_data_file)

#Plot data
my_heatmap <- make_heatmap(data, filter_low_abundance = TRUE)

#Fit models of data
model_results <- fit_diffAbund_models(data, test_interactions = TRUE)

In functions_for_this_project.R:

# Read in 16S rRNA sequences and convert to tibble with associated information
read_16s_seqs <- function(data_file){
  seqs <- readDNAStringSet(data_file)
  seqs <- data.frame(FeatureID = names(seqs), Sequence = seqs, SeqLength = width(seqs)) %>% 
    as_tibble()
  return(seqs)
}

# Make heatmap of sequence abundances with rows in a particular order
make_heatmap <- function(seq_data, filter_low_abundance = TRUE){
  seq_order = seq_data %>% arrange(value) %>% select(SeqID) 
  ggplot()
  ## etc
}

# Fit series of linear models of sequence abundances
fit_diffAbund_models <- function(seq_abundances, test_interactions = FALSE){
  ## etc
}

Tools for staying organized across multiple machines

Make sure your project has a master home for all raw data! Never edit or move around raw data.
Write code locally, run it on a cluster or server: edit locally with a text editor like BBEdit, save remotely, submit
Run code locally, import data from server: remote connect
- Filezilla, Mac remote, rsync, sftp/scp
Run RStudio Server on a Wynton interactive node from your browser
- New feature, support/docs still under construction. Instructions here: https://docs.google.com/document/d/13MPXPdtGy4J80Wu9ZZp534fQEJQOc67Xox1zB379t6M/

Tools for keeping track of complicated workflows and avoiding repeating long computations

make
drake ( R ): https://docs.ropensci.org/drake/
snakemake (python)

Problems we haven't talked about yet

How to figure out what's going on when a new version of your analysis is generating different results from a previous version?
How to avoid accidentally deleting something important?
How to keep track of multiple versions of a script (e.g, on the cluster and on your own computer)?
How can multiple people collaborate on the same code without overwriting each other?

Version Control (Git and GitHub)

What is Version Control and Why?

Editing your code: want to track changes over time without saving many separate versions.

git_edit

Collaborating on your code: want to merge changes from multiple sources.

git_collab

Git: software to track and merge changes to text documents

git_merge

Can be run from the command line, from an RStudio Project, or with a graphical interface tool
- GitHub Desktop, Git Kraken, Git For Windows
Creates a permanent (hard to delete) record of changes to your project code over time

Frequently used git commands

git init: Create a git “repository”, or project you want to track
git add, git commit: Record a set of code edits as a “commit” to your repository
git status: See what files in your project have been commited and which ones haven't
git diff: See what changes you've made to your script since your last commit
git push: Propagate your changes to a remote repository (i.e. the online version of your project on GitHub)
git pull: Retrieve changes made elsewhere in a remote repository and merge them with the latest version on your computer
git checkout: Retrieve a previously commited version of your code, or a different branch (chain) of versions

What does this look like in RStudio?

GitHub

Git is the software, GitHub is the sharing platform
Can comment, discuss, collaborate on open-source code
Can have public or private repositories

Additional Version Control Resources

Happy Git and GitHub for the useR (eBook): https://happygitwithr.com
Software Carpentry Intro to Git: https://swcarpentry.github.io/git-novice/
Official Git Documentation: https://git-scm.com/doc
Visualization of Git commit tracking and branching: https://git-school.github.io/visualizing-git/

Summary

Have a system to keep your projects organized, shareable, testable, and reproducible
RStudio/RMarkdown suite of tools are great for this purpose
Use remote connect tools to share data across machines
Use version control (Git) to track and save versions of your code