Cecilia Noecker
May 7, 2020
“The design question that you will face most often as you formulate and execute a series of computational experiments is how much effort to put into software engineering. Depending upon your temperament, you may be tempted to execute a quick series of commands in order to test your hypothesis immediately, or you may be tempted to over-engineer your programs to carry out your experiment in a pleasingly automatic fashion. In practice, I find that a happy medium between these two often involves iterative improvement of scripts.”
“A quick guide to organizing computational biology projects”, William Noble, PLOS Comp Bio 2009 doi.org/10.1371/journal.pcbi.1000424
Organization: locate and manage code and data across file systems
Documentation: future you and others can understand what you did and why
Validation/Testing: your code produces the correct results
Reproducibility/Provenance: all results and output can be unambiguously linked with the specific versions of the data and code that produced them
“Good enough practices in scientific computing”, Greg Wilson et al, PLOS Comp Bio 2017 10.1371/journal.pcbi.1005510
In your R Markdown document:
library(tidyverse)
library(Biostrings)
source("functions_for_this_project.R")
#Define input file
input_data_file <- "my_seq_file.fasta"
#Read in data
data <- read_16s_seqs(input_data_file)
#Plot data
my_heatmap <- make_heatmap(data, filter_low_abundance = TRUE)
#Fit models of data
model_results <- fit_diffAbund_models(data, test_interactions = TRUE)
In functions_for_this_project.R
:
# Read in 16S rRNA sequences and convert to tibble with associated information
read_16s_seqs <- function(data_file){
seqs <- readDNAStringSet(data_file)
seqs <- data.frame(FeatureID = names(seqs), Sequence = seqs, SeqLength = width(seqs)) %>%
as_tibble()
return(seqs)
}
# Make heatmap of sequence abundances with rows in a particular order
make_heatmap <- function(seq_data, filter_low_abundance = TRUE){
seq_order = seq_data %>% arrange(value) %>% select(SeqID)
ggplot()
## etc
}
# Fit series of linear models of sequence abundances
fit_diffAbund_models <- function(seq_abundances, test_interactions = FALSE){
## etc
}
How to figure out what's going on when a new version of your analysis is generating different results from a previous version?
How to avoid accidentally deleting something important?
How to keep track of multiple versions of a script (e.g, on the cluster and on your own computer)?
How can multiple people collaborate on the same code without overwriting each other?
Version Control (Git and GitHub)
Editing your code: want to track changes over time without saving many separate versions.
Collaborating on your code: want to merge changes from multiple sources.
Can be run from the command line, from an RStudio Project, or with a graphical interface tool
Creates a permanent (hard to delete) record of changes to your project code over time