This is the hands-on segment of the multi-omics analysis tutorial. The data analyzed below are described in Lloyd-Price et al Nature Microbiology, 2019. Please visit the Setup page for some initial steps to complete prior to the tutorial.

1 Descriptive analysis

What is the overall landscape of our dataset?

Typically, you would start the analysis of a new dataset (after data cleaning and QC) with some initial descriptive analyses to summarize the overall trends in the data across all features (taxa/genes/metabolites). The tools most commonly used for this task are similar to those you’ve already seen in Jordan’s previous session on 16S rRNA sequencing: dimensional reduction and ordination (PCA/PCoA/NMDS), visualization of feature abundances across samples (heatmaps, line plots, etc.), and statistical modeling to assess which features are differentially abundant between experimental groups. In the interest of time, we won’t work through these analyses together, but you can ordination plots based on microbiome taxa (PCoA, Bray-Curtis dissimilarity) and metabolite features (PCA) below. Some additional questions and suggestions for performing these analyses are included at the end of the document if you want to continue to explore this dataset.

library(tidyverse)
library(tidymodels)
library(vegan)
library(ape)

## Import metadata
metadata_cinc <- read_tsv("data/subject_metadata_Cincinnati.tsv") %>% filter(data_type == "metabolomics") %>% mutate(diagnosis = factor(diagnosis, levels = c("nonIBD", "UC", "CD")))
metadata_mgh <- read_tsv("data/subject_metadata_MGH.tsv") %>% filter(data_type == "metabolomics") %>% mutate(diagnosis = factor(diagnosis, levels = c("nonIBD", "UC", "CD")))

## Import metabolite data
metabolites_cinc <- read_tsv("data/metabolite_profile_Cincinnati.tsv") %>% 
  rename_with( ~ gsub("_\\(.*", "", gsub(" |\\/", "_", .x))) %>% 
  pivot_longer(names_to = "SampleID", cols = contains("HSM")) %>% 
  left_join(select(metadata_cinc, SampleID, SubjectID, diagnosis), by = "SampleID") %>% 
  mutate(value_zeros = ifelse(is.na(value), 0, value))
metabolites_mgh <- read_tsv("data/metabolite_profile_MGH.tsv") %>% 
  rename_with( ~ gsub("_\\(.*", "", gsub(" |\\/", "_", .x))) %>% 
  pivot_longer(names_to = "SampleID", cols = contains("MSM")) %>% 
  left_join(select(metadata_mgh, SampleID, SubjectID, diagnosis), by = "SampleID") %>% 
  mutate(value_zeros = ifelse(is.na(value), 0, value))

metabolites_all <- rbind(mutate(metabolites_cinc, study_site = "Cincinnati"), mutate(metabolites_mgh, study_site = "MGH"))
## Filter features present in fewer than 5 samples and log-transform values
bad_feats <- metabolites_all %>% group_by(Compound) %>% 
  summarize(numVar = length(unique(value_zeros))) %>% 
  filter(numVar < 5) %>% pull(Compound)
metabolites_all <- filter(metabolites_all, !Compound %in% bad_feats) %>% mutate(logValueZeros = log10(value_zeros+1))

#Format for PCA, calculate PCA, plot
met_pr <- pivot_wider(metabolites_all, names_from = Compound, values_from = logValueZeros, id_cols= c(SampleID, study_site, diagnosis), values_fill = 0)
met_pca <- met_pr %>% column_to_rownames(var = "SampleID") %>% select(-study_site, -diagnosis) %>% as.matrix() %>% prcomp()
met_summary <- as_tibble(met_pca$x, rownames = NA) %>% rownames_to_column(var = "SampleID") %>% select(SampleID, PC1, PC2, PC3, PC4) %>% left_join(distinct(select(metabolites_all, SampleID, diagnosis, study_site)))
met_plot <- ggplot(met_summary, aes(x = PC1, y = PC2, color = diagnosis, shape = study_site)) + geom_point(size = 1.5) + ggtitle("Metabolite abundances") + theme_classic()
met_plot

met_vars <- met_pca %>% tidy(matrix = "pcs")

# Remove large datasets
rm(metabolites_all, metabolites_cinc, metabolites_mgh)

### Now do Taxa abundances
taxa_cinc <- read_tsv("data/taxa_profile_Cincinnati.tsv") %>% 
  pivot_longer(names_to = "SampleID", cols = -Feature) %>% 
  left_join(select(metadata_cinc, site_name, SampleID, SubjectID, diagnosis), by = "SampleID") # %>% mutate(value_zeros = ifelse(is.na(value), 0, value))

taxa_mgh <- read_tsv("data/taxa_profile_MGH.tsv") %>% 
  pivot_longer(names_to = "SampleID", cols = -Feature) %>% 
  left_join(select(metadata_mgh, site_name, SampleID, SubjectID, diagnosis), by = "SampleID") #%>% 
  #mutate(value_zeros = ifelse(is.na(value), 0, value))
taxa_all <- rbind(taxa_cinc, taxa_mgh)

#Get species-level abundances, calculate Bray-Curtis dissimilarities between samples
bray_dist <- taxa_all %>% select(-diagnosis,-SubjectID, -site_name) %>% 
  filter(grepl("s__.*", Feature)|Feature=="UNKNOWN") %>% 
  pivot_wider(names_from = "Feature", values_from = "value", values_fill = 0) %>% 
  column_to_rownames("SampleID") %>% 
  as.matrix() %>% 
  vegdist(method = "bray")

#Perform principal coordinates analysis
pcoa1 <- pcoa(bray_dist)
pcoa_dat <- pcoa1$vectors %>% as_tibble(rownames = NA) %>% rownames_to_column("SampleID") %>% left_join(distinct(select(taxa_all, SampleID, SubjectID, diagnosis, site_name)))

#Visualize PCoA
species_plot <- ggplot(pcoa_dat, aes(x = Axis.1, y = Axis.2, color = diagnosis, shape = site_name)) + geom_point(size = 1.5) + ggtitle("Species abundances") + theme_classic()
species_plot

3 Predictive analysis

An alternative (and complementary!) approach to multi-omic analysis is to instead treat our dataset as just a “bag of features”: where we don’t assume anything about any specific feature’s meaning but simply apply high-dimensional analysis techniques to describe data structure and identify features possibly linked to our outcomes of interest. We’ll explore this now, using a machine learning analysis to evaluate what features best predict disease status.

3.1 Classification

In this analysis, we will investigate the questions: - How well do metabolomics and/or metagenomic features predict IBD diagnosis? - Can we identify biomarkers that are signatures of diagnosis?

These are classification questions - to answer them, we want to obtain a predictive model that can accurately predict what category (e.g. control or IBD) any sample belongs to, on the basis of its high-dimensional data. Then we will inspect the models to evaluate which features in our large dataset were most informative for those predictions.

We will use a commonly used type of machine learning model called a random forest. “Random Forest” describes an algorithm that combines predictions from many decision trees, each constructed from randomly sampled subsets of the data. Averaged together, these make better predictions than a single tree.

Machine learning workflows can get quite convoluted, in large part to limit the risk of overparameterization and overfitting to dataset idiosyncrasies. (Overfitting is very easy to do when you have thousands of possible predictors.) In this analysis, we’ll work through a relatively standard workflow, using the tidymodels suite of packages:

  1. Define the model choice and how we will split up our data for training and testing the model.

  2. Use cross-validation to select hyperparameters for the model and get an initial idea of model accuracy (more on this below).

  3. Fit a final model and examine which features are most predictive.

  4. Test the model’s predictive ability on a new set of data.

For these models, we will use a 5-fold cross-validation scheme to select the best model hyperparameters, and then we will calculate a final testing accuracy in an independent dataset - in this case, the samples from another study site cohort. k-fold cross-validation works by fitting k different models, each trained on 80% of the training data and tested on the remaining 20%, and then calculating the final performance statistics across all 5 folds.

Step 1: Set up training data scheme and model specification

Our first random forest will learn a model to predict IBD diagnosis (to start, just IBD or control) on the basis of taxonomic abundances.

Step 2: Use cross-validation to select hyperparameters for the model and get an initial idea of model quality

We will use cross-validation performance to choose parameters for our model. Random forests have 2 main parameters (there are also others that we will leave as the defaults): the number of decision trees to make up the whole model (num.trees), and the number of variables randomly selected for each tree (mtry). The optimal values for these depend on dataset characteristics. We will pick the values that perform the best from a selection.

We’ll evaluate our predictive models using 2 common metrics: - Accuracy: total correct predictions divided by total predictions or samples - Area under the ROC curve (AUC): This is a commonly used metric in classification analyses. The predictions produced by the model are actually probabilities: the estimated probability that that particular sample belongs to each class. An ROC curve accounts for the idea that we could use probability cutoffs to assign a sample to each class.

##           Truth
## Prediction  IBD nonIBD
##     IBD    2861    520
##     nonIBD   10    830

We’ll make a plot of performance across all of our different parameter combinations to choose what to use for our final model:

## # A tibble: 9 x 5
##    mtry trees .config accuracy roc_auc
##   <int> <int> <fct>      <dbl>   <dbl>
## 1     1   100 Model1     0.680   0.888
## 2   100   100 Model2     0.974   0.991
## 3   200   100 Model3     0.966   0.990
## 4     1   450 Model4     0.680   0.927
## 5   100   450 Model5     0.979   0.990
## 6   200   450 Model6     0.972   0.991
## 7     1   800 Model7     0.680   0.925
## 8   100   800 Model8     0.972   0.990
## 9   200   800 Model9     0.966   0.989

## # A tibble: 5 x 8
##    mtry trees .metric .estimator  mean     n std_err .config
##   <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <fct>  
## 1   100   100 roc_auc binary     0.991     5 0.00594 Model2 
## 2   200   450 roc_auc binary     0.991     5 0.00581 Model6 
## 3   100   450 roc_auc binary     0.990     5 0.00634 Model5 
## 4   200   100 roc_auc binary     0.990     5 0.00520 Model3 
## 5   100   800 roc_auc binary     0.990     5 0.00612 Model8

Step 3: Fit final model and examine which features are most predictive

Now that we’ve selected reasonable parameters, we can fit our final training model and examine which features appear to be most informative for predicting IBD status in this dataset.

##           Truth
## Prediction IBD nonIBD
##     IBD    319      1
##     nonIBD   0    149

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary             1

Looks like a pretty strong model! We can also look at which particular taxa tended to be most informative of IBD status in this final model. This is quantified here using a metric called Gini impurity.

Step 4: Test final model on a new set of data

Even though our model seems to perform really well, this is almost always true on a training dataset (the model is optimized for that after all). To really get a sense of whether these particular combinations of taxa are informative for predicting IBD status, we need to apply our model to a totally new dataset. In this case, we will try predicting IBD status with the same model (same parameters, same final fit decision trees) on a new, independent dataset - the samples from the MGH study site.

## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.616
##           Truth
## Prediction IBD nonIBD
##     IBD    236    196
##     nonIBD  60     56
## # A tibble: 1 x 1
##   Accuracy
##      <dbl>
## 1    0.533

3.2 Follow up questions

  • Try this with another data type: Does the metabolite data predict disease subtype more accurately than the taxonomic data? What about both datasets together, and/or some of the additional metadata such as sex and age?

  • Try another formulation of the problem: what about a multi-class classifier to distinguish UC, CD, and nonIBD?

  • How well do you think a model would perform based solely on the 10 or 15 most important taxa?

  • Do you think this training-testing setup was a good approach to prevent overfitting and over-interpretation? How could we refine this strategy?

  • Do you think the random forest algorithm is a good choice of classification tool for this dataset? Why or why not? Look into support vector machines and elastic net logistic regression as alternative approaches.