19w5221 Home Confirmed Participants Schedule Workshop Videos Final Report (PDF)

Schedule for: 19w5221 - Emerging Statistical Challenges & Methods for Analysis of Human Microbiome Data

Beginning on Sunday, September 15 and ending Friday September 20, 2019

All times in Banff, Alberta time, MDT (UTC-6).

Sunday, September 15
16:00 - 17:30	Check-in begins at 16:00 on Sunday and is open 24 hours (Front Desk - Professional Development Centre)
17:30 - 19:30	Dinner ↓ A buffet dinner is served daily between 5:30pm and 7:30pm in the Vistas Dining Room, the top floor of the Sally Borden Building. (Vistas Dining Room)
20:00 - 22:00	Informal gathering (Corbett Hall Lounge (CH 2110))

Monday, September 16
07:00 - 08:45	Breakfast ↓ Breakfast is served daily between 7 and 9am in the Vistas Dining Room, the top floor of the Sally Borden Building. (Vistas Dining Room)
08:45 - 09:00	Introduction and Welcome by BIRS Staff ↓ A brief introduction to BIRS with important logistical information, technology instruction, and opportunity for participants to ask questions. (TCPL 201)
09:00 - 09:40	Yijuan Hu: Analyzing matched sets of microbiome data using the LDM and PERMANOVA (Presenter: Glen Satten) ↓ Matched data arise frequently in microbiome studies. For example, we may collect samples pre and post treatment from a set of subjects, or matched case-control subjects who were matched on important confounding factors. However, there is a lack of methods to provide both a global test of microbiome effect and tests of individual operational taxonomic units (OTUs) in a unified manner, while accommodating complex data such as those with unbalanced sample sizes per set, confounders varying within a set, and continuous traits of interest. PERMANOVA is a commonly used distance-based method for testing the global hypotheses of any microbiome effect. We have also developed the linear decomposition model (LDM) that includes the global test and tests of individual OTU effects while controlling the false discovery rate (FDR). Here we present a strategy that can be used in the LDM and PERMANOVA for analyzing matched-set data. We propose to include set indicators as covariates so as to constrain comparisons between samples within a set. We also propose to permute covariates within each set which can account for exchangeable sample correlations. Additionally, the flexible nature of the LDM and PERMANOVA allows discrete or continuous variables (e.g., clinical outcomes) to be tested, within-set confounders to be adjusted, and unbalanced data to be fully exploited. Our simulations indicate that the proposed strategy outperformed alternative strategies in a wide range of scenarios. Using simulation, we also explored optimal designs for matched-set studies. The flexibility of the LDM and PERMANOVA for a variety of matched-set microbiome data is illustrated by the analysis of data from two microbiome studies. (TCPL 201)
09:40 - 10:20	Zhengzheng Tang: Robust and powerful differential composition tests on clustered microbiome data ↓ Clustered microbiome data have become prevalent in recent years from designs such as longitudinal studies, family studies, and matched case-control studies. The within-cluster dependence compounds the challenge of the microbiome data analysis. Methods that properly accommodate intra-cluster correlation and features of the microbiome data are needed. We develop robust and powerful differential composition tests for clustered microbiome data. The methods do not rely on any distributional assumptions on the microbial compositions, which provides flexibility to model various correlation structures among taxa and among samples within a cluster. By leveraging the adjusted sandwich covariance estimate, the methods properly accommodate sample dependence within a cluster. Different types of confounding variables can be easily adjusted for in the methods. We perform extensive simulation studies under commonly-adopted clustered data designs to evaluate the methods. The usefulness of the proposed methods is further demonstrated with a real dataset from a longitudinal microbiome study on pregnant women. (TCPL 201)
10:20 - 10:50	Coffee Break (TCPL Foyer)
10:50 - 11:30	Jun Chen: A permutation framework for robust and powerful differential abundance analysis of microbiome sequencing data (Cancelled) ↓ One central theme of microbiome studies is to identify bacterial taxa/functions associated with some clinical or biological outcome (a.k.a, microbiome biomarker discovery). The discovered microbiome biomarkers can be used for disease diagnosis, prognosis, and treatment selection. Many methods have been proposed for this task, ranging from simple Wilcoxon rank sum test to sophisticated zero-inflated parametric models. Due to the excessive zeros, outliers, compositionality and phylogenetic structure in microbiome data, the existing methods are still far from optimal: parametric methods tend to be less robust while non-parametric methods are less powerful. To address the limitations of current approaches, we propose an efficient permutation framework (ZicoSeq) for robust and powerful microbiome biomarker discovery. The method is based on the traditional F-statistic for linear models with a novel posterior sampling step to address zero inflation and sampling variability. A multiple-stage normalization strategy is implemented to control the compositional effects. The framework takes into account the full characteristics of microbiome sequencing data including variable library sizes, the correlations among taxa, the inherent compositionality, and the phylogenetic relatedness of the taxa. An omnibus test is developed to capture various biological effects. By simulations and real data applications, we demonstrate good power and false positive control for the proposed method. (TCPL 201)
10:50 - 11:30	Michael Wu: Testing Associations Between Microbiome and Other Omics Data Types ↓ Joint analysis of microbiome and other genomic data types offers to simultaneously improve power to identify novel associations and elucidate the mechanisms underlying established relationships with outcomes. However, microbiome data are subject to high dimensionality, compositionality, sparsity, phylogenetic constraints, and complexity of relationships among taxa. Combined with the myriad challenges specific to other omics data types, how to conduct integrative analysis continues to pose a grand challenge. To move towards joint analysis, we propose development of methods for identifying individual and groups of genomic features related to microbiome community structure. Specifically, using kernels to capture microbiome community structure, we develop approaches for rapidly screening genomic features that collectively, marginally or conditionally affecting beta diversity. (TCPL 201)
11:30 - 13:00	Lunch ↓ Lunch is served daily between 11:30am and 1:30pm in the Vistas Dining Room, the top floor of the Sally Borden Building. (Vistas Dining Room)
13:00 - 14:00	Guided Tour of The Banff Centre ↓ Meet in the Corbett Hall Lounge for a guided tour of The Banff Centre campus. (Corbett Hall Lounge (CH 2110))
14:00 - 14:20	Group Photo ↓ Meet in foyer of TCPL to participate in the BIRS group photo. The photograph will be taken outdoors, so dress appropriately for the weather. Please don't be late, or you might not be in the official group photo! (TCPL 201)
14:20 - 15:00	Toby Kenney: Using Stochastic Differential Equations to Model Microbial Dynamics ↓ Most of the research to-date on analysis of microbiome data has focussed on the microbial states associated with various conditions. However, understanding the temporal dynamics of the microbiome is also extremely important. Stochastic differential equations (SDEs) are widely used in ecology to describe the dynamics of ecological systems. In this talk, we present two preliminary approaches to modelling microbial dynamics using SDEs. The first approach looks at modelling the temporal dynamics of an individual OTU using an Ornstein-Uhlembek (OU) process, which is based on Brownian motion, with mean reversion, meaning that the abundance of the OTU, while fluctuating randomly, is drawn towards a stable level. By comparing the fit of the OU process with Brownian motion, we are able to provide evidence confirming the tendency towards mean-reversion. By studying the Fisher information matrix for the parameters of the OU process, we are able to determine the accuracy of our modelling for various sampling schemes, and study the best sampling frequency for various systems. The second approach looks at modelling inter-species interaction between OTUs, using a stochastic version of the generalised Lotka-Volterra equation (GLV). The deterministic GLV equation is widely used in ecology as a simple model for various forms of inter-species interaction. In this work, we study the equation with the addition of a Brownian motion stochastic term to the equation. We prove existence of a solution to this equation, and show that the stochastic process has a stationary distribution with the ergodic property. We show that the use of approximate maximum likelihood to estimate parameters of the equation is consistent, and empirically performs better than using a deterministic differential equation with measurement error. We apply this approach to real data to identify interactions between the most abundant families. (TCPL 202)
15:00 - 15:30	Coffee Break (TCPL Foyer)
15:30 - 16:10	Robert Beiko: Has anyone seen my plasmid? Probing the dark corners of metagenome-assembled genomes ↓ Metagenomic analyses typically produce millions of short reads, sampled from the entire diversity of genomes present in a particular sample. While direct analysis of these reads can yield useful information about the diversity of microorganisms and functions present, a great deal of information can be learned by merging short reads into longer assemblies. Algorithms to reconstruct metagenome-assembled genomes (MAGs) draw from different types of evidence, including the relative abundance of particular reads in a sample, and the similarity of “words” of length k (known as k-mers). Reconstruction of MAGs has shed new light on heretofore unknown deep lineages of bacteria, and revealed the degree of diversity of closely related organisms in different habitats. MAGs can also be very useful for the reconstruction of entire metabolic pathways and networks. However, the effectiveness of MAG assembly is not uniform, and stretches of DNA that deviate from the expected frequency or k-mer distribution can be difficult or impossible to correctly assign. This problem is especially acute in unusual constituents of the genome such as plasmids and genomic islands (GIs); since these elements often harbour useful information about antimicrobial resistance and other important pathways, their absence from a MAG can lead to underestimation of their abundance. We assessed the extent of the problem using a simulated 250 base-pair paired-end metagenome of 30 genomes displaying a broad range of GI abundance and numbers of plasmids. Across a range of methods, a median of 66.2% of all chromosomal sequence was binned into MAGs; however, only 23.1% of plasmids and 31.7% of GIs were similarly present in any bin. When assessing the percentage of GIs and plasmids that were correctly assigned to the same bin as the rest of their source genome this performance is even worse (median 32.5% of GIs and 6.9% of plasmids). These results on a relatively simple simulated community point to (possibly fundamental) limitations of existing methods in assigning exotic elements to their correct source genome. Although further improvements will undoubtedly be realized through better algorithms and statistics, high accuracy may depend on the integration of additional DNA sequencing data, and better use of known reference genomes. (TCPL 202)
17:30 - 19:30	Dinner ↓ A buffet dinner is served daily between 5:30pm and 7:30pm in the Vistas Dining Room, the top floor of the Sally Borden Building. (Vistas Dining Room)

Tuesday, September 17
07:00 - 09:00	Breakfast (Vistas Dining Room)
09:00 - 09:40	Hong Gu: Principal Component Analysis for microbiome data by correcting the measurement errors and sequencing depths ↓ Data exploratory methods, such as Principal Component Analysis (PCA), cannot properly be directly applied on microbiome data due to the issues of sampling errors and sequencing depths. Under the assumption of Poisson sampling errors, we study the problem of computing a PCA of the underlying Poisson means or a nonlinear transformation of the latent Poisson means. We develop a semiparametric approach to correct the bias of variance estimators, both for untransformed and transformed (with particular attention to log-transformation) Poisson means without any assumptions on the underlying distribution of these means. Furthermore, we incorporate methods for correcting diﬀerent exposure or sequencing depth in the data. In addition to identifying the principal components, we also address the non-trivial problem of computing the principal scores in this semiparametric framework. Most previous approaches tend to take a more parametric line. For example the Poisson-log-normal (PLN) model approach. We compare our method with the PLN approach and find that our method is better at identifying the main principal components of the latent log-transformed Poisson means, and as a further major advantage, takes far less time to compute. Comparing methods on real data, we see that our method also appears to be more robust to outliers than the parametric method. (TCPL 201)
09:40 - 10:20	Glen Satten (TCPL 201)
10:20 - 10:50	Coffee Break (TCPL Foyer)
10:50 - 11:30	Benjamin Callahan: Modeling and Correcting Bias in Metagenomic Sequencing Measurements (Cancelled) ↓ Marker-gene and metagenomic sequencing measurements differ from the truth, often dramatically, because these measurement methods are biased towards detecting some taxa over others. This experimental bias makes the taxon or gene abundances measured by different protocols quantitatively incomparable and can lead to spurious biological conclusions. We propose a mathematical model for how bias distorts community measurements based on the properties of real experiments. We validate this model with 16S rRNA gene and shotgun metagenomics data from defined bacterial communities. Our model better fits the experimental data despite being simpler than previous models. We illustrate how our model can be used to evaluate protocols, to understand the effect of bias on downstream statistical analyses, and to measure and correct bias given suitable calibration controls. I will further discuss the practical challenges in measuring and mitigating the effects of bias in real experiments and analyses. (TCPL 201)
11:30 - 13:30	Lunch (Vistas Dining Room)
13:30 - 14:10	Shyamal Peddada: Differential (Relative) Abundance Analysis – Some Recent Developments and Challenges ↓ Increasingly researchers are conducting microbiome studies to ask a wide range of questions of scientific interest. However, as we learn from Morton et al. (Nature Comm., 2019), the question of “who are there?” is still an important basic question before we can answer questions such as “What are they doing? “How are they doing?” etc. It is well documented in the literature that the observed microbiome data are relative abundances (compositional) with lots of zeros (Gloor et al., 2016, 2017). Consequently, the method of analysis is not necessarily routine. Numerous methods have been proposed in the literature and there have been misunderstandings and controversies, in part because there is a lack of clarity on what parameters are to be tested and what hypotheses a given method/statistic is really testing. In this talk we summarize some existing methods and also describe some recent developments in the area. We shall illustrate the methods using simulations and the global gut data of Yatsunenko et al. (Nature, 2012). (TCPL 202)
14:10 - 14:50	Gregory Gloor: Finding the centre: correcting for compositional asymmetry in high-throughput sequencing datasets ↓ An under-appreciated pathology of microbiome and other high throughput sequencing data are their often unbalanced nature: i.e, there is often systematic variation between groups simply due to presence or absence of features, and this variation is important to the biological interpretation of the data. We demonstrate the pathology in modelled and real unbalanced experimental designs to show how this causes both false negative and false positive inference. We then introduce several approaches to demonstrate how the pathologies can be recognized and addressed. The transformations are implemented as an extension to a general compositional data analysis tool known as ALDEx2 which is available on Bioconductor. (TCPL 202)
14:50 - 15:20	Coffee Break (TCPL Foyer)
17:30 - 19:30	Dinner (Vistas Dining Room)

Wednesday, September 18
07:00 - 09:00	Breakfast (Vistas Dining Room)
09:50 - 10:30	Ni Zhao: A Benchmark Project for Differential Abundance Testing in Microbiome Studies ↓ In human microbiome studies, it is essential to evaluate the association between microbial group (e.g., community or clade) composition and a host phenotype of interest. In response, a number of microbial group association tests have been proposed, which take into account the unique features of the microbiome data (e.g., high-dimensionality, compositionality, phylogenetic relationship). These tests generally fall in the class of aggregation tests which amplify the overall group association by combining all the underlying microbial association signals; as such, they are powerful when many microbial species are associated (i.e., low sparsity). However, in practice, the microbial association signals can be highly sparse, and this is especially the situation where we have a difficulty to discover the microbial group association. Hence, here we introduce a powerful microbial group association test for sparse microbial association signals, namely, microbiome higher criticism analysis (MiHC). MiHC is a data-driven optimal test taken in a search space spanned by tailoring the higher criticism test to incorporate phylogenetic information and/or modulate sparsity levels. Our simulations show that MiHC maintains a high power at different phylogenetic relevance and sparsity levels with correct type I error controls. We also demonstrate the use of MiHC with tree real data applications. (TCPL 201)
10:20 - 10:50	Coffee Break (TCPL Foyer)
10:50 - 11:30	Huilin Li: Microbial causal mediation inference ↓ Recent microbiome association studies have revealed important associations between microbiome and disease/health status. Such findings encourage scientists to dive deeper to uncover the causal role of microbiome in the underlying biological mechanism, and have led to applying statistical models to quantify causal microbiome effects and to identify the specific microbial agents. However, there are no existing causal mediation methods specifically designed to handle high dimensional and compositional microbiome data. We propose a rigorous Sparse Microbial Causal Mediation Model (SparseMCMM) specifically designed for the high dimensional and compositional microbiome data in a typical three-factor (treatment, microbiome and outcome) causal study design. In particular, linear log-contrast regression model and Dirichlet regression model are proposed to estimate the causal direct effect of treatment and the causal mediation effects of microbiome at both the community and individual taxon levels. Regularization techniques are used to perform the variable selection in the proposed model framework to identify signature causal microbes. Hypothesis tests on overall and component-wise mediation effects are proposed and their statistical significance is estimated by permutation procedures. Extensive simulated scenarios show that SparseMCMM has excellent performance in estimation and hypothesis testing. Finally, we showcase the utility of the proposed SparseMCMM method in a study which the murine microbiome has been manipulated by providing a clear and sensible causal path among antibiotic treatment, microbiome composition and mouse weight. (TCPL 201)
11:30 - 13:30	Lunch (Vistas Dining Room)
13:30 - 17:30	Free Afternoon (Banff National Park)
17:30 - 19:30	Dinner (Vistas Dining Room)

Thursday, September 19
07:00 - 09:00	Breakfast (Vistas Dining Room)
09:00 - 09:40	Hongzhe Li: Hypothesis Testing for Phylogenetic Composition: A Minimum-cost Flow Perspective ↓ Quantitative comparison of microbial composition from different populations is a fundamental task in various microbiome studies. We consider two-sample testing for microbial compositional data by leveraging the phylogenetic tree information. Motivated by existing phylogenetic distances, we take a minimum-cost flow perspective to study such testing problems. Our investigation shows that multivariate analysis of variance with permutation (\textsc{permanova}) using phylogenetic distances, one of the most commonly used methods in practice, is essentially a sum-of-squares type test and has better power for dense alternatives. However, empirical evidence from real data sets suggests that the phylogenetic microbial composition difference between two populations is usually sparse. Motivated by this observation, we propose a new maximum type test, Detector of Active Flow on a Tree (\textsc{dafot}). It is shown that \textsc{dafot} is particularly powerful against sparse phylogenetic composition difference and enjoys certain optimality. The practical merit of the proposed method is demonstrated by simulation studies and an application to a human intestinal biopsy microbiome data set for patients with ulcerative colitis. (TCPL 201)
09:40 - 10:20	Snehalata Huzurbazar: Visualizations to guide dimension reduction for sparse high-dimensional data ↓ Dimension reduction for high-dimensional data is necessary for descriptive data analysis. Most visualization options are restricted to 2 or 3 dimensions; however, more dimensions are needed to capture relationships among variables (or observations) in high-dimensional data. Using 16S rRNA microbiome data, we develop intensity plots to highlight the changing contributions of taxa (or subjects) as the number of principal components of the dimension reduction or ordination method are changed. The plots provide a quick visualization of taxa/subjects that are close to the `center' or that contribute to dissimilarity. They also allow for exploration of patterns among related subjects or taxa not seen in other visualizations. In addition, we also use of Andrews curves and explore the data using the tourr package in R. (TCPL 201)
10:20 - 10:50	Coffee Break (TCPL Foyer)
10:50 - 11:30	Natalie Knox: Bias and variability in microbiome research: Can we uncover the ground truth? ↓ Despite the exponential increase in microbiome research, it remains challenging to compare results across studies due to several factors that contribute to bias and variability. In this session, I will present the basic principles of microbiome study design and data generation to help mitigate these factors. The prevailing microbiome data processing methods has also brought new challenges to microbiome research. I will demonstrate how different approaches can play a role in the analytical results and subsequent statistical inferences that are used to draw conclusions about the study population. At the end of this session, attendees should have a greater understanding of the complexities associated with microbiome research from data generation to data analysis and interpretation. (TCPL 201)
11:30 - 13:30	Lunch (Vistas Dining Room)
13:30 - 14:10	Myung Hee Lee: Quantile regression with micribome compositional data ↓ When the distribution of outcome variable is skewed and association between covariates to central outcome may be weak and the meaningful association may be uncovered other parts of the distribution other than central area. We consider quantile regression problem where compositional data are used as covariates. Likelihood-based framework for estimation of regression quantiles will be introduced using Asymmetric Laplace Density. An empirical Bayes, model-based approach is used to facilitate variable selection when there are large number of candidate covariates that have weak to no effect on the outcome. (TCPL 202)
14:10 - 14:50	Paul McMurdie: Statistics at a Microbiome Start-up, and Survey of our Biggest Challenges (TCPL 202)
14:50 - 15:20	Coffee Break (TCPL Foyer)
17:30 - 19:30	Dinner (Vistas Dining Room)

Friday, September 20
07:00 - 09:00	Breakfast (Vistas Dining Room)
09:00 - 09:40	Ekaterina Smirnova: HMP2Data: Integrative Human Microbiome Data R Bioconductor package and analysis workflow ↓ The integrative Human Microbiome project (iHMP) generated longitudinal datasets from three different cohorts to study the association between microbiome and (1) pregnancy and preterm birth; (2) inflammatory bowel disease; and (3) type 2 diabetes. However, working with these data is daunting due to complex processing steps to (1) access, import and merge different components in formats suitable for ecological and statistical analysis; and (2) visualize and combine the data to analyze the longitudinal, multi-omics and multi-body site host-microbiota interactions. We present the community-resource package HMP2Data that allows researchers to easily access the iHMP data deposited at the coordinating center (DACC). Each cohort data was harmonized usingMultiAssayExperiment and phyloseq packages to allow easy data management and analysis. We concentrate on the vaginal microbiome data from the pregnancy and preterm birth study and discuss recent analyses of similarities across -omics, here cytokines and 16S, data were illustrated using co-inertia techniques. (TCPL 201)
09:40 - 10:20	Jennifer Fettweis: An integrated framework for multi-omic studies of the vaginal microbiome ↓ The lack of harmonization in taxonomic and functional assignment across multi-omic microbiome studies often makes it difficult to accurately interpret and fully integrate results. Multi-omic studies typically use a combination of methods and databases that have not been harmonized to derive taxonomic and functional features from 16S rRNA surveys, metagenomics, metatranscriptomics, metaproteomics assays. Thus, the differences observed between assays can often attributable to the disjointed nature of underlying databases and computational methods rather than biological or technical variation. Here, we propose an integrated framework for vaginal microbiome omics analysis that leverages integrated vaginal databases for 16S rRNA, metagenomics, metatranscriptomics and metaproteomics assays. The framework permits direct comparison of taxonomic and functional feature calls across omics assays. (TCPL 201)
10:20 - 10:50	Coffee Break (TCPL Foyer)
10:50 - 11:30	Vicki Hertzberg: Design Issues in Human Studies of Disease and the Microbiome ↓ Variation in human-associated microbial communities across individuals may be due to genetics, environment, diet, and age. Although studies of the relationships between disease and the microbiome in humans often use control groups, a prevailing attitude seems to be that just about any old control group will do. Often descriptive characteristics of controls groups are not provided other than the number of controls and that they are “healthy”. Occasionally investigators will control for age, race, and sex. In the last several years, four major studies have found significant similarities between genetically unrelated individuals who share households, with the closest similarities being between adult coupled partners. The implications of these findings for microbiome study designs are important: optimal study designs should include partner/caregivers as controls to meet the standard of equipoise for testing a null hypothesis. Here we propose a set of guidelines for control groups for microbiome studies. We give suggested effect sizes to guide power calculations. We illustrate these guidelines with data from studies of the role of the gut microbiome in neurodegenerative diseases. (TCPL 201)
11:30 - 12:00	Checkout by Noon ↓ 5-day workshop participants are welcome to use BIRS facilities (BIRS Coffee Lounge, TCPL and Reading Room) until 3 pm on Friday, although participants are still required to checkout of the guest rooms by 12 noon. (Front Desk - Professional Development Centre)
12:00 - 13:30	Lunch from 11:30 to 13:30 (Vistas Dining Room)

©2025 Banff International Research Station for Mathematical Innovation and Discovery. All Rights Reserved.