Statistical modelling for biology, genetics & *omics
The group has expertise in developing statistical and machine learning methods for phylogenetics, regulatory genomics, epigenetics, proteomics and metabolomics.
Statistical Modelling for Biology, Genetics and *omics - Example Research Projects
Information about postgraduate research opportunities and how to apply can be found on the Postgraduate Research Study page. Below is a selection of projects that could be undertaken with our group.
Modelling genetic variation (MSc/PhD)
Supervisors: Vincent Macaulay
Relevant research groups: Bayesian Modelling and Inference, Statistical Modelling for Biology, Genetics and *omics
Variation in the distribution of different DNA sequences across individuals has been shaped by many processes which can be modelled probabilistically, processes such as demographic factors like prehistoric population movements, or natural selection. This project involves developing new techniques for teasing out information on those processes from the wealth of raw data that is now being generated by high-throughput genetic assays, and is likely to involve computationally-intensive sampling techniques to approximate the posterior distribution of parameters of interest. The characterization of the amount of population structure on different geographical scales will influence the design of experiments to identify the genetic variants that increase risk of complex diseases, such as diabetes or heart disease.
The evolution of shape (PhD)
Supervisors: Vincent Macaulay
Relevant research groups: Bayesian Modelling and Inference, Modelling in Space and Time, Statistical Modelling for Biology, Genetics and *omics
Shapes of objects change in time. Organisms evolve and in the process change form: humans and chimpanzees derive from some common ancestor presumably different from either in shape. Designed objects are no different: an Art Deco tea pot from the 1920s might share some features with one from Ikea in 2010, but they are different. Mathematical models of evolution for certain data types, like the strings of As, Gs , Cs and Ts in our evolving DNA, are quite mature and allow us to learn about the relationships of the objects (their phylogeny or family tree), about the changes that happen to them in time (the evolutionary process) and about the ways objects were configured in the past (the ancestral states), by statistical techniques like phylogenetic analysis. Such techniques for shape data are still in their infancy. This project will develop novel statistical inference approaches (in a Bayesian context) for complex data objects, like functions, surfaces and shapes, using Gaussian-process models, with potential application in fields as diverse as language evolution, morphometrics and industrial design.
Bayesian variable selection for genetic and genomic studies (PhD)
Supervisors: Mayetri Gupta
Relevant research groups: Bayesian Modelling and Inference, Computational Statistics, Statistical Modelling for Biology, Genetics and *omics
An important issue in high-dimensional regression problems is the accurate and efficient estimation of models when, compared to the number of data points, a substantially larger number of potential predictors are present. Further complications arise with correlated predictors, leading to the breakdown of standard statistical models for inference; and the uncertain definition of the outcome variable, which is often a varying composition of several different observable traits. Examples of such problems arise in many scenarios in genomics- in determining expression patterns of genes that may be responsible for a type of cancer; and in determining which genetic mutations lead to higher risks for occurrence of a disease. This project involves developing broad and improved Bayesian methodologies for efficient inference in high-dimensional regression-type problems with complex multivariate outcomes, with a focus on genetic data applications.
The successful candidate should have a strong background in methodological and applied Statistics, expert skills in relevant statistical software or programming languages (such as R, C/C++/Python), and also have a deep interest in developing knowledge in cross-disciplinary topics in genomics. The candidate will be expected to consolidate and master an extensive range of topics in modern Statistical theory and applications during their PhD, including advanced Bayesian modelling and computation, latent variable models, machine learning, and methods for Big Data. The successful candidate will be considered for funding to cover domestic tuition fees, as well as paying a stipend at the Research Council rate for four years.
Bayesian statistical data integration of single-cell and bulk “OMICS” datasets with clinical parameters for accurate prediction of treatment outcomes in Rheumatoid Arthritis (PhD)
Supervisors: Mayetri Gupta
Relevant research groups: Bayesian Modelling and Inference, Computational Statistics, Statistical Modelling for Biology, Genetics and *omics, Biostatistics, Epidemiology and Health Applications
In recent years, many different computational methods to analyse biological data have been established: including DNA (Genomics), RNA (Transcriptomics), Proteins (proteomics) and Metabolomics, that captures more dynamic events. These methods were refined by the advent of single cell technology, where it is now possible to capture the transcriptomics profile of single cells, spatial arrangements of cells from flow methods or imaging methods like functional magnetic resonance imaging. At the same time, these OMICS data can be complemented with clinical data – measurement of patients, like age, smoking status, phenotype of disease or drug treatment. It is an interesting and important open statistical question how to combine data from different “modalities” (like transcriptome with clinical data or imaging data) in a statistically valid way, to compare different datasets and make justifiable statistical inferences. This PhD project will be jointly supervised with Dr. Thomas Otto and Prof. Stefan Siebert from the Institute of Infection, Immunity & Inflammation), you will explore how to combine different datasets using Bayesian latent variable modelling, focusing on clinical datasets from Rheumatoid Arthritis.
The successful candidate will be considered for funding to cover domestic tuition fees, as well as paying a stipend at the Research Council rate for four years.
Scalable Bayesian models for inferring evolutionary traits of plants (PhD)
Supervisors: Vinny Davies, Richard Reeve, Claire Harris (BIOSS)
Relevant research groups: Bayesian Modelling and Inference, Computational Statistics, Environmental, Ecological Sciences and Sustainability, Statistical Modelling for Biology, Genetics and *omics
The functional traits and environmental preferences of plant species determine how they will react to changes resulting from global warming. The main global biodiversity repositories, such as the Global Biodiversity Information Facility (GBIF), contain hundreds of millions of records from hundreds of thousands of species in the plant kingdom alone, and the spatiotemporal data in these records can be associated with soil, climate or other environmental data from other databases. Combining these records allow us to identify environmental preferences, especially for common species where many records exist. Furthermore, in a previous PhD studentship we showed that these traits are highly evolutionarily conserved (Harris et al., 2022), so it is possible to impute the preferences for rare species where little data exists using phylogenetic inference techniques.
The aim of this PhD project is to investigate the application of Bayesian variable selection methods to identify these evolutionarily conserved traits more effectively, and to quantify these traits and their associated uncertainty for all plant species for use in a plant ecosystem digital twin that we are developing separately to forecast the impact of climate change on biodiversity. In another PhD studentship, we previously developed similar methods for trait inference in viral evolution (Davies et al., 2017; Davies et al., 2019), but due to the scale of the data here, these methods will need to be significantly enhanced. We therefore propose a project to investigate extensions to methods for phylogenetic trait inference to handle datasets involving hundreds of millions of records in phylogenies with hundreds of thousands of tips, potentially through either sub-sampling (Quiroz et al, 2018) or modelling splitting and recombination (Nemeth & Sherlock, 2018).
Estimating false discovery rates in metabolite identification using generative AI (PhD)
Supervisors: Vinny Davies, Andrew Elliott, Justin J.J. van der Hooft (Wageningen University)
Relevant research groups: Machine Learning and AI, Emulation and Uncertainty Quantification, Statistical Modelling for Biology, Genetics and *omics, Statistics in Chemistry/Physics
Metabolomics is the study field that aims to map all molecules that are part of an organism, which can help us understand its metabolism and how it can be affected by disease, stress, age, or other factors. During metabolomics experiments, mass spectra of the metabolites are collected and then annotated by comparison against spectral databases such as METLIN (Smith et al., 2005) or GNPS (Wang et al., 2016). Generally, however, these spectral databases do not contain the mass spectra of a large proportion of metabolites, so the best matching spectrum from the database is not always the correct identification. Matches can be scored using cosine similarity, or more advanced methods such as Spec2Vec (Huber et al., 2021), but these scores do not provide any statement about the statistical accuracy of the match. Creating decoy spectral libraries, specifically a large database of fake spectra, is one potential way of estimating False Discovery Rates (FDRs), allowing us to quantify the probability of a spectrum match being correct (Scheubert et al., 2017). However, these methods are not widely used, suggesting there is significant scope to improve their performance and ease of use. In this project, we will use the code framework from our recently developed Virtual Metabolomics Mass Spectrometer (ViMMS) (Wandy et al., 2019, 2022) to systematically evaluate existing methods and identify possible improvements. We will then explore how we can use generative AI, e.g., Generative Adversarial Networks or Variational Autoencoders, to train a deep neural network that can create more realistic decoy spectra, and thus improve our estimation of FDRs.
Multi objective Bayesian optimisation for in silico to real metabolomics experiments (PhD/MSc)
Supervisors: Vinny Davies, Craig Alexander
Relevant research groups: Computational Statistics, Machine Learning and AI, Emulation and Uncertainty Quantification, Statistical Modelling for Biology, Genetics and *omics, Statistics in Chemistry/Physics
Untargeted metabolomics experiments aim to identify the small molecules that make up a particular sample (e.g., blood), allowing us to identify biomarkers, discover new chemicals, or understand the metabolism (Smith et al., 2014). Data Dependent Acquisition (DDA) methods are used to collect the information needed to identify the metabolites, and various more advanced DDA methods have recently been designed to improve this process (Davies et al. (2021); McBride et al. (2023)). Each of these methods, however, has parameters that must be chosen in order to maximise the amount of relevant data (metabolite spectra) that is collected. Our recent work led to the design of a Virtual Metabolomics Mass Spectrometer (ViMMS) in which we can run computer simulations of experiments and test different parameter settings (Wandy et al., 2019, 2022). Previously this has involved running a pre-determined set of parameters as part of a grid search in ViMMS, and then choosing the best parameter settings based on a single measure of performance. The proposed M.Res. (or Ph.D.) will extend this approach by using multi objective Bayesian Optimisation to adapt simulations and optimise over multiple different measurements of quality. By optimising parameters in this manner, we can help improve real experiments currently underway at the University of Glasgow and beyond.
Modality of mixtures of distributions (PhD)
Supervisors: Surajit Ray
Relevant research groups: Nonparametric and Semi-parametric Statistics, Applied Probability and Stochastic Processes, Statistical Modelling for Biology, Genetics and *omics, Biostatistics, Epidemiology and Health Applications
Finite mixtures provide a flexible and powerful tool for fitting univariate and multivariate distributions that cannot be captured by standard statistical distributions. In particular, multivariate mixtures have been widely used to perform modeling and cluster analysis of high-dimensional data in a wide range of applications. Modes of mixture densities have been used with great success for organizing mixture components into homogenous groups. But the results are limited to normal mixtures. Beyond the clustering application existing research in this area has provided fundamental results regarding the upper bound of the number of modes, but they too are limited to normal mixtures. In this project, we wish to explore the modality of non-normal distributions and their application to real life problems
Implementing a biology-empowered statistical framework to detect rare varient risk factors for complex diseases in whole genome sequence cohorts (PhD)
Supervisors: Vincent Macaulay, Luísa Pereira (Geneticist, i3s)
Relevant research groups: Statistical Modelling for Biology, Genetics and *omics, Biostatistics, Epidemiology and Health Applications
The traditional genome-wide association studies to detect candidate genetic risk factors for complex diseases/phenotypes (GWAS) recur largely to the microarray technology, genotyping at once thousands or millions of variants regularly spaced across the genome. These microarrays include mostly common variants (minor allele frequency, MAF>5%), missing candidate rare variants which are the more likely to be deleterious . Currently, the best strategy to genotype low-frequency (1%<MAF<5%) and rare (MAF<1%) variants is through next generation sequencing, and the increasingly availability of whole genome sequences (WGS) places us in the brink of detecting rare variants associated with complex diseases . Statistically, this detection constitutes a challenge, as the massive number of rare variants in genomes (for example, 64.7M in 150 Iberian WGSs) would imply genotyping millions/billions of individuals to attain statistical power. In the last couple years, several statistical methods have being tested in the context of association of rare variants with complex traits [2, 3, 4], largely testing strategies to aggregate the rare variants. These works have not yet tested the statistical empowerment that can be gained by incorporating reliable biological evidence on the aggregation of rare variants in the most probable functional regions, such as non-coding regulatory regions that control the expression of genes . In fact, it has been demonstrated that even for common candidate variants, most of these variants (around 88%; ) are located in non-coding regions. If this is true for the common variants detected by the traditional GWAS, it is highly probable to be also true for rare variants.
In this work, we will implement a biology-empowered statistical framework to detect rare variant risk factors for complex diseases in WGS cohorts. We will recur to the 200,000 WGSs from UK Biobank database , that will be available to scientists before the end of 2023. Access to clinical information of these >40 years old UK residents is also provided. We will build our framework around type-2 diabetes (T2D), a common complex disease for which thousands of common variant candidates have been found . Also, the mapping of regulatory elements is well known for the pancreatic beta cells that play a leading role in T2D . We will use this mapping in guiding the rare variants’ aggregation and test it against a random aggregation across the genome. Of course, the framework rationale will be appliable to any other complex disease. We will browse literature for aggregation methods available at the beginning of this work, but we already selected the method SKAT (sequence kernel association test; ) to be tested. SKAT fits a random-effects model to the set of variants within a genomic interval or biologically-meaningful region (such as a coding or regulatory region) and computes variant-set level p-values, while permitting correction for covariates (such as the principal components mentioned above that can account for population stratification between cases and controls).
Regular seminars relevant to the group are held as part of the Statistics seminar series. The seminars cover various aspects across the AI3 initiative and usually span multiple groups. You can find more information on the Statistics seminar series page, where you can also subscribe to the seminar series calendar.
Analysis and inference from heterogeneous forms of biological data can reveal new insights into intracellular biological mechanisms, dynamic biological networks, evolutionary processes, the genetic basis of disease, and more. The Statistical Modelling for Biology, Genetics and *omics group develops and applies statistical models and methods to analyse large collections of biological data, for example, DNA sequence or structural information, gene expression levels, cell molecule concentrations or protein samples.
The group has expertise in phylogenetics, regulatory genomics, epigenetics, proteomics and metabolomics with active collaborations with researchers in other colleges within the University of Glasgow and beyond.