University of Glasgow - Schools - School of Mathematics & Statistics - Research - Research in Statistics and Data Analytics - Interdisciplinary Impact - Statistical Modelling for Biology, Genetics and *omics

Statistical modelling for biology, genetics & *omics

The group has expertise in developing statistical and machine learning methods for phylogenetics, regulatory genomics, epigenetics, proteomics and metabolomics.

Staff

Postgraduate research students

Refine By

Statistical Modelling for Biology, Genetics and *omics - Example Research Projects

Information about postgraduate research opportunities and how to apply can be found on the Postgraduate Research Study page. Below is a selection of projects that could be undertaken with our group.

Modelling genetic variation (MSc/PhD)

Supervisors: Vincent Macaulay
Relevant research groups: Bayesian Modelling and Inference, Statistical Modelling for Biology, Genetics and *omics

Variation in the distribution of different DNA sequences across individuals has been shaped by many processes which can be modelled probabilistically, processes such as demographic factors like prehistoric population movements, or natural selection. This project involves developing new techniques for teasing out

Modality of mixtures of distributions (PhD)

Supervisors: Surajit Ray
Relevant research groups: Nonparametric and Semi-parametric Statistics, Applied Probability and Stochastic Processes, Statistical Modelling for Biology, Genetics and *omics, Biostatistics, Epidemiology and Health Applications

Finite mixtures provide a flexible and powerful tool for fitting univariate and multivariate distributions that cannot be captured by standard statistical distributions. In particular, multivariate mixtures have been widely used to perform modeling and cluster analysis of high-dimensional data in a wide range of applications. Modes of mixture densities have been used with great success for organizing mixture components into homogenous groups. But the results are limited to normal mixtures. Beyond the clustering application existing research in this area has provided fundamental results regarding the upper bound of the number of modes, but they too are limited to normal mixtures. In this project, we wish to explore the modality of non-normal distributions and their application to real life problems

Implementing a biology-empowered statistical framework to detect rare varient risk factors for complex diseases in whole genome sequence cohorts (PhD)

Supervisors: Vincent Macaulay, Luísa Pereira (Geneticist, i3s)
Relevant research groups: Statistical Modelling for Biology, Genetics and *omics, Biostatistics, Epidemiology and Health Applications

The traditional genome-wide association studies to detect candidate genetic risk factors for complex diseases/phenotypes (GWAS) recur largely to the microarray technology, genotyping at once thousands or millions of variants regularly spaced across the genome. These microarrays include mostly common variants (minor allele frequency, MAF>5%), missing candidate rare variants which are the more likely to be deleterious [1]. Currently, the best strategy to genotype low-frequency (1%<MAF<5%) and rare (MAF<1%) variants is through next generation sequencing, and the increasingly availability of whole genome sequences (WGS) places us in the brink of detecting rare variants associated with complex diseases [2]. Statistically, this detection constitutes a challenge, as the massive number of rare variants in genomes (for example, 64.7M in 150 Iberian WGSs) would imply genotyping millions/billions of individuals to attain statistical power. In the last couple years, several statistical methods have being tested in the context of association of rare variants with complex traits [2, 3, 4], largely testing strategies to aggregate the rare variants. These works have not yet tested the statistical empowerment that can be gained by incorporating reliable biological evidence on the aggregation of rare variants in the most probable functional regions, such as non-coding regulatory regions that control the expression of genes [4]. In fact, it has been demonstrated that even for common candidate variants, most of these variants (around 88%; [5]) are located in non-coding regions. If this is true for the common variants detected by the traditional GWAS, it is highly probable to be also true for rare variants.

In this work, we will implement a biology-empowered statistical framework to detect rare variant risk factors for complex diseases in WGS cohorts. We will recur to the 200,000 WGSs from UK Biobank database [6], that will be available to scientists before the end of 2023. Access to clinical information of these >40 years old UK residents is also provided. We will build our framework around type-2 diabetes (T2D), a common complex disease for which thousands of common variant candidates have been found [7]. Also, the mapping of regulatory elements is well known for the pancreatic beta cells that play a leading role in T2D [8]. We will use this mapping in guiding the rare variants’ aggregation and test it against a random aggregation across the genome. Of course, the framework rationale will be appliable to any other complex disease. We will browse literature for aggregation methods available at the beginning of this work, but we already selected the method SKAT (sequence kernel association test; [3]) to be tested. SKAT fits a random-effects model to the set of variants within a genomic interval or biologically-meaningful region (such as a coding or regulatory region) and computes variant-set level p-values, while permitting correction for covariates (such as the principal components mentioned above that can account for population stratification between cases and controls).

Seminars

Regular seminars relevant to the group are held as part of the Statistics seminar series. The seminars cover various aspects across the AI3 initiative and usually span multiple groups. You can find more information on the Statistics seminar series page, where you can also subscribe to the seminar series calendar.

Analysis and inference from heterogeneous forms of biological data can reveal new insights into intracellular biological mechanisms, dynamic biological networks, evolutionary processes, the genetic basis of disease, and more. The Statistical Modelling for Biology, Genetics and *omics group develops and applies statistical models and methods to analyse large collections of biological data, for example, DNA sequence or structural information, gene expression levels, cell molecule concentrations or protein samples.

The group has expertise in phylogenetics, regulatory genomics, epigenetics, proteomics and metabolomics with active collaborations with researchers in other colleges within the University of Glasgow and beyond.