Biostatistics, epidemiology, & health applications

Our group has a wide range of interests across health and epidemiology, developing methods for a variety of applications including disease risk, prevalence and animal health, and continue to be at the forefront of many aspects of Covid research.


Postgraduate research students

Refine By

Biostatistics, Epidemiology and Health Applications - Example Research Projects

Information about postgraduate research opportunities and how to apply can be found on the Postgraduate Research Study page. Below is a selection of projects that could be undertaken with our group.

Bayesian statistical data integration of single-cell and bulk “OMICS” datasets with clinical parameters for accurate prediction of treatment outcomes in Rheumatoid Arthritis (PhD)

Supervisors: Mayetri Gupta
Relevant research groups: Bayesian Modelling and InferenceComputational StatisticsVincent MacaulayBiostatistics, Epidemiology and Health Applications

In recent years, many different computational methods to analyse biological data have been established: including DNA (Genomics), RNA (Transcriptomics), Proteins (proteomics) and Metabolomics, that captures more dynamic events. These methods were refined by the advent of single cell technology, where it is now possible to capture the transcriptomics profile of single cells, spatial arrangements of cells from flow methods or imaging methods like functional magnetic resonance imaging. At the same time, these OMICS data can be complemented with clinical data – measurement of patients, like age, smoking status, phenotype of disease or drug treatment. It is an interesting and important open statistical question how to combine data from different “modalities” (like transcriptome with clinical data or imaging data) in a statistically valid way, to compare different datasets and make justifiable statistical inferences. This PhD project will be jointly supervised with Dr. Thomas Otto and Prof. Stefan Siebert from the Institute of Infection, Immunity & Inflammation), you will explore how to combine different datasets using Bayesian latent variable modelling, focusing on clinical datasets from Rheumatoid Arthritis.

Funding Notes

The successful candidate will be considered for funding to cover domestic tuition fees, as well as paying a stipend at the Research Council rate for four years.

Modality of mixtures of distributions (PhD)

Supervisors: Surajit Ray
Relevant research groups: Nonparametric and Semi-parametric StatisticsApplied Probability and Stochastic ProcessesStatistical Modelling for Biology, Genetics and *omicsBiostatistics, Epidemiology and Health Applications

Finite mixtures provide a flexible and powerful tool for fitting univariate and multivariate distributions that cannot be captured by standard statistical distributions. In particular, multivariate mixtures have been widely used to perform modeling and cluster analysis of high-dimensional data in a wide range of applications. Modes of mixture densities have been used with great success for organizing mixture components into homogenous groups. But the results are limited to normal mixtures. Beyond the clustering application existing research in this area has provided fundamental results regarding the upper bound of the number of modes, but they too are limited to normal mixtures. In this project, we wish to explore the modality of non-normal distributions and their application to real life problems

Estimating the effects of air pollution on human health (PhD)

Supervisors: Duncan Lee
Relevant research groups: Modelling in Space and TimeBiostatistics, Epidemiology and Health Applications

The health impact of exposure to air pollution is thought to reduce average life expectancy by six months, with an estimated equivalent health cost of 19 billion each year (from DEFRA). These effects have been estimated using statistical models, which quantify the impact on human health of exposure in both the short and the long term. However, the estimation of such effects is challenging, because individual level measures of health and pollution exposure are not available. Therefore, the majority of studies are conducted at the population level, and the resulting inference can only be made about the effects of pollution on overall population health. However, the data used in such studies are spatially misaligned, as the health data relate to extended areas such as cities or electoral wards, while the pollution concentrations are measured at individual locations. Furthermore, pollution monitors are typically located where concentrations are thought to be highest, known as preferential sampling, which is likely to result in overly high measurements being recorded. This project aims to develop statistical methodology to address these problems, and thus provide a less biased estimate of the effects of pollution on health than are currently produced.

Mapping disease risk in space and time (PhD)

Supervisors: Duncan Lee
Relevant research groups: Modelling in Space and TimeBiostatistics, Epidemiology and Health Applications

Disease risk varies over space and time, due to similar variation in environmental exposures such as air pollution and risk inducing behaviours such as smoking.  Modelling the spatio-temporal pattern in disease risk is known as disease mapping, and the aims are to: quantify the spatial pattern in disease risk to determine the extent of health inequalities,  determine whether there has been any increase or reduction in the risk over time, identify the locations of clusters of areas at elevated risk, and quantify the impact of exposures, such as air pollution, on disease risk. I am working on all these related problems at present, and I have PhD projects in all these areas.

Generating deep fake left ventricles: a step towards personalised heart treatments (PhD)

Supervisors: Andrew Elliott, Vinny Davies, Hao Gao
Relevant research groups: Machine Learning and AIEmulation and Uncertainty QuantificationBiostatistics, Epidemiology and Health ApplicationsStatistical Modelling for Biology, Genetics and *omics

Personalised medicine is an exciting avenue in the field of cardiac healthcare where an understanding of patient-specific mechanisms can lead to improved treatments (Gao et al., 2017). The use of mathematical models to link the underlying properties of the heart with cardiac imaging offers the possibility of obtaining important parameters of heart function non-invasively (Gao et al., 2015). Unfortunately, current estimation methods rely on complex mathematical forward simulations, resulting in a solution taking hours, a time frame not suitable for real-time treatment decisions. To increase the applicability of these methods, statistical emulation methods have been proposed as an efficient way of estimating the parameters (Davies et al., 2019Noè et al., 2019). In this approach, simulations of the mathematical model are run in advance and then machine learning based methods are used to estimate the relationship between the cardiac imaging and the parameters of interest. These methods are, however, limited by our ability to understand the how cardiac geometry varies across patients which is in term limited by the amount of data available (Romaszko et al., 2019). In this project we will look at AI based methods for generating fake cardiac geometries which can be used to increase the amount of data (Qiao et al., 2023). We will explore different types of AI generation, including Generative Adversarial Networks or Variational Autoencoders, to understand how we can generate better 3D and 4D models of the fake left ventricles and create an improved emulation strategy that can make use of them.

Bayesian Mixture Models for Spatio-Temporal Data (PhD)

Supervisors: Craig Anderson
Relevant research groups: Modelling in Space and Time, Bayesian Modelling and Inference, Biostatistics, Epidemiology and Health Applications

The prevalence of disease is typically not constant across space – instead the risk tends to vary from one region to another.  Some of this variability may be down to environmental conditions, but many of them are driven by socio-economic differences between regions, with poorer regions tending to have worse health than wealthier regions.  For example, within the the Greater Glasgow and Clyde region, where the World Health Organisation noted that life expectancy ranges from 54 in Calton to 82 in Lenzie, despite these areas being less than 10 miles apart. There is substantial value to health professionals and policymakers in identifying some of the causes behind these localised health inequalities.

Disease mapping is a field of statistical epidemiology which focuses on estimating the patterns of disease risk across a geographical region. The main goal of such mapping is typically to identify regions of high disease risk so that relevant public health interventions can be made. This project involves the development of statistical models which will enhance our understanding regional differences in the risk of suffering from major diseases by focusing on these localised health inequalities.

Standard Bayesian hierarchical models with a conditional autoregressive prior are frequently used for risk estimation in this context, but these models assume a smooth risk surface which is often not appropriate in practice. In reality, it will often be the case that different regions have vastly different risk profiles and require different data generating functions as a result.

In this work we propose a mixture model based approach which allows different sub-populations to be represented by different underlying statistical distributions within a single modelling framework. By integrating CAR models into mixture models, researchers can simultaneously account for spatial dependencies and identify distinct disease patterns within subpopulations.

Implementing a biology-empowered statistical framework to detect rare varient risk factors for complex diseases in whole genome sequence cohorts (PhD)

Supervisors: Vincent Macaulay, Luísa Pereira (Geneticist, i3s)
Relevant research groups: Statistical Modelling for Biology, Genetics and *omicsBiostatistics, Epidemiology and Health Applications

The traditional genome-wide association studies to detect candidate genetic risk factors for complex diseases/phenotypes (GWAS) recur largely to the microarray technology, genotyping at once thousands or millions of variants regularly spaced across the genome. These microarrays include mostly common variants (minor allele frequency, MAF>5%), missing candidate rare variants which are the more likely to be deleterious [1]. Currently, the best strategy to genotype low-frequency (1%<MAF<5%) and rare (MAF<1%) variants is through next generation sequencing, and the increasingly availability of whole genome sequences (WGS) places us in the brink of detecting rare variants associated with complex diseases [2]. Statistically, this detection constitutes a challenge, as the massive number of rare variants in genomes (for example, 64.7M in 150 Iberian WGSs) would imply genotyping millions/billions of individuals to attain statistical power. In the last couple years, several statistical methods have being tested in the context of association of rare variants with complex traits [234], largely testing strategies to aggregate the rare variants. These works have not yet tested the statistical empowerment that can be gained by incorporating reliable biological evidence on the aggregation of rare variants in the most probable functional regions, such as non-coding regulatory regions that control the expression of genes [4]. In fact, it has been demonstrated that even for common candidate variants, most of these variants (around 88%; [5]) are located in non-coding regions. If this is true for the common variants detected by the traditional GWAS, it is highly probable to be also true for rare variants.

In this work, we will implement a biology-empowered statistical framework to detect rare variant risk factors for complex diseases in WGS cohorts. We will recur to the 200,000 WGSs from UK Biobank database [6], that will be available to scientists before the end of 2023. Access to clinical information of these >40 years old UK residents is also provided. We will build our framework around type-2 diabetes (T2D), a common complex disease for which thousands of common variant candidates have been found [7]. Also, the mapping of regulatory elements is well known for the pancreatic beta cells that play a leading role in T2D [8]. We will use this mapping in guiding the rare variants’ aggregation and test it against a random aggregation across the genome. Of course, the framework rationale will be appliable to any other complex disease. We will browse literature for aggregation methods available at the beginning of this work, but we already selected the method SKAT (sequence kernel association test; [3]) to be tested. SKAT fits a random-effects model to the set of variants within a genomic interval or biologically-meaningful region (such as a coding or regulatory region) and computes variant-set level p-values, while permitting correction for covariates (such as the principal components mentioned above that can account for population stratification between cases and controls).



Regular seminars relevant to the group are held as part of the Statistics seminar series. The seminars cover various aspects across the AI3 initiative and usually span multiple groups. You can find more information on the Statistics seminar series page, where you can also subscribe to the seminar series calendar.

What causes human disease and wellness? Why are some diseases prevalent in certain diseases and communities, but not others? What can be done to prevent the occurrence of spread of disease? How big are the inequalities in disease risk? These are all questions that have become increasingly important with the recent Covid-19 pandemic.

Staff in the Biostatistics, Epidemiology and Health Applications group were at the forefront of Covid research, from research into the immunology of Covid, assessing its prevalence through nowcasting, and detecting hotspots for Covid outbreaks. People in the group also work on a variety of health-related applications including air pollution, cardiac mechanics, assessing genetic risk of heart disease and cancer, and animal health.

The group, in collaboration with Scottish Environment Protection Agency and Public Health Scotland, has helped to quantify the effects of air pollution on health in Scotland. Others in the group have been involved in working on clinical trials looking at pain assessment in cats and dogs.