Statistics and Data Analytics
Supervisor: Dirk Husmeier
Research student: Peter Radvanyi
Environmental statistics; species distributions modelling; spatial ecology; analysis of citizen science data; application of Bayesian methods to characterize biological communities in changing environments
Supervisor: Dirk Husmeier
Population dynamics of ecological systems; spatial ecology; evolutionary ecology in changing environments
Member of other research groups: Mathematical Biology
Research student: Renato Andrade
Supervised and unsupervised learning; mixture models; variable selection; educational testing data; dynamic treatment regime estimation
Research students: Shuhrah Alghamdi, Riham Ismail, Sebastian Martinez Bustos, Robin Muegge(PGR), Aldawarsi Bashayr, Alastair Gemmell
Prof Gemmell is chief executive of the Environment Protection Agency of South Australia.
Research students: Flynn Gewirtz-O'Reilly, Lanxin Li, Kannat Na Bangchang
Postgraduate opportunities: Bayesian statistical data integration of single-cell and bulk “OMICS” datasets with clinical parameters for accurate prediction of treatment outcomes in Rheumatoid Arthritis, Bayesian variable selection for genetic and genomic studies
Machine learning and Bayesian statistics applied to systems biology and bioinformatics; Bayesian networks; statistical phylogenetics
Research staff: Andrej Aderhold, Agnieszka Borowska, Alan Lazarus, Benn Macdonald, Mihaela Paun
Research students: Shaykah Aldossari, Aldawarsi Bashayr, Dalton David, Campioni Nazareno, Ionut Paun, Yalei Yang
My work focuses on spatial point process methodology with a focus on the development of modern, realistically complex, spatial statistical methodology that is both computationally feasible and relevant to end-users. During my career I have been enthudiastic about taking spatial point processes from the theoretical literature into the real world and is encouraging statistical development by fostering strong relationships with the user community. My work has has impacted on spatial modelling and biodiversity research in the context of ecological studies across many species, taxa and ecosystems. I also have a keen interest of applying realistically complex spatial models in other context, including crime modelling, earthquake forecasting, environmental modelling, epidemiology and terrorism studies.
Research staff: Andrew Seaton
Research students: Erin Bryce, Stephen Jun Villejo
Postgraduate opportunities: Integrated spatio-temporal modelling for environmental data, New methods for analysis of migratory navigation
Research student: Robin Muegge(PGR)
Spatiotemporal modelling; Bayesian methods; environmental epidemiology and disease mapping
Research students: George Gerogiannis, Kamol Sanittham, Michael Waltenberger, Robin Muegge(PGR), Yoana Napier, Xueqing Yin
Postgraduate opportunities: Mapping disease risk in space and time, Estimating the effects of air pollution on human health, Forecasting Local Net-electricity Demand at Scale
Research student: Peter Radvanyi
Statistical genetics; population genetics; Bayesian methods; phylogenetics; GPs
Research student: Laura Stewart
Environmental and ecological modelling; nonparametric smoothing; time series analysis; functional data analysis
Research staff: Craig Wilkie, Jafet Belmont Osuna
Research students: Peter Radvanyi, Michael Currie
Postgraduate opportunities: Funded PhD project: Data analytics for urban environmental planning
Forensic statistics; quantile regression; semiparametric models; biostatistics applications
Research student: Wenhui Zhang
Bayesian statistics; MCMC and other Monte Carlo methods; mixture models; discrete choice models
Supervisor: Dirk Husmeier
COVID Resarch, Functional Data Analysis; Analysis of mixture models; high-dimensional data; medical image analysis; analysis of earth systems data; immunoinformatics
Research students: Salihah Alghamdi, Yangsong Cheng, Alastair Gemmell, Bader Lafi Q Alruwaili, Wenhui Zhang, Flynn Gewirtz-O'Reilly
Postgraduate opportunities: Modality of mixtures of distributions, Analysis of Spatially correlated functional data objects.
Radio-carbon and cosmogenic dating-design and analysis of proficiency trials; environmental radioactivity; sensitivity and uncertainty analysis applied to complex environmental models; spatial and spatiotemporal modeling of water quality; flood risk modeling; environmental indicators; developing the evidence base for environmental policy and regulation
Supervisor: Janine Illian
Supervisor: Xiaoyu Luo
Bayesian statistical inference; Markov chain Monte Carlo (MCMC) methods; data integration; model selection; stochastic processes
Statistical analysis of mixture distributions; latent structure analysis; pattern recognition; machine learning; smoothing and nonparametric statistics; optimum design of experiments
Non-parametric inference; optimisation; optimal experimental design; sampling theory; applications in economics; multiple comparisons
Research student: Lida Mavrogonatou
Supervisor: Claire Miller (née Ferguson)
Supervised learning; distance metric learning; hyperspectral image analysis
Bayesian data analysis, Ecological statistics, Statistical computing
Member of other research groups: Continuum Mechanics - Modelling and Analysis of Material Systems
Research Topic: Analysis of Spatially correlated functional data objects.
Supervisor: Surajit Ray
Research Topic: Computing, Inference and Applications of Hierarchical Mode
Supervisor: Surajit Ray
Research Topic: Clustering and Cluster Inference of complex data structures
Supervisor: Surajit Ray
Supervisor: Adrian Bowman
Supervisor: Vlad Vyshemirsky
Research Topic: Developing novel ways to represent spatial patterns in disease
Supervisor: Craig Anderson
Research Topic: Spatiotemporal models for environmental data
Supervisor: Adrian Bowman
Supervisor: Ludger Evers
Estimating the effects of air pollution on human health (PhD)
The health impact of exposure to air pollution is thought to reduce average life expectancy by six months, with an estimated equivalent health cost of 19 billion each year (from DEFRA). These effects have been estimated using statistical models, which quantify the impact on human health of exposure in both the short and the long term. However, the estimation of such effects is challenging, because individual level measures of health and pollution exposure are not available. Therefore, the majority of studies are conducted at the population level, and the resulting inference can only be made about the effects of pollution on overall population health. However, the data used in such studies are spatially misaligned, as the health data relate to extended areas such as cities or electoral wards, while the pollution concentrations are measured at individual locations. Furthermore, pollution monitors are typically located where concentrations are thought to be highest, known as preferential sampling, which is likely to result in overly high measurements being recorded. This project aims to develop statistical methodology to address these problems, and thus provide a less biased estimate of the effects of pollution on health than are currently produced.
Bayesian variable selection for genetic and genomic studies (PhD)
An important issue in high-dimensional regression problems is the accurate and efficient estimation of models when, compared to the number of data points, a substantially larger number of potential predictors are present. Further complications arise with correlated predictors, leading to the breakdown of standard statistical models for inference; and the uncertain definition of the outcome variable, which is often a varying composition of several different observable traits. Examples of such problems arise in many scenarios in genomics- in determining expression patterns of genes that may be responsible for a type of cancer; and in determining which genetic mutations lead to higher risks for occurrence of a disease. This project involves developing broad and improved Bayesian methodologies for efficient inference in high-dimensional regression-type problems with complex multivariate outcomes, with a focus on genetic data applications.
The successful candidate should have a strong background in methodological and applied Statistics, expert skills in relevant statistical software or programming languages (such as R, C/C++/Python), and also have a deep interest in developing knowledge in cross-disciplinary topics in genomics. The candidate will be expected to consolidate and master an extensive range of topics in modern Statistical theory and applications during their PhD, including advanced Bayesian modelling and computation, latent variable models, machine learning, and methods for Big Data. The successful candidate will be considered for funding to cover domestic tuition fees, as well as paying a stipend at the Research Council rate for four years.
Analysis of Spatially correlated functional data objects. (PhD)
Historically, functional data analysis techniques have widely been used to analyze traditional time series data, albeit from a different perspective. Of late, FDA techniques are increasingly being used in domains such as environmental science, where the data are spatio-temporal in nature and hence is it typical to consider such data as functional data where the functions are correlated in time or space. An example where modeling the dependencies is crucial is in analyzing remotely sensed data observed over a number of years across the surface of the earth, where each year forms a single functional data object. One might be interested in decomposing the overall variation across space and time and attribute it to covariates of interest. Another interesting class of data with dependence structure consists of weather data on several variables collected from balloons where the domain of the functions is a vertical strip in the atmosphere, and the data are spatially correlated. One of the challenges in such type of data is the problem of missingness, to address which one needs develop appropriate spatial smoothing techniques for spatially dependent functional data. There are also interesting design of experiment issues, as well as questions of data calibration to account for the variability in sensing instruments. Inspite of the research initiative in analyzing dependent functional data there are several unresolved problems, which the student will work on:
- robust statistical models for incorporating temporal and spatial dependencies in functional data
- developing reliable prediction and interpolation techniques for dependent functional data
- developing inferential framework for testing hypotheses related to simplified dependent structures
- analysing sparsely observed functional data by borrowing information from neighbours
- visualisation of data summaries associated with dependent functional data
- Clustering of functional data
Mapping disease risk in space and time (PhD)
Disease risk varies over space and time, due to similar variation in environmental exposures such as air pollution and risk inducing behaviours such as smoking. Modelling the spatio-temporal pattern in disease risk is known as disease mapping, and the aims are to: quantify the spatial pattern in disease risk to determine the extent of health inequalities, determine whether there has been any increase or reduction in the risk over time, identify the locations of clusters of areas at elevated risk, and quantify the impact of exposures, such as air pollution, on disease risk. I am working on all these related problems at present, and I have PhD projects in all these areas.
Modality of mixtures of distributions (PhD)
Finite mixtures provide a flexible and powerful tool for fitting univariate and multivariate distributions that cannot be captured by standard statistical distributions. In particular, multivariate mixtures have been widely used to perform modeling and cluster analysis of high-dimensional data in a wide range of applications. Modes of mixture densities have been used with great success for organizing mixture components into homogenous groups. But the results are limited to normal mixtures. Beyond the clustering application existing research in this area has provided fundamental results regarding the upper bound of the number of modes, but they too are limited to normal mixtures. In this project, we wish to explore the modality of non-normal distributions and their application to real life problems
Bayesian statistical data integration of single-cell and bulk “OMICS” datasets with clinical parameters for accurate prediction of treatment outcomes in Rheumatoid Arthritis (PhD)
In recent years, many different computational methods to analyse biological data have been established: including DNA (Genomics), RNA (Transcriptomics), Proteins (proteomics) and Metabolomics, that captures more dynamic events. These methods were refined by the advent of single cell technology, where it is now possible to capture the transcriptomics profile of single cells, spatial arrangements of cells from flow methods or imaging methods like functional magnetic resonance imaging. At the same time, these OMICS data can be complemented with clinical data – measurement of patients, like age, smoking status, phenotype of disease or drug treatment. It is an interesting and important open statistical question how to combine data from different “modalities” (like transcriptome with clinical data or imaging data) in a statistically valid way, to compare different datasets and make justifiable statistical inferences. This PhD project will be jointly supervised with Dr. Thomas Otto and Prof. Stefan Siebert from the Institute of Infection, Immunity & Inflammation), you will explore how to combine different datasets using Bayesian latent variable modelling, focusing on clinical datasets from Rheumatoid Arthritis.
The successful candidate will be considered for funding to cover domestic tuition fees, as well as paying a stipend at the Research Council rate for four years.
Funded PhD project: Data analytics for urban environmental planning (PhD)
The transition to a sustainable society is one of the key challenges facing researchers, policy makers and communities today. Key to future city planning for sustainable solutions is an understanding of what data are available and required to inform effective decision making. Novel data analytics and data visualisations are essential tools in this process.
This PhD is suitable for someone from a mathematical/computational sciences background with a strong interest in data analytics and data visualisation. This studentship is an opportunity to develop expertise in data-driven analytics/modelling for connecting quantitative and qualitative spatial (and temporal) data streams and investigating questions arising in urban environmental planning. The successful candidate will play a key role within a large, multi-disciplinary project, GALLANT, supporting Glasgow’s sustainable transformation.
You can find futher information here:
New methods for analysis of migratory navigation (PhD)
Joint project with Dr Urška Demšar (University of St Andrews)
Migratory birds travel annually across vast expanses of oceans and continents to reach their destination with incredible accuracy. How they are able to do this using only locally available cues is still not fully understood. Migratory navigation consists of two processes: birds either identify the direction in which to fly (compass orientation) or the location where they are at a specific moment in time (geographic positioning). One of the possible ways they do this is to use information from the Earth’s magnetic field in the so-called geomagnetic navigation (Mouritsen 2018). While there is substantial evidence (both physiological and behavioural) that they do sense magnetic field (Deutschlander and Beason 2014), we however still do not know exactly which of the components of the field they use for orientation or positioning. We also do not understand how rapid changes in the field affect movement behaviour.
There is a possibility that birds can sense these rapid large changes and that this may affect their navigational process. To study this, we need to link accurate data on Earth’s magnetic field with animal tracking data. This has only become possible very recently through new spatial data science advances: we developed the MagGeo tool, which links contemporaneous geomagnetic data from Swarm satellites of the European Space Agency with animal tracking data (Benitez Paez et al. 2021).
Linking geomagnetic data to animal tracking data however creates a highly-dimensional data set, which is difficult to explore. Typical analyses of contextual environmental information in ecology include representing contextual variables as co-variates in relatively simple statistical models (Brum Bastos et al. 2021), but this is not sufficient for studying detailed navigational behaviour. This project will analyse complex spatio-temporal data using computationally efficient statistical model fitting approches in a Bayesian context.
This project is fully based on open data to support reproducibility and open science. We will test our new methods by annotating publicly available bird tracking data (e.g. from repositories such as Movebank.org), using the open MagGeo tool and implementing our new methods as Free and Open Source Software (R/Python).
Benitez Paez F, Brum Bastos VdS, Beggan CD, Long JA and Demšar U, 2021. Fusion of wildlife tracking and satellite geomagnetic data for the study of animal migration. Movement Ecology, 9:31. https://doi.org/10.1186/s40462-021-00268-4
Brum Bastos VdS, Łos M, Long JA, Nelson T and Demšar U, 2021, Context-aware movement analysis in ecology: a systematic review. International Journal of Geographic Information Science, https://doi.org/10.1080/13658816.2021.1962528
Deutschlander ME and Beason RC, 2014. Avian navigation and geographic positioning. Journal of Field Ornithology, 85(2):111–133. https://doi.org/10.1111/jofo.12055
Scalable Bayesian Models for Inferring Evolutionary Traits of Plants (PhD)
Supervisors: Vinny Davies
Relevant research groups: Statistics and Data Analytics
The functional traits and environmental preferences of plant species determine how they will react to changes resulting from global warming. The main global biodiversity repositories, such as the Global Biodiversity Information Facility (GBIF), contain hundreds of millions of records from hundreds of thousands of species in the plant kingdom alone, and the spatiotemporal data in these records can be associated with soil, climate or other environmental data from other databases. Combining these records allow us to identify environmental preferences, especially for common species where many records exist. Furthermore, in a previous PhD studentship we showed that these traits are highly evolutionarily conserved (Harris et al., 2022), so it is possible to impute the preferences for rare species where little data exists using phylogenetic inference techniques.
The aim of this PhD project is to investigate the application of Bayesian variable selection methods to identify these evolutionarily conserved traits more effectively, and to quantify these traits and their associated uncertainty for all plant species for use in a plant ecosystem digital twin that we are developing separately to forecast the impact of climate change on biodiversity. In another PhD studentship, we previously developed similar methods for trait inference in viral evolution (Davies et al., 2017; Davies et al., 2019), but due to the scale of the data here, these methods will need to be significantly enhanced. We therefore propose a project to investigate extensions to methods for phylogenetic trait inference to handle datasets involving hundreds of millions of records in phylogenies with hundreds of thousands of tips, potentially through either sub-sampling (Quiroz et al, 2018) or modelling splitting and recombination (Nemeth & Sherlock, 2018).
Forecasting Local Net-electricity Demand at Scale (PhD)
Electricity supply and demand must balance in real-time, which is increasingly challenging as low-carbon technologies revolutionise energy production (wind, solar) and consumption (electric vehicles, heat pumps). Short-term forecasts are therefore essential to maintain an economic and reliable supply of electricity. Such forecasts are widely used in the energy sector, but forecasters face emerging challenges from new consumer behaviours, small scale generation and storage, as well as data quality, privacy, and security issues. This PhD project will give you the opportunity to develop statistical models to forecast electricity demand at regional and local levels of our continuously evolving energy system. Research themes include:
- Computationally efficient modelling and forecasting of 100s or 1000s of regions (or potentially millions of smart meters!).
- Adaptive modelling and forecasting in the presence of structural breaks.
- Probabilistic forecasting accounting for spatial and temporal dependencies and hierarchies.
The project provides an excellent opportunity to conduct cutting edge methodological development complemented by a practical application of societal importance. The successful candidate will need to be comfortable with interfacing with other disciplines and industry partners and be passionate about their research.
Metabolomics DIA Resolver (PhD)
Supervisors: Vinny Davies
Relevant research groups: Statistics and Data Analytics
In metabolomics we take a sample (blood, urine, etc) and put it through a mass spectrometer. The mass spectrometer scans the sample in multiple ways to help us work out what metabolites can be found in the sample. Identifying these metabolites can be useful for clinical trials, disease diagnosis and progression and various other medical applications. There are various way of choosing the scans, but in one particular method (DIA) we often see multiple fragments from multiple metabolites in a single scan. In order to identify the metabolites we need to work out which fragments belong to which metabolites. The project will use our recently developed virtual mass spectrometer, ViMMS (Wandy et al., 2019; Wandy et al., 2022), to continue the development of our new metabolomics DIA resolver, MSdeconvolve. We will expand MSdeconvole to work across multiple repeated samples collected in different ways and then extended it to work for completely different samples. Initially this will be done using standard statistical and machine learning methods, but we will look to extend this into a Bayesian modelling framework.
Integrated spatio-temporal modelling for environmental data (PhD)
(Jointly supervised by Peter Henrys, CEH)
The last decade has seen a proliferation of environmental data with vast quantities of information available from various sources. This has been due to a number of different factors including: the advent of sensor technologies; the provision of remotely sensed data from both drones and satellites; and the explosion in citizen science initiatives. These data represent a step change in the resolution of available data across space and time - sensors can be streaming data at a resolution of seconds whereas citizen science observations can be in the hundreds of thousands.
Over the same period, the resources available for traditional field surveys have decreased dramatically whilst logistical issues (such as access to sites, ) have increased. This has severely impacted the ability for field survey campaigns to collect data at high spatial and temporal resolutions. It is exactly this sort of information that is required to fit models that can quantify and predict the spread of invasive species, for example.
Whilst we have seen an explosion of data across various sources, there is no single source that provides both the spatial and temporal intensity that may be required when fitting complex spatio-temporal models (cf invasive species example) - each has its own advantages and benefits in terms of information content. There is therefore potentially huge benefit in beginning together data from these different sources within a consistent framework to exploit the benefits each offers and to understand processes at unprecedented resolutions/scales that would be impossible to monitor.
Current approaches to combining data in this way are typically very bespoke and involve complex model structures that are not reusable outside of the particular application area. What is needed is an overarching generic methodological framework and associated software solutions to implement such analyses. Not only would such a framework provide the methodological basis to enable researchers to benefit from this big data revolution, but also the capability to change such analyses from being stand alone research projects in their own right, to more operational, standard analytical routines.
FInally, such dynamic, integrated analyses could feedback into data collection initiatives to ensure optimal allocation of effort for traditional surveys or optimal power management for sensor networks. The major step change being that this optimal allocation of effort is conditional on other data that is available. So, for example, given the coverage and intensity of the citizen science data, where should we optimally send our paid surveyors? The idea is that information is collected at times and locations that provide the greatest benefit in understanding the underpinning stochastic processes. These two major issues - integrated analyses and adaptive sampling - ensure that environmental monitoring is fit for purpose and scientists, policy and industry can benefit from the big data revolution.
This project will develop an integrated statistical modelling strategy that provides a single modelling framework for enabling quantification of ecosystem goods and services while accounting for the fundamental differences in different data streams. Data collected at different spatial resolutions can be used within the same model through projecting it into continuous space and projecting it back into the landscape level of interest. As a result, decisions can be made at the relevant spatial scale and uncertainty is propagated through, facilitating appropriate decision making.