Statistics and Data Analytics

Staff

Dr Andrej Aderhold : Research Associate

Supervisor: Dirk Husmeier

  • Publications
  • Dr Craig Alexander : Lecturer

     

    Research student: Peter Radvanyi

  • Personal Website
  • Dr Linda Altieri : Environmental Research Associate

    Dr Craig Anderson : Lecturer

    Research students: Alison Smith, Xueqing Yin, Riham Ismail, Kamol Sanittham, Michael Waltenberger

  • Personal Website
  • Dr Jafet Belmont Osuna : Research Associate

    Environmental statistics; species distributions modelling; spatial ecology; analysis of citizen science data; application of Bayesian methods to characterize biological communities in changing environments

    Supervisors: Marian Scott OBE, Claire Miller (née Ferguson)

  • Dr Mitchum Bock : Lecturer

  • Publications
  • Dr Agnieszka Borowska : Research Assistant

    Supervisor: Dirk Husmeier

  • Prof Adrian Bowman : Professor of Statistics

    Research students: Yinuo Liu, George Vazanellis

  • Personal Website
  • Publications
  • Dr Daniela Castro-Camilo : Lecturer

    Research students: Erin Bryce, Daniela Cuba, Chenglei Hu

  • Personal Website
  • Dr Christina A Cobbold : Reader

    Population dynamics of ecological systems; spatial ecology; evolutionary ecology in changing environments

    Member of other research groups: Mathematical Biology
    Research student: Renato Andrade

  • Personal Website
  • Publications
  • Dr Nema Dean : Lecturer

    Supervised and unsupervised learning; mixture models; variable selection; educational testing data; dynamic treatment regime estimation

    Research students: Shuhrah Alghamdi, Riham Ismail, Sebastian Martinez Bustos, Robin Muegge, Aldawarsi Bashayr, Alastair Gemmell

  • Personal Website
  • Publications
  • Dr Amira Elayouty : Lecturer

  • Ludger Evers : Lecturer (part-time)

    Research students: Benjamin Szili, Ivona Voroneckaja, Shuhrah Alghamdi, Dimitra Eleftheriou

  • Publications
  • Prof James Campbell Gemmell : Honorary Professor

    Prof Gemmell is chief executive of the Environment Protection Agency of South Australia.

  • Personal Website
  • Dr Mayetri Gupta : Reader

    Research students: Flynn Gewirtz-O'Reilly, Lanxin Li, Kannat Na Bangchang
    Postgraduate opportunities: Bayesian statistical data integration of single-cell and bulk “OMICS” datasets with clinical parameters for accurate prediction of treatment outcomes in Rheumatoid Arthritis, Bayesian variable selection for genetic and genomic studies

  • Personal Website
  • Publications
  • Prof Dirk Husmeier : Chair of Statistics

    Machine learning and Bayesian statistics applied to systems biology and bioinformatics; Bayesian networks; statistical phylogenetics

    Research staff: Andrej Aderhold, Agnieszka Borowska, Alan Lazarus, Benn Macdonald, Mihaela Paun, Ionut Paun
    Research students: Shaykah Aldossari, Aldawarsi Bashayr, Dalton David, Campioni Nazareno, Yalei Yang

  • Personal Website
  • Publications
  • Prof Janine Illian : Chair/Professor in Statistical Science

    My work focuses on spatial point process methodology with a focus on the development of modern, realistically complex, spatial statistical methodology that is both computationally feasible and relevant to end-users. During my career I have been enthudiastic about taking spatial point processes from the theoretical literature into the real world and is encouraging statistical development by fostering strong relationships with the user community.   My work has has impacted on spatial modelling and biodiversity research in the context of ecological studies across many species, taxa and ecosystems. I also have a keen interest of applying realistically complex spatial models in other context, including crime modelling, earthquake forecasting, environmental modelling, epidemiology and terrorism studies.

    Research staff: Andrew Seaton
    Research students: Erin Bryce, Stephen Jun Villejo
    Postgraduate opportunities: Statistical methodology for Assessing the impacts of offshore renewable developments on marine wildlife , Integrated spatio-temporal modelling for environmental data, New methods for analysis of migratory navigation

  • Dr Eilidh Jack : Lecturer

    Research student: Robin Muegge

  • Prof Duncan Lee : Professor

    Spatiotemporal modelling; Bayesian methods; environmental epidemiology and disease mapping

    Research students: George Gerogiannis, Kamol Sanittham, Michael Waltenberger, Robin Muegge, Yoana Napier, Xueqing Yin
    Postgraduate opportunities: Mapping disease risk in space and time, Estimating the effects of air pollution on human health

  • Personal Website
  • Publications
  • Dr Marnie Low : Lecturer

    Research student: Peter Radvanyi

  • Personal Website
  • Publications
  • Dr Vincent Macaulay : Reader

    Statistical genetics; population genetics; Bayesian methods; phylogenetics; GPs

    Research student: Laura Stewart
    Postgraduate opportunities: The evolution of shape, Modelling genetic variation

  • Personal Website
  • Publications
  • Dr Benn Macdonald : Research Assistant

    Member of other research groups: Mathematical Biology
    Research student: Hanadi Alzahrani
    Supervisor: Dirk Husmeier

  • Dr Colette Mair : Lecturer

  • Prof Claire Miller (née Ferguson): Professor

    Environmental and ecological modelling; nonparametric smoothing; time series analysis; functional data analysis

    Research staff: Craig Wilkie, Jafet Belmont Osuna
    Research student: Peter Radvanyi

  • Personal Website
  • Publications
  • Dr Gary Napier : Lecturer

    Research students: Catherine Holland, Michael Waltenberger

  • Publications
  • Dr Tereza Neocleous : Lecturer

    Forensic statistics; quantile regression; semiparametric models; biostatistics applications

    Research students: Dimitra Eleftheriou, Catherine Holland

  • Personal Website
  • Publications
  • Dr Mu Niu : Lecturer

    Research student: Wenhui Zhang

  • Dr Agostino Nobile : Honorary Research Fellow

    Bayesian statistics; MCMC and other Monte Carlo methods; mixture models; discrete choice models

  • Personal Website
  • Publications
  • Dr Ruth O'Donnell : Lecturer

  • Publications
  • Dr Theo Papamarkou : Lecturer

    Research students: Benjamin Szili, Dimitra Eleftheriou

  • Dr Mihaela Paun : Research Associate

    Supervisor: Dirk Husmeier

  • Dr Surajit Ray : Senior lecturer

    COVID Resarch, Functional Data Analysis; Analysis of mixture models; high-dimensional data; medical image analysis; analysis of earth systems data; immunoinformatics

    Research students: Salihah Alghamdi, Yangsong Cheng, Alastair Gemmell, Bader Lafi Q Alruwaili, Wenhui Zhang, Flynn Gewirtz-O'Reilly
    Postgraduate opportunities: Modality of mixtures of distributions, Analysis of Spatially correlated functional data objects.

  • Personal Website
  • Publications
  • Prof Marian Scott OBE: Professor of Environmental Statistics

    Radio-carbon and cosmogenic dating-design and analysis of proficiency trials; environmental radioactivity; sensitivity and uncertainty analysis applied to complex environmental models; spatial and spatiotemporal modeling of water quality; flood risk modeling; environmental indicators; developing the evidence base for environmental policy and regulation

    Research staff: Jafet Belmont Osuna
    Research students: Yoana Napier, Daniela Cuba

  • Personal Website
  • Publications
  • Dr Andrew Seaton : Research Associate

    Supervisor: Janine Illian

  • Qingying Shu : Postdoctoral Research Fellow

    Supervisor: Xiaoyu Luo

  • Dr Ron Smith : Honorary Senior Research Fellow

  • Personal Website
  • Publications
  • Dr Ben Swallow : Lecturer

    Bayesian statistical inference; Markov chain Monte Carlo (MCMC) methods; data integration; model selection; stochastic processes

    Member of other research groups: Mathematical Biology
    Research students: Stephen Jun Villejo, Chenglei Hu

  • Personal Website
  • Prof Michael Titterington : Honorary Senior Research Fellow

    Statistical analysis of mixture distributions; latent structure analysis; pattern recognition; machine learning; smoothing and nonparametric statistics; optimum design of experiments

  • Personal Website
  • Publications
  • Dr Bernard Torsney : Honorary Research Fellow

    Non-parametric inference; optimisation; optimal experimental design; sampling theory; applications in economics; multiple comparisons

  • Personal Website
  • Publications
  • Dr Liberty Vittert : Mitchell Lecturer

  • Personal Website
  • Dr Vlad Vyshemirsky : Lecturer

    Research student: Lida Mavrogonatou

  • Publications
  • Dr Craig Wilkie : Research Associate

    Supervisor: Claire Miller (née Ferguson)

  • Dr Xiaochen Yang : Lecturer

    Supervised learning; distance metric learning; hyperspectral image analysis

  • Personal Website
  • Dr Wei Zhang : Lecturer

    Bayesian data analysis, Ecological statistics, Statistical computing 

    Member of other research groups: Continuum Mechanics - Modelling and Analysis of Material Systems


  • Postgraduates

    Salihah Alghamdi : PhD Student

    Research Topic: Analysis of Spatially correlated functional data objects.
    Supervisor: Surajit Ray

  • Erin Bryce : PhD Student

    Research Topic: Statistical landslide hazard modelling with a view towards medium to long term territorial planning
    Supervisors: Daniela Castro-Camilo, Janine Illian

  • Yangsong Cheng : PhD Student

    Research Topic: Computing, Inference and Applications of Hierarchical Mode Association Clustering
    Supervisor: Surajit Ray

  • Daniela Cuba : PhD Student

    Research Topic: Statistical tools to interpret soil variation
    Supervisors: Daniela Castro-Camilo, Marian Scott OBE

  • Dimitra Eleftheriou : PhD Student

    Supervisors: Tereza Neocleous, Ludger Evers, Theo Papamarkou

  • Flynn Gewirtz-O'Reilly : PhD Student

    Supervisors: Mayetri Gupta, Surajit Ray

  • Catherine Holland : PhD Student

    Research Topic: Bayesian approaches to compositional data with structural zeros
    Supervisors: Gary Napier, Tereza Neocleous

  • Chenglei Hu : PhD Student

    Research Topic: Natural hazard risk estimation using Multivariate Extreme-Value Mixture Models (MEVMM)
    Supervisors: Daniela Castro-Camilo, Ben Swallow

  • Bader Lafi Q Alruwaili : PhD Student

    Research Topic: Clustering and Cluster Inference of complex data structures
    Supervisor: Surajit Ray

  • Yinuo Liu : PhD Student

    Supervisor: Adrian Bowman

  • Lida Mavrogonatou : PhD Student

    Supervisor: Vlad Vyshemirsky

  • Robin Muegge : PhD Student

    Research Topic: Estimating the effects of air pollution on human health
    Supervisors: Nema Dean, Duncan Lee, Eilidh Jack

  • Kannat Na Bangchang : PhD Student

    Supervisors: Mayetri Gupta, Manuele Leonelli

  • Yoana Napier : MSc Student

    Supervisors: Marian Scott OBE, Duncan Lee

  • Peter Radvanyi : PhD Student

    Research Topic: Groundwater monitoring design
    Supervisors: Claire Miller (née Ferguson), Craig Alexander, Marnie Low

  • Kamol Sanittham : PhD Student

    Supervisors: Duncan Lee, Craig Anderson

  • Alison Smith : PhD Student

    Research Topic: Developing novel ways to represent spatial patterns in disease risk
    Supervisor: Craig Anderson

  • Laura Stewart : PhD Student

    Research Topic: Development and application of stochastic models of agglomeration
    Supervisors: Vincent Macaulay, Alexey Lindo

  • Benjamin Szili : PhD Student

    Supervisors: Ludger Evers, Theo Papamarkou

  • George Vazanellis : PhD Student

    Research Topic: Spatiotemporal models for environmental data
    Supervisor: Adrian Bowman

  • Stephen Jun Villejo : PhD Student

    Research Topic: A Bayesian Spatio-Temporal Model to Test for Stability of Risks for Spatially Misaligned Data
    Supervisors: Ben Swallow, Janine Illian

  • Ivona Voroneckaja : PhD Student

    Supervisor: Ludger Evers

  • Michael Waltenberger : PhD Student

    Supervisors: Duncan Lee, Craig Anderson, Gary Napier

  • Yalei Yang : PhD Student

    Supervisors: Hao Gao, Dirk Husmeier

  • Xueqing Yin : PhD Student

    Research Topic: Mapping disease risk in space and time
    Supervisors: Craig Anderson, Duncan Lee

  • Wenhui Zhang : PhD Student

    Research Topic: Analysis of Positron Emission Tomography data for tumour detection and delineation
    Supervisors: Surajit Ray, Mu Niu


  • Statistics and Data Analytics sample thesis topics

    Modelling genetic variation (MSc/PhD)

    Supervisors: Vincent Macaulay
    Relevant research groups: Statistics and Data Analytics

    Variation in the distribution of different DNA sequences across individuals has been shaped by many processes which can be modelled probabilistically, processes such as demographic factors like prehistoric population movements, or natural selection. This project involves developing new techniques for teasing out information on those processes from the wealth of raw data that is now being generated by high-throughput genetic assays, and is likely to involve computationally-intensive sampling techniques to approximate the posterior distribution of parameters of interest. The characterization of the amount of population structure on different geographical scales will influence the design of experiments to identify the genetic variants that increase risk of complex diseases, such as diabetes or heart disease.

    The evolution of shape (PhD)

    Supervisors: Vincent Macaulay
    Relevant research groups: Statistics and Data Analytics

    Shapes of objects change in time. Organisms evolve and in the process change form: humans and chimpanzees derive from some common ancestor presumably different from either in shape. Designed objects are no different: an Art Deco tea pot from the 1920s might share some features with one from Ikea in 2010, but they are different. Mathematical models of evolution for certain data types, like the strings of As, Gs , Cs and Ts in our evolving DNA, are quite mature and allow us to learn about the relationships of the objects (their phylogeny or family tree), about the changes that happen to them in time (the evolutionary process) and about the ways objects were configured in the past (the ancestral states), by statistical techniques like phylogenetic analysis. Such techniques for shape data are still in their infancy. This project will develop novel statistical inference approaches (in a Bayesian context) for complex data objects, like functions, surfaces and shapes, using Gaussian-process models, with potential application in fields as diverse as language evolution, morphometrics and industrial design.

    Estimating the effects of air pollution on human health (PhD)

    Supervisors: Duncan Lee
    Relevant research groups: Statistics and Data Analytics

    The health impact of exposure to air pollution is thought to reduce average life expectancy by six months, with an estimated equivalent health cost of 19 billion each year (from DEFRA). These effects have been estimated using statistical models, which quantify the impact on human health of exposure in both the short and the long term. However, the estimation of such effects is challenging, because individual level measures of health and pollution exposure are not available. Therefore, the majority of studies are conducted at the population level, and the resulting inference can only be made about the effects of pollution on overall population health. However, the data used in such studies are spatially misaligned, as the health data relate to extended areas such as cities or electoral wards, while the pollution concentrations are measured at individual locations. Furthermore, pollution monitors are typically located where concentrations are thought to be highest, known as preferential sampling, which is likely to result in overly high measurements being recorded. This project aims to develop statistical methodology to address these problems, and thus provide a less biased estimate of the effects of pollution on health than are currently produced.

    Bayesian variable selection for genetic and genomic studies (PhD)

    Supervisors: Mayetri Gupta
    Relevant research groups: Statistics and Data Analytics

    An important issue in high-dimensional regression problems is the accurate and efficient estimation of models when, compared to the number of data points, a substantially larger number of potential predictors are present. Further complications arise with correlated predictors, leading to the breakdown of standard statistical models for inference; and the uncertain definition of the outcome variable, which is often a varying composition of several different observable traits. Examples of such problems arise in many scenarios in genomics- in determining expression patterns of genes that may be responsible for a type of cancer; and in determining which genetic mutations lead to higher risks for occurrence of a disease. This project involves developing broad and improved Bayesian methodologies for efficient inference in high-dimensional regression-type problems with complex multivariate outcomes, with a focus on genetic data applications.

    The successful candidate should have a strong background in methodological and applied Statistics, expert skills in relevant statistical software or programming languages (such as R, C/C++/Python), and also have a deep interest in developing knowledge in cross-disciplinary topics in genomics. The candidate will be expected to consolidate and master an extensive range of topics in modern Statistical theory and applications during their PhD, including advanced Bayesian modelling and computation, latent variable models, machine learning, and methods for Big Data. The successful candidate will be considered for funding to cover domestic tuition fees, as well as paying a stipend at the Research Council rate for four years.

    Analysis of Spatially correlated functional data objects. (PhD)

    Supervisors: Surajit Ray
    Relevant research groups: Statistics and Data Analytics

    Historically, functional data analysis techniques have widely been used to analyze traditional time series data, albeit from a different perspective. Of late, FDA techniques are increasingly being used in domains such as environmental science, where the data are spatio-temporal in nature and hence is it typical to consider such data as functional data where the functions are correlated in time or space. An example where modeling the dependencies is crucial is in analyzing remotely sensed data observed over a number of years across the surface of the earth, where each year forms a single functional data object. One might be interested in decomposing the overall variation across space and time and attribute it to covariates of interest. Another interesting class of data with dependence structure consists of weather data on several variables collected from balloons where the domain of the functions is a vertical strip in the atmosphere, and the data are spatially correlated. One of the challenges in such type of data is the problem of missingness, to address which one needs develop appropriate spatial smoothing techniques for spatially dependent functional data. There are also interesting design of experiment issues, as well as questions of data calibration to account for the variability in sensing instruments. Inspite of the research initiative in analyzing dependent functional data there are several unresolved problems, which the student will work on:

    • robust statistical models for incorporating temporal and spatial dependencies in functional data
    • developing reliable prediction and interpolation techniques for dependent functional data
    • developing inferential framework for testing hypotheses related to simplified dependent structures
    • analysing sparsely observed functional data by borrowing information from neighbours
    • visualisation of data summaries associated with dependent functional data
    • Clustering of functional data

    Mapping disease risk in space and time (PhD)

    Supervisors: Duncan Lee
    Relevant research groups: Statistics and Data Analytics

    Disease risk varies over space and time, due to similar variation in environmental exposures such as air pollution and risk inducing behaviours such as smoking.  Modelling the spatio-temporal pattern in disease risk is known as disease mapping, and the aims are to: quantify the spatial pattern in disease risk to determine the extent of health inequalities,  determine whether there has been any increase or reduction in the risk over time, identify the locations of clusters of areas at elevated risk, and quantify the impact of exposures, such as air pollution, on disease risk. I am working on all these related problems at present, and I have PhD projects in all these areas.

    Modality of mixtures of distributions (PhD)

    Supervisors: Surajit Ray
    Relevant research groups: Statistics and Data Analytics

    Finite mixtures provide a flexible and powerful tool for fitting univariate and multivariate distributions that cannot be captured by standard statistical distributions. In particular, multivariate mixtures have been widely used to perform modeling and cluster analysis of high-dimensional data in a wide range of applications. Modes of mixture densities have been used with great success for organizing mixture components into homogenous groups. But the results are limited to normal mixtures. Beyond the clustering application existing research in this area has provided fundamental results regarding the upper bound of the number of modes, but they too are limited to normal mixtures. In this project, we wish to explore the modality of non-normal distributions and their application to real life problems

    Bayesian statistical data integration of single-cell and bulk “OMICS” datasets with clinical parameters for accurate prediction of treatment outcomes in Rheumatoid Arthritis (PhD)

    Supervisors: Mayetri Gupta
    Relevant research groups: Statistics and Data Analytics

    In recent years, many different computational methods to analyse biological data have been established: including DNA (Genomics), RNA (Transcriptomics), Proteins (proteomics) and Metabolomics, that captures more dynamic events. These methods were refined by the advent of single cell technology, where it is now possible to capture the transcriptomics profile of single cells, spatial arrangements of cells from flow methods or imaging methods like functional magnetic resonance imaging. At the same time, these OMICS data can be complemented with clinical data – measurement of patients, like age, smoking status, phenotype of disease or drug treatment. It is an interesting and important open statistical question how to combine data from different “modalities” (like transcriptome with clinical data or imaging data) in a statistically valid way, to compare different datasets and make justifiable statistical inferences. This PhD project will be jointly supervised with Dr. Thomas Otto and Prof. Stefan Siebert from the Institute of Infection, Immunity & Inflammation), you will explore how to combine different datasets using Bayesian latent variable modelling, focusing on clinical datasets from Rheumatoid Arthritis.

    Funding Notes

    The successful candidate will be considered for funding to cover domestic tuition fees, as well as paying a stipend at the Research Council rate for four years.

    New methods for analysis of migratory navigation (PhD)

    Supervisors: Janine Illian
    Relevant research groups: Statistics and Data Analytics

    Joint project with Dr Urška Demšar (University of St Andrews)

    Migratory birds travel annually across vast expanses of oceans and continents to reach their destination with incredible accuracy. How they are able to do this using only locally available cues is still not fully understood. Migratory navigation consists of two processes: birds either identify the direction in which to fly (compass orientation) or the location where they are at a specific moment in time (geographic positioning). One of the possible ways they do this is to use information from the Earth’s magnetic field in the so-called geomagnetic navigation (Mouritsen 2018). While there is substantial evidence (both physiological and behavioural) that they do sense magnetic field (Deutschlander and Beason 2014), we however still do not know exactly which of the components of the field they use for orientation or positioning. We also do not understand how rapid changes in the field affect movement behaviour.

    There is a possibility that birds can sense these rapid large changes and that this may affect their navigational process. To study this, we need to link accurate data on Earth’s magnetic field with animal tracking data. This has only become possible very recently through new spatial data science advances:  we developed the MagGeo tool, which links contemporaneous geomagnetic data from Swarm satellites of the European Space Agency with animal tracking data (Benitez Paez et al. 2021).

    Linking geomagnetic data to animal tracking data however creates a highly-dimensional data set, which is difficult to explore. Typical analyses of contextual environmental information in ecology include representing contextual variables as co-variates in relatively simple statistical models (Brum Bastos et al. 2021), but this is not sufficient for studying detailed navigational behaviour. This project will analyse complex spatio-temporal data using computationally efficient statistical model fitting approches in a Bayesian context.

    This project is fully based on open data to support reproducibility and open science. We will test our new methods by annotating publicly available bird tracking data (e.g. from repositories such as Movebank.org), using the open MagGeo tool and implementing our new methods as Free and Open Source Software (R/Python).

    References

    Benitez Paez F, Brum Bastos VdS, Beggan CD, Long JA and Demšar U, 2021. Fusion of wildlife tracking and satellite geomagnetic data for the study of animal migration. Movement Ecology, 9:31. https://doi.org/10.1186/s40462-021-00268-4

    Brum Bastos VdS, Łos M, Long JA, Nelson T and Demšar U, 2021, Context-aware movement analysis in ecology: a systematic review. International Journal of Geographic Information Science, https://doi.org/10.1080/13658816.2021.1962528

    Deutschlander ME and Beason RC, 2014. Avian navigation and geographic positioning. Journal of Field Ornithology, 85(2):111–133. https://doi.org/10.1111/jofo.12055

    Scalable Bayesian Models for Inferring Evolutionary Traits of Plants (PhD)

    Supervisors: Vinny Davies
    Relevant research groups: Statistics and Data Analytics

    The functional traits and environmental preferences of plant species determine how they will react to changes resulting from global warming. The main global biodiversity repositories, such as the Global Biodiversity Information Facility (GBIF), contain hundreds of millions of records from hundreds of thousands of species in the plant kingdom alone, and the spatiotemporal data in these records can be associated with soil, climate or other environmental data from other databases. Combining these records allow us to identify environmental preferences, especially for common species where many records exist. Furthermore, in a previous PhD studentship we showed that these traits are highly evolutionarily conserved (Harris et al., 2022), so it is possible to impute the preferences for rare species where little data exists using phylogenetic inference techniques.

    The aim of this PhD project is to investigate the application of Bayesian variable selection methods to identify these evolutionarily conserved traits more effectively, and to quantify these traits and their associated uncertainty for all plant species for use in a plant ecosystem digital twin that we are developing separately to forecast the impact of climate change on biodiversity. In another PhD studentship, we previously developed similar methods for trait inference in viral evolution (Davies et al., 2017; Davies et al., 2019), but due to the scale of the data here, these methods will need to be significantly enhanced. We therefore propose a project to investigate extensions to methods for phylogenetic trait inference to handle datasets involving hundreds of millions of records in phylogenies with hundreds of thousands of tips, potentially through either sub-sampling (Quiroz et al, 2018) or modelling splitting and recombination (Nemeth & Sherlock, 2018).

    Metabolomics DIA Resolver (PhD)

    Supervisors: Vinny Davies
    Relevant research groups: Statistics and Data Analytics

    In metabolomics we take a sample (blood, urine, etc) and put it through a mass spectrometer. The mass spectrometer scans the sample in multiple ways to help us work out what metabolites can be found in the sample. Identifying these metabolites can be useful for clinical trials, disease diagnosis and progression and various other medical applications. There are various way of choosing the scans, but in one particular method (DIA) we often see multiple fragments from multiple metabolites in a single scan. In order to identify the metabolites we need to work out which fragments belong to which metabolites. The project will use our recently developed virtual mass spectrometer, ViMMS (Wandy et al., 2019Wandy et al., 2022), to continue the development of our new metabolomics DIA resolver, MSdeconvolve. We will expand MSdeconvole to work across multiple repeated samples collected in different ways and then extended it to work for completely different samples. Initially this will be done using standard statistical and machine learning methods, but we will look to extend this into a Bayesian modelling framework.

    Integrated spatio-temporal modelling for environmental data (PhD)

    Supervisors: Janine Illian
    Relevant research groups: Statistics and Data Analytics

    (Jointly supervised by Peter Henrys, CEH)

    The last decade has seen a proliferation of environmental data with vast quantities of information available from various sources. This has been due to a number of different factors including: the advent of sensor technologies; the provision of remotely sensed data from both drones and satellites; and the explosion in citizen science initiatives. These data represent a step change in the resolution of available data across space and time - sensors can be streaming data at a resolution of seconds whereas citizen science observations can be in the hundreds of thousands.  

    Over the same period, the resources available for traditional field surveys have decreased dramatically whilst logistical issues (such as access to sites, ) have increased. This has severely impacted the ability for field survey campaigns to collect data at high spatial and temporal resolutions. It is exactly this sort of information that is required to fit models that can quantify and predict the spread of invasive species, for example. 

    Whilst we have seen an explosion of data across various sources, there is no single source that provides both the spatial and temporal intensity that may be required when fitting complex spatio-temporal models (cf invasive species example) - each has its own advantages and benefits in terms of information content. There is therefore potentially huge benefit in beginning together data from these different sources within a consistent framework to exploit the benefits each offers and to understand processes at unprecedented resolutions/scales that would be impossible to monitor. 

    Current approaches to combining data in this way are typically very bespoke and involve complex model structures that are not reusable outside of the particular application area. What is needed is an overarching generic methodological framework and associated software solutions to implement such analyses. Not only would such a framework provide the methodological basis to enable researchers to benefit from this big data revolution, but also the capability to change such analyses from being stand alone research projects in their own right, to more operational, standard analytical routines. 

    FInally, such dynamic, integrated analyses could feedback into data collection initiatives to ensure optimal allocation of effort for traditional surveys or optimal power management for sensor networks. The major step change being that this optimal allocation of effort is conditional on other data that is available. So, for example, given the coverage and intensity of the citizen science data, where should we optimally send our paid surveyors? The idea is that information is collected at times and locations that provide the greatest benefit in understanding the underpinning stochastic processes. These two major issues - integrated analyses and adaptive sampling - ensure that environmental monitoring is fit for purpose and scientists, policy and industry can benefit from the big data revolution. 

    This project will develop an integrated statistical modelling strategy that provides a single modelling framework for enabling quantification of ecosystem goods and services while accounting for the fundamental differences in different data streams. Data collected at different spatial resolutions can be used within the same model through projecting it into continuous space and projecting it back into the landscape level of interest.  As a result, decisions can be made at the relevant spatial scale and uncertainty is propagated through, facilitating appropriate decision making. 

    Evaluating probabilistic forecasts in high-dimensional settings (PhD)

    Supervisors: Jethro Browell
    Relevant research groups: Statistics and Data Analytics

    Many decisions are informed by forecasts, and almost all forecasts are uncertain to some degree. Probabilistic forecasts quantify uncertainty to help improve decision-making and are playing an important role in fields including weather forecasting, economics, energy, and public policy. Evaluating the quality of past forecasts is essential to give forecasters and forecast users confidence in their current predictions, and to compare the performance of forecasting systems.

    While the principles of probabilistic forecast evaluation have been established over the past 15 years, most notably that of “sharpness subject to calibration/reliability”, we lack a complete toolkit for applying these principles in many situations, especially those that arise in high-dimensional settings. Furthermore, forecast evaluation must be interpretable by forecast users as well as expert forecasts, and assigning value to marginal improvements in forecast quality remains a challenge in many sectors.

    This PhD will develop new statistical methods for probabilistic forecast evaluation considering some of the following issues:

    • Verifying probabilistic calibration conditional on relevant covariates
    • Skill scores for multivariate probabilistic forecasts where “ideal” performance is unknowable
    • Assigning value to marginal forecast improvement though the convolution of utility functions and Murphey Diagrams
    • Development of the concept of “anticipated verification” and “predicting the of uncertainty of future forecasts”
    • Decomposing forecast misspecification (e.g. into spatial and temporal components)
    • Evaluation of Conformal Predictions

    Good knowledge of multivariate statistics is essential, prior knowledge of probabilistic forecasting and forecast evaluation would be an advantage.

    Adaptive Probabilistic Forecasting (PhD)

    Supervisors: Jethro Browell
    Relevant research groups: Statistics and Data Analytics

    Data-driven predictive models depend on the representativeness of data used in model selection and estimation. However, many processes change over time meaning that recent data is more representative than old data. In this situation, predictive models should track these changes, which is the aim of “online” or “adaptive” algorithms. Furthermore, many users of forecasts require probabilistic forecasts, which quantify uncertainty, to inform their decision-making. Existing adaptive methods such as Recursive Least Squares, the Kalman Filter have been very successful for adaptive point forecasting, but adaptive probabilistic forecasting has received little attention. This PhD will develop methods for adaptive probabilistic forecasting from a theoretical perspective and with a view to apply these methods to problems in at least one application area to be determined.

    In the context of adaptive probabilistic forecasting, this PhD may consider:

    • Online estimation of Generalised Additive Models for Location Scale and Shape
    • Online/adaptive (multivariate) time series prediction
    • Online aggregation (of experts, or hierarchies)

    A good knowledge of methods for time series analysis and regression is essential, familiarity with flexible regression (GAMs) and distributional regression (GAMLSS/quantile regression) would be an advantage.

    Statistical methodology for Assessing the impacts of offshore renewable developments on marine wildlife (PhD)

    Supervisors: Janine Illian
    Relevant research groups: Statistics and Data Analytics

    (jointly supervised by Esther Jones and Adam Butler, BIOSS)

    Assessing the impacts of offshore renewable developments on marine wildlife is a critical component of the consenting process. A NERC-funded project, ECOWINGS, will provide a step-change in analysing predator-prey dynamics in the marine environment, collecting data across trophic levels against a backdrop of developing wind farms and climate change. Aerial survey and GPS data from multiple species of seabirds will be collected contemporaneously alongside prey data available over the whole water column from an automated surface vehicle and underwater drone.

    These methods of data collection will generate 3D space and time profiles of predators and prey, creating a rich source of information and enormous potential for modelling and interrogation. The data present a unique opportunity for experimental design across a dynamic and changing marine ecosystem, which is heavily influenced by local and global anthropogenic activities. However, these data have complex intrinsic spatio-temporal properties, which are challenging to analyse. Significant statistical methods development could be achieved using this system as a case study, contributing to the scientific knowledge base not only in offshore renewables but more generally in the many circumstances where patchy ecological spatio-temporal data are available. 

    This PhD project will develop spatio-temporal modelling methodology that will allow user to anaylse these exciting - and complex - data sets and help inform our knowledge on the impact of off-shore renewable on wildlife.