Statistics and Data Analytics
Staff
Dr Andrej Aderhold : Research Associate
Supervisor: Dirk Husmeier
Dr Craig Alexander : Lecturer
Research student: Peter Radvanyi
Dr Linda Altieri : Environmental Research Associate
Dr Craig Anderson : Lecturer
Research students: Alison Smith, Xueqing Yin, Riham Ismail, Kamol Sanittham, Michael Waltenberger
Dr Jafet Belmont Osuna : Research Associate
Environmental statistics; species distributions modelling; spatial ecology; analysis of citizen science data; application of Bayesian methods to characterize biological communities in changing environments
Supervisors: Marian Scott OBE, Claire Miller (née Ferguson)
Dr Mitchum Bock : Lecturer
Dr Agnieszka Borowska : Research Assistant
Supervisor: Dirk Husmeier
Prof Adrian Bowman : Professor of Statistics
Research students: Yinuo Liu, George Vazanellis
Dr Daniela Castro-Camilo : Lecturer
Research students: Erin Bryce, Daniela Cuba, Chenglei Hu
Dr Christina A Cobbold : Reader
Population dynamics of ecological systems; spatial ecology; evolutionary ecology in changing environments
Member of other research groups: Mathematical Biology
Research student: Renato Andrade
Dr Nema Dean : Lecturer
Supervised and unsupervised learning; mixture models; variable selection; educational testing data; dynamic treatment regime estimation
Research students: Shuhrah Alghamdi, Riham Ismail, Sebastian Martinez Bustos, Robin Muegge, Aldawarsi Bashayr, Alastair Gemmell
Dr Amira Elayouty : Lecturer
Ludger Evers : Lecturer (part-time)
Research students: Benjamin Szili, Ivona Voroneckaja, Shuhrah Alghamdi, Dimitra Eleftheriou
Prof James Campbell Gemmell : Honorary Professor
Prof Gemmell is chief executive of the Environment Protection Agency of South Australia.
Dr Mayetri Gupta : Reader
Research students: Flynn Gewirtz-O'Reilly, Lanxin Li, Kannat Na Bangchang
Postgraduate opportunities: Bayesian statistical data integration of single-cell and bulk “OMICS” datasets with clinical parameters for accurate prediction of treatment outcomes in Rheumatoid Arthritis, Bayesian variable selection for genetic and genomic studies
Prof Dirk Husmeier : Chair of Statistics
Machine learning and Bayesian statistics applied to systems biology and bioinformatics; Bayesian networks; statistical phylogenetics
Research staff: Andrej Aderhold, Agnieszka Borowska, Alan Lazarus, Benn Macdonald, Mihaela Paun, Ionut Paun
Research students: Shaykah Aldossari, Aldawarsi Bashayr, Dalton David, Campioni Nazareno, Yalei Yang
Prof Janine Illian : Chair/Professor in Statistical Science
My work focuses on spatial point process methodology with a focus on the development of modern, realistically complex, spatial statistical methodology that is both computationally feasible and relevant to end-users. During my career I have been enthudiastic about taking spatial point processes from the theoretical literature into the real world and is encouraging statistical development by fostering strong relationships with the user community. My work has has impacted on spatial modelling and biodiversity research in the context of ecological studies across many species, taxa and ecosystems. I also have a keen interest of applying realistically complex spatial models in other context, including crime modelling, earthquake forecasting, environmental modelling, epidemiology and terrorism studies.
Research staff: Andrew Seaton
Research students: Erin Bryce, Stephen Jun Villejo
Postgraduate opportunities: Statistical methodology for Assessing the impacts of offshore renewable developments on marine wildlife , Integrated spatio-temporal modelling for environmental data, New methods for analysis of migratory navigation
Dr Eilidh Jack : Lecturer
Research student: Robin Muegge
Prof Duncan Lee : Professor
Spatiotemporal modelling; Bayesian methods; environmental epidemiology and disease mapping
Research students: George Gerogiannis, Kamol Sanittham, Michael Waltenberger, Robin Muegge, Yoana Napier, Xueqing Yin
Postgraduate opportunities: Mapping disease risk in space and time, Estimating the effects of air pollution on human health
Dr Marnie Low : Lecturer
Research student: Peter Radvanyi
Dr Vincent Macaulay : Reader
Statistical genetics; population genetics; Bayesian methods; phylogenetics; GPs
Research student: Laura Stewart
Postgraduate opportunities: The evolution of shape, Modelling genetic variation
Dr Benn Macdonald : Research Assistant
Member of other research groups: Mathematical Biology
Research student: Hanadi Alzahrani
Supervisor: Dirk Husmeier
Dr Colette Mair : Lecturer
Prof Claire Miller (née Ferguson): Professor
Environmental and ecological modelling; nonparametric smoothing; time series analysis; functional data analysis
Research staff: Craig Wilkie, Jafet Belmont Osuna
Research student: Peter Radvanyi
Dr Gary Napier : Lecturer
Research students: Catherine Holland, Michael Waltenberger
Dr Tereza Neocleous : Lecturer
Forensic statistics; quantile regression; semiparametric models; biostatistics applications
Research students: Dimitra Eleftheriou, Catherine Holland
Dr Mu Niu : Lecturer
Research student: Wenhui Zhang
Dr Agostino Nobile : Honorary Research Fellow
Bayesian statistics; MCMC and other Monte Carlo methods; mixture models; discrete choice models
Dr Ruth O'Donnell : Lecturer
Dr Theo Papamarkou : Lecturer
Research students: Benjamin Szili, Dimitra Eleftheriou
Dr Mihaela Paun : Research Associate
Supervisor: Dirk Husmeier
Dr Surajit Ray : Senior lecturer
COVID Resarch, Functional Data Analysis; Analysis of mixture models; high-dimensional data; medical image analysis; analysis of earth systems data; immunoinformatics
Research students: Salihah Alghamdi, Yangsong Cheng, Alastair Gemmell, Bader Lafi Q Alruwaili, Wenhui Zhang, Flynn Gewirtz-O'Reilly
Postgraduate opportunities: Modality of mixtures of distributions, Analysis of Spatially correlated functional data objects.
Prof Marian Scott OBE: Professor of Environmental Statistics
Radio-carbon and cosmogenic dating-design and analysis of proficiency trials; environmental radioactivity; sensitivity and uncertainty analysis applied to complex environmental models; spatial and spatiotemporal modeling of water quality; flood risk modeling; environmental indicators; developing the evidence base for environmental policy and regulation
Research staff: Jafet Belmont Osuna
Research students: Yoana Napier, Daniela Cuba
Dr Andrew Seaton : Research Associate
Supervisor: Janine Illian
Qingying Shu : Postdoctoral Research Fellow
Supervisor: Xiaoyu Luo
Dr Ron Smith : Honorary Senior Research Fellow
Dr Ben Swallow : Lecturer
Bayesian statistical inference; Markov chain Monte Carlo (MCMC) methods; data integration; model selection; stochastic processes
Member of other research groups: Mathematical Biology
Research students: Stephen Jun Villejo, Chenglei Hu
Prof Michael Titterington : Honorary Senior Research Fellow
Statistical analysis of mixture distributions; latent structure analysis; pattern recognition; machine learning; smoothing and nonparametric statistics; optimum design of experiments
Dr Bernard Torsney : Honorary Research Fellow
Non-parametric inference; optimisation; optimal experimental design; sampling theory; applications in economics; multiple comparisons
Dr Liberty Vittert : Mitchell Lecturer
Dr Vlad Vyshemirsky : Lecturer
Research student: Lida Mavrogonatou
Dr Craig Wilkie : Research Associate
Supervisor: Claire Miller (née Ferguson)
Dr Xiaochen Yang : Lecturer
Supervised learning; distance metric learning; hyperspectral image analysis
Dr Wei Zhang : Lecturer
Bayesian data analysis, Ecological statistics, Statistical computing
Member of other research groups: Continuum Mechanics - Modelling and Analysis of Material Systems
Postgraduates
Salihah Alghamdi : PhD Student
Research Topic: Analysis of Spatially correlated functional data objects.
Supervisor: Surajit Ray
Erin Bryce : PhD Student
Research Topic: Statistical landslide hazard modelling with a view towards
medium to long term territorial planning
Supervisors: Daniela Castro-Camilo, Janine Illian
Yangsong Cheng : PhD Student
Research Topic: Computing, Inference and Applications of Hierarchical Mode
Association Clustering
Supervisor: Surajit Ray
Daniela Cuba : PhD Student
Research Topic: Statistical tools to interpret soil variation
Supervisors: Daniela Castro-Camilo, Marian Scott OBE
Dimitra Eleftheriou : PhD Student
Supervisors: Tereza Neocleous, Ludger Evers, Theo Papamarkou
Flynn Gewirtz-O'Reilly : PhD Student
Supervisors: Mayetri Gupta, Surajit Ray
Catherine Holland : PhD Student
Research Topic: Bayesian approaches to compositional data with structural zeros
Supervisors: Gary Napier, Tereza Neocleous
Chenglei Hu : PhD Student
Research Topic: Natural hazard risk estimation using Multivariate Extreme-Value
Mixture Models (MEVMM)
Supervisors: Daniela Castro-Camilo, Ben Swallow
Bader Lafi Q Alruwaili : PhD Student
Research Topic: Clustering and Cluster Inference of complex data structures
Supervisor: Surajit Ray
Yinuo Liu : PhD Student
Supervisor: Adrian Bowman
Lida Mavrogonatou : PhD Student
Supervisor: Vlad Vyshemirsky
Robin Muegge : PhD Student
Research Topic: Estimating the effects of air pollution on human health
Supervisors: Nema Dean, Duncan Lee, Eilidh Jack
Kannat Na Bangchang : PhD Student
Supervisors: Mayetri Gupta, Manuele Leonelli
Yoana Napier : MSc Student
Supervisors: Marian Scott OBE, Duncan Lee
Peter Radvanyi : PhD Student
Research Topic: Groundwater monitoring design
Supervisors: Claire Miller (née Ferguson), Craig Alexander, Marnie Low
Kamol Sanittham : PhD Student
Supervisors: Duncan Lee, Craig Anderson
Alison Smith : PhD Student
Research Topic: Developing novel ways to represent spatial patterns in disease
risk
Supervisor: Craig Anderson
Laura Stewart : PhD Student
Research Topic: Development and application of stochastic models of
agglomeration
Supervisors: Vincent Macaulay, Alexey Lindo
Benjamin Szili : PhD Student
Supervisors: Ludger Evers, Theo Papamarkou
George Vazanellis : PhD Student
Research Topic: Spatiotemporal models for environmental data
Supervisor: Adrian Bowman
Stephen Jun Villejo : PhD Student
Research Topic: A Bayesian Spatio-Temporal Model to Test for Stability of Risks
for Spatially Misaligned Data
Supervisors: Ben Swallow, Janine Illian
Ivona Voroneckaja : PhD Student
Supervisor: Ludger Evers
Michael Waltenberger : PhD Student
Supervisors: Duncan Lee, Craig Anderson, Gary Napier
Yalei Yang : PhD Student
Supervisors: Hao Gao, Dirk Husmeier
Xueqing Yin : PhD Student
Research Topic: Mapping disease risk in space and time
Supervisors: Craig Anderson, Duncan Lee
Wenhui Zhang : PhD Student
Research Topic: Analysis of Positron Emission Tomography data for tumour
detection and delineation
Supervisors: Surajit Ray, Mu Niu
Statistics and Data Analytics sample thesis topics
Modelling genetic variation (MSc/PhD)
Supervisors: Vincent Macaulay
Relevant research groups: Statistics and Data Analytics
Variation in the distribution of different DNA sequences across individuals has been shaped by many processes which can be modelled probabilistically, processes such as demographic factors like prehistoric population movements, or natural selection. This project involves developing new techniques for teasing out information on those processes from the wealth of raw data that is now being generated by high-throughput genetic assays, and is likely to involve computationally-intensive sampling techniques to approximate the posterior distribution of parameters of interest. The characterization of the amount of population structure on different geographical scales will influence the design of experiments to identify the genetic variants that increase risk of complex diseases, such as diabetes or heart disease.
The evolution of shape (PhD)
Supervisors: Vincent Macaulay
Relevant research groups: Statistics and Data Analytics
Shapes of objects change in time. Organisms evolve and in the process change form: humans and chimpanzees derive from some common ancestor presumably different from either in shape. Designed objects are no different: an Art Deco tea pot from the 1920s might share some features with one from Ikea in 2010, but they are different. Mathematical models of evolution for certain data types, like the strings of As, Gs , Cs and Ts in our evolving DNA, are quite mature and allow us to learn about the relationships of the objects (their phylogeny or family tree), about the changes that happen to them in time (the evolutionary process) and about the ways objects were configured in the past (the ancestral states), by statistical techniques like phylogenetic analysis. Such techniques for shape data are still in their infancy. This project will develop novel statistical inference approaches (in a Bayesian context) for complex data objects, like functions, surfaces and shapes, using Gaussian-process models, with potential application in fields as diverse as language evolution, morphometrics and industrial design.
Estimating the effects of air pollution on human health (PhD)
Supervisors: Duncan Lee
Relevant research groups: Statistics and Data Analytics
The health impact of exposure to air pollution is thought to reduce average life expectancy by six months, with an estimated equivalent health cost of 19 billion each year (from DEFRA). These effects have been estimated using statistical models, which quantify the impact on human health of exposure in both the short and the long term. However, the estimation of such effects is challenging, because individual level measures of health and pollution exposure are not available. Therefore, the majority of studies are conducted at the population level, and the resulting inference can only be made about the effects of pollution on overall population health. However, the data used in such studies are spatially misaligned, as the health data relate to extended areas such as cities or electoral wards, while the pollution concentrations are measured at individual locations. Furthermore, pollution monitors are typically located where concentrations are thought to be highest, known as preferential sampling, which is likely to result in overly high measurements being recorded. This project aims to develop statistical methodology to address these problems, and thus provide a less biased estimate of the effects of pollution on health than are currently produced.
Bayesian variable selection for genetic and genomic studies (PhD)
Supervisors: Mayetri Gupta
Relevant research groups: Statistics and Data Analytics
An important issue in high-dimensional regression problems is the accurate and efficient estimation of models when, compared to the number of data points, a substantially larger number of potential predictors are present. Further complications arise with correlated predictors, leading to the breakdown of standard statistical models for inference; and the uncertain definition of the outcome variable, which is often a varying composition of several different observable traits. Examples of such problems arise in many scenarios in genomics- in determining expression patterns of genes that may be responsible for a type of cancer; and in determining which genetic mutations lead to higher risks for occurrence of a disease. This project involves developing broad and improved Bayesian methodologies for efficient inference in high-dimensional regression-type problems with complex multivariate outcomes, with a focus on genetic data applications.
The successful candidate should have a strong background in methodological and applied Statistics, expert skills in relevant statistical software or programming languages (such as R, C/C++/Python), and also have a deep interest in developing knowledge in cross-disciplinary topics in genomics. The candidate will be expected to consolidate and master an extensive range of topics in modern Statistical theory and applications during their PhD, including advanced Bayesian modelling and computation, latent variable models, machine learning, and methods for Big Data. The successful candidate will be considered for funding to cover domestic tuition fees, as well as paying a stipend at the Research Council rate for four years.
Analysis of Spatially correlated functional data objects. (PhD)
Supervisors: Surajit Ray
Relevant research groups: Statistics and Data Analytics
Historically, functional data analysis techniques have widely been used to analyze traditional time series data, albeit from a different perspective. Of late, FDA techniques are increasingly being used in domains such as environmental science, where the data are spatio-temporal in nature and hence is it typical to consider such data as functional data where the functions are correlated in time or space. An example where modeling the dependencies is crucial is in analyzing remotely sensed data observed over a number of years across the surface of the earth, where each year forms a single functional data object. One might be interested in decomposing the overall variation across space and time and attribute it to covariates of interest. Another interesting class of data with dependence structure consists of weather data on several variables collected from balloons where the domain of the functions is a vertical strip in the atmosphere, and the data are spatially correlated. One of the challenges in such type of data is the problem of missingness, to address which one needs develop appropriate spatial smoothing techniques for spatially dependent functional data. There are also interesting design of experiment issues, as well as questions of data calibration to account for the variability in sensing instruments. Inspite of the research initiative in analyzing dependent functional data there are several unresolved problems, which the student will work on:
- robust statistical models for incorporating temporal and spatial dependencies in functional data
- developing reliable prediction and interpolation techniques for dependent functional data
- developing inferential framework for testing hypotheses related to simplified dependent structures
- analysing sparsely observed functional data by borrowing information from neighbours
- visualisation of data summaries associated with dependent functional data
- Clustering of functional data
Mapping disease risk in space and time (PhD)
Supervisors: Duncan Lee
Relevant research groups: Statistics and Data Analytics
Disease risk varies over space and time, due to similar variation in environmental exposures such as air pollution and risk inducing behaviours such as smoking. Modelling the spatio-temporal pattern in disease risk is known as disease mapping, and the aims are to: quantify the spatial pattern in disease risk to determine the extent of health inequalities, determine whether there has been any increase or reduction in the risk over time, identify the locations of clusters of areas at elevated risk, and quantify the impact of exposures, such as air pollution, on disease risk. I am working on all these related problems at present, and I have PhD projects in all these areas.
Modality of mixtures of distributions (PhD)
Supervisors: Surajit Ray
Relevant research groups: Statistics and Data Analytics
Finite mixtures provide a flexible and powerful tool for fitting univariate and multivariate distributions that cannot be captured by standard statistical distributions. In particular, multivariate mixtures have been widely used to perform modeling and cluster analysis of high-dimensional data in a wide range of applications. Modes of mixture densities have been used with great success for organizing mixture components into homogenous groups. But the results are limited to normal mixtures. Beyond the clustering application existing research in this area has provided fundamental results regarding the upper bound of the number of modes, but they too are limited to normal mixtures. In this project, we wish to explore the modality of non-normal distributions and their application to real life problems
Bayesian statistical data integration of single-cell and bulk “OMICS” datasets with clinical parameters for accurate prediction of treatment outcomes in Rheumatoid Arthritis (PhD)
Supervisors: Mayetri Gupta
Relevant research groups: Statistics and Data Analytics
In recent years, many different computational methods to analyse biological data have been established: including DNA (Genomics), RNA (Transcriptomics), Proteins (proteomics) and Metabolomics, that captures more dynamic events. These methods were refined by the advent of single cell technology, where it is now possible to capture the transcriptomics profile of single cells, spatial arrangements of cells from flow methods or imaging methods like functional magnetic resonance imaging. At the same time, these OMICS data can be complemented with clinical data – measurement of patients, like age, smoking status, phenotype of disease or drug treatment. It is an interesting and important open statistical question how to combine data from different “modalities” (like transcriptome with clinical data or imaging data) in a statistically valid way, to compare different datasets and make justifiable statistical inferences. This PhD project will be jointly supervised with Dr. Thomas Otto and Prof. Stefan Siebert from the Institute of Infection, Immunity & Inflammation), you will explore how to combine different datasets using Bayesian latent variable modelling, focusing on clinical datasets from Rheumatoid Arthritis.
Funding Notes
The successful candidate will be considered for funding to cover domestic tuition fees, as well as paying a stipend at the Research Council rate for four years.
New methods for analysis of migratory navigation (PhD)
Supervisors: Janine Illian
Relevant research groups: Statistics and Data Analytics
Joint project with Dr Urška Demšar (University of St Andrews)
Migratory birds travel annually across vast expanses of oceans and continents to reach their destination with incredible accuracy. How they are able to do this using only locally available cues is still not fully understood. Migratory navigation consists of two processes: birds either identify the direction in which to fly (compass orientation) or the location where they are at a specific moment in time (geographic positioning). One of the possible ways they do this is to use information from the Earth’s magnetic field in the so-called geomagnetic navigation (Mouritsen 2018). While there is substantial evidence (both physiological and behavioural) that they do sense magnetic field (Deutschlander and Beason 2014), we however still do not know exactly which of the components of the field they use for orientation or positioning. We also do not understand how rapid changes in the field affect movement behaviour.
There is a possibility that birds can sense these rapid large changes and that this may affect their navigational process. To study this, we need to link accurate data on Earth’s magnetic field with animal tracking data. This has only become possible very recently through new spatial data science advances: we developed the MagGeo tool, which links contemporaneous geomagnetic data from Swarm satellites of the European Space Agency with animal tracking data (Benitez Paez et al. 2021).
Linking geomagnetic data to animal tracking data however creates a highly-dimensional data set, which is difficult to explore. Typical analyses of contextual environmental information in ecology include representing contextual variables as co-variates in relatively simple statistical models (Brum Bastos et al. 2021), but this is not sufficient for studying detailed navigational behaviour. This project will analyse complex spatio-temporal data using computationally efficient statistical model fitting approches in a Bayesian context.
This project is fully based on open data to support reproducibility and open science. We will test our new methods by annotating publicly available bird tracking data (e.g. from repositories such as Movebank.org), using the open MagGeo tool and implementing our new methods as Free and Open Source Software (R/Python).
References
Benitez Paez F, Brum Bastos VdS, Beggan CD, Long JA and Demšar U, 2021. Fusion of wildlife tracking and satellite geomagnetic data for the study of animal migration. Movement Ecology, 9:31. https://doi.org/10.1186/s40462-021-00268-4
Brum Bastos VdS, Łos M, Long JA, Nelson T and Demšar U, 2021, Context-aware movement analysis in ecology: a systematic review. International Journal of Geographic Information Science, https://doi.org/10.1080/13658816.2021.1962528
Deutschlander ME and Beason RC, 2014. Avian navigation and geographic positioning. Journal of Field Ornithology, 85(2):111–133. https://doi.org/10.1111/jofo.12055
Scalable Bayesian Models for Inferring Evolutionary Traits of Plants (PhD)
Supervisors: Vinny Davies
Relevant research groups: Statistics and Data Analytics
The functional traits and environmental preferences of plant species determine how they will react to changes resulting from global warming. The main global biodiversity repositories, such as the Global Biodiversity Information Facility (GBIF), contain hundreds of millions of records from hundreds of thousands of species in the plant kingdom alone, and the spatiotemporal data in these records can be associated with soil, climate or other environmental data from other databases. Combining these records allow us to identify environmental preferences, especially for common species where many records exist. Furthermore, in a previous PhD studentship we showed that these traits are highly evolutionarily conserved (Harris et al., 2022), so it is possible to impute the preferences for rare species where little data exists using phylogenetic inference techniques.
The aim of this PhD project is to investigate the application of Bayesian variable selection methods to identify these evolutionarily conserved traits more effectively, and to quantify these traits and their associated uncertainty for all plant species for use in a plant ecosystem digital twin that we are developing separately to forecast the impact of climate change on biodiversity. In another PhD studentship, we previously developed similar methods for trait inference in viral evolution (Davies et al., 2017; Davies et al., 2019), but due to the scale of the data here, these methods will need to be significantly enhanced. We therefore propose a project to investigate extensions to methods for phylogenetic trait inference to handle datasets involving hundreds of millions of records in phylogenies with hundreds of thousands of tips, potentially through either sub-sampling (Quiroz et al, 2018) or modelling splitting and recombination (Nemeth & Sherlock, 2018).
Metabolomics DIA Resolver (PhD)
Supervisors: Vinny Davies
Relevant research groups: Statistics and Data Analytics
In metabolomics we take a sample (blood, urine, etc) and put it through a mass spectrometer. The mass spectrometer scans the sample in multiple ways to help us work out what metabolites can be found in the sample. Identifying these metabolites can be useful for clinical trials, disease diagnosis and progression and various other medical applications. There are various way of choosing the scans, but in one particular method (DIA) we often see multiple fragments from multiple metabolites in a single scan. In order to identify the metabolites we need to work out which fragments belong to which metabolites. The project will use our recently developed virtual mass spectrometer, ViMMS (Wandy et al., 2019; Wandy et al., 2022), to continue the development of our new metabolomics DIA resolver, MSdeconvolve. We will expand MSdeconvole to work across multiple repeated samples collected in different ways and then extended it to work for completely different samples. Initially this will be done using standard statistical and machine learning methods, but we will look to extend this into a Bayesian modelling framework.
Integrated spatio-temporal modelling for environmental data (PhD)
Supervisors: Janine Illian
Relevant research groups: Statistics and Data Analytics
(Jointly supervised by Peter Henrys, CEH)
The last decade has seen a proliferation of environmental data with vast quantities of information available from various sources. This has been due to a number of different factors including: the advent of sensor technologies; the provision of remotely sensed data from both drones and satellites; and the explosion in citizen science initiatives. These data represent a step change in the resolution of available data across space and time - sensors can be streaming data at a resolution of seconds whereas citizen science observations can be in the hundreds of thousands.
Over the same period, the resources available for traditional field surveys have decreased dramatically whilst logistical issues (such as access to sites, ) have increased. This has severely impacted the ability for field survey campaigns to collect data at high spatial and temporal resolutions. It is exactly this sort of information that is required to fit models that can quantify and predict the spread of invasive species, for example.
Whilst we have seen an explosion of data across various sources, there is no single source that provides both the spatial and temporal intensity that may be required when fitting complex spatio-temporal models (cf invasive species example) - each has its own advantages and benefits in terms of information content. There is therefore potentially huge benefit in beginning together data from these different sources within a consistent framework to exploit the benefits each offers and to understand processes at unprecedented resolutions/scales that would be impossible to monitor.
Current approaches to combining data in this way are typically very bespoke and involve complex model structures that are not reusable outside of the particular application area. What is needed is an overarching generic methodological framework and associated software solutions to implement such analyses. Not only would such a framework provide the methodological basis to enable researchers to benefit from this big data revolution, but also the capability to change such analyses from being stand alone research projects in their own right, to more operational, standard analytical routines.
FInally, such dynamic, integrated analyses could feedback into data collection initiatives to ensure optimal allocation of effort for traditional surveys or optimal power management for sensor networks. The major step change being that this optimal allocation of effort is conditional on other data that is available. So, for example, given the coverage and intensity of the citizen science data, where should we optimally send our paid surveyors? The idea is that information is collected at times and locations that provide the greatest benefit in understanding the underpinning stochastic processes. These two major issues - integrated analyses and adaptive sampling - ensure that environmental monitoring is fit for purpose and scientists, policy and industry can benefit from the big data revolution.
This project will develop an integrated statistical modelling strategy that provides a single modelling framework for enabling quantification of ecosystem goods and services while accounting for the fundamental differences in different data streams. Data collected at different spatial resolutions can be used within the same model through projecting it into continuous space and projecting it back into the landscape level of interest. As a result, decisions can be made at the relevant spatial scale and uncertainty is propagated through, facilitating appropriate decision making.
Evaluating probabilistic forecasts in high-dimensional settings (PhD)
Supervisors: Jethro Browell
Relevant research groups: Statistics and Data Analytics
Many decisions are informed by forecasts, and almost all forecasts are uncertain to some degree. Probabilistic forecasts quantify uncertainty to help improve decision-making and are playing an important role in fields including weather forecasting, economics, energy, and public policy. Evaluating the quality of past forecasts is essential to give forecasters and forecast users confidence in their current predictions, and to compare the performance of forecasting systems.
While the principles of probabilistic forecast evaluation have been established over the past 15 years, most notably that of “sharpness subject to calibration/reliability”, we lack a complete toolkit for applying these principles in many situations, especially those that arise in high-dimensional settings. Furthermore, forecast evaluation must be interpretable by forecast users as well as expert forecasts, and assigning value to marginal improvements in forecast quality remains a challenge in many sectors.
This PhD will develop new statistical methods for probabilistic forecast evaluation considering some of the following issues:
- Verifying probabilistic calibration conditional on relevant covariates
- Skill scores for multivariate probabilistic forecasts where “ideal” performance is unknowable
- Assigning value to marginal forecast improvement though the convolution of utility functions and Murphey Diagrams
- Development of the concept of “anticipated verification” and “predicting the of uncertainty of future forecasts”
- Decomposing forecast misspecification (e.g. into spatial and temporal components)
- Evaluation of Conformal Predictions
Good knowledge of multivariate statistics is essential, prior knowledge of probabilistic forecasting and forecast evaluation would be an advantage.
Adaptive Probabilistic Forecasting (PhD)
Supervisors: Jethro Browell
Relevant research groups: Statistics and Data Analytics
Data-driven predictive models depend on the representativeness of data used in model selection and estimation. However, many processes change over time meaning that recent data is more representative than old data. In this situation, predictive models should track these changes, which is the aim of “online” or “adaptive” algorithms. Furthermore, many users of forecasts require probabilistic forecasts, which quantify uncertainty, to inform their decision-making. Existing adaptive methods such as Recursive Least Squares, the Kalman Filter have been very successful for adaptive point forecasting, but adaptive probabilistic forecasting has received little attention. This PhD will develop methods for adaptive probabilistic forecasting from a theoretical perspective and with a view to apply these methods to problems in at least one application area to be determined.
In the context of adaptive probabilistic forecasting, this PhD may consider:
- Online estimation of Generalised Additive Models for Location Scale and Shape
- Online/adaptive (multivariate) time series prediction
- Online aggregation (of experts, or hierarchies)
A good knowledge of methods for time series analysis and regression is essential, familiarity with flexible regression (GAMs) and distributional regression (GAMLSS/quantile regression) would be an advantage.
Statistical methodology for Assessing the impacts of offshore renewable developments on marine wildlife (PhD)
Supervisors: Janine Illian
Relevant research groups: Statistics and Data Analytics
(jointly supervised by Esther Jones and Adam Butler, BIOSS)
Assessing the impacts of offshore renewable developments on marine wildlife is a critical component of the consenting process. A NERC-funded project, ECOWINGS, will provide a step-change in analysing predator-prey dynamics in the marine environment, collecting data across trophic levels against a backdrop of developing wind farms and climate change. Aerial survey and GPS data from multiple species of seabirds will be collected contemporaneously alongside prey data available over the whole water column from an automated surface vehicle and underwater drone.
These methods of data collection will generate 3D space and time profiles of predators and prey, creating a rich source of information and enormous potential for modelling and interrogation. The data present a unique opportunity for experimental design across a dynamic and changing marine ecosystem, which is heavily influenced by local and global anthropogenic activities. However, these data have complex intrinsic spatio-temporal properties, which are challenging to analyse. Significant statistical methods development could be achieved using this system as a case study, contributing to the scientific knowledge base not only in offshore renewables but more generally in the many circumstances where patchy ecological spatio-temporal data are available.
This PhD project will develop spatio-temporal modelling methodology that will allow user to anaylse these exciting - and complex - data sets and help inform our knowledge on the impact of off-shore renewable on wildlife.