# Dr Vinny Davies

**Lecturer**(Statistics)

## Biography

Dr Davies is a lecturer in Statistics in the School of Mathematics and Statistics specilising in computational biology and computational methods for statistics and machine learning. He completed his Ph.D. within the School of Mathematics and Statistics where he focused on variable selection models for selecting antigenic sites in virus evolution. He then completed several post-docteral research positions in both the schools of Statistics and Computing Science, as well as spending time as a Biostatistician at the University of Leeds. He returned to the School of Mathematics and Statistics in 2021 where his research interests focus on methods on the interface between Statistics and Machine Learning. He has a particular interest in Computational Metabolomics and Digital Twins, but a general interest in applying statistical and machine learning methods to any biological, chemical or health problem.

If you are interested in a doing a Ph.D., please take a look at the Additional Information section or email me directly.

## Supervision

**Current PhD Students**

- Ross McBride - Tackling Scheduling and Uncertainty in Mass Spectrometry Fragmentation Strategies

**Davison**, Emily

Designing advanced statistical inference methods for learning the parameters of a mathematical biodiversity model

**Current MRes Students**

- Cara MacBride - Spatial prediction for areal unit data: Are machine learning methods or spatial statistical models best?

## Teaching

This year I am teaching Python and GLMs, as well as supervising a number of undergraduate and master's projects. Previously I have taught on the Large Scale Computing course (NNs in Tensorflow).

## Research datasets

## Additional information

I am looking for potential PhD students across a range of subjects and have a number of projects available below. Please contact me if you wish to discuss these or any other projects further.

**Scalable Bayesian Models for Inferring Evolutionary Traits of Plants**

Supervised jointly with Richard Reeve and Claire Harris

The functional traits and environmental preferences of plant species determine how they will react to changes resulting from global warming. The main global biodiversity repositories, such as the Global Biodiversity Information Facility (GBIF), contain hundreds of millions of records from hundreds of thousands of species in the plant kingdom alone, and the spatiotemporal data in these records can be associated with soil, climate or other environmental data from other databases. Combining these records allow us to identify environmental preferences, especially for common species where many records exist. Furthermore, in a previous PhD studentship we showed that these traits are highly evolutionarily conserved (Harris et al., 2022), so it is possible to impute the preferences for rare species where little data exists using phylogenetic inference techniques.

The aim of this PhD project is to investigate the application of Bayesian variable selection methods to identify these evolutionarily conserved traits more effectively, and to quantify these traits and their associated uncertainty for all plant species for use in a plant ecosystem digital twin that we are developing separately to forecast the impact of climate change on biodiversity. In another PhD studentship, we previously developed similar methods for trait inference in viral evolution (Davies et al., 2017; Davies et al., 2019), but due to the scale of the data here, these methods will need to be significantly enhanced. We therefore propose a project to investigate extensions to methods for phylogenetic trait inference to handle datasets involving hundreds of millions of records in phylogenies with hundreds of thousands of tips, potentially through either sub-sampling (Quiroz et al., 2018) or modelling splitting and recombination (Nemeth & Sherlock, 2018).

**Multi Objective Bayesian Optimisation for In Silico to Real Metabolomics Experiments **

Supervised jointly with Craig Alexander (Note this project can be masters or PhD)

Untargeted metabolomics experiments aim to identify the small molecules that make up a particular sample (e.g., blood), allowing us to identify biomarkers, discover new chemicals, or understand the metabolism (Smith et al., 2014). Data Dependent Acquisition (DDA) methods are used to collect the information needed to identify the metabolites, and various more advanced DDA methods have recently been designed to improve this process (Davies et al. (2021); McBride et al. (2023)). Each of these methods, however, has parameters that must be chosen in order to maximise the amount of relevant data (metabolite spectra) that is collected. Our recent work led to the design of a Virtual Metabolomics Mass Spectrometer (ViMMS) in which we can run computer simulations of experiments and test different parameter settings (Wandy et al., 2019, 2022). Previously this has involved running a pre-determined set of parameters as part of a grid search in ViMMS, and then choosing the best parameter settings based on a single measure of performance. The proposed M.Res. (or Ph.D.) will extend this approach by using multi objective Bayesian Optimisation to adapt simulations and optimise over multiple different measurements of quality. By optimising parameters in this manner, we can help improve real experiments currently underway at the University of Glasgow and beyond.

**Estimating false discovery rates in metabolite identification using generative AI**

Supervised jointly with Andrew Elliott and Justin J.J. van der Hooft (Wageningen University)

Metabolomics is the study field that aims to map all molecules that are part of an organism, which can help us understand its metabolism and how it can be affected by disease, stress, age, or other factors. During metabolomics experiments, mass spectra of the metabolites are collected and then annotated by comparison against spectral databases such as METLIN (Smith et al., 2005) or GNPS (Wang et al., 2016). Generally, however, these spectral databases do not contain the mass spectra of a large proportion of metabolites, so the best matching spectrum from the database is not always the correct identification. Matches can be scored using cosine similarity, or more advanced methods such as Spec2Vec (Huber et al., 2021), but these scores do not provide any statement about the statistical accuracy of the match. Creating decoy spectral libraries, specifically a large database of fake spectra, is one potential way of estimating False Discovery Rates (FDRs), allowing us to quantify the probability of a spectrum match being correct (Scheubert et al., 2017). However, these methods are not widely used, suggesting there is significant scope to improve their performance and ease of use. In this project, we will use the code framework from our recently developed Virtual Metabolomics Mass Spectrometer (ViMMS) (Wandy et al., 2019, 2022) to systematically evaluate existing methods and identify possible improvements. We will then explore how we can use generative AI, e.g., Generative Adversarial Networks or Variational Autoencoders, to train a deep neural network that can create more realistic decoy spectra, and thus improve our estimation of FDRs.

**Generating deep fake left ventricles: a step towards personalised heart treatments**

Supervised jointly with Andrew Elliot and Hao Gao

Personalised medicine is an exciting avenue in the field of cardiac healthcare where an understanding of patient-specific mechanisms can lead to improved treatments (Gao et al., 2017). The use of mathematical models to link the underlying properties of the heart with cardiac imaging offers the possibility of obtaining important parameters of heart function non-invasively (Gao et al., 2015). Unfortunately, current estimation methods rely on complex mathematical forward simulations, resulting in a solution taking hours, a time frame not suitable for real-time treatment decisions. To increase the applicability of these methods, statistical emulation methods have been proposed as an efficient way of estimating the parameters (Davies et al., 2019; Noè et al., 2019). In this approach, simulations of the mathematical model are run in advance and then machine learning based methods are used to estimate the relationship between the cardiac imaging and the parameters of interest. These methods are, however, limited by our ability to understand the how cardiac geometry varies across patients which is in term limited by the amount of data available (Romaszko et al., 2019). In this project we will look at AI based methods for generating fake cardiac geometries which can be used to increase the amount of data (Qiao et al., 2023). We will explore different types of AI generation, including Generative Adversarial Networks or Variational Autoencoders, to understand how we can generate better 3D and 4D models of the fake left ventricles and create an improved emulation strategy that can make use of them.