Latent variable modelling for massive and complex datasets

Massive advances in genomic sequencing-based and imaging technologies in the last two decades have generated the potential to make novel biological discoveries at extremely high resolutions- but have also led to numerous challenging problems in how to sensibly and accurately analyse the generated datasets. These data are typically of large dimensions- leading to computational obstacles; are subject to various technical artefacts; and their distributions exhibit complex features, such as long-ranging correlations, non-ellipsoidal shapes, skewness and multimodality, which cause difficulties in making successful inference through classical standard statistical models. We have been developing latent variable-based Bayesian hierarchical modelling approaches for clustering in complex datasets, that lead to efficient and powerful computational methods in enabling inference and biological discovery. One example involves clustering high-volume genotyping data- finding subgroups with common features is often a necessary first step with the downstream goal of detection of genetic variants associated with specific health outcomes.