EXTERNAL: Bayesian model-based clustering in high dimensions
Paul Kirk (MRC Biostatistics Unit)
Thursday 19th November, 2020 12:00-13:00 Zoom
Although the challenges presented by high dimensional data in the context of regression are well-known and the subject of much current research, comparatively little work has been done on this in the context of clustering. In this setting, the key challenge is that often only a small subset of the features provides a relevant stratification of the population. Identifying relevant strata can be particularly challenging when dealing with high-dimensional datasets, in which there may be many features that provide no information whatsoever about population structure, or -- perhaps worse -- in which there may be (potentially large) feature subsets that define irrelevant stratifications. For example, when dealing with genetic data, there may be some genetic variants that allow us to group patients in terms of disease risk, but others that would provide completely irrelevant stratifications (e.g. which would group patients together on the basis of eye or hair colour). Bayesian profile regression is an outcome-guided model-based clustering approach that makes use of a response in order to guide the clustering toward relevant stratifications. Here we consider how this approach can be extended to the “multiview” setting, in which different groups of features (“views”) define different stratifications. We present some results in the context of breast cancer subtyping to illustrate how the approach can be used to perform integrative clustering of multiple ‘omics datasets.