Principal component analysis in the space of phylogenetic trees
Tom Nye (Newcastle University)
Friday 17th March, 2017 15:00-16:00 Maths 203
Phylogenetic analysis of DNA or other data commonly gives rise to a sample of inferred evolutionary trees. Principal Component Analysis (PCA) cannot be applied directly to samples of trees since the space of evolutionary trees on a fixed set of taxa is not a Euclidean vector space. Instead, principal component analysis must be reformulated in the geometry of tree-space, which is a metric space with a unique geodesic between each pair of trees. The analogue of a Euclidean first principal component is a principal geodesic in tree-space, and these can be estimated by minimizing sums of squared projected distances to the data. However, the construction of higher-order principal components remained elusive for several years. In this talk I propose a solution: the k-th order principal component is the locus of the weighted Frechet mean of k+1 points in tree-space, where the weights vary over the standard k-dimensional simplex. I will describe basic properties of these objects, in particular that locally they generically have dimension k, and propose an efficient algorithm for projection onto these surfaces. Combined with a stochastic optimization algorithm, this gives a procedure for constructing a principal component of arbitrary order in tree-space. These methods enable visualizations of slices of tree-space, revealing structure within these complex data sets.