Mathematical and Simulation-Based Analysis of the Behavior of Admixed Taxa in the Neighbor-Joining Algorithm
Kim J, Disanto F, Kopelman NM, Rosenberg NA (2019). Bulletin of Mathematical Biology 81(2):452-493.

Abstract

The neighbor-joining algorithm for phylogenetic inference (NJ) has been seen to have three specific properties when applied to distance matrices that contain an admixed taxon: (1) antecedence of clustering, in which the admixed taxon agglomerates with one of its source taxa before the two source taxa agglomerate with each other; (2) intermediacy of distances, in which the distance on an inferred NJ tree between an admixed taxon and either of its source taxa is smaller than the distance between the two source taxa; and (3) intermediacy of path lengths, in which the number of edges separating the admixed taxon and either of its source taxa is less than or equal to the number of edges between the source taxa. We examine the behavior of neighbor-joining on distance matrices containing an admixed group, investigating the occurrence of antecedence of clustering, intermediacy of distances, and intermediacy of path lengths. We first mathematically predict the frequency with which the properties are satisfied for a labeled unrooted binary tree selected uniformly at random in the absence of admixture. We then introduce a taxon constructed by a linear admixture of distances from two source taxa, examining three admixture scenarios by simulation: a model in which distance matrices are chosen at random, a model in which an admixed taxon is added to a set of taxa that reflect treelike evolution, and a model that introduces a perturbation of the treelike scenario. In contrast to previous conjectures, we observe that the three properties are sometimes violated by distance matrices that include an admixed taxon. However, we also find that they are satisfied more often than is expected by chance when the distance matrix contains an admixed taxon, especially when evolution among the non-admixed taxa is treelike. The results contribute to a deeper understanding of the nature of evolutionary trees constructed from data that do not necessarily reflect a treelike evolutionary process.