JLT
Posts: 740 Joined: Jan. 2008
|
A formal test of the theory of universal common ancestry
Douglas L. Theobald
Nature 465: 219–222 doi:10.1038/nature09014 Quote | In the conclusion of On the Origin of Species, Darwin proposed that “all the organic beings which have ever lived on this earth have descended from some one primordial form”. This theory of UCA—the proposition that all extant life is genetically related—is perhaps the most fundamental premise of modern evolutionary theory, providing a unifying foundation for all life sciences. UCA is now supported by a wealth of evidence from many independent sources, including: (1) the agreement between phylogeny and biogeography; (2) the correspondence between phylogeny and the palaeontological record; (3) the existence of numerous predicted transitional fossils; (4) the hierarchical classification of morphological characteristics; (5) the marked similarities of biological structures with different functions (that is, homologies); and (6) the congruence of morphological and molecular phylogenies. Although the consilience of these classic arguments provides strong evidence for the common ancestry of higher taxa such as the chordates or metazoans, none expressly address questions such as whether bacteria, yeast and humans are all genetically related. However, the ‘universal’ in universal common ancestry is primarily supported by two further lines of evidence: various key commonalities at the molecular level (including fundamental biological polymers, nucleic acid genetic material, l-amino acids, and core metabolism) and the near universality of the genetic code. Notably, these two traditional arguments for UCA are largely qualitative, and typical presentations of the evidence do not assess quantitative measures of support for competing hypotheses, such as the probability of evolution from multiple, independent ancestors. The inference from biological similarities to evolutionary homology is a feature shared by several of the lines of evidence for common ancestry. For instance, it is widely assumed that high sequence resemblance, often gauged by an E value from a BLAST search, indicates genetic kinship. However, a small E value directly demonstrates only that two biological sequences are more similar than would be expected by chance. [...] Sequence similarity is an empirical observation, whereas the conclusion of homology is a hypothesis proposed to explain the similarity. Statistically significant sequence similarity can arise from factors other than common ancestry, such as convergent evolution due to selection, structural constraints on sequence identity, mutation bias, chance, or artefact manufacture. For these reasons, a sceptic who rejects the common ancestry of all life might nevertheless accept that universally conserved proteins have similar sequences and are ‘homologous’ in the original pre-Darwinian sense of the term (homology here being similarity of structure due to “fidelity to archetype”). Consequently, it would be advantageous to have a method that is able to objectively quantify the support from sequence data for common-ancestry versus competing multiple-ancestry hypotheses. Here I report tests of the theory of UCA using model selection theory, without assuming that sequence similarity indicates a genealogical relationship. [...] The theory of UCA allows for the possibility of multiple independent origins of life. If life began multiple times, UCA requires a ‘bottleneck’ in evolution in which descendants of only one of the independent origins have survived exclusively until the present (and the rest have become extinct), or, multiple populations with independent, separate origins convergently gained the ability to exchange essential genetic material (in effect, to become one species). All of the models examined here are compatible with multiple origins in both the above schemes, and therefore the tests reported here are designed to discriminate specifically between UCA and multiple ancestry, rather than between single and multiple origins of life. Furthermore, UCA does not demand that the last universal common ancestor was a single organism, in accord with the traditional evolutionary view that common ancestors of species are groups, not individuals. Rather, the last universal common ancestor may have comprised a population of organisms with different genotypes that lived in different places at different times. The data set consists of a subset of the protein alignment data from ref. 27, containing 23 universally conserved proteins for 12 taxa from all three domains of life, including nine proteins thought to have been horizontally transferred early in evolution. The conserved proteins in this data set were identified based on significant sequence similarity using BLAST searches, and they have consequently been postulated to be orthologues. The first class of models I considered (presented in Table 1 and Fig. 1) constrains all the universally conserved proteins in a given set of taxa to evolve by the same tree, and hence these models do not account for possible horizontal gene transfer (HGT) or symbiotic fusion events during the evolution of the three domains of life. Hereafter I refer to this set of models as ‘class I’. The class I model ABE, representing universal common ancestry of all taxa in the three domains of life and shown in Fig. 1a, can be considered to represent the classic three-domain ‘tree of life’ model of evolution. Among the class I models, all criteria select the UCA tree by an extremely large margin (score differences ranging from 6,569 to 14,057), even though nearly half of the proteins in the analysis probably have evolutionary histories complicated by HGT. For all model selection criteria, by statistical convention a score difference of 5 or greater is viewed as very strong empirical evidence for the hypothesis with the better score (in this work higher scores are better). All scores shown are also highly statistically significant (the estimated variance for each score is approximately 2–3). According to a standard objective Bayesian interpretation of the model selection criteria, the scores are the log odds of the hypotheses. Therefore, UCA is at least 10^2,860 times more probable than the closest competing hypothesis. Notably, UCA is the most accurate and the most parsimonious hypothesis. Compared to the multiple-ancestry hypotheses, UCA provides a much better fit to the data (as seen from its higher likelihood), and it is also the least complex (as judged by the number of parameters). The extraordinary strength of these results in the face of suspected HGT events suggests that the preference for the UCA model is robust to the extent of HGT. To test this possibility, the analysis was expanded to include models that allow each protein to have a distinct, independent evolutionary history. I refer to this set of models, which rejects a single tree metaphor for genealogically related taxa, as ‘class II’. Representative class II models are shown in Fig. 2. Within each set of genealogically related taxa, each of the 23 universally conserved proteins is allowed to evolve on its own separate phylogeny, in which both branch lengths and tree topology are free parameters. [...] Overall, the model selection tests show that the class II models are greatly preferred to the class I models. For instance, the class II UCA hypothesis ([ABE]II) versus the class I UCA hypothesis (ABE) gives a highly significant LLR of 3,557, a ?AIC of 2,633 and an LBF of 2,875. The optimal class II models represent an upper limit to the degree of HGT, as many of the apparent reticulations are probably due to incomplete lineage sorting, hidden paralogy, recombination, or inaccuracies in the evolutionary models. Nonetheless, as with the class I non-HGT hypotheses, all model selection criteria unequivocally support a single common genetic ancestry for all taxa. Also similar to the class I models, the class II UCA model has the greatest explanatory power and is the most parsimonious. Several hypotheses have been proposed to explain the origin of eukaryotes and the early evolution of life by endosymbiotic fusion of an early archaeon and bacterium. A key commonality of these hypotheses is the rejection of a single, bifurcating tree as a proper model for the ancestry of Eukarya. For instance, in these biological hypotheses certain eukaryotic genes are derived from Archaea whereas others are derived from Bacteria. The class II models freely allow eukaryotic genes to be either archaeal-derived or bacterial-derived, as the data dictate, and hence class II hypotheses can model several endosymbiotic ‘rings’ and HGT events. [...] In all cases, these bounds show that multiple-ancestry versions of the constrained class II models are overwhelmingly rejected by the tests (model selection scores of several thousands), indicating that common ancestry is also preferred for all specific HGT and endosymbiotic fusion models. In terms of a fusion hypothesis for the origin of Eukarya, the data conclusively support a UCA model in which Eukarya share an ancestor with Bacteria and another independently with Archaea, and in which Bacteria and Archaea are also genetically related independently of Eukarya (see Table 3). [...] What property of the sequence data supports common ancestry so decisively? When two related taxa are separated into two trees, the strong correlations that exist between the sequences are no longer modelled, which results in a large decrease in the likelihood. Consequently, when comparing a common-ancestry model to a multiple-ancestry model, the large test scores are a direct measure of the increase in our ability to accurately predict the sequence of a genealogically related protein relative to an unrelated protein. The sequence correlations between a given clade of taxa and the rest of the tree would be eliminated if the columns in the sequence alignment for that clade were randomly shuffled. In such a case, these model-based selection tests should prefer the multiple-ancestry model. In fact, in actual tests with randomly shuffled data, the optimal estimate of the unified tree (for both maximum likelihood and Bayesian analyses) contains an extremely large internal branch separating the shuffled taxa from the rest. In all cases tried, with a wide variety of evolutionary models (from the simplest to the most parameter rich), the multiple-ancestry models for shuffled data sets are preferred by a large margin over common ancestry models (LLR on the order of a thousand), even with the large internal branches. Hence, the large test scores in favour of UCA models reflect the immense power of a tree structure, coupled with a gradual Markovian mechanism of residue substitution, to accurately and precisely explain the particular patterns of sequence correlations found among genealogically related biological macromolecules. |
If IDists were actually interested in testing ID, they'd write articles like this.
ETA: Nick Matzke at PT about this article
ETA: That's 10^2,860: Therefore, UCA is at least 10^2,860 times more probable than the closest competing hypothesis.
-------------- "Random mutations, if they are truly random, will affect, and potentially damage, any aspect of the organism, [...] Thus, a realistic [computer] simulation [of evolution] would allow the program, OS, and hardware to be affected in a random fashion." GilDodgen, Frilly shirt owner
|