Thursday, 7 June 2012

How to root a phylogenetic tree

Teaching phylogenetics, it is clear that one of the things that causes a surprising amount of confusion is rooting the tree - defining the position on the tree of the (hypothetical) ancestor. Here, then, is my basic introduction to rooting a phylogenetic tree and why it is important.

Unless you are modelling recombination or Horizontal Transfer, a phylogenetics is explicitly based on a model of a single common ancestral lineage that splits over time into multiple lineages. For species, these splits are speciation events. For molecules (e.g. genes or protein sequences), these splits can also be duplication events. But how do you know where that common ancestral lineage is?

With respect to rooting a phylogenetic tree, there are three main strategies that are routinely employed. In approximate order of confidence in the ancestry, these are: (1) No rooting (leave the tree unrooted), (2) Midpoint rooting, and (3) Outgroup (not outlier!) rooting.

Unrooted. I'll start with the unrooted tree because most methods produce an unrooted tree. Note that, although phylogenetics as a general approach assumes there's a common ancestor somewhere, many of the actual methods for constructing trees from data make no direction/ancestry assumptions. (The obvious exception is UPGMA, which is not used that much for serious phylogenetics.)

Leaving the tree unrooted is also the easiest because it essentially involves doing nothing! In the figures below, the left-hand image shows the unrooted tree. The key thing here is that there is no assumption about ancestry and therefore no statement about the direction of evolution. If you trust the approximate branch lengths, it is still possible to say things about the topology, such as A and B are closer to each other than they are to C and D (with an important caveat covered below), but you cannot make any direct inferences about the common ancestor. Whilst easier, therefore, you are limited in terms of analysis and interpretation.

This kind of representation is best used in situations where you have three or more well-defined clades (such as different members of a gene family) that radiated a long time ago over a relatively short period, such that the placement of the root and the order of branching is not known. You might still be able to say something with confidence about some of the individual radiations but you avoid making unsupported claims about the precise history and relationships of those clades. (Another option here is to "collapse" the root into a multifurcation (one-to-many split) but I will not deal with that here.)

The alternative when the actual root placement is unknown, is to use midpoint rooting:



Midpoint Rooting. As its name suggests, Midpoint rooting attempts to root the tree in its middle point. It does this by calculating all of the tip-to-tip distances and selecting the longest - A to E in the tree above. The root is then placed exactly half-way between these two tips. If the tree is behaving and the rates of evolution are pretty constant throughout, this point should represent the ancestral point. It is therefore useful in situations where the actual root is not known but the assumption of a reasonably constant "clock-like" rate of evolution is quite sound. The other situation it generally works well for is when a tree is fairly balanced with some closely-related clades separated by a long branch in the middle - if midpoint rooting places the root far away from any nodes, it is less likely to be wrong (i.e. moving it a little due to rate discrepancies would not make any difference).

The main problem with midpoint rooting is that it is very susceptible to large deviations from a constant evolutionary rate, especially if these are not "balanced", i.e. they only occur on one side of the actual root. The other time midpoint rooting tends to go wrong is when it places the root in amongst a rather dense set of short branches, where quite small deviations will place the root on another branch. For these reasons, whenever possible, outgroup rooting is generally the method of choice.



Outgroup Rooting. Unlike midpoint rooting, in which features of the tree itself specify the root, in outgroup rooting existing knowledge is utilised to place the root in the right place. This is done by using an "Outgroup" - a species or molecule that is known to be more distantly related than everything else in the tree. In the example above, kangaroo is used to outgroup root the tree, as it is known that marsupial mammals diverged from the ancestral lineage of all placental mammals.

This tree emphasises the point made above about evolutionary rates - the midpoint root was wrong because the rodent lineages (mouse, rat and their ancestor) are evolving faster than the rest of the tree, probably due to relatively short generation times. (This pattern is often seen with real trees.) This can sometimes be obvious if deleting one of the nodes used to midpoint root the tree changes the position of the root: for a perfect clock-like tree, it should make no difference (as long as you do not delete the outgroup). This will not always work, though - in the example above, it would not make a difference, for example.

Why does the root matter? There are a couple of reasons why correct rooting is important. The first just comes down to interpretation and understanding - it would be wrong to get too obsessive about the superficial differences between the three trees in the above figure - they are all essentially the same tree (same topology) and the differences are all down to rooting. The second is more important and comes down to the direction of evolution and conclusions about relatedness. If you want to infer anything about ancestry, you obviously need to have the right root. From the midpoint rooted mammalian tree above, it might not be obvious that placental mammals form a Monophyletic clade. Coming back to my earlier caveat when interpreting an unrooted tree, you might determine that the kangaroo, lemur and human were all more closely related to each other than to the mouse and rat. In terms of pure sequence divergence (on which the tree was built) this might be right but, in evolutionary terms, it is wrong: all placental mammals are equally distant from kangaroos due to the shared common ancestor.

Related post: How to read a phylogenetic tree.

7 comments:

  1. Thanks, i found this very useful. I have been debating how important it is to root a tree when looking at phylogenetics of MLST sequences. I had composed phylogenetics on both unrooted and outgroups methods. Now that i have read this, i will be using the outgroups method so that i can put the ancestry of the different groups into perspective.

    ReplyDelete
  2. Thanks so much! But i am still in trouble when you do outgroup rooting. If you don't know whether kangaroo is far different and this is even unknown, what should i do

    ReplyDelete
    Replies
    1. If you don't have any way of determining what the outgroup should be then you cannot use outgroup rooting. If the tree is looking well-behaved, with a fairly constant rate of evolution and reasonably long ancestral branches, you can use midpoint rooting. Otherwise, it is probably safest to leave your tree unrooted. You should always work within the limits and uncertainties of your data - it is much better to say "I don't know", or "there is insufficient signal in the data to determine", than to force the issue.

      Delete
  3. Hi there,

    With out-group rooting don't you have the same confounding problem of differing evolutionary rates as you would with mid-point rooting?

    Cheers,

    James

    ReplyDelete
    Replies
    1. You can do. If your phylogenetic inference method does not correctly handle rate variation, or it is too extreme, you will still have problems getting the right tree and the outgroup might be stuck in the wrong place. You can get a sense of this by looking at the bootstrap or likelihood values of the two descendant branches where the outgroup joins the rest of the tree. A particular problem is where you have very divergent taxa where the time between divergence was very small relative to the time since divergence; there is little signal to correctly resolve the branching order.

      Delete

Thanks for leaving a comment!