Sequence Alignment:

  • Answers to at least four student questions
  • 1st student presentation on a step by step illustration of the Needleman Wunsch algorithm

What is in a tree?

 

Trees can be either rooted or unrooted (at least the ones calculated from molecular data :-)).

The assumption of a molecular clock is usually not justified a priori.

Therefore, trees form molecular data are usually calculated as unrooted trees (which is the reason for some of the complications with JALVIEW and other "stupid" programs). To root a tree you either can assume a molecular clock (substitutions occur at a constant rate, again this assumption is usually not warranted), or you can use an outgroup (i.e. something that you know forms the deepest branch). For example, to root a phylogeny of birds, you could use the homologous characters from a reptile as outgroup, to root a phylogeny of alpha hemoglobins you could use a beta hemoglobin sequence, or a myoglobin sequence as outgroup.

Trees have a branching pattern (also called the topology), and branch lengths. Often the branch lengths are ignored in depicting trees (which then are also are referred to as cladograms - note that cladograms should be considered rooted*). You can swap branches attached to a node, and you can depict the tree as rooted in any branch you like without changing the tree.

Tree exercise: Which of these trees are identical, when you consider them as unrooted and only consider the topology? here

While many trees have identical topologies, there is an enormous number of possible different tree topologies for rather small number of terminal taxa. An illustrative table is here.

Reminders:

Evolution of protein families:
      Homology (shared ancestry) versus  Analogy (convergent evolution)

Types of homology  (orthology, paralogy, xenology, synology)

Discussion and examples from Fitch's article (TIG 2000, see handout). See also globin trees above.

How many different groups of homologous proteins are there?
 Problems:  homology and detection of homology are two different things. 

Paradox (?): If all genes evolved through duplication and diversification from the same first self replicating RNA molecule, aren't all genes homologs?

At present there are about 500 known types of protein folds in the pdb data banks.  How many of these folds can be joined into a single class? 
(see the earlier example of
Helicase and F1-ATPase. Both form hexamers with something rotating in the middle (either the gamma subunit or the DNA; D. Crampton, pers. communication).   The monomers have the same type of nucleotide binding fold (picture) 

Comparative genome analyses

Background

How can genes get duplicated (see here for a review, pdf-file is here):

Whole genome duplication,
partial genome duplication,
and

single genes (tandem repeats)

Whole genome duplication: frequent event in plants, also speculated to have occurred at least twice in the early evolution of vertebrates.  15% of the yeast genome is present in duplicated form, the currently accepted idea is that there was an ancient duplication followed by rearrangement and gene loss.   The idea of genome duplications in early vertebrate evolution has become very popular, but phylogeny of regulatory proteins does not support this idea (see here and here for pro and here for contra (see Table 2 in particular).

Parts of chromosomes get duplicated: traces of this seen in Arabidopsis and Caenorhabditis

Single genes get duplicated -> gene families originally tandemly replicated (see the Caeonrhapditis paper above)

Aside: How many different genes are necessary in an organism?

Surprisingly few - but usually many more present:
Minimum: prokaryotes 500 - 1000, eukaryotes 5000-10000

In prokaryotes gene duplications often (or exclusively) occur via horizontal transfer and illegitimate recombination.

Genome dot plots

Genome dot plots allows to compare two genomes (or rather the ORF in encoded in these genomes).  In contrast to a normal dot plot, one does not move a window through the sequence, rather one takes one ORF at a time and compares it to the other genome.

Robert L. Charlebois' genome and bioinformatics site performed these and other analysis. Sadly, this site is at present not functioning.

For example BLASTP-based dot plot of Pyrococcus abyssi vs Pyrococcus horikoshii depicted below clearly reveals inversions, and a duplication (two parallel diagonals), the latter can also be detected by comparing a genome to itself.

 

See this paper from Tillier and Collins on a discussion of this and similar patterns.

The picture below is a comparison of the Yeast proteom with itself (the diagonal is removed). It clearly shows many small regions of duplications.

Only in case you don't have any questions:

HORIZONTAL GENE TRANFER versus VERTICAL INHERITANCE

In prokaryotes molecular phylogenies often are in conflict with each other.  One reason for this is horizontal gene transfer.  Prokaryotes do not have sex, where whole genomes have a chance to recombine with one another, they rely on the transfer of smaller fragments.  Horizontal gene transfer is postulated to have significantly shaped prokaryotic genomes.  An example for the interdomain HGT of ATPases is here

An intriguing question is the following: 

How big a problem is HGT for reconstructing the evolutionary history of organisms? 

 

Two extreme answers:

 

A)        No problem at all

The majority consensus of conserved genes or genes involved in information processing or ….. provide a reliable backbone, long distance HGT provide a means to correlate evolution in different parts of the tree of life

 

B)        HGT is it! (Forget the rest.)

HGT frequency is the signal that is seen in prokaryotic classification: 

Bacteria (purple bacteria, cyanobacteria …) appear more similar to other Bacteria (purple bacteria, cyanobacteria …) (and are therefore classified as such) because they more frequently exchange genes with other Bacteria (purple bacteria, cyanobacteria …) than with less “related” groups.  

 

For illustrations go here (at present works only with IE).

 

 

Note:
The term cladogram refers to a strictly bifurcating diagram, where each clade is defined by a common ancestor that only gives rise to members of this clade. I.e., a clade is monophyletic (derived from one ancestor) as opposed to polyphyletic
(derived from many ancestors). A clade is recognized and defined by shared derived characters (= synapomorphies). Shared primitive characters (= sympleisiomorphies) do not define a clade.

To use these terms you need to have polarized characters; for most molecular characters you don't know which state is primitive and which is derived (exceptions:....).

Related terms:
autapomorphy = a derived character that is only present in one group; an autapomorphic character does not tell us anything about the relationship of the group that has this character ot other groups.

homoplasy = a derived character that was derived twice independently (convergent evolution). Note that the characters in question might still be homologous (e.g. a position in a sequence alignment, frontlimbs turned into wings in birds and bats).

paraphyletic = a taxonomic group that is defined by a common ancestor, however, the common ancestor of this group also has decendants that do not belong to this taxonomic group. Many systematists despise paraphyletic groups (and consider them to be polyphyletic). Examples for paraphyletic groups are reptiles and protists. Many consider the archaea to be paraphyletic as well.

holophyletic = same as above, but the comon ancestor gave rise only to members of the group.