Addendum to Wednesdays class:

An easy way to stay up-to-date are services (agents) that search the web for new and interesting publications or sequences.  There are many companies that offer this, some that are available to everyone are:

Also: While Medline is incorporating more and more non-medical literature, there are still gaps in the coverage.  Alternatives are Spaceline and other databanks available though the National Library of Medicine (here) and the local services offered at the UConn libraries. Homer (you can select different databanks, sometimes), but especially current contents and agricola nicely complement pubmed.  The best way to access is the use of “silverplatter”(go to the lib homepage, select databases, then science and click on any of the ones that mention current contents or science citation index). 

 

 

Why might you care about molecular evolution? 
Molecular Evolution in the News

Molecular epidemiology -- pathogen tracking. To an epidemiologist studying infectious diseases, it is very useful to know how or where a person became infected with the disease. This information is perhaps the most basic fact we can use in preventing the further spread of a disease. For over a decade now, epidemiologists have been using DNA sequences of viruses to make phylogenetic trees and thereby track the sources of infections. Some of these examples are spectacular.

1) Law: A case of intentional HIV injection? In a highly publicized case in Lafayette, Louisiana in 1998, a woman claimed that her ex-lover (a physician) deliberately injected her with HIV-tainted blood (HIV is the virus that causes AIDS). She did not know whose tainted blood it was nor did she realize she had been injected with blood until she became sick with viral infections months later. Records showed that the physician had indeed drawn blood from an HIV+ patient on the day she was injected. There were no records of her injection and no witnesses. So how could her story be tested?

Evolutionary trees provide the best scientific evidence in a case like this. HIV picks up mutations very fast ñ even within a single individual. If one person gives the virus to another, there are few differences between the virus in the donor and the virus in the recipient. As the virus goes from person to person, it keeps changing and gets more and more different over time. Thus, the HIV sequences in two individuals who got the virus from two different people will be very different. Thus, if the woman's story were true, her virus should be very similar to the virus in the person whose blood was drawn but should be very different from viruses taken from other people in Lafayette. That was exactly what the evolutionary trees showed; her virus appeared to have come from the patient's virus but was unlike the virus taken from other people in town. Since there was no way to explain how she would have gotten THAT patient's virus on her own, the evolutionary analysis supported her story. (Incidentally, this case was the first use of phylogenetics in U.S. criminal court.)

2) Did a Florida dentist with AIDS transmit the virus to his patients?
Kimberly Bergalis made national headlines and testified in congressional hearings as a heterosexual young woman who got AIDS. The only known potential source of her virus was her dentist, and over half a dozen of his other patients also had the disease. In this case, the initial evidence implicating the dentist was merely the statistical association of several people with AIDS whose only known exposure was the dentist. Again, evolutionary trees were created to see if the patients' viruses appeared to have descended from the dentist virus. The dentist virus did appear to be closely related to many of the patient viruses, as if it was the source. However, two patients appeared to have gotten their virus elsewhere, and those two patients were the only two infected patients with other risk factors. So again, the evolutionary analysis provided a critical means of understanding HIV transmission. (see next box)

3) Other cases. Evolutionary trees have been used in many other cases of infectious disease transmission. They were used to identify deer mice as the source of hantavirus infections in the Four-Corners area in the early 1990s. They are routinely used to determine the source of rabies viruses in human cases, and they led to the discovery of a case in which rabies virus took at least 7 years to kill a person (a length of time far in excess of anything known previously). And trees have been used to determine whether recent cases of polio in North America were relict strains from the New World, were vaccine strains, or were introduced from Asia.

(From http://www.indiana.edu/~ensiweb/pap.apld.html, Applied Evolution: Technology for the 21st Century, James Bull, PhD, University of Texas at Austin, For the Symposium Presented, by the Society for the Study of Evolution, "Building the Web of Life: Evolution in Action" NABT Ft. Worth, 10/99)

Confidence measures are important.

Hillis, D. M., and J. P. Huelsenbeck.  1994.  Support for dental HIV transmission.  Nature 369:24-25.

SIR -- On the basis of a phylogenetic analysis of HIV sequences, Ou et al. concluded that a Florida dentist infected five of his eight known HIV-1 seropositive patients.  These authors used bootstrap resampling to test the reliability of their finding and found that the HIV sequences from the dentist and infected patients formed a monophyletic group in 79% of the replicates in parsimony analysis.  DeBry et al. in Scientific Correspondence questioned the conclusion of dental transmission, however, because a bootstrap analysis (based on threshold parsimony) of independently sequenced HIV variants clustered only one of the patient sequences with a dental sequence in the majority-rule consensus tree.  DeBry et al. concluded that their analyses "...show that the available data are consistent with both the dental transmission hypothesis and the null hypothesis (the patients were independently infected from the local community) and do not distinguish between the two."  But both studies used an analysis of the bootstrap results that may not be the most appropriate method for this case.  We have reanalysed the two datasets, as well as sequences from new patients and new local controls, and find strong support for trees consistent with HIV transmission between the dentist and six of ten of his seropositive patients.

News piece from UCIrvine

 

 

A talk by Walter Fitch (slides and sound) is here

Professor Walter M. Fitch and assistant research biologist Robin M. Bush of UCI's Department of Ecology and Evolutionary Biology, working with researchers at the Centers for Disease Control and Prevention, studied the evolution of a prevalent form of the influenza A virus during an 11-year period from 1986 to 1997. They discovered that viruses having mutations in certain parts of an important viral surface protein were more likely than other strains to spawn future influenza lineages. Human susceptibility to infection depends on immunity gained during past bouts of influenza; thus, new viral mutations are required for new epidemics to occur. Knowing which currently circulating mutant strains are more likely to have successful offspring potentially may help in vaccine strain selection. The researchers' findings appear in the Dec. 3 issue of Science magazine.

……

Fitch and his fellow researchers followed the evolutionary pattern of the influenza virus, one that involves a never-ending battle between the virus and its host. The human body fights the invading virus by making antibodies against it. The antibodies recognize the shape of proteins on the viral surface. Previous infections only prepare the body to fight viruses with recognizable shapes. Thus, only those viruses that have undergone mutations that change their shape can cause disease. Over time, new strains of the virus continually emerge, spread and produce offspring lineages that undergo further mutations. This process is called antigenic drift. "The cycle goes on and on—new antibodies, new mutants," Fitch said.

The research into the virus' genetic data focused on the evolution of the hemagglutinin gene—the gene that codes for the major influenza surface protein. Fitch and fellow researchers constructed "family trees" for viral strains from 11 consecutive flu seasons. Each branch on the tree represents a new mutant strain of the virus. They found that the viral strains undergoing the greatest number of amino acid changes in specified positions of the hemagglutinin gene were most closely related to future influenza lineages in nine of the 11 flu seasons tested.

By studying the family trees of various flu strains, Fitch said, researchers can attempt to predict the evolution of an influenza virus and thus potentially aid in the development of more effective influenza vaccines.

The research team is currently expanding its work to include all three groups of circulating influenza viruses, hoping that contrasting their evolutionary strategies may lend more insight into the evolution of influenza.

Along with Fitch and Bush, Catherine A. Bender, Kanta Subbarao and Nancy J. Cox of the Centers for Disease Control and Prevention participated in the study.

 

First principles do not work to determine protein function

See here

Homology determination and phylogenetic reconstruction currently are the main pilars of bioinformatics.

Experience shows that
if two sequences show significant similarity in their primary sequence, they have shared ancestry, and probably similar function (although some proteins acquired radically new functional assignments, lysozyme -> lense crystalline).
To date there are no examples where convergent evolution has let to significant similarity of the primary sequence (there are convergent trends in lysozymes - see Car Beth Stewart's work. but these are convergent trends within one group of homologs).

Why phylogenetic reconstruction of molecular evolution?

A.   Systematic classification of organisms

e.g.:   Who were the first angiosperms? (i.e. where are the first angiosperms located relative
to present day angiosperms?)          Where in the tree of life is the last common ancestor located?

B.   Evolution of molecules

e.g.: domain shuffling, reassignment of function, gene duplications, horizontal gene transfer

By which organisms and from what precursors was eukaryotic DNA packaging invented? 

What are the origins of the eukaryotic cytoskeleton?

What is the function of a protein encoded by an uncharacterized DNA sequence?

C.   Mechanisms of evolution

e.g.: What is the role of horizontal gene transfer in microbial evolution?

Are speciation events correlated with or caused by major genome rearrangements? 

Did duplications of whole genomes provide the material that allowed more complex developmental pathways to evolve?

How:

1) Obtain sequences

          Sequencing

          Databank Searches -> ncbi a) entrez, b) BLAST, c) blast of pre-release data

          Friends

2) Determine homology (practically you only can determine the non-randomness of a match, but at present most believe that this is a sufficient demonstration of homology)

Definition of Homology:

Two sequences are homologous, if there was an ancestral molecule in the past that is an ancestor to both of the sequences

Types of homology:

          Orthology:   bifurcation in molecular tree reflects speciation
          Paralogy:     bifurcation in molecular tree reflects gene duplication
          Xenology:    gene was obtained by organism through horizontal transfer
          Synology:    genes ended up in one organism through fusion of lineages.

Orthologues: bifurcation in molecular tree reflects speciation
– these are the molecules people interested in the taxonomic classification of organisms want to study

Paralogues: bifurcation in molecular tree reflects gene duplication
          The study of paralogues and their distribution in genomes provides clues on the way genomes evolved.  Gen and genome duplication have emerged as the most important pathway to molecular innovation, including the evolution of developmental pathways.

Xenologues: gene was obtained by organism through horizontal transfer.  The classic example for Xenologs are antibiotic resistance genes, but the history of many other molecules also fits into this category:
inteins, selfsplicing introns, transposable elements, ion pumps, other transporters,

Synologues: genes ended up in one organism through fusion of lineages.  The paradigm are genes that were transferred into the eukaryotic cell together with the endosymbionts that evolved into mitochondria and plastids

 

3) Align sequences

(most algorithms used for phylogenetic reconstruction require a global (as opposed to a pairwise) alignment exception:
statalign from Thorne JL, and Kishino H, 1992, Freeing phylogenies from artifacts of
alignment. Mol Bio Evol 9:1148-1162)

a.     algorithms doing a global alignment: clustalw 1.7, or pile_up (GCG)         

b.     local alignments (MACAW)

4) Reconstruct evolutionary history

a.    Distance Analyses

      1. calculate pairwise distances        
        (different distance measures, correction for multiple hits, correction for codon bias)
      2. make distance matrix (table of pairwise corrected distances)
      3. calculate tree from distance matrix

       

b.    Parsimony Analyses

find that tree that explains sequence data with minimum number of substitutions

(tree includes hypothesis of sequence at each of the nodes (acctrans, deltrans)

 

c.    Maximum Likelihood Analyses

given a model for sequence evolution, find the tree that has the highest probability under this model.

This approach can also be used to successively refine the model.

Else:
spectral analyses (i.e., look at patterns of substitutions): evolutionary parsimony, Hadamard conjugation

Another way to categorize methods of phylogenetic reconstruction is to ask if they are using

(i) an optimality criterion (e.g.: smallest error between distance matrix and distances in tree, least number of steps), or
(ii) algorithmic approaches (UPGMA or neighbor joining)

5) Interpret the result.

It is especially important to consider artifacts that might originate in phylogenetic reconstruction, and
to asses the reliability of your results.

BREAK ?

 

Blast, PSI–blast, and Homology

Are two similar sequences homologous (i.e., is their similarity due to shared ancestry) ?

One way to quantify the similarity between two sequences is to

1.    compare the actual sequences and calculate alignment score

2.    randomize (scramble) one (or both) of the sequences and calculate the alignment score for the randomized sequences.

3.    repeat step 2 at least 100 times

4.    describe distribution of randomized alignment scores

5.    do a statistical test to determine if the score obtained for the real sequences is significantly better than the score for the randomized sequences

To illustrate the assessment of similarity/homology we will use a program from Pearson's FASTA package called PRSS. 
This and many other programs by Bill Pearson are available from his web page at ftp://ftp.virginia.edu/pub/fasta/

A version for PCs is here, a web version is available here.

Go through example. Sequences are here (fl), here (B), here (A) and here (A2)

There are many other alignment programs.  BLAST is a program that is widely used and offered through the NCBI (go here for more info).  It also offers to do pairwise comparisons  (go here do example).

To force the program to report an alignment increase the E-value.

Rules of thumb:

If you can demonstrate significant similarity using either randomization or an unweighted blast search, your sequences are homologous (i.e. related by common ancestry).  Convergent evolution has not been shown to lead to sequence similarities detectable by these means (see above - this might not be true for scores in PSI-blast)

If the actual alignment score is more than three standard deviations (of the randomized sequences) better than the mean for the randomized sequences, the two sequences are homologous (i.e. related by common ancestry).  PRSS and many other program use more accurate distributions to describe the distribution of random hits.  The expectation value for the alignment-score of the actual sequences is based on these statistics.

Usually E values (in a blast search or through randomization) smaller than 10-4 are convincing.

 

Terminology

E-values give the expected number of matches with an alignment score this good or better,

P-values give the probability of to find a match of this quality or better. 

P values are [0,1], E-values are [0,infinity).  For small values E=P

z-values, give the distance between the actual alignment score and the mean of the scores for the randomized sequences expressed as multiples of the standard deviation calculated for the randomized scores.  For example: a z-value of 3 means that the actual alignment score is 3 standard deviations better than the average for the randomized sequences.  Z-values > 3 are usually considered as suggestive of homology, z-values > 5 are considered as sufficient demonstration.  (see the but below).

A somewhat readable description of E, P HSP and other values is here.

BUT:
Failure to detect significant similarity does only shows our inability to detect homology, it does not prove that the sequences are not homologous.

Examples:

Jim Knox (MCB-UConn) has studied many proteins involved in bacterial cell wall biosynthesis and antibiotic binding, synthesis or destruction. Many of these proteins have identical 3-D structure, and therefore can be assumed to be homologous, however, the above tests fail to detect this homologies. (for example, enzymes with GRASP nucleotide binging sites are depicted here)

DNA replication involves many different enzymes. Some of the proteins do the same thing in bacteria, archaea and eukaryotes; they have similar 3-D structures (e.g.: sliding clamp, E. coli dnaN and eukaryotic PCNA, see Edgell and Doolittle, Cell 89, 995-998), but again, the above tests fail to detect homology.

Helicase and F1-ATPase. Both form hexamers with something rotating in the middle (either the gamma subunit or the DNA; D. Crampton, pers. communication). The monomers have the same type of nucleotide binding fold (picture)

BLAST and PSI BLAST

Run a blast trial with  this sequence

A tutorial for standard blast search is here

An easy way to force the program to report less significant matches is to increase the expect value in the advanced blast page. 

PSI-blast provides an enormous advantage over normal blast in the detection of distantly related sequences.  It only works if some closely related sequences are already available, but if this is the case it finds a lot of other distantly related sequences. 

The NCBI page describes PSI blast as follows:
Position-Specific Iterated BLAST (PSI-BLAST) provides an automated, easy-to-use version of a "profile" search, which is a sensitive way to look for sequence homologues. The program first performs a gapped BLAST database search. The PSI-BLAST program uses the information from any significant alignments returned to construct a position-specific score matrix, which replaces the query sequence for the next round of database searching. PSI-BLAST may be iterated until no new significant alignments are found. At this time PSI-BLAST may be used only for comparing protein queries with protein databases. 

A diagram giving an overviw over the PSI-blast procedure is here.

The results of a normal blast search are aligned and a pattern of conserved residues is extracted from the alignment.  This pattern is used for the next iteration.  An important parameter to adjust is the E-value threshold down to which matches are included in the alignment and pattern extraction. 
At higher iterations a PSI blast profile can be corrupted and false positives are identified with “significant” E-values.   I.e., in a traditional blast search one can be quite certain that a match with an E-value of 10^-13 represents a homologue; this is not clear with PSI blast.  Test studies indicate that profile corruptions are likely after more than 5 iterations. On the positive side: there are many fewer false negatives with PSI blast than with normal blast.