Introduction

Grading Criteria:

1. 25% participation and assignments
2. 25% Quiz
3. 50% Exam (June 14, in class)

No required textbook, however there are recommended books

Bioinformatics is an area of science that uses computational approaches to answer biological questions.

Why is it important to do Bioinformatics in Evolutionary Context?

Problem: Application of first principles does not (yet) work:
The following chain although (believed to be) mainly determined by the DNA sequence (plus other components of the cell which in turn are encoded by other parts of the genome) can at present not be simulated in a computer:

DNA sequence transcription translation protein folding protein function (catalytic and other properties) properties of the organism(s) ecology (taking also the non biological environment into account) ... .

Most scientists believe that the principle of reductionism (plus new laws and relations emerging on each level) is true for this chain; however, this is clearly "in principle" only. Biology relies on this sequence to work more or less unambiguously, but:

At several steps along the way from DNA to function our understanding of the chemical and physical processes involved is so incomplete that prediction of protein function based on only a single DNA sequence is at present impossible (at least for a protein of reasonable size).

Solution:
Use evolutionary context - "Everything in biology makes sense only if considered in the context of evolution."

Present day proteins evolved through substitution and selection from ancestral proteins. Related proteins have similar sequence AND similar structure AND similar function.

Functional similarity can refer to:

identical function,

similar function, e.g.:
identical reactions catalyzed in different organisms; or same catalytic mechanism but different substrate (malic and lactic acid dehydrogenases); similar subunits and domains that are brought together through a (hypothetical) process called domain shuffling, e.g. nucleotide binding domains in hexokinse, myosin, HSP70, and ATPsynthases.

Experience shows that protein sequence space is so big that similar sequences do not arise through convergent evolution (at least if significant similarity is detectable through pairwise comparison). Here is the simple estimate:

There are 20 to the power of 600 = 4*10⁷⁸⁰ different proteins possible with lengths of 600 amino acids. For comparison the universe contains only about 10⁸⁹ protons and has an age of about 5*10¹⁷ seconds or 5*10²⁹ picoseconds. If every proton in the universe were a computer that explored one possible protein sequence per picosecond, we only would have explored 5*10¹¹⁸ sequences, i.e. a negligible fraction of the possible sequences with length 600 (one in about 10⁶⁶²).

The following is based on observation and not on an a priori truth:

If two sequences show significant similarity in their primary sequence, they have shared ancestry, and probably similar function (although some proteins acquired radically new functional assignments, lysozyme -> lense crystalline).

To date there is no example known where convergent evolution has led to significant similarity of the primary sequence.

THE REVERSE IS NOT TRUE: DOMAINS WITH THE SAME OR SIMILAR FUNCTION DO NOT ALWAYS SHOW SIGNIFICANT SIMILARITY for one of two reasons:

a) they evolved independently (e.g. different types of nucleotide binding sites); or

b) they underwent so many substitution events that there is no readily detectable similarity remaining.)

In particular, DOMAINS WITH SHARED ANCESTRY DO NOT ALWAYS SHOW SIGNIFICANT SIMILARITY (reason: see B above), many recent advances concern the improved detection of similarity.