CLASS 10. Looking for distant homologs.

HOMEWORK READING:

Ch.6 (textbook);

Sections 6.1, 6.2 (supp. textbook) [OPTIONAL]
Quaerendo Invenietis


Position-Specific Scoring Matrices [PSSM]



Fig. BLOSUM62 matrix.[Source: Wikipedia]


Example of proteins with no detectable significant sequence similarity, but similar tertiary structure: here. Try to align this and this sequences in PRSS.

Regular substitution matrices (such as BLOSUM62 matrix shown above) are general scoring schemes. PSSMs can be viewed as extension of substitution scoring matrices, that take into account of probabilities of a certain amino acid to occur in a specific position in the protein. To generate a PSSM, sequences are required to be aligned.



Fig. Example of a profile (PSSM + info on how to treat gaps). Entry in each cell is the log-odds score for finding a particular matching amino acid in a target sequence. [Source]. To see example of how entries in the matrix can be calculated, refer to Fig. 6.1 and 6.2 in the textbook


NCBI offers a PSSM viewer, where a PSSM matrix can be viewed color-coded, cross-linked to amino acid explorer and can be re-sorted by different features (e.g., examine PSSM-id 110296).

PSI-BLAST

Position-Specific Iterative (PSI) - BLAST is an enhanced BLAST program intended to help finding more distant homologs to a query sequence. A pre-requisite for it to work is presence of more closely related sequences in the database.


Fig. Overview of PSI-BLAST.




Fig.6.4 [Source].Illustration of how PSI-BLAST constructs PSSMs from local pairwise alignments.


E-values from PSI-BLAST are not the same as in "regular BLAST": the E-value reported in a PSI-blast search represents the match with the profile, not with the original query sequence!!


Inclusion of unrelated sequences can cause PSSM corruption. Therefore it is not recommended to exceed three to five PSI-BLAST iterations.

PSI-BLAST interactive tutorial shows step-by-step how PSI-BLAST works and how to use its web-based interface.

Constructed PSSMs can be saved and a database search can be made against the database of PSSMs. RPS-BLAST (Reverse Position Specific BLAST) program is used for this task. Example: Conserved Domains search.

Hidden Markov Models

In general, a process has a Markov property if the probability of its future states depend only upon its present state and a fixed number of past states. Markov chain is a discrete random process with Markovian property. Markov chains often represented as directed graphs. For example, a DNA sequence can be shown as this Markov Model:



Fig. Markov Model of a DNA sequence. Any path through the models from the start state to the end state will produce (emit) a DNA sequence. B: start state; E: end state


Components of a Markov Model:

  • States (nodes on the graph)
  • Transition probabilities (how one state changes to another state, shown as arrows on the graph)
  • Emission probabilities

Profile HMMs is only one possible use of HMMs in bioinformatic applications.
Similar to the example of DNA sequence model above, it is possible to define a Markov Model that summarizes information in a multiple alignment. Such model is referred to as profile HMM and is a generalization of PSSM, in particular because it allows position specific treatments of indels (gaps).



Fig. A complete profile HMM.
[Source]

The i-th residue of a protein can be emitted from multiple alternative states. Since the residue itself does not carry information about which state it originated from, the model is referred as a "hidden" Markov model.

HMM applications in bioinformatics

  • Modeling Protein Families (e.g., Superfamily DB)
  • Characterization of Protein Domains (e.g., Pfam DB)
  • Multiple Sequence Alignment (e.g., ProbCons program)
  • Gene Finding (e.g., GENSCAN)