Different Formats for DNA and Protein Sequences

There are many different representations (or, in other words, formats) of DNA and protein sequences (joined with ancillary annotations). Almost every program introduces its "native" format optimized for a particular program needs, and some programs include converters from one format to another. However, some programs read sequences only in one or the other formats. To recognize formats and to be able to interconvert files in different formats is very important for successful data analysis.

There are sets of symbols used to abbreviate nucleotides and amino acids. The characters can be either upper or lower case. The characters enable input of nucleic acid sequences taking full account of any ambiguities in the sequence.

         Symbol   Meaning
          ------   -------
            A       Adenine
            G       Guanine
            C       Cytosine
            T       Thymine
            U       Uracil
            Y       pYrimidine  (C or T)
            R       puRine      (A or G)
            W       "Weak"      (A or T)
            S       "Strong"    (C or G)
            K       "Keto"      (T or G)
            M       "aMino"     (C or A)
            B       not A       (C or G or T)
            D       not C       (A or G or T)
            H       not G       (A or C or T)
            V       not T       (A or C or G)
          X,N,?     unknown     (A or C or G or T)
            O       deletion
            -       deletion

The protein sequences are given by the one-letter code used by the late Margaret Dayhoff's group in the Atlas of Protein Sequences. They are as follows:

               Symbol               Stands for
               ------               ----------
                 A                     ala
                 B                     asx
                 C                     cys
                 D                     asp
                 E                     glu
                 F                     phe
                 G                     gly
                 H                     his
                 I                     ileu
                 J                  (not used)
                 K                     lys
                 L                     leu
                 M                     met
                 N                     asn
                 O                  (not used)
                 P                     pro
                 Q                     gln
                 R                     arg
                 S                     ser
                 T                     thr
                 U                  (not used)
                 V                     val
                 W                     trp
                 X             unknown amino acid
                 Y                     tyr
                 Z                     glx
                 *                nonsense (stop)
                 ?        unknown amino acid or deletion
                 -                   deletion

where "nonsense", and "unknown" mean respectively a nonsense (chain termination) codon and an amino acid whose identity has not been determined. The state "asx" means "either asn or asp", and the state "glx" means "either gln or glu" and the state "deletion" means that alignment studies indicate a deletion has happened in the ancestry of this position, so that it is no longer present.

Here are the same one-letter codes tabulated the other way 'round:

              Amino acid               One-letter code
              ----------               ---------------
                ala                           A
                arg                           R
                asn                           N
                asp                           D
                asx                           B
                cys                           C
                gln                           Q
                glu                           E
                gly                           G
                glx                           Z
                his                           H
                ileu                          I
                leu                           L
                lys                           K
                met                           M
                phe                           F
                pro                           P
                ser                           S
                thr                           T
                trp                           W
                tyr                           Y
                val                           V
                deletion                      -
                nonsense (stop)               *
                unknown amino acid            X
                unknown (incl. Deletion)      ?

The simplest format for DNA/protein sequences is a FASTA format (or Pearson format). It is used in a variety of molecular biology software. Every sequence in the file starts with "greater than" character (>). That character is followed by an identifier of a sequence (e.g. name, description, gi number) and a carriage return (it is also called a paragraph sign). This line is called a definition line. Everything after carriage return is considered as a sequence. The example of a sequence in FASTA format is shown below:

>gi|2978501|gb|AAC06133.1| vacuolar ATPase proteolipid subunit [Giardia intestinalis]
MSSIDSPVAVEKCPAGASFWSMLGQVVAVVFSSIGAAYGTAKAGSGLGV
AGLINPAPVTKLTLPVIMAGILSIYGLITSLLINSRVRSYTNGMPLYVS
YAHFGAGLCCGLAALAAGLAIGVSGSAAVKAVAKQPSLFVVMLIVLIFS
EALALYGLIIALILSTKSADSNFCVNNVNQ

Practical Tip1: The first line might be long and be wrapped around in a text editor or web browser. Often web browsers introduces carriare returns after every line of a text. If you cut and paste the sequence from a web browser, check its contents in a text editor (such as MS Word), and remove any carriage returns. Otherwise, part of the definition line will be interpreted as a sequence.

Practical Tip2: It is very often convenient to put species name in the beginning of the definition line because the programs often take first several symbols of the definition line as the identifiers for the sequences (see examples below).

GenBank format (also called GenBankFlatFile format) is one of the formats that shows the complete information on a sequence entry. Every record in GenBank format consists of 3 parts: the header, the features that describe the annotations on the record, and the sequence sequence itself. Very detailed decription of every field in a GenBank record is available here.

The example of GenBank record is shown below:


LOCUS       AF123456     1512 bp    mRNA            VRT       23-MAR-1999
DEFINITION  Gallus gallus testis-specific mRNA sequence.
ACCESSION   AF123456
VERSION     AF123456.1  GI:4454562
KEYWORDS    .
SOURCE      chicken.
  ORGANISM  Gallus gallus
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Archosauria;
            Aves; Neognathae; Galliformes; Phasianidae; Phasianinae; Gallus.
REFERENCE   1  (bases 1 to 1512)
  AUTHORS   Nanda,I., Shan,Z.H., Schartl,M., Burt,D.V., Koehler,M.,
            Nothwang,H.-G., Gruetzner,F., Paton,I.R., Windsor,D., Dunn,I.,
            Engel,W., Staeheli,P., Mizuno,S., Haaf,T. and Schmid,M.
  TITLE     300 million years of conserved synteny between chicken Z and human
            chromosome 9
  JOURNAL   Nat. Genet. 21 (3), 258-259 (1999)
  MEDLINE   99178258
REFERENCE   2  (bases 1 to 1512)
  AUTHORS   Shan,Z.H. and Haaf,T.
  TITLE     Isolation of the Z-linked copy of human DMT1 in chicken and
            analyzing its role in developing gonads, further evidence for
            evolutionary conservation of sex determining genes
  JOURNAL   Unpublished
REFERENCE   3  (bases 1 to 1512)
  AUTHORS   Haaf,T. and Shan,Z.H.
  TITLE     Direct Submission
  JOURNAL   Submitted (25-JAN-1999) Max-Planck Institute for Molecular
            Genetics, Ihnestr. 73, Berlin 14195, Germany
COMMENT     [WARNING] On Dec 23, 1999 this sequence was replaced by a newer
            version gi:6633795.
FEATURES             Location/Qualifiers
     source          1..1512
                     /organism="Gallus gallus"
                     /db_xref="taxon:9031"
                     /chromosome="Z"
                     /map="Zp21"
                     /note="testis-specific mRNA"
BASE COUNT      320 a    434 c    473 g    285 t
ORIGIN
        1 cccggcgcgg gcaagaagct gccgcgtctg cccaagtgtg cccgctgccg caaccacggc
       61 tactcctcgc cgctgaaggg gcacaagcgg ttctgcatgt ggcgggactg ccagtgcaag
      121 aagtgcagcc tgatccgccg agcggcaggg gtgatggccg tgcaggttgc actgaggagg
      181 cagcaagccc aggaagagga gctggggatc agcgcaccct gtacccctgc ccagtgcccc
      241 tgagccagtt gtcaagaaga gcagcagcag cagctcctgt ctcctgcagg acagcagcag
      301 cccctgctca ctccacgagc acggtggcag cagcagcagc gagcgcacca ccagagggac
      361 ggatgctcat tcaggacatc ccttccatcc ccagcagagg gcacttggag agcacgtctg
      421 atttggttgt ggactccacc tactacagca gtttttacca gcatccctgt atccttacta
      481 taacaacctg tacaactact cccagtacca aatggcagtg gccactgagt cttcctcaag
      541 tgagacaggg ggtacgtttg tagggtcagc catgaaaaac agccttcgaa gcctcccagc
      601 aacatacatg tcaagccagt caggaaaaca gtggcagatg aaagggatgg agaaccgcca
      661 tgccatgagc tcccagtacc ggatgtgctc ctactacccg cccacctcat acctgggcca
      721 gggggttggc agtcccacct gcgtcacaca gatactggcc tcggaggaca ccccctccta
      781 ctcagagtcg aaagcgagag tgttttcgcc gcccagcagc caggactcgg gcctggggtg
      841 cctgtcgagc agcgagagca ccaagggaga cctggagtgc gagccccacc aagagcccgg
      901 cgccttcgcg gtgagcccgg ttcttgaggg cgagtaggcg cggcgtcggg cggctgctgc
      961 gcggcgttca ctgttgcctt gttctgttgg gggttgcggg ggggcgttgg gtttcttctt
     1021 tccggggcgg ggggggcacg gcggggccgc ggccgggccg gcggggcggg gcggggcggg
     1081 acggggcggg gcggagccgc gcgggggccg cagtccgggc cggggccgcc gtcgggtctc
     1141 ggcccgctcc cgtcggggcg gagcgtccga cgatcggcct ccacgaaacg cggtgccgtg
     1201 atgtgtttgt agtggttcct cgtaggctcc agacgttttc tcctcgtatc gccaaattaa
     1261 cgcgttttgc atattacagt tgagtgcctc gacttagatt gcaatataag cggccagcaa
     1321 acaagtctca aaaaaaagtt acgtgcgttt ctgcgagtgt tattttgtta agaacggctc
     1381 acagtgtcct cttcctgtgt tacagaagcc aacctgaaat gaaactagtc tggaaaaatt
     1441 cattgttctc tgtagttgca gctgtacctg aaataaaaat gttattgatg actgaaaaaa
     1501 aaaaaaaaaa aa
//

The following formats are the formats of aligned sequences.

One format which is commonly used is PHYLIP format (which comes from PHYLIP package):

   6   13
Archaeopt CGATGCTTAC CGC
HesperorniCGTTACTCGT TGT
BaluchitheTAATGTTAAT TGT
B. virginiTAATGTTCGT TGT
BrontosaurCAAAACCCAT CAT
B.subtilisGGCAGCCAAT CAC

The first line of the input file contains the number of species and the number of characters, in free format, separated by blanks (not by commas). The information for each species follows, starting with a ten-character species name (which can include punctuation marks and blanks), and continuing with the characters for that species.

The sequences can continue over multiple lines; as a consequence, there are two flavors of the format: interleaved and sequential. In sequential format all of one sequence is given, possibly on multiple lines, before the next starts. In interleaved format the first part of the file should contain the first part of each of the sequences, then possibly a line containing nothing but a carriage-return character, then the second part of each sequence, and so on. Only the first parts of the sequences should be preceded by names. Here is a hypothetical example of interleaved format:

  5    42
Turkey    AAGCTNGGGC ATTTCAGGGT
Salmo gairAAGCCTTGGC AGTGCAGGGT
H. SapiensACCGGTTGGC CGTTCAGGGT
Chimp     AAACCCTTGC CGTTACGCTT
Gorilla   AAACCCTTGC CGGTACGCTT

GAGCCCGGGC AATACAGGGT AT
GAGCCGTGGC CGGGCACGGT AT
ACAGGTTGGC CGTTCAGGGT AA
AAACCGAGGC CGGGACACTC AT
AAACCATTGC CGGTACGCTT AA

while in sequential format the same sequences would be:

  5    42
Turkey    AAGCTNGGGC ATTTCAGGGT
GAGCCCGGGC AATACAGGGT AT
Salmo gairAAGCCTTGGC AGTGCAGGGT
GAGCCGTGGC CGGGCACGGT AT
H. SapiensACCGGTTGGC CGTTCAGGGT
ACAGGTTGGC CGTTCAGGGT AA
Chimp     AAACCCTTGC CGTTACGCTT
AAACCGAGGC CGGGACACTC AT
Gorilla   AAACCCTTGC CGGTACGCTT
AAACCATTGC CGGTACGCTT AA

CLUSTAL (.aln) format was originated in the alignment program CLUSTAL. The file starts with word CLUSTAL. The alignment is written in blocks of 60 residues. Every block starts with sequence names. The example of the alignemnt in CLUSTAL format is shown below:

CLUSTAL X (1.8) multiple sequence alignment


R.sodomens      ---------------------------------CAACCUGA-GAGUU-U-GA-U-CCU-G
R.rubrum4       --------------------------------UUCCCUGAA-GAGUU-U-GA-U---U-G
Ag.tumefac      -------------------------------CUCAACUUGA-GAGUU-U-GA-U-CCU-G
Ag.rhizog2      --------------------------------UUCCCUGAA-GAGUU-U-GA-U-CCU-G
Rhb.legum4      --------------------------------UUCCCUGAA-GAGUU-U-GA-U-CCU-G
Rhb.legum6      --------------------------------UUCCCUGAA-GAGUU-U-GA-U-CCU-G
Bdr.japon8      --------------------------------UUCCCUGAA-GAGUU-U-GA-U---U-G
                                                    *   * ***** * ** *   * *

R.sodomens      GCUC-A-G-AAC-GAAC-GC--U-GGC-GGC-A-GG-C-CU--AACACA-UGCAA---G-
R.rubrum4       GCUC-A-G-GAC-GAAC-GC--U-GGC-GGC-A-GG-C-CU--AACACA-UGCAA---G-
Ag.tumefac      GCUC-A-G-AAC-GAAC-GC--U-GGC-GGC-A-GG-C-UU--AACACA-UGCAA---G-
Ag.rhizog2      GCUC-A-G-AAC-GAAC-GC--U-GGC-GGC-A-GG-C-UU--AACACA-UGCAA---G-
Rhb.legum4      GCUC-A-G-AAC-GAAC-GC--U-GGC-GGC-A-GG-C-UU--AACACA-UGCAA---G-
Rhb.legum6      GCUC-A-G-AAC-GAAC-GC--U-GGC-GGC-A-GG-C-UU--AACACA-UGCAA---G-
Bdr.japon8      GCUC-A-G-AGC-GAAC-GC--U-GGC-GGC-A-GG-C-UU--AACACA-UGCAA---G-
                **** * *   * **** **  * *** *** * ** *  *  ****** *****   *

There are standalone programs available which allow interconversion b/w different formats. One of such programs is ReadSeq and it is available as online version here.

References:

Felsenstein, J. (1993) Phylogeny Inference Package Manual, version 3.5c. Distributed by the author. Dept. of Genetics, Univ. of Washington, Seattle.