ASTRAL RAF Sequence Maps

When Astral 1.55, was built, the Protein Data Bank provided CIF files produced by the pdb2cif program. These files include a mapping between the PDB-format records SEQRES (representing the sequence of the molecule used in an experiment) and ATOM (representing the atoms experimentally observed).

Because the CIF files contain known errors, the Astral RAF sequence maps were generated directly from the PDB files.

The RAF maps summarize the SEQRES—ATOM relationship in a form that can be rapidly parsed in most computer languages. Errors in the mappings are corrected manually, with human interpretation of the original PDB file serving as the final arbiter in case of difficulties or discrepancies in machine translation.

To download the RAF sequence maps used to generate Astral 1.55, click here (39.2 MB)

Description of Format

The RAF file contains one line per PDB chain. Each line contains two parts: the header and the body.

Here is an example of the header format:

101m_ 0.01 38 010301 111011    0  153
^1    ^2   ^3 ^4     ^5     ^6   ^7   ^8

  1. PDB+chain ID. A '_' for the chain ID indicates a blank chain ID. The chain ID is case sensitive. Most chains currently in the PDB have an upper case chain ID.
  2. version number the RAF format, 0.01 for Astral 1.55.
  3. header length (i.e. the body starts in position 39, counting from 1). The header length will always be constant for every entry from a given version of RAF; however, this length may change in future versions.
  4. PDB datestamp (date file was obtained from the PDB)
  5. set of 1-bit flags (if set: 1->mapped, 2->active, 3->checked, 4->manually edited, 5->ok, 6->one-to-one mapping).
    NOTE: one-to-one-mapping only means that there is a one-to-one mapping between SEQRES and ATOM sequences. It does not mean the sequences are the same, only that all the residues are seen.
  6. first non-blank residue identifier (PDB format, 4ch+1 for the insertion code)
  7. last non-blank residue identifier (PDB format, 4ch+1 for the insertion code)
  8. body starts here

The body contains one field per residue in the protein. Each field is of fixed length, 7 characters. Here is an example containing 6 residues:

   B .a   1 rr   M .i   3Acc   5 de   6A t.
---- residue identifier (B|M|E if missing, 4 ch)
    _ insertion code (' ' if missing)
     - aa one-letter code from ATOM ('.' if missing)
      - aa one letter code from SEQRES ('.' if missing)
The meaning of the characters in each field is as follows:
  • First 4 characters - residue identifier. These are normally derived from ATOM records, except in the case of bibliographic entries. Warning: these identifiers do not always monotonically increase. Note that if ATOM records for a residue are not present because the residue is not observed in the structure, residue identifiers can be found in the corresponding SEQRES entries in the XML/mmCIF PDB files (they do not appear in PDB files formatted in the older PDB format). For bibliographic entries, no PDB entry is available, so consecutive residue identifiers beginning with '1' are assigned.
  • 5th character - insertion code, or ' ' if missing.
  • 6th character - amino acid one-letter code from ATOM records, or '.' if missing.
  • 7th character - amino acid one-letter code from SEQRES records, or '.' if missing.

In the above example, the protein sequence from the SEQRES records is ALA ARG ILE CYS GLU, and the protein sequence listed in the ATOM records is ARG 1, CYS 3, ASP 5, THR 6. The CYS has insertion code 'A'. There are no ATOM records corresponding to the SEQRES records for the ALA or the ILE, so the residue identifier is replaced by a B (in the case of ALA at the beginning of the chain) or a M (in the case of ILE, which comes after the first identified residue). ASP 5 is mysteriously mutated to a GLU in the SEQRES records, and THR 6 is missing from the SEQRES records.

Translation Table

Since Astral 1.55, chemically modified residues have been included in our translation table which maps the 3-letter codes found in PDB files to one-letter codes in our sequences. The complete table is shown below, with one-letter codes to the right of the corresponding 3-letter code.
  ala a val v phe f pro p met m ile i leu l asp d glu e lys k
  arg r ser s thr t tyr y his h cys c asn n gln q trp w gly g
  2as d 3ah h 5hp e acl r aib a alm a alo t aly k arm r asa d
  asb d ask d asl d asq d aya a bcs c bhd d bmt t bnn a buc c
  bug l c5c c c6c c ccs c cea c chg a cle l cme c csd a cso c
  csp c css c csw c cxm m cy1 c cy3 c cyg c cym c cyq c dah f
  dal a dar r das d dcy c dgl e dgn q dha a dhi h dil i div v
  dle l dly k dnp a dpn f dpr p dsn s dsp d dth t dtr w dty y
  dva v efc c fla a fme m ggl e glz g gma e gsc g hac a har r
  hic h hip h hmr r hpq f htr w hyp p iil i iyr y kcx k llp k
  lly k ltr w lym k lyz k maa a men n mhs h mis s mle l mpq g
  msa g mse m mva v nem h nep h nle l nln l nlp l nmc g oas s
  ocs c omt m paq y pca e pec c phi f phl f pr3 c prr a ptr y
  sac s sar g sch c scs c scy c sel s sep s set s shc c shr k
  soc c sty y sva s tih a tpl w tpo t tpq a trg k tro w tyb y
  tyq y tys y tyy y agm r gl3 g smc c asx b cgu e csx c glx z