ASTRAL RAF Sequence Maps
The Protein Data Bank now provides XML
files that include a mapping between the
PDB-format
records SEQRES (representing the
sequence of the molecule used in an experiment) and
ATOM
(representing the atoms experimentally observed). These
XML files (along with the chemical dictionary) also provide
information on the original identity of most residues prior to any
post-translational modifications.
Starting with Astral 1.73,
Astral
RAF sequence maps are
generated from the XML files.
The RAF maps summarize the
SEQRES—ATOM relationship
in a form that can be rapidly parsed in most computer languages.
Errors in the mappings are corrected manually, with human
interpretation of the original PDB file
serving as the final arbiter in case of difficulties or
discrepancies in machine translation.
Since Astral 1.75, manual edits have not
been required, as XML2RAF was able to successfully generate
sequences for all chains using the "parent residue" records in the
PDB's chemical dictionary, as well as other records in the PDBML
(XML) files. XML2RAF is free software, and is available from the
Downloads > Parseable Files & Software page.
To download the RAF sequence maps used
to generate Astral
2.01,
click here (185 MB)
Description of Format
The RAF file contains one line per PDB chain. Each line contains two parts: the header and the
body.
Here is an example of the header format:
101m_ 0.02 38 010301 111011 0 153
^1 ^2 ^3 ^4 ^5 ^6 ^7 ^8
-
PDB+chain ID.
A '_' for the chain ID indicates a
blank chain ID. The chain
ID is case sensitive.
Most chains currently in the
PDB
have an upper case chain ID.
-
version number the RAF format,
0.02. See below for a description of changes in the format from 0.01 to 0.02.
-
header length (i.e. the body starts in position 39, counting
from 1). The header length will always be constant for every
entry from a given version of RAF;
however, this length may change in future versions.
-
PDB datestamp
(last modification time of the PDB file)
-
set of 1-bit flags (if set: 1->mapped, 2->active,
3->checked, 4->manually edited, 5->ok, 6->one-to-one
mapping).
NOTE: one-to-one-mapping only means that there is
a one-to-one mapping between SEQRES
and ATOM sequences. It does not
mean the sequences are the same, only that all the residues are
seen.
-
first non-blank residue identifier (PDB
format, 4ch+1 for the insertion code)
-
last non-blank residue identifier (PDB
format, 4ch+1 for the insertion code)
- body starts here
The body contains one field per residue in the protein. Each field is
of fixed length, 7 characters. Here is an example containing 6
residues:
B .a 1 rr M .i 3Acc 5 de 6A t.
---- residue identifier (B|M|E if missing, 4 ch)
_ insertion code (' ' if missing)
- aa one-letter code from ATOM ('.' if missing)
- aa one letter code from SEQRES ('.' if missing)
The meaning of the characters in each field is as follows:
-
First 4 characters - residue identifier.
These are normally derived from ATOM
records, except in the case of bibliographic entries. Warning:
these identifiers do not always monotonically increase.
Note that if ATOM
records for a residue are not
present because the residue is not observed in the structure,
residue identifiers can be found in
the corresponding
SEQRES entries in the XML/mmCIF PDB
files (they do not appear in
PDB files formatted in the
older PDB format). For
bibliographic entries, no PDB
entry is available, so consecutive residue identifiers
beginning with '1' are assigned.
- 5th character - insertion code, or ' ' if missing.
-
6th character - amino acid one-letter code from
ATOM records, or '.' if missing.
-
7th character - amino acid one-letter code from
SEQRES records, or '.' if missing.
In the above example, the protein sequence from the
SEQRES records is
ALA
ARG
ILE
CYS
GLU, and the protein sequence listed in
the ATOM records is
ARG 1,
CYS 3,
ASP 5,
THR 6.
The CYS has insertion
code 'A'. There are no ATOM records
corresponding to the SEQRES records for
the ALA or the
ILE,
so the residue identifier is replaced by a B (in the case of
ALA at the beginning of the chain)
or a M (in the case of
ILE,
which comes after the first identified residue).
ASP 5
is mysteriously mutated to a
GLU
in the SEQRES records, and
THR 6 is missing from the
SEQRES
records.
Changes to the format from 0.01 to 0.02
-
The PDB datestamp field now
reflects the modification time of the PDB
file, rather than the date the file was obtained from the
PDB.
-
Some bibliographic PDB entries
(entries with sequences but not coordinates) have been added to
the RAF.
The PDB
code for these domains begins with '0'. Because residue
identifiers of residues in the RAF
file are normally based on the ATOM
records, and the bibliographic PDB
entries have no ATOM records, each
residue in the RAF entries for
bibliographic chains has been numbered starting at 1.
Older RAF maps for stable releases