The Protein Data Bank now provides XML files that include a mapping between the PDB-format records SEQRES (representing the sequence of the molecule used in an experiment) and ATOM (representing the atoms experimentally observed). These XML files (along with the chemical dictionary) also provide information on the original identity of most residues prior to any post-translational modifications.
Starting with Astral 1.73, Astral RAF sequence maps are generated from the XML files.
The RAF maps summarize the SEQRES—ATOM relationship in a form that can be rapidly parsed in most computer languages. Errors in the mappings are corrected manually, with human interpretation of the original PDB file serving as the final arbiter in case of difficulties or discrepancies in machine translation.
Since Astral 1.75, manual edits have not been required, as XML2RAF was able to successfully generate sequences for all chains using the "parent residue" records in the PDB's chemical dictionary, as well as other records in the PDBML (XML) files. XML2RAF is free software, and is available from the Downloads > Parseable Files & Software page.
To download the RAF sequence maps used to generate Astral 2.04, click here (267.9 MB)
To download the RAF sequence maps used to generate the Astral 2.04 update on 2014-12-18, click here (275.2 MB)
The RAF file contains one line per PDB chain. Each line contains two parts: the header and the body.
Here is an example of the header format:
101m_ 0.02 38 010301 111011 0 153 ^1 ^2 ^3 ^4 ^5 ^6 ^7 ^8
The body contains one field per residue in the protein. Each field is of fixed length, 7 characters. Here is an example containing 6 residues:
B .a 1 rr M .i 3Acc 5 de 6A t. ---- residue identifier (B|M|E if missing, 4 ch) _ insertion code (' ' if missing) - aa one-letter code from ATOM ('.' if missing) - aa one letter code from SEQRES ('.' if missing)The meaning of the characters in each field is as follows:
In the above example, the protein sequence from the SEQRES records is ALA ARG ILE CYS GLU, and the protein sequence listed in the ATOM records is ARG 1, CYS 3, ASP 5, THR 6. The CYS has insertion code 'A'. There are no ATOM records corresponding to the SEQRES records for the ALA or the ILE, so the residue identifier is replaced by a B (in the case of ALA at the beginning of the chain) or a M (in the case of ILE, which comes after the first identified residue). ASP 5 is mysteriously mutated to a GLU in the SEQRES records, and THR 6 is missing from the SEQRES records.