Pdb-hla-docs/Resnums

= Residue Numbers in PDB: Design/Testing Notes =

These are automated tests that explore residue numbering in various PDB formats and document our design.

Contents


 * Residue Numbers in PDB: Design/Testing Notes
 * The PDB coordinate system: 1JWS coordinates from Jmol
 * The Sequential Coordinate System: PDB XML for 1JWS
 * Residue Level Nomenclature Mapping
 * Reporting Sequential Coordinates with Jmol PDB coordinates
 * Insertion Codes
 * @@ISSUE: Impossibly High Residue Numbers

The PDB coordinate system: 1JWS coordinates from Jmol
In PDB ATOM records, the sixth column is the residue number; it typically starts at 1 (for each chain), but for the C chain in pdb1jws.ent.gz, it starts at 306:

ATOM  3015  N   PRO C 306     -17.332  37.136  20.939  1.00 51.85           N ATOM   3016  CA  PRO C 306     -17.492  38.525  20.447  1.00 51.22           C ATOM   3017  C   PRO C 306     -18.233  38.517  19.119  1.00 48.08           C ATOM   3018  O   PRO C 306     -17.896  37.740  18.221  1.00 49.73           O ATOM   3019  CB  PRO C 306     -16.092  39.099  20.300  1.00 52.24           C ATOM   3020  CG  PRO C 306     -15.354  38.324  21.380  1.00 52.61           C ATOM   3021  CD  PRO C 306     -15.919  36.891  21.275  1.00 52.64           C ATOM   3022  N   LYS C 307     -19.231  39.392  18.996  1.00 39.54           N ATOM   3023  CA  LYS C 307     -20.038  39.444  17.786  1.00 35.12           C ATOM   3024  C   LYS C 307     -20.046  40.767  17.012  1.00 33.31           C

Coordinates returned from Jmol use this coordinate system. To look at the Jmol output for 1JWS, first we have the Makefile tell us where the cache and work directories (in the sense of the Neurocommons Package Makefile conventions) are:

>>> pdbid = '1JWS' >>> import os # see http://docs.python.org/library/os.html >>> cache = os.popen('make print-cache').readline.strip >>> work = os.popen('make print-work').readline.strip

Then we look in the file of contacts for this PDB structure:

>>> from os import path >>> from pprint import pprint >>> contsv = path.join(work, pdbid.lower+".tsv") >>> pprint(list([line.strip.split("\t") ...             for line in open(contsv)])[:12]) 'pdbid', 'chain1', 'res1', 'n1', 'chain2', 'res2', 'n2'], ['1JWS', 'C', 'PRO', '306', 'A', 'PHE', '51'],  ['1JWS', 'C', 'PRO', '306', 'A', 'ALA', '52'],  ['1JWS', 'C', 'PRO', '306', 'A', 'SER', '53'],  ['1JWS', 'C', 'PRO', '306', 'B', 'VAL', '85'],  ['1JWS', 'C', 'LYS', '307', 'A', 'SER', '53'],  ['1JWS', 'C', 'LYS', '307', 'A', 'PHE', '54'],  ['1JWS', 'C', 'LYS', '307', 'A', 'GLU', '55'],  ['1JWS', 'C', 'LYS', '307', 'B', 'HIS', '81'],  ['1JWS', 'C', 'LYS', '307', 'B', 'ASN', '82'],  ['1JWS', 'C', 'LYS', '307', 'B', 'VAL', '85'],  ['1JWS', 'C', 'TYR', '308', 'A', 'PHE', '24'

The Sequential Coordinate System: PDB XML for 1JWS
Now let’s look at our cached copy of 1jws.xml.gz, starting with the root element, using lxml

>>> import gzip >>> from lxml import etree >>> xmlgzfn = path.join(cache, "hla-xml", pdbid.lower+".xml.gz") >>> pdbx = etree.parse(gzip.open(xmlgzfn)) >>> pdbx.getroot.tag '{http://pdbml.pdb.org/schema/pdbx-v32.xsd}datablock'

The pdbparts.Parts class uses the same namespace name:

>>> import pdbparts >>> NS = pdbparts.Parts.NS >>> NS['PDBx'] 'http://pdbml.pdb.org/schema/pdbx-v32.xsd'

Residues are enumerated in entity_poly_seq elements:

>>> reselts = pdbx.xpath("./PDBx:entity_poly_seqCategory" ...                     "/PDBx:entity_poly_seq", namespaces=NS) >>> len(reselts) 624

The entity_poly_seq elements are indexed by “entity” numbers rather than chain ids; while it’s cheating a bit to skip ahead to the pdbparts.Parts class, let alone peek inside, let’s do it anyway; once we enumerate the polymers (i.e. chains), we can peek at the mapping from entities to chains:

>>> parts = pdbparts.Parts(pdbx, contsv) >>> pprint([(chain, name, type_) ...        for chain, seq, u, x, y, name, type_ in parts.polymers]) [('A',  'HLA class II histocompatibility antigen, DR alpha chain',   'polypeptide(L)'), ('B',  'HLA class II histocompatibility antigen, DR-1 beta chain',   'polypeptide(L)'), ('C', 'HA peptide', 'polypeptide(L)'), ('D', 'Enterotoxin type C-3', 'polypeptide(L)')] >>> parts._smap {'1': ['A'], '3': ['C'], '2': ['B'], '4': ['D']}

Now let’s pick out just the elements for chain C using entity_id 3:

>>> reselts = pdbx.xpath("./PDBx:entity_poly_seqCategory" ...                     "/PDBx:entity_poly_seq[@entity_id='3']", ...                      namespaces=NS) >>> len(reselts) 13 >>> [e.attrib['num'] for e in reselts] ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13']

Those are the sequential coordinates, which are reported by the residues method:

>>> chain_c = [(nnn, num, chain) ...           for nnn, num, num2, chain in parts.residues ...           if chain == 'C'] >>> pprint(chain_c) [('PRO', 1, 'C'), ('LYS', 2, 'C'), ('TYR', 3, 'C'), ('VAL', 4, 'C'), ('LYS', 5, 'C'), ('GLN', 6, 'C'), ('ASN', 7, 'C'), ('THR', 8, 'C'), ('LEU', 9, 'C'), ('LYS', 10, 'C'), ('LEU', 11, 'C'), ('ALA', 12, 'C'), ('THR', 13, 'C')]

Residue Level Nomenclature Mapping
The docs for pdbx_poly_seq_scheme say, “The PDBX_POLY_SEQ_SCHEME category provides residue level nomenclature mapping for polymer entities.” The seq_id docs say “Pointer to _entity_poly_seq.num”. Together, this seems to allow us to map from 306 back to 1:

>>> mapelts = pdbx.xpath("./PDBx:pdbx_poly_seq_schemeCategory" ...                     "/PDBx:pdbx_poly_seq_scheme[" ...                      "PDBx:pdb_strand_id='C']", ...                      namespaces=NS) >>> pprint([(e.attrib['seq_id'], ...         e.xpath("PDBx:pdb_seq_num", namespaces=NS)[0].text, ...         e.xpath("PDBx:pdb_ins_code", namespaces=NS)[0].text) ...         for e in mapelts]) [('1', '306', None), ('2', '307', None), ('3', '308', None), ('4', '309', None), ('5', '310', None), ('6', '311', None), ('7', '312', None), ('8', '313', None), ('9', '314', None), ('10', '315', None), ('11', '316', None), ('12', '317', None), ('13', '318', None)]

Keep in mind (as I forgot to, leading to an earlier bug) that there are data on other chains here too:

>>> elts = pdbx.xpath("./PDBx:pdbx_poly_seq_schemeCategory" ...                  "/PDBx:pdbx_poly_seq_scheme", ...                   namespaces = NS) >>> pprint(set([pdbparts.field(e, 'pdb_strand_id') for e in elts])) set(['A', 'B', 'C', 'D'])

The insmap function captures the mapping:

>>> mapx = pdbparts.insmap(pdbx) >>> mapx[('A', '1')] 1 >>> mapx[('B', '190')] 190 >>> mapx[('C', '306')] 1 >>> mapx[('D', '1')] 1

Reporting Sequential Coordinates with Jmol PDB coordinates
OK, let’s use that data in the contacts API:

>>> pprint(list(parts.contacts)[:12]) [('PRO', 1, '306', 'C', 'PHE', 51, '51', 'A'), ('PRO', 1, '306', 'C', 'ALA', 52, '52', 'A'), ('PRO', 1, '306', 'C', 'SER', 53, '53', 'A'), ('PRO', 1, '306', 'C', 'VAL', 85, '85', 'B'), ('LYS', 2, '307', 'C', 'SER', 53, '53', 'A'), ('LYS', 2, '307', 'C', 'PHE', 54, '54', 'A'), ('LYS', 2, '307', 'C', 'GLU', 55, '55', 'A'), ('LYS', 2, '307', 'C', 'HIS', 81, '81', 'B'), ('LYS', 2, '307', 'C', 'ASN', 82, '82', 'B'), ('LYS', 2, '307', 'C', 'VAL', 85, '85', 'B'), ('TYR', 3, '308', 'C', 'PHE', 24, '24', 'A'), ('TYR', 3, '308', 'C', 'ILE', 31, '31', 'A')]

Insertion Codes
Not only do PDB numbers not always start from 1, they aren’t always integers at all. For example, 52^A in 1W72:

>>> pdbid = '1W72' >>> contsv = path.join(work, pdbid.lower+".tsv") >>> pprint(list([line.strip.split("\t") ...             for line in open(contsv)])[20:30]) '1W72', 'F', 'ASP', '3', 'D', 'TYR', '99'], ['1W72', 'F', 'ASP', '3', 'D', 'ARG', '114'],  ['1W72', 'F', 'ASP', '3', 'D', 'ARG', '156'],  ['1W72', 'F', 'ASP', '3', 'D', 'TYR', '159'],  ['1W72', 'F', 'PRO', '4', 'D', 'ASN', '66'],  ['1W72', 'F', 'PRO', '4', 'D', 'TYR', '159'],  ['1W72', 'F', 'PRO', '4', 'I', 'TRP', '52^A'],  ['1W72', 'F', 'PRO', '4', 'I', 'ASN', '53'],  ['1W72', 'F', 'THR', '5', 'D', 'ARG', '114'],  ['1W72', 'F', 'THR', '5', 'D', 'GLN', '155'

What are they? The PDB format docs say only:

"“The insertion code is commonly used in sequence numbering.”"

And the data dictionary used for the XML schema is no better:

"“PDB insertion code.”"

But Frances C. Bernstein explained Mar 2 2004 in what is insertion code.

Let’s test the handling of insertion codes:

>>> xmlgzfn = path.join(cache, "hla-xml", pdbid.lower+".xml.gz") >>> pdbx = etree.parse(gzip.open(xmlgzfn)) >>> parts = pdbparts.Parts(pdbx, contsv) >>> pprint([row for row in parts.contacts ...        if int(row[1]) in (4, 6, 7, 8)]) [('PRO', 4, '4', 'F', 'ASN', 66, '66', 'D'), ('PRO', 4, '4', 'F', 'TYR', 159, '159', 'D'), ('PRO', 4, '4', 'F', 'TRP', 53, '52^A', 'I'), ('PRO', 4, '4', 'F', 'ASN', 54, '53', 'I'), ('GLY', 6, '6', 'F', 'THR', 73, '73', 'D'), ('GLY', 6, '6', 'F', 'ARG', 156, '156', 'D'), ('GLY', 6, '6', 'F', 'SER', 52, '52', 'I'), ('GLY', 6, '6', 'F', 'GLY', 56, '55', 'I'), ('GLY', 6, '6', 'F', 'SER', 57, '56', 'I'), ('GLY', 6, '6', 'F', 'HIS', 103, '99', 'I'), ('GLY', 6, '6', 'F', 'TYR', 105, '100^A', 'I'), ('HIS', 7, '7', 'F', 'THR', 73, '73', 'D'), ('HIS', 7, '7', 'F', 'ASN', 77, '77', 'D'), ('HIS', 7, '7', 'F', 'TRP', 147, '147', 'D'), ('HIS', 7, '7', 'F', 'ALA', 152, '152', 'D'), ('HIS', 7, '7', 'F', 'GLN', 155, '155', 'D'), ('HIS', 7, '7', 'F', 'ARG', 156, '156', 'D'), ('HIS', 7, '7', 'F', 'GLY', 56, '55', 'I'), ('HIS', 7, '7', 'F', 'SER', 57, '56', 'I'), ('HIS', 7, '7', 'F', 'TYR', 105, '100^A', 'I'), ('SER', 8, '8', 'F', 'THR', 73, '73', 'D'), ('SER', 8, '8', 'F', 'ASN', 77, '77', 'D'), ('SER', 8, '8', 'F', 'THR', 80, '80', 'D'), ('SER', 8, '8', 'F', 'THR', 143, '143', 'D'), ('SER', 8, '8', 'F', 'LYS', 146, '146', 'D'), ('SER', 8, '8', 'F', 'TRP', 147, '147', 'D'), ('SER', 8, '8', 'F', 'TYR', 105, '100^A', 'I'), ('SER', 8, '8', 'F', 'TRP', 90, '91', 'M'), ('SER', 8, '8', 'F', 'SER', 92, '93', 'M')]

@@ISSUE: Impossibly High Residue Numbers
Cases such as 2P5E have impossibly high (909) residue numbers:

>>> pdbid = '2P5E' >>> contsv = path.join(work, pdbid.lower+".tsv") >>> pprint(list([line.strip.split("\t") ...             for line in open(contsv)][-15:])) '2P5E', 'C', 'CYS', '9', 'A', 'ASP', '77'], ['2P5E', 'C', 'CYS', '9', 'A', 'THR', '80'],  ['2P5E', 'C', 'CYS', '9', 'A', 'LEU', '81'],  ['2P5E', 'C', 'CYS', '9', 'A', 'TYR', '84'],  ['2P5E', 'C', 'CYS', '9', 'A', 'TYR', '116'],  ['2P5E', 'C', 'CYS', '9', 'A', 'TYR', '123'],  ['2P5E', 'C', 'CYS', '9', 'A', 'THR', '142'],  ['2P5E', 'C', 'CYS', '9', 'A', 'THR', '143'],  ['2P5E', 'C', 'CYS', '9', 'A', 'LYS', '146'],  ['2P5E', 'C', 'CYS', '9', 'A', 'TRP', '147'],  ['2P5E', 'C', 'IPA', '909', 'A', 'GLN', '155'],  ['2P5E', 'C', 'IPA', '909', 'A', 'LEU', '156'],  ['2P5E', 'C', 'IPA', '909', 'A', 'TYR', '159'],  ['2P5E', 'C', 'IPA', '909', 'D', 'TYR', '31'],  ['2P5E', 'C', 'IPA', '909', 'D', 'LEU', '95'

The normalized residue numbers are reported as None, which causes them to get skipped when writing out RDF:

>>> xmlgzfn = path.join(cache, "hla-xml", pdbid.lower+".xml.gz") >>> pdbx = etree.parse(gzip.open(xmlgzfn)) >>> parts = pdbparts.Parts(pdbx, contsv) >>> pprint([row for row in parts.contacts ...        if row[1] is None or row[5] is None]) [('SER', 1, '1', 'C', 'GOL', None, '902', 'A'), ('IPA', None, '909', 'C', 'GLN', 155, '155', 'A'), ('IPA', None, '909', 'C', 'LEU', 156, '156', 'A'), ('IPA', None, '909', 'C', 'TYR', 159, '159', 'A'), ('IPA', None, '909', 'C', 'TYR', 32, '31', 'D'), ('IPA', None, '909', 'C', 'LEU', 96, '95', 'D')]