Pdb-hla-docs/Sifts

= Structural features from SIFTS (in progress) =

Our goal is to marshall SCOP/CATH/Pfam features as well as secondary structure. SIFTS seems to provide just the data we need:

The “Structure integration with function, taxonomy and sequence (SIFTS) initiative” aims to work towards the integration of various bioinformatics resources.

...

This information is vital for the reliable integration of the sequence family databases such as Pfam and Interpro with the structure-oriented databases of SCOP and CATH.

—SIFTS

Usage
To marshal data from sifts.xml.gz:

$ python sifts_parse.py sifts.xml.gz pdbml.xml.gz tpl.py rdfout.nt


 * sifts.xml.gz describes features of a PDB structure in eFamily format, compressed.
 * pdbml.xml.gz gives coordinate transformation information of that PDB structure in compressed PDBML format.
 * tpl.py is a python module with a run function that produces RDF from SIFTS data. The choice of how the SIFTS data are modelled in RDF is up to the template.
 * rdfout.nt is where the results will be written in N-Triples format.


 * class sifts_parse.Sifts ( tree, insmap=None )
 * Provide access to information from SIFTS/eFamily XML documents.{| frame="void" rules="none" frame="void" rules="none"

! Parameters:
 * tree – lxml etree node for sifts element
 * insmap – coordinate mapping a la pdbparts.insmap</tt>
 * insmap – coordinate mapping a la pdbparts.insmap</tt>


 * }
 * cath</tt>
 * Get SCOP features.{| frame="void" rules="none" frame="void" rules="none"

! Returns:
 * an iterator of (cathid, chain, start, end) tuples where chain, start, and end identify a range of residues and cathid is, e.g. ‘1ymmA01’
 * }
 * pdbid</tt>
 * Get PDB id (in upper case).


 * pfam</tt>
 * {| frame="void" rules="none" frame="void" rules="none"

! Returns:
 * iterator of tuples: (pfamid, chain, s0, e0, spdb, epdb) where s0, e0 are in sequential coordinates and spdb, epdb are in PDB coordinates.
 * }
 * scop</tt>
 * Get SCOP features.{| frame="void" rules="none" frame="void" rules="none"

! Returns:
 * an iterator of (scopid, chain, start, end) tuples where chain, start, and end identify a range of residues and scopid is, e.g. ‘123702’
 * }
 * secondary</tt>
 * Get secondary structure.{| frame="void" rules="none" frame="void" rules="none"

! Returns: an iterator of ((code, feature), chain, start, end) tuples where chain, start, and end identify a range of residues and (code, feature) give the type of feature as a 1 letter DSSP code and a longer feature name; e.g. (‘B’, ‘BULGE’).
 * }


 * sifts_parse.</tt>field</tt> ( e, n, ns='http://pdbml.pdb.org/schema/pdbx-v32.xsd' )
 * Get text content of a child of an element.We experimented with DOM APIs and lxml during development; this isolates one of the differences.


 * sifts_parse.</tt>frompdb</tt> ( chain, n, coordmap )
 * translate from PDB coordinates to sequential coordinatesSee resnums.doctest for discussion of sequential coordinates.prints a diagnostic when raising KeyError #@@ doctest


 * sifts_parse.</tt>main</tt> ( argv )
 * See Usage above.


 * sifts_parse.</tt>progress</tt> ( *s )
 * Report progress/diagnostics to stderr.


 * sifts_parse.</tt><tt>run</tt> ( sifts, streamkb )
 * Run a template to produce RDF from parts of PDB structures.Note this function is for documentation only; the client-provided template supplies the actual funtion.{| frame="void" rules="none" frame="void" rules="none"

! Parameters:
 * parts – a <tt>Sifts</tt> object that provides access to SIFTS data about a PDB structure.
 * streamkb – <tt>streamrdf.StreamKB</tt> that the template is to use to write RDF.
 * streamkb – <tt>streamrdf.StreamKB</tt> that the template is to use to write RDF.


 * }

sifts_tpl.py – streaming RDF template for PDB feature info
The run function is called with a Parts object (@@related to but different from the one documented in PDB template interface ).

Note

See bug 203 re grounding DSSP classes in URI space.


 * <tt>sifts_tpl.</tt><tt>feature_sym</tt> ( pdbid, chain, sort, key, start=None, end=None )
 * Make a URI/term for SCOP/CATH/Pfam/secondary a sequence feature. >>> feature_sym('1K5N', 'A', "CATH", '1k5nA01')

'<http://purl.org/science/pdb/SF/CATH/1K5N/A/1k5nA01>' >>> feature_sym('1K5N', 'B', 'secondary', 'T', 97, 98) '<http://purl.org/science/pdb/SF/secondary/1K5N/B/T,97-98>'


 * <tt>sifts_tpl.</tt><tt>run</tt> ( sifts, out )
 * See <tt>sifts_parse.run</tt>.

Unit tests
>>> import sifts_tpl, sifts_parse >>> import doctest >>> quiet = doctest.testmod(sifts_tpl)

Ew... crummy test coverage here:

>>> quiet = doctest.testmod(sifts_parse)

SIFTS data in eFamily format: 1K5N
SCOP/CATH/Pfam features as well as secondary structure are shown, for example, on the 1k5n page.

SIFTS provides the corresponding data in a 1k5n xml file.

Let’s put a copy of that file in the cache and parse it using lxml:

>>> pdbid = '1K5N' >>> from os import path, popen >>> dummy = popen('make prepare').read # never mind output >>> dummy = popen('make PDB_FILES=%s cache' % pdbid.lower).read >>> cache = popen('make print-cache').readline.strip >>> import gzip >>> xmlgzfn = path.join(cache, "hla-sifts", pdbid.lower+".xml.gz") >>> from lxml import etree >>> xml = etree.parse(gzip.open(xmlgzfn)).getroot

Next, let’s check that the root namespace is as expected:

>>> xml.tag '{http://www.efamily.org.uk/xml/efamily/2004/08/14/eFamily.xsd}entry' >>> from sifts_parse import Sifts >>> xml.tag == '{%s}entry' % Sifts.NS['e'] True

This is the namespace documented in the eFamily schema documentation, which goes along with a high level diagram.

A SCOP feature: 1k5nA01
We can see CATH feature 1k5nA01 represented as a <tt>crossRefDb</tt> element under each of 181 <tt>residue</tt> elements:

<crossRefDb dbSource="CATH" dbVersion="3.2" dbCoordSys="PDBresnum" dbAccessionId="1k5nA01" dbResNum="1" dbResName="GLY" dbChainId="A"> </crossRefDb> ... <crossRefDb dbSource="CATH" dbVersion="3.2" dbCoordSys="PDBresnum" dbAccessionId="1k5nA01" dbResNum="181" dbResName="ARG" dbChainId="A"> </crossRefDb>

Note

Hmm... it also seems to be represented as a <tt>mapRegion</tt> element:

<mapRegion start="1" end="181"> <db dbSource="CATH" dbVersion="3.2" dbCoordSys="PDBresnum" dbAccessionId="1k5nA01"> </db> </mapRegion>

We’re using the <tt>crossRefDb</tt> elements because that’s closer to how secondary structure features are represented, allowing us to share code.

Note

The pprint function helps normalize test output:

>>> from pprint import pprint

We can grab the elements for <tt>1k5nA01</tt> using this XPath expression:

>>> q=xml.xpath(./*/*/*/*/e:crossRefDb[@dbAccessionId="1k5nA01" ... and @dbSource="CATH" and @dbCoordSys="PDBresnum"], ...            namespaces=Sifts.NS) >>> len(q) 181

Note

Wow... <tt>dbVersion</tt> went from 23 to 24 during development; the web page still shows 184, but the SIFTS data says 186.

Perhaps it’s worth a local checked-in copy of some test data after all?

>>> def maskv(attrs): ...    del attrs['dbVersion'] ...    return attrs

For the sake of space, let’s just look at the first and last 3:

>>> pprint([maskv(e.attrib) for e in q[:3]]) [{'dbChainId': 'A', 'dbResNum': '1', 'dbAccessionId': '1k5nA01', 'dbResName': 'GLY', 'dbSource': 'CATH', 'dbCoordSys': 'PDBresnum'}, {'dbChainId': 'A', 'dbResNum': '2', 'dbAccessionId': '1k5nA01', 'dbResName': 'SER', 'dbSource': 'CATH', 'dbCoordSys': 'PDBresnum'}, {'dbChainId': 'A', 'dbResNum': '3', 'dbAccessionId': '1k5nA01', 'dbResName': 'HIS', 'dbSource': 'CATH', 'dbCoordSys': 'PDBresnum'}] >>> pprint([maskv(e.attrib) for e in q[-3:]]) [{'dbChainId': 'A', 'dbResNum': '179', 'dbAccessionId': '1k5nA01', 'dbResName': 'LEU', 'dbSource': 'CATH', 'dbCoordSys': 'PDBresnum'}, {'dbChainId': 'A', 'dbResNum': '180', 'dbAccessionId': '1k5nA01', 'dbResName': 'GLN', 'dbSource': 'CATH', 'dbCoordSys': 'PDBresnum'}, {'dbChainId': 'A', 'dbResNum': '181', 'dbAccessionId': '1k5nA01', 'dbResName': 'ARG', 'dbSource': 'CATH', 'dbCoordSys': 'PDBresnum'}]

Note

We have to be careful about <tt>crossRefDb</tt> elements in different <tt>segment</tt> elements, but that was easy enough that we’re not bothering to test it so far.

Mapping from PDB coordinates: 1k5nB00
The <tt>dbResNum</tt> is not as simple as that example suggests. The documentation for the residue element just says:

"The database residue number."

as if residues have only one number; the first residue in <tt>1k5nB00</tt> would be numbered 1 if that were the case, but it is numbered 0:

>>> q=xml.xpath('./*/*/*/*/e:crossRefDb[@dbAccessionId="1k5nB00" and @dbSource="CATH" and @dbCoordSys="PDBresnum"]', ...            namespaces=Sifts.NS) >>> pprint([maskv(e.attrib) for e in q[:3]]) [{'dbChainId': 'B', 'dbResNum': '0', 'dbAccessionId': '1k5nB00', 'dbResName': 'MET', 'dbSource': 'CATH', 'dbCoordSys': 'PDBresnum'}, {'dbChainId': 'B', 'dbResNum': '1', 'dbAccessionId': '1k5nB00', 'dbResName': 'ILE', 'dbSource': 'CATH', 'dbCoordSys': 'PDBresnum'}, {'dbChainId': 'B', 'dbResNum': '2', 'dbAccessionId': '1k5nB00', 'dbResName': 'GLN', 'dbSource': 'CATH', 'dbCoordSys': 'PDBresnum'}]

The feature ends at the 100th residue, which is numbered 99:

>>> pprint([e.attrib['dbResNum'] for e in q[-3:]]) ['97', '98', '99']

The trick is the <tt>dbCoordSys</tt> attribute; the value <tt>PDBresnum</tt> indicates that we need to map back from PDB coordinates to ordinal numbers. This mapping is available in the corresponding PDBML file; let’s grab it from the cache. The setup function will help when we want to look at other structures:

>>> def setup(pdbid): ...    dummy = popen('make PDB_FILES=%s cache' % pdbid.lower).read ...    xmlgzfn = path.join(cache, "hla-sifts", pdbid.lower+".xml.gz") ...    xml = etree.parse(gzip.open(xmlgzfn)).getroot ...    xmlgzfn = path.join(cache, "hla-xml", pdbid.lower+".xml.gz") ...    pdbx = etree.parse(gzip.open(xmlgzfn)).getroot ...    return pdbid, xml, pdbx >>> pdbid, xml, pdbx = setup('1K5N')

In particular, we need the <tt>insmap</tt>, i.e. the mapping from PDB codes (that may include insertion codes) to sequential numbers:

>>> from pdbparts import insmap >>> im = insmap(pdbx, '') >>> im[('B', '0')] 1 >>> im[('B', '1')] 2 >>> im[('B', '99')] 100

Now we have what we need to instantiate the <tt>Sifts</tt> class:

>>> s = Sifts(xml, im)

Template access to SIFTS data
The <tt>Sifts</tt> class is designed to separate parsing from RDF modelling. A trivial example is accessing the PDBID from a SIFTS file:

>>> s.pdbid '1K5N'

Methods such as <tt>cath</tt> abstract the XML/XPath operations, returning an iterator:

>>> pprint(list(s.cath)) [('1k5nA01', 'A', 1, 181), ('1k5nA02', 'A', 182, 276), ('1k5nB00', 'B', 1, 1), ('1k5nB00', 'B', 2, 100)]

We can unpack that a bit more slowly:

>>> each = s.cath >>> id_, chain, start, end = each.next >>> id_ '1k5nA01' >>> chain, start, end ('A', 1, 181)

And how about SCOP:

>>> pprint(list(s.scop)) [('77267', 'A', 1, 181), ('77266', 'A', 182, 276), ('77268', 'B', 1, 1), ('77268', 'B', 2, 100)]

Good.

Secondary structure: misc. design notes
Let’s find a concise XPath for the secondary structure.

Where is the loop at residue 2 recorded? Which elements?

>>> q=xml.xpath('.//*[@dbResNum="2"]', ...            namespaces=Sifts.NS) >>> sorted(list(set([e.tag for e in q]))) ['{http://www.efamily.org.uk/xml/efamily/2004/08/14/eFamily.xsd}crossRefDb', '{http://www.efamily.org.uk/xml/efamily/2004/08/14/eFamily.xsd}residue']

Hmm... the crossRefDb elements aren’t relevant; they’re all from other sources:

>>> q=xml.xpath('.//e:crossRefDb[@dbResNum="2"]', ...            namespaces=Sifts.NS) >>> sorted(list(set([e.attrib['dbSource'] for e in q]))) ['CATH', 'PDB', 'SCOP']

Let’s look at those residue elements:

>>> q=xml.xpath('.//e:residue[@dbResNum="2"]', ...            namespaces=Sifts.NS) >>> pprint([e.attrib for e in q]) [{'dbResName': 'SER', 'dbResNum': '2'}, {'dbResName': 'ILE', 'dbResNum': '2'}, {'dbResName': 'ARG', 'dbResNum': '2'}]

Yes... that looks promising... let’s find the rest of the relevant data:

>>> q=xml.xpath('.//e:segment//e:residue[' ...            '    e:residueDetail/@property="codeSecondaryStructure"' ...             'and e:residueDetail/@property="nameSecondaryStructure"' ...             'and e:crossRefDb/@dbCoordSys="PDBresnum"' ...             ']', ...             namespaces=Sifts.NS) >>> pprint([(e.xpath('e:crossRefDb[@dbCoordSys="PDBresnum"]', ...                            namespaces=Sifts.NS)[0].attrib['dbChainId'], ...         e.attrib, ...         [ee.text.strip ...          for ee in e.xpath("e:residueDetail", ...                             namespaces=Sifts.NS)] ...         ) ...         for e in q[:12]]) [('A', {'dbResName': 'HIS', 'dbResNum': '3'}, ['E', 'STRAND']), ('A', {'dbResName': 'SER', 'dbResNum': '4'}, ['E', 'STRAND']), ('A', {'dbResName': 'MET', 'dbResNum': '5'}, ['E', 'STRAND']), ('A', {'dbResName': 'ARG', 'dbResNum': '6'}, ['E', 'STRAND']), ('A', {'dbResName': 'TYR', 'dbResNum': '7'}, ['E', 'STRAND']), ('A', {'dbResName': 'PHE', 'dbResNum': '8'}, ['E', 'STRAND']), ('A', {'dbResName': 'HIS', 'dbResNum': '9'}, ['E', 'STRAND']), ('A', {'dbResName': 'THR', 'dbResNum': '10'}, ['E', 'STRAND']), ('A', {'dbResName': 'SER', 'dbResNum': '11'}, ['E', 'STRAND']), ('A', {'dbResName': 'VAL', 'dbResNum': '12'}, ['E', 'STRAND']), ('A', {'dbResName': 'ARG', 'dbResNum': '21'}, ['E', 'STRAND']), ('A', {'dbResName': 'PHE', 'dbResNum': '22'}, ['E', 'STRAND'])]

OK... that helped get the kinks out of <tt>secondary</tt>:

>>> pprint(list(s.secondary)[:5]) [(('E', 'STRAND'), 'A', 3, 12), (('E', 'STRAND'), 'A', 21, 25), (('E', 'STRAND'), 'A', 27, 28), (('T', 'BETA-TURN'), 'A', 29, 30), (('B', 'BULGE'), 'A', 31, 31)] >>> pprint(list(s.secondary)[-5:]) [(('E', 'STRAND'), 'B', 79, 84), (('T', 'BETA-TURN'), 'B', 86, 87), (('S', 'BEND'), 'B', 89, 90), (('E', 'STRAND'), 'B', 92, 95), (('T', 'BETA-TURN'), 'B', 98, 99)]

The single-letter codes are DSSP codes:

There are eight types of secondary structure that DSSP defines:


 * G = 3-turn helix (310 helix). Min length 3 residues.
 * H = 4-turn helix (α helix). Min length 4 residues.
 * I = 5-turn helix (π helix). Min length 5 residues.
 * T = hydrogen bonded turn (3, 4 or 5 turn)
 * E = extended strand in parallel and/or anti-parallel β-sheet conformation. Min length 2 residues.
 * B = residue in isolated β-bridge (single pair β-sheet hydrogen bond formation)
 * S = bend (the only non-hydrogen-bond based assignment)”

See also: DSSP (protien) in wikipedia; I’m not sure which is a more relevant explanation.

Pfam and mapRegion
Pfam data seems to be available only in <tt>mapRegion</tt> elements:

>>> q=xml.xpath('.//e:db[@dbSource="Pfam"]', ...            namespaces=Sifts.NS) >>> pprint([maskv(e.attrib) for e in q]) [{'dbAccessionId': 'PF00129', 'dbSource': 'Pfam', 'dbCoordSys': 'UniProt'}, {'dbAccessionId': 'PF07654', 'dbSource': 'Pfam', 'dbCoordSys': 'UniProt'}, {'dbAccessionId': 'PF07654', 'dbSource': 'Pfam', 'dbCoordSys': 'UniProt'}] >>> q=xml.xpath('.//e:mapRegion[e:db/@dbSource="Pfam"]', ...            namespaces=Sifts.NS) >>> pprint([e.attrib for e in q]) [{'start': '1', 'end': '179'}, {'start': '186', 'end': '271'}, {'start': '12', 'end': '93'}]

<tt>pfam</tt> extracts this data:

>>> pprint(list(s.pfam)) [('PF00129', 'A', 1, 179), ('PF07654', 'A', 186, 271), ('PF07654', 'B', 12, 93)]

1KPR debugging
This example pointed out that frompdb(x1-x2) wasn’t the way to go but rather x1-frompdb(x2).

>>> pdbid, xml, pdbx = setup('1KPR') >>> s = Sifts(xml, insmap(pdbx, '')) >>> pprint(list(s.pfam)) [('PF00129', 'A', 1, 179), ('PF07654', 'A', 186, 271), ('PF07654', 'B', 12, 93), ('PF00129', 'C', 1, 179), ('PF07654', 'C', 186, 271), ('PF07654', 'D', 12, 93)]

Pfam offsets: 1YMM
Let’s check another PDB structure that has non-trivial offsets in its Pfam stuff:

>>> pdbid, xml, pdbx = setup('1YMM') >>> im = insmap(pdbx, '')

Let’s have a close look at the insertion map for C:

>>> pprint(sorted([(pos, c, n) for (c, n), pos in im.iteritems ...               if c == 'C'])) [(1, 'C', '85'), (2, 'C', '86'), (3, 'C', '87'), (4, 'C', '88'), (5, 'C', '89'), (6, 'C', '90'), (7, 'C', '91'), (8, 'C', '92'), (9, 'C', '93'), (10, 'C', '94'), (11, 'C', '95'), (12, 'C', '96'), (13, 'C', '97'), (14, 'C', '98'), (15, 'C', '99'), (16, 'C', '100'), (17, 'C', '101'), (18, 'C', '102'), (19, 'C', '103'), (20, 'C', '104'), (21, 'C', '105'), (22, 'C', '106'), (23, 'C', '107')]

... and the XML data regarding Pfam for chain C:

>>> q=xml.xpath('.//e:mapRegion[e:db/@dbSource="Pfam" and ../../../@entityId="C"]', ...            namespaces=Sifts.NS) >>> pprint([e.attrib for e in q]) [{'start': '1', 'end': '15'}]

The results are as expected:

>>> s = Sifts(xml, im) >>> pprint(list(s.pfam)) [('PF00993', 'A', 4, 84), ('PF07654', 'A', 93, 175), ('PF00969', 'B', 13, 87), ('PF07654', 'B', 103, 185), ('PF01669', 'C', 1, 15), ('PF09291', 'D', 119, 202), ('PF07654', 'E', 133, 226)] >>> pprint(list(s.scop)) [('123702', 'A', 13, 81), ('123701', 'A', 83, 179), ('123704', 'B', 3, 92), ('123703', 'B', 93, 189), ('144702', 'D', 9, 104), ('123705', 'E', 1, 118), ('123705', 'E', 119, 119), ('123706', 'E', 123, 246)] >>> pprint(list(s.cath)) [('1ymmA01', 'A', 2, 81), ('1ymmA02', 'A', 82, 183), ('1ymmB01', 'B', 3, 91), ('1ymmB02', 'B', 92, 189), ('1ymmD00', 'D', 9, 104), ('1ymmE01', 'E', 3, 118), ('1ymmE01', 'E', 119, 119), ('1ymmE02', 'E', 120, 246)] >>> len(list(s.secondary)) 130

3HAE (regression test)
>>> pdbid, xml, pdbx = setup('3HAE') >>> im = insmap(pdbx, '') >>> s = Sifts(xml, im) >>> pprint(list(s.pfam)) [('PF00129', 'A', 1, 179), ('PF07654', 'A', 188, 271), ('PF07654', 'B', 12, 93), ('PF00129', 'D', 1, 179), ('PF07654', 'D', 188, 271), ('PF07654', 'E', 12, 93), ('PF00129', 'J', 1, 179), ('PF07654', 'J', 188, 271), ('PF07654', 'K', 12, 93), ('PF00129', 'P', 1, 179), ('PF07654', 'P', 188, 271), ('PF07654', 'Q', 12, 93)]

1JWS numbering (regression test)
1JWS gave me trouble earlier.

>>> pdbid, xml, pdbx = setup('1JWS') >>> s = Sifts(xml, insmap(pdbx, '')) >>> pprint(list(s.scop)) [('84243', 'A', 3, 81), ('84242', 'A', 82, 182), ('84245', 'B', 1, 92), ('84244', 'B', 93, 190), ('84246', 'D', 1, 121), ('84247', 'D', 122, 239)] >>> pprint(list(s.cath)) [('1jwsA01', 'A', 3, 81), ('1jwsA02', 'A', 82, 181), ('1jwsB01', 'B', 1, 92), ('1jwsB02', 'B', 93, 190), ('1jwsD02', 'D', 1, 31), ('1jwsD01', 'D', 32, 119), ('1jwsD02', 'D', 120, 237)] >>> pprint(list(s.pfam)) [('PF00993', 'A', 4, 84), ('PF07654', 'A', 93, 175), ('PF00969', 'B', 13, 87), ('PF07654', 'B', 103, 185), ('PF01123', 'D', 23, 119), ('PF02876', 'D', 129, 235)]