Pdb-hla-docs/Modules

= Protein Databank (pdb package) =

Selecting PDB structures and caching PDB data
The protein data bank (PDB) includes thousands of structures, of which only a few hundred are relevant to our study of HLA structure variation.

Keyword search is implemented in keyfetch.py; it finds finds around 270 structures for the keyword ‘HLA’; some, such as 1FFK, are false positives; they are not relevant to our investigation; a list of false positives to filter out is maintained in the code.

Usage
To build a makefile definition for PDB_FILES:

$ python keyfetch.py --makefile HLA >out.mk

To download data for matching structures:

$ python keyfetch.py HLA data_dir/


 * keyfetch.fetch ( i, fmt='.xml.gz, dir='. )
 * Fetch PDBML data for PDB structure i and store in dir.


 * keyfetch.main ( argv )
 * See Usage above.


 * keyfetch.pdb_uri ( i, fmt='.xml.gz, pdb_files='http://www.rcsb.org/pdb/files/ )
 * build URI for .xml.gz fileWe assume i needs no encoding. >>> pdb_uri("1N8R")

'http://www.rcsb.org/pdb/files/1N8R.xml.gz'


 * keyfetch.</tt>progress</tt> ( *args )
 * Report progress/diagnostic to stderr


 * keyfetch.</tt>structureIdList</tt> ( kw, svc='http://www.rcsb.org/pdb/rest/search, exc={'HLA': ('1B6U, 1BBG, 1FFK, 1IIE, 1JJ2, 1K73, 1K8A, 1K9M, 1L3H, 1LDS, 1W2B, 1WX9, 1XTI, 1XTJ, 1XTK, 2BBG, 2VB5, 2WA0, 3BBG, '3G71')} )
 * Search for PDB structures using a keyword.{| frame="void" rules="none" frame="void" rules="none"

! Parameters:
 * kw – keyword to search for in PDB structure descriptions; e.g. ‘HLA’
 * svc – endpoint supporting the RCSB PDB RESTful Web Services interface.
 * exc – dictionary mapping keywords to lists of PDBIDs to exclude, e.g. to eliminate known false positives.
 * exc – dictionary mapping keywords to lists of PDBIDs to exclude, e.g. to eliminate known false positives.

! Raises: urllib2.URLError</tt> ! Returns: a list of pdbid strings in lower case.
 * }

Using Jmol to compute contacts between residues
Usage:

$ jython pdbcontacts.py out_dir/ pdb1.pdb pdb2.pdb pdb3.pdb ...

For each pdbid, it writes to out_dir/pdbid.tsv a tab-separated file with a one line header plus one line per contact:

print >>out, "\t".join([pdbid, ch1, res1, n1, ch2, res2, n2])

Such a line indicates that the n1th residue of chain ch1 in pdb structure named pdbid has an atom that is within some tolerance (e.g. 5 angstroms) of an atom in the n2th residue of chain ch2. Water and hydrogen are exluded. Residue number here are given in PDB coordinates. Chain names are, e.g. ‘A’, ‘B’, ‘C’.

Template-based conversion of PDBML to RDF
The code to convert PDB data from XML to RDF is split in two parts:


 * code inspect the XML and pick out various parts, pdbparts</tt>.
 * a streaming RDF template, pdb_tpl</tt>, to write out the data as RDF.

In this way details such as choice of namespace URIs in the RDF output can be edited without opening up the code to extract XML data.

Usage
To extract data from PDBID.xml.gz</tt>:

pdbparts.py template.py PDBID.xml.gz XXXX.tsv rdfout.nt


 * The template is a python module with a run</tt> function. The choice of how the data from PDBID.xml.gz are modelled in RDF is up to the template.
 * The PDBID.xml.gz</tt> argument names a compressed PDBML file, e.g. a copy of 1k5n.xml.gz from the PDB website.


 * The XXXX.tsv</tt> file is the result of contact calculations from Jmol; see pdbcontacts</tt> for details.

PDBML: Protein Data Bank (PDB) XML format
The PDBML design was published in:


 * PDBML: the representation of archival macromolecular structure data in XML. John Wesbrook, Nobutoshi Ito, Haruki Nakamura, Kim Henrick and Helen M. Berman, Bioinformatics, 21(7), 988-992, 2005.

The namespace of these documents is:

>>> Parts.NS['PDBx'] 'http://pdbml.pdb.org/schema/pdbx-v32.xsd'

The schema for this namespace is derived from the mmCIF format PDB Exchange Data Dictionary, which includes fields such as pdbx_seq_one_letter_code.

See also design notes and tests on residue numbering.


 * class pdbparts.</tt>Parts</tt> ( tree, tsvfn )
 * Provide access to parts of a PDB structure, given a parsed PDB XML document and a Jmol filename.Note that many of these methods return table-like structures in the form of iterators of tuples of strings.NotePerhaps named tuples would work better? No need so far.{| frame="void" rules="none" frame="void" rules="none"

! Parameters:
 * tree –
 * tree –

a parsed PDBML file, as returned from lxml.etree.parse</tt>. See also: The lxml.etree Tutorial.
 * tsvfn – filename of a tab-separated-file with one line per contact, as per pdbcontacts</tt>.


 * }
 * <tt>citations</tt>
 * Get citations, per PDB documentation on citations{| frame="void" rules="none" frame="void" rules="none"

! Returns: an iterator of (pubmed, title, year) tulples, one for each of the structure’s citations containing a pubmed identifier.
 * pubmed is an integer numeral, e.g. “12627512”
 * year is an integer numeral, e.g. “1984”


 * }
 * <tt>contacts</tt>
 * Get contacts between an HLA chain and a peptide in a PDB structure.Each contact is between some residue on the peptide and some residue on an HLA chain.A residue is represented as (res, seq, seq_auth, chain) where:
 * res is a 3 letter amino acid code (e.g. “THR” or “ASN”),
 * seq is a sequence coordinate in sequential coordinates
 * seq_auth is a sequence coordinate in PDB coordinates
 * chain is, as usual, ‘A’, or ‘B’, or ‘C’ {| frame="void" rules="none" frame="void" rules="none"

! Returns:
 * an iterator of (resp, seqp, seq_authp, chainp, res, seq, seq_auth, chain) where resp, seqp etc. identify the residue on the peptide and res, seq, etc. identify the residue on the HLA chain.
 * }
 * <tt>keywords</tt>
 * Get keywords, as per PDB documentation on keywords{| frame="void" rules="none" frame="void" rules="none"

! Returns:
 * a pair of strings (keywords, text).
 * }
 * <tt>neighbors</tt>
 * Get a PDB structure’s related structures, as per PDBML documentation on related structures.{| frame="void" rules="none" frame="void" rules="none"

! Returns:
 * an iterator over tuples of strings (id, details) for each of the structure’s related structures.
 * }
 * <tt>non_polymers</tt>
 * Get non-polymer entities of a PDB structure.{| frame="void" rules="none" frame="void" rules="none"

! Returns: an iterator of tuples of strings (entity_id, name, comp_id, comp_type, formula, formula_weight) for each non-polymer entity in the structure. cf. PDB documentation on non-polymer entities and on chemical compounds for comp_type, formula, and formula_weight. It seems that align_id corresponds to an entity id.
 * }
 * <tt>pdbid</tt>
 * {| frame="void" rules="none" frame="void" rules="none"

! Returns:
 * the PDB identifier; e.g. “1D5M”.
 * }
 * <tt>polymers</tt>
 * Enumerate polymer entities in a PDB structure.{| frame="void" rules="none" frame="void" rules="none"

! Returns: an interator over tuples of (chain, seqcodes, uniprot, allele, details, mutation, description, type) for each polymer entity in the structure.
 * chain is, e.g. ‘A’, ‘B’, ‘C’
 * seqcodes gives the sequence in one-letter-per-peptide format (with whitespace removed)
 * uniprot is a uniprot accession number if there is a related uniprot record or else None
 * allele is an allele name (e.g. ‘B*3501’) if one is availble for this polymer or else None

See PDB documentation on polymers regarding details, mutation, description, and type.
 * }NoteAn earlier version of this interface had entities and strands methods, analagous to PDBML entities and PDBML strands, but a single polymers method seems more straightforward. The only difference in the total information passed across the interface is regarding water entities, and we don’t seem to be using that information.
 * <tt>residues</tt>
 * Get residues of a PDB structure, per PDB documentation on monomer sequences.Note<tt>polymers</tt> must be called first in order to build an index of chains by entity.{| frame="void" rules="none" frame="void" rules="none"

! Returns:
 * an iterator of tuples of strings (mon_id, seq, seq_auth, chain) for each of the residues in the structure.
 * }


 * <tt>pdbparts.</tt><tt>attr</tt> ( e, n )
 * Get an attribute from an element.We experimented with DOM APIs and lxml during development; this isolates one of the differences.


 * <tt>pdbparts.</tt><tt>field</tt> ( e, n, ns='http://pdbml.pdb.org/schema/pdbx-v32.xsd' )
 * Get text content of a child of an element.We experimented with DOM APIs and lxml during development; this isolates one of the differences.


 * <tt>pdbparts.</tt><tt>insmap</tt> ( tree, sep='^' )
 * Map back from PDB coordinates with insertion codes.See also details of insertion codes.{| frame="void" rules="none" frame="void" rules="none"

! Parameters:
 * tree –
 * tree –

a parsed PDBML file, as returned from <tt>lxml.etree.parse</tt>. See also: The lxml.etree Tutorial. ""
 * sep – separator used to punctuate insertion code; defaults to ‘^’ as in 52^A. Use ‘’ for 52A.

! Returns: a dictionary that maps (chain, pdbcoord) pairs to sequential coordinates. See also sequential coordinates and PDB coordinates.
 * }NoteThis should probably be renamed; insmap has become a misnomer. Also, it should perhaps be factored out of this module, since sifts_parse uses it as well.


 * <tt>pdbparts.</tt><tt>main</tt> ( argv )
 * See Usage above.


 * <tt>pdbparts.</tt><tt>progress</tt> ( *args )
 * Report progress/diagnostics to stderr.


 * <tt>pdbparts.</tt><tt>run</tt> ( parts, streamkb )
 * Run a template to produce RDF from parts of PDB structures.NoteNote this function is for documentation only; the client-provided template supplies the actual funtion.{| frame="void" rules="none" frame="void" rules="none"

! Parameters:
 * parts – a <tt>Parts</tt> object that provides access to data about a given PDB structure.
 * streamkb – <tt>streamrdf.StreamKB</tt> that the template is to use to write RDF.
 * streamkb – <tt>streamrdf.StreamKB</tt> that the template is to use to write RDF.


 * }


 * <tt>pdbparts.</tt><tt>seq_norm</tt> ( s )
 * Normalize sequence by removing whitespace.

streamrdf.py – quick-n-dirty RDF serialization using n-ntriples
A StreamKB helps write RDF in n-triples format:

>>> import sys >>> s = StreamKB(sys.stdout) >>> PubM = Namespace('http://purl.org/science/article/pmid/') >>> article = PubM.sym('8624812') >>> foaf = Namespace('http://xmlns.com/foaf/0.1/') >>> s.classAssertion(article, foaf.Document) <http://purl.org/science/article/pmid/8624812> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Document>.

Note the distinction between URI strings and n-triples terms:

>>> sym('http://www.w3.org/') '<http://www.w3.org/>'

Namespaces help build terms, either with python identifiers in source code:

>>> dc  = Namespace('http://purl.org/dc/elements/1.1/') >>> dc.title '<http://purl.org/dc/elements/1.1/title>'

or with runtime computed names:

>>> PubM.sym('8624812') '<http://purl.org/science/article/pmid/8624812>'

Take care with the difference beetween python strings and (quoted) turtle/N3 string terms:

>>> s.dataPropertyAssertion(dc.title, article, ...                        txt('Take care with "quotes"')) <http://purl.org/science/article/pmid/8624812> <http://purl.org/dc/elements/1.1/title> "Take care with \"quotes\"".

We allow integers as terms, using a slight extension of n-triples (into into turtle):

>>> s.dataPropertyAssertion(rdf.value, article, 42) <http://purl.org/science/article/pmid/8624812> <http://www.w3.org/1999/02/22-rdf-syntax-ns#value> 42.

Note

Serialization of integers goes beyond strict n-triples into turtle, but the tools we’re using (e.g. virtuoso) seem to grok.


 * class <tt>streamrdf.</tt><tt>Namespace</tt> ( pfx )
 * URI terms with a common prefixThis lets us use python’s obj.name syntax to represent namespaces of N-Triples constant terms.


 * <tt>prefix</tt>
 * Prefix of this namespace.Where is this used? Is it dead code?


 * <tt>sym</tt> ( n )
 * for cases when n is computed at runtime or not a python name


 * class <tt>streamrdf.</tt><tt>StreamKB</tt> ( out )
 * A streaming, that is append-only, RDF/OWL knowledge base.{| frame="void" rules="none" frame="void" rules="none"

! Parameter:
 * out – a file-like object where this object will write its output
 * }
 * <tt>classAssertion</tt> ( i, c )
 * Write a turtle/N3 statement asserting i is in CWe assume i and c are well-formed turtle terms.


 * <tt>dataPropertyAssertion</tt> ( p, s, o )
 * Write a turtle/N3 statement, assuming p, s, and o are well-formed turtle/N3 terms.


 * <tt>streamrdf.</tt><tt>sym</tt> ( name )
 * Produce a turtle/N3 constant term from its name.{| frame="void" rules="none" frame="void" rules="none"

! Parameter:
 * name – an absolute URI (limited to strings not containing ‘>’ to avoid escaping complications).
 * }

pdb_naming.py – URI name conventions for the pdb bundle

 * <tt>pdb_naming.</tt><tt>allele_chain_sym</tt> ( name )
 * Get the turtle term for an HLA allele (AA chain form).This follows the mahco bundle algorithm. >>> allele_chain_sym('DMA*0101')

'<http://purl.org/stemnet/HLA#DMA_0101_Chain>' >>> allele_chain_sym("DRB7*0101") '<http://purl.org/stemnet/HLA#DRB7_0101_Chain>' For silent mutations, the corresponding AA-level allele is returned. >>> allele_chain_sym('DQB1*050101') '<http://purl.org/stemnet/HLA#DQB1_0501_Chain>' >>> allele_chain_sym('A*01010101') '<http://purl.org/stemnet/HLA#A_0101_Chain>'


 * <tt>pdb_naming.</tt><tt>allele_dna_sym</tt> ( name )
 * Get the turtle term for an HLA allele (DNA form).This follows the mahco bundle algorithm. >>> allele_dna_sym('DMA*0101')

'<http://purl.org/stemnet/HLA#DMA_0101>' >>> allele_dna_sym('DQB1*050101') '<http://purl.org/stemnet/HLA#DQB1_050101>' >>> allele_dna_sym('A*01010101') '<http://purl.org/stemnet/HLA#A_01010101>'


 * <tt>pdb_naming.</tt><tt>chain_name</tt> ( uri )
 * Get a chain name back from the URI from a chain symbol/term. >>> sym = chain_sym('1A1M', 'A')

>>> uri = sym[1:-1] >>> chain_name(uri) == 'A' True


 * <tt>pdb_naming.</tt><tt>chain_sym</tt> ( pdbid, chain )
 * Make an n-triples term (URI) for a chain from its pdbid and name. >>> chain_sym('1A1M', 'A')

'<http://purl.org/science/pdb/chain/1A1M/A>'


 * <tt>pdb_naming.</tt><tt>locus_sym</tt> ( name )
 * >>> locus_sym('HLA-A')

'<http://purl.org/stemnet/HLA#HLA-A>' >>> locus_sym('HLA-DRA') '<http://purl.org/stemnet/HLA#HLA-DRA>'


 * <tt>pdb_naming.</tt><tt>residue_sym</tt> ( pdbid, chain, seq )
 * Make an n-triples term (URI) from residue. >>> residue_sym('1A1M', 'A', 23)

'<http://purl.org/science/pdb/residue/1A1M/A/0023>' Sequence numbers are padded with 0s (to 4 places) to facilitate sorting.It’s OK to pass seq as a string: >>> residue_sym('1A1M', 'A', '23') '<http://purl.org/science/pdb/residue/1A1M/A/0023>'

chebi_codes.py – mapping from amino acid short names to CHEBI codes
There are 20 standard amino acids with conventional 3-letter and 1-letter names:

>>> sorted(CodeToLetter.keys) ['ALA', 'ARG', 'ASN', 'ASP', 'CYS', 'GLN', 'GLU', 'GLY', 'HIS', 'ILE', 'LEU', 'LYS', 'MET', 'PHE', 'PRO', 'SER', 'THR', 'TRP', 'TYR', 'VAL'] >>> len(CodeToLetter.keys) 20 >>> sorted(LetterToCode.keys) ['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y'] >>> all([LetterToCode[CodeToLetter[code]] == code ...     for code in CodeToLetter.keys]) True

The CHEBI amino acid residue page lists them by name and CHEBI code. To look up a CHEBI code by 3-letter code:

>>> CodeToChebi['ALA'] '32441' >>> sorted(CodeToChebi.keys) == sorted(CodeToLetter.keys) True

To look up a CHEBI code by 1-letter code:

>>> CodeToChebi[LetterToCode['A']] '32441'

pdb_tpl.py – streaming RDF template for PDB structures
The basic idea is to call the various methods on the parts object, coerce the strings there into turtle terms using sym or txt, and call DataPropertyAssertion and/or ClassAssertion.


 * <tt>pdb_tpl.</tt><tt>bond_sym</tt> ( pdbid, chain, seq )
 * Build a term for a bond. >>> bond_sym('1A1M', 'A', '10')

'<http://purl.org/science/pdb/bond/1A1M/A/0010>' Notebug #203: pdb chains, residues etc. are 404


 * <tt>pdb_tpl.</tt><tt>run</tt> ( parts, out )
 * Convert PDB data to RDF.See <tt>pdbparts.run</tt> for parameter info.

See also: pdb_tpl.py source in the subversion repository.

contacts_tpl.py – streaming RDF template for PDB contact info

 * <tt>contacts_tpl.</tt><tt>run</tt> ( parts, out )
 * Convert contact data to RDF.See <tt>pdbparts.run</tt> for parameter info.

See also: contacts_tpl.py source in the subversion repository.