ImmPort/Demo sketch

Work in progress - not ready for prime time

''Originally based on email from Alan to JAR, 2009-05-12; subsequently edited by JAR; and then updated by Dan as development progresses.''

See also: ImmPort/Demo queries

The demo has two parts: associating perturbed sequence features with alleles, and associating diseases with alleles.

This explanation assumes familiarity with ImmPort/PDB.

@@extra credit For a deliverable we try to get [Richard and/or Mark] to have a triple store installation of our content with as much of this as we can accomplish (it's possible we can do all of this) and have them running whatever portions of queries leading to this as we have available.

Contact Points from PDB
done except for alignment bug re PTM

The most important kind of feature is the contact point, where an HLA chain contacts the target peptide. For example, the First Glance in Jmol view of 1AGB: shows contact residues are TYR:7 A, ASP:9 A, ASN 63:A, etc.:



We model each contac from residue ?hla_residue on an HLA chain to ?peptide_residue on a peptide a fact, as the following query demonstrates:

First few results:

ToDo: make this an automated test case for Bundles/pdb.

separate bundle for contacts? @@@

Sequence features
code prototyped but no data bundled built yet.

For present purposes a sequence feature is a structural or functional element connected with a position or span within an amino acid sequence. That is, any given feature is fixed at some particular place along a sequence. The kinds of features we'll look at are:
 * Contact point - derived from PDB - where an HLA molecule contacts a short polypeptide chain
 * Protein family - derived from PFAM or SCOP (JAR's guess is that these will not be very informative)
 * Structural - derived from CATH

@@regions from http://www.uniprot.org/uniprot/P04229 ?

Features are, e.g. "alpha helix", or "domains".

PFAM has domain info, and PDB links to PFAM; e.g. PFAM info for 1CF5 is PF00161, whence Interpro IPR001574 tells us e.g. of IPR016138 Ribosome-inactivating protein, subdomain 1. That page gives us examples of the coordinates of the domain in particular proteins. (I'm having a hard time figuring out how the domain is defined, but that's a different story.) Presumably among all of the listed accessions (sequences) is the appropriate chain from PDB 1CF5.

Sanger's PFAM site has very PFAM/Uniprot/PDB correlations in easy-to-scrape form; e.g. 1K5N. pfam_scrape.py is a prototype implementation that could be integrated with the PDB template mechanism.

EBI seems to have even better structural information; e.g. 1K5N.

A table of sequence feature variance types (SFVTs) is supplied by Richard's group (and we'll find more) (JAR: how?) - these are features specific to one allele (really? which one?). This is work AlanR does.

Alan and Dan need to work out what their interface is around SFVTs as this is likely to be a blocking point in the process.

(Need to figure out the coordinate systems used and relate to PFAM and SVFT.) (I wonder if this information is available in a machine-friendly form? Probably not.)

See also ImmPort/Sequence features, some notes started by jar.

Perturbation and Sequence alignment
done except for bug with primary_allele on DRB and PMT bug

Alleles differ in their sequences; often two related alleles will be almost identical, differing only at a single amino acid. The mutation points are here called perturbations and are of great interest because they may reflect functional changes related to inherited susceptibility to disease.

Perturbations also occur at particular positions, and are thus are like sequence feature. The goal is to correlate features with perturbations: In a particular allele (or set of alleles related to some disease), which features are perturbed (i.e. which features contain positions that are perturbed in that allele and generally not in others)?

For example, consider HLA-B*080101. In order to find corresponding positions on other alleles, we use as a reference the reference sequence of the HLA-B locus assigned by The IMGT in B_prot.txt, B*070201:

HLADB-2.26.0-Jul 2009 HLA-B Protein Sequence Alignments Sequences Aligned: 17 July 2009 Steven G. E. Marsh, Anthony Nolan Research Institute.

Prot. Pos. -30       -20        -10                10         20 B*070201                MLVM APRTVLLLLS AALALTETWA GSHSMRYFYT SVSRPGRGEP ... B*080101                -- -- D- AM                        ^                                ^ 1st/-24th                       31st/7th

Some alleles have insertions, deletions, or an initial unknown segment, but not in this case, so the fully aligned offset is 31. ''Why does B_prot.txt from IMGT start from -24? where does that number come from?'' Seems to be this bit of IMGT/HLA help: "For amino acid-based systems, the start codon of the mature protein is labeled codon 1." ToDo add PMT offset, rebuild pdb, hla, and pdbsc bundles.

The alleles with perturbations at this offset are B*1404 and B*15010102N:

HLA-B Protein Sequence Alignments

Prot. Pos. -30       -20        -10                10         20 B*070201                MLVM APRTVLLLLS AALALTETWA GSHSMRYFYT SVSRPGRGEP ... B*080101                -- -- D- AM                        ^                                ^ 1st                             31st ... B*1404                 **** ********** ********** *-H--- A- ... B*15010102N            -R-T -- G- -ECGVGREMA --G-SEGTAG

See Bundles/hla for a SPARQL query for "perturbations in B*1404".

Richard's group has asked to compute whether sequence features are in areas of variation in specific alleles even if we don't have direct information. ''I (Alan) imagine that JAR would do this portion of the work in consultation with [them?], which would include doing the parsing [of what? IMGT?], translation, and query development demonstrating the capability.'' Example of such a query: By comparing structures of allele X binding to peptides of length 9, which contact points in allele Y (assuming the IMGT alignment used to map to X) have different amino acids than X?

Finding perturbed contact points
Example query: given a PDB structure... relative to contact positions, what alleles have those contact positions deleted... which have a different amino acid?

Each PDB chain was compared against all HLA alleles using blast (using sequence data from PDB and IMGT respectively).

Continuing with the 1AGB example: the blast results show that positions 1-276 of chain A of 1AGB (i.e. all of it) match positions 25 to 300 @@FIX of the HLA-B*080101 allele. So the 7th position on 1AGB A corresponds to the 31st position of B*080101.

As we discovered above, the alleles with perturbations at this offset are B*1404 and B*15010102N.

The corresponding SPARQL query is:

The first few results are as expected:

Notes:


 * sc:blast_match blast_match data should comes from the pdbsc bundle (@@docs)
 * ToDo: make separate relations for exact match and off-by-one match (done? ToDo: test it)
 * ToDo: add these to sciencecommons.owl (via the .lisp source), including comments and, if feasible, domains, ranges, etc.
 * sc:primary_allele is new; it relates a locus/gene (e.g. HLA-A) to the primary/base allele chosen by IMGT; this data has been added to the hla bundle. Its use in this query is just to restrict the offsets to the same coordinate system, i.e. the coordinate system of the base allele for the locus.
 * sc:perturbation relates an allele to a perturbation (either a residue or a deletion); it's somewhat like sc:chain_contact in that it (typically) relates a polypeptide to one of its amino acids
 * sc:chain_contact has replaced util:contact from the pdb contact point calculation
 * sc:aligned_offset relates a residue (or a deletion) to an offset aligned to the base allele for a locus. (@@give an example)

@@jar thinks this would be interesting: 2 alleles of the same locus are contacting the same peptide at different contact points

I wrote some code to find interesting cases, and it found:


 * 1S9V chain A from 1 to 193 matches DQA1*050101 from 24 to 216 (1-based)
 * DQA1*050101 diff has a deletion (.) at 78 (0-based)

yup... that found a bug... residue 1 should be offset by 24 or so. ok... fixed that one... and good... the deletion worked:

  .   24. ...   101.   103.

This one has multiple matching alleles (DQA1_0503, DQA1_0505, 7, 8, 9). hmm... when there are multiple matching alleles, can we be sure the offsets all match? Maybe it's easy to check?

Darn:

ValueError: alignment mismatch: ['1D5M', 'A', 'DRA*0101', 'DRA*010201', [26, 27, 28, ...], [29, 30, 31, ...]]

That seems to be 1 of 46 cases: danc@norbert:~/nc/packages/pdbsc> grep ValueError make.out |wc 46  17217   81344

DanC went over this with Jar and found that the alignment mismatch check was buggy... gotta take another look.

ppoints.py prototype
This was originall prototyped in pdbsc/ppoints.py; I think it's working, including blast offsets:

pdbsc$ PYTHONPATH=../pdb:../hla python ppoints.py 1AGB A 7 9 63 70 contact: 1AGB A 7 @@pos 7 @@blast 1AGB A matches B*080101 which is offset by 0 from its base, B*070201 @@blast coords (1, 276, 25, 300) give offset 24 @@for a total of: 31 where the main allele, B*070201  has res: Y

perturbation: B*1404 31 H perturbation: B*15010102N 31 R

Reference sequences, aka primary alleles
Move this to Bundles/hla?

Show all primary alleles; check that each has an rdfs:label from MaHCO:

A study of IMGT alignment data
Note Entrez distinguishes HLA-DQA1 and HLA-DQA2, while the IMGT .zip file lumps them together into DQ_prot.txt. This resulted in assigning a sc:primary_allele to DRB, but not to DRB1, but MaHCO doesn't have a term for DRB, so of course no alleles are linked to it. Issue. See also: Bundles/hla.

Perhaps we should be using DRB1_prot.msf rather than DRB_prot.txt. The .txt format seems to be undocumented, and though it seems to have alignment info, the IMGT download docs say, "The MSF file format is the only format provided that includes the alignment information."

The MSF format seems to be a proprietary invention of Accelrys, but the IMGT download docs go on to say, "All PIR and MSF files have been generated using "ReadSeq", a freely available sequence format conversion program written by D. Gilbert." After a few broken links (or were they server glitches?) I found the readseq source. But I still don't see how deletions are represented. Perhaps better to just hard-code the list of reference sequences (aka sc:primary_allele) from IMGT/HLA help.

verify: Each allele has a delta
exercise: parse alignment data; check that there's a delta between each allele.

Done in alignment_parser.py r1172; see ImmPort/Check Alignment Deltas

Mapping of disease to allele
We need some mapping of disease to allele. There are implicit mappings already via MeSH -> paper (medline/subject-headings bundle), IMGT(allele) -> paper (hla bundle), our text processing (allele) -> paper (medline/alleles bundle). We can also use the alleles JAR has found in OMIM (omim bundle), the OMIM phenotype annotations from The Human Phenotype Ontology - Downloads (relates OMIM to HP = human phenotype ontology), or an existing similar source if there is one (where to look?). Finally we can also look at the GO annotations (obo bundle) to see whether there are mappings to the Uniprot entries for different alleles to see whether there is biological material there (essentially: what are the Uniprot ids for the HLA alleles. How many annotations are there. Rerun our GO annotations script to get the Uniprot associations too - bundle ncbi/goa is now keyed to Entrez Gene, but could easily be additionally keyed to Uniprot).

(See Bundles/ipi for a Uniprot to Gene mapping.)

Query
With the above we can do a demo query: Start with disease (or anything else we can), follow to allele. For that allele find perturbed contact positions or SFVTs [?]. If more than one allele is associated with the disease, ask which peturbations are common to both.

For example, one might start broadly with Immune System Disease (MeSH D020274), or narrowly with Systemic Scleroderma.

To follow the trail from there to contact points:

This query for Autoimmune Diseases returns some interesting results. ''The hla bundle seems to have some non-HLA alleles captured; note we filter those out with a regex.''

PREFIX dc:  prefix skos:  prefix foaf:  prefix sc:  prefix mesh: 

select ?allele ?article ?title where {  { ?rec sc:has-as-major-mesh mesh:D007154. }  union { ?rec sc:has-as-minor-mesh mesh:D007154. }  union { ?rec sc:has-as-major-mesh [ skos:broader mesh:D007154]. }  union { ?rec sc:has-as-minor-mesh [ skos:broader mesh:D007154]. }

?rec foaf:primaryTopic ?article. [] sc:article ?article; sc:name ?allele. ?article dc:title ?title; dc:date ?year filter(regex(?allele, "^HLA")) }

This gets a few dozen alleles from papers about type 1 Diabetes:

PREFIX dc:  prefix skos:  prefix foaf: <http://xmlns.com/foaf/0.1/> prefix sc: <http://purl.org/science/owl/sciencecommons/> prefix mesh: <http://purl.org/commons/record/mesh/>

select distinct ?allele where {  { ?rec sc:has-as-major-mesh mesh:D003922. }  union { ?rec sc:has-as-minor-mesh mesh:D003922. }  union { ?rec sc:has-as-major-mesh [ skos:broader mesh:D003922]. }  union { ?rec sc:has-as-minor-mesh [ skos:broader mesh:D003922]. }

?rec foaf:primaryTopic ?article. [] sc:article ?article; sc:name ?allele. } order by ?allele

URIs for Alleles
This is largely done; the query to Bjorn is in progress.

I think it would be a good idea to create URIs for alleles; currently they only exist as strings. Just modify the scripts for the medline/alleles, hla, and omim bundles appropriately (make prepare, make cache, make, make export, rdfherd bundle_update - see Bundle development - do not be shy about updating the development instance. A candidate set of URIs is the MHC ontology   http://www.bioinformatics.org/mahco/wiki/ but be sure to ask Bjorn Peters before using it.

''This work is nearing completion. See Bundles/mahco.''

example query:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix MaHCO: <http://purl.org/stemnet/HLA#> prefix HLA: <http://purl.org/stemnet/HLA#>

select ?group ?allele where {   ?group rdfs:subClassOf HLA:HLA-A. ?allele rdfs:subClassOf ?group }

''oops... it seems there are 2 hierarchies; we're using the one for dna when we should be using the one for proteins''

also curious: MaHCO is saying that A*68 is a subclass of A*28 This is right; see http://www3.interscience.wiley.com/journal/120148521/abstract

Finding the Entrez gene of a PDB chain via Uniprot
Each PDB chain is correlated to a Uniprot accession, and in the case of HLA chains, these Uniprot accessions are related to an Entrez HLA-XX gene, in Bundles/ipi.

The following query shows that the HLA gene names in Entrez are HLA-A, HLA-B, HLA-C, HLA-DMA, HLA-DMB, HLA-DQA1, HLA-DQA2, HLA-DQB1, HLA-DRA, HLA-E, HLA-G, HLA-H.

browsing with Jmol
see ImmPort/JmolViz

short peptide sequence binding
There is other work in progress to continue, e.g. the databases of peptides that bind to particular HLA variants (IEDB?).

tools for showing off triple store content at a low level
JAR thinks that a rudimentary RDF browser front end would be a good contribution to the demo, just as a way to get them oriented to the nature and diversity of the content.

Appendix: Task List

 * alignment stuff (in progress)
 * get SCOP/CATH/PFAM from SIFTS (in progress)
 * earlier approach: I prototyped a scraping implementation: pfam_scrape.py. Integration with the template stuff should be straightforward. The scraping and fetching should be separated so that the fetched data can be cached for offline scraping.
 * mediawiki query facility
 * I haven't worked with mediawiki much (and I mostly avoid PHP), but I've seen things like this done before, so this looks like a matter of reading documentation, asking around, and gluing things together.
 * help page is the best documentation, according to an aug 2007 blog entry nominated by google. Doesn't look like mediawiki can do the sort of reports I'm thinking of out-of-the-box
 * a Bugzilla Reports extension might make a good starting point; ah... simpler: Feed Import
 * see also Alan's work in Sandbox and http://svn.mumble.net:8080/svn/lsw/trunk/util/mediawiki-bot.lisp
 * note http://wiki.jmol.org/index.php/MediaWiki
 * show query results with Jmol
 * I studied the Jmol/FGiJ javascript enough to think this is pretty straightforward (assuming query results are residues... hmm... maybe I need more information about what queries we're talking about), but javascript is still something of a black art to me, so there may be dragons.