ImmPort/Sequence features

Taking a look now at the spreadsheets supplied by SWMC, to figure out how to reproduce some of this work leveraging our tools and ontologies.

The terminology is sort of painful for someone not immersed in it... I review it here mainly for my own benefit:
 * We have regions of the genome corresponding more or less to single transcribed products; these have come to be called genes (as in 'Entrez Gene') even though the usage differs somewhat from the original meaning in Mendelian genetics. But the spreadsheet uses the more principled term locus, and I'll follow that practice.
 * For any locus (such as HLA-DRB1) there are many alleles - an allele being a particular sequence, found at the locus, that occurs in nature (i.e. in some person's genome).
 * But the word allele is also used to refer to sequences that are only partial sequences at a locus - that is, there are many IMGT/HLA 'allele' records that are only partial sequences - they don't account for the entire protein product. The rest of the allele in the population possessing the partial sequence is presumed to be coincident with what you find in the dominant allele, but in theory it could be something else, or the missing portion could even show variation within that subpopulation, leading to multiple complete alleles corresponding to a single partial allele.  This is confusing.
 * Residue = amino acid residue within a protein or other residue chain
 * There are many positions within a locus. Since we're working exclusively at the protein level, we'll take a position to be a residue position in a (the) protein product of the dominant allele.  Positions can be applied to protein products of other alleles through alignment with the dominant allele product; this sometimes means the creation of new positions when the new product has an insertion relative to the dominant product.
 * We artificially define a sequence feature (SF) of a locus to be an arbitrary set of (residue) positions in that locus. The usage is justified by the fact that some allele, such as the dominant one, may have some interesting physical or functional feature, such as a turn or loop, at a position or set of positions.  Note that since we define features to be sets, the space of features is closed under union and intersection.  In particular, individual positions, as well as the entire locus, are sequence features.
 * A sequence feature variant type (SFVT) is the sequence of amino acids, one for each position in the SF, found at those positions in some allele.

(Of course our knowledge is incomplete, and "the population" is in practice usually some subset of the entire human population - could be the population whose alleles are known in IMGT/HLA so far, or some study population.)

Example of a sequence feature and its SFVTs (from Nishanth): DRB1 feature {33,34} has four variant types: NQ, NR, KQ, and HQ.

Another example: if the sequence feature is the entire locus, then each (complete) allele is a SFVT.

Reference sequences for the HLA loci are given here: http://www.ebi.ac.uk/imgt/hla/help/align_help.html

Now we can talk about the set of possible SFVTs for a SF - the naturally occurring variation for that SF. If the set has cardinality 1, there is no variation in the population.

Sources of sequence features - we've already talked about this - they can be functional, structural, or some combination
 * contact points (be sure that positions are in the coordinate system of the canonical allele)
 * PDB annotation (SCOP/CATH/...)
 * Uniprot ? (do these annotations come from some other source?)
 * PFAM
 * Nishanth

The tables
Positions (in the spreadsheets) are given as positions in the "mature protein" - position 1 is the first residue in the protein, after the signal peptide has been trimmed off. This is not the same as position 1 in the DNA, which immediately follows the start codon. ''Problem: How to determine the length of the signal peptide? These are needed in order to convert between BLAST (hla.dat) coordinates and the spreadsheets. It's in the *_prot.txt alignment file, but does it occur anywhere else?''

Two of the spreadsheets (they look mostly the same to me - PosSeqFeatures and SeqFeatures) records these cardinalities for a number of SFs. Each SF is described textually. The SFs include all positions (singleton SFs) that show variation, as well as some SFs that are intersections of structural and functional features (not sure how this is generated, but probably a cross product).

Another spreadsheet has one row per SFVT (PosSeqFeat_Motif_Vars) for the chosen SFs, giving a symbol to designate each one. Column G of each row lists all the alleles that have that SFVT.

There is a huge table (in a .txt file) giving a cross product, with one row per IMGT/HLA allele and one column per SF. Individual entries are SFVTs.

Using SPARQL to obtain some of this information
TBD

Alan at one point started to encode SFs in OWL classes, to enable the kinds of queries we might want to do, exploiting the set-nature of SFs and therefore the kind of inference done by Pellet et al.

But it's not clear this would be of any help. Fantasize that we had a database in RDF that gave the residue occurring at every location of every HLA allele: residue aa particular-aa; allele particular-allele; location particular-location. (or something of that ilk). We could define sequence features as classes of locations, starting with singleton location sets and building them up using union (and down using intersection). But OWL, lacking products (in the mathematical sense), doesn't give us any way to map over aggregates (a SF being an aggregate). For example, how would one write a query to ask: for a given SF, what are all of its SFVTs? Or, for a given SFVT, what alleles have that SFVT? Or, given an allele, by which SFs is it distinguished from the reference allele?

Alternative approach: Precompute SFVT / allele incidences outside of the triple store, then deposit the result into the triple store for access.

It may still be useful to have an "ontology" of SFs as described above as a way to help navigate this information.