ImmPort/PDB alleles

Many of the HLA structures from PDB have an allele associated with one of its chains. ''Should they all have one? Are all the ones without alleles false positive matches for HLA?''

Unmatched chains
As of 2009-08-27, it returns 200 results (see ); the first few are:

http://purl.obolibrary.org/pdb/1A6A/crystal http://purl.obolibrary.org/pdb/1AO7/crystal http://purl.obolibrary.org/pdb/1BD2/crystal http://purl.obolibrary.org/pdb/1BXT/crystal http://purl.obolibrary.org/pdb/1CG9/crystal http://purl.obolibrary.org/pdb/1D5M/crystal http://purl.obolibrary.org/pdb/1D5X/crystal http://purl.obolibrary.org/pdb/1D5Z/crystal http://purl.obolibrary.org/pdb/1D6E/crystal ...

String Searching
One approach is to look for substring(pdbseq, alleleseq) where pdbseq is the sequence of a PDB chain and alleleseq is the sequence of an HLA allele.

I cleaned up the code to do this a little bit and started a new bundle (packages/pdbsc/matchph.py):

pdbsc$ python matchph.py

HLA alleles: 3412 PDB chains: 271 unmatched chains 45 1A1M A no match 1A1N A no match 1A1O A no match ...

2 digit matches: 73 1A6A A matches allele group: DRA*01 1A9B A matches allele group: B*35 1A9B D matches allele group: B*35 ...

4 digit matches: 153 1AGB A matches allele: B*0801 1AGC A matches allele: B*0801 1AGD A matches allele: B*0801 ...

The full output is.

There seem to be 91

Scoring
A search for "python multiple sequence alignment" turned up a student project to implement the Needleman-Wunsch Algorithm; further search found macpy with cleaner code, though it uses a trivial substitution matrix; probably should be enhanced to use BLOSUM or the like. It's pretty slow (takes several minutes to compare one chain against the relevant alleles) and I'm not at all confident about interpreting the scores.

Blast
Summary:


 * structures with allele/group matches: 221
 * chains with allele matches: 209
 * chains with allele group matches: 73
 * chains with ungrouped allele matches: 1
 * 2CII A alleles ['HLA-Cw*0825', 'HLA-Cw*0741', 'HLA-Cw*0703']
 * These total more than 221 because there are some structures with more than one HLA chain; e.g. 1HDM A and B.
 * structures with no chains matched to alleles: 52
 * These are candidates to add to the false positive list in the HLA keyword search. The first few are 1GZP, 1GZQ; see ImmPort/Blast Report for the full list.

Detail:

I made a blast database of all 3622 HLA alleles in hla.dat (ImmPort/Blast Report shows how the packages/pdbsc/Makefile does this.).

Then I made 2418 fasta files, one for each chain from the pdb bundle as found by this query:

PREFIX rdfs:  prefix util:  prefix ro:   prefix IAO: 

select distinct ?pdbid ?chain ?seq where { ?record rdfs:label ?pdbid; IAO:is_about ?struct. ?struct util:has_grain ?complex. ?complex ro:has_part ?chain. ?chain util:seq ?seq } order by ?chain

Then I ran blast on each of the 2418 chains against the alleles.

86 of them failed because they seem to have RNA sequence data (U...) rather than aa sequence data. Looking at 1JJ2.xml.gz showed that chain 9 has type polyribonucleotide rather than polypeptide. Oddly, chain 0 is also a polyribonucleotide but blast gave no hits rather than failing with a warning in that case. Hence it's not surprising that this query shows 97 results, i.e. more than 86:

Then I eliminated various cases:


 * blast results had no hits
 * less than 50% identities (i.e. matching residues)
 * blast bit scores below 100 (eliminates the case of 9/9 identities, i.e. short peptides)

''Hmm... I didn't count those cases to see that they add up to 2418.''

In the case of an actual HLA chain, this would still leave dozens of matches; the first few would have 100% or 99% identities; I eliminated any below the top number of identities. This led to either a single allele (modulo nucleotide-only variation) or (except for one case) an allele group (XX*NN) as summarized above.

For detailed enumeration, see ImmPort/Blast Report.