Pdb-hla-docs/Pdbsc pkg

= Aligning structures and alleles (pdbsc package) =

The pdbsc package integrates data from IMGT/HLA and PDB to align HLA chains in PDB structures with IMGT/HLA alleles.

For further design rationale, see the “Alignment across alleles and structures” section of Semantic Web Technologies Applied to Interpretation of HLA Structure Variation.

Usage
To make a fasta file with all HLA sequences:

$ python pdbmix.py --hla-fasta hla.dat out.fasta

where hla.dat is per IMGT/HLA; see hla_to_fasta for details.

To capture alignment across alleles and structures in RDF:

$ python pdbmix.py chain-blast.xml chain.fasta align_dir out.nt

where


 * chain-blast.xml is an XML blast report of matching this chain against all HLA alleles.
 * chain.fasta is a fasta file containing this chain’s sequence, as per ncquery.pdb_to_fasta


 * align_dir is a directory containing IMGT/HLA alignment data as per :func:imgtp.parse_all_loci
 * out.nt is where results are to be written as per pdbmix_tpl.run</tt> and streamrdf.StreamKB</tt>.

For further design rationale, see the “Alignment across alleles and structures” section of Semantic Web Technologies Applied to Interpretation of HLA Structure Variation


 * class pdbmix.</tt>PDBMix</tt> ( bt, fasta, diffs, psns )
 * Provide access to alignment data about an HLA chain from PDB.{| frame="void" rules="none" frame="void" rules="none"

! Parameters:
 * bt –
 * bt –

blast output XML tree, as from lxml.etree.parse</tt>. (See also: The lxml.etree Tutorial.) ""
 * fasta – contents of PDBID_CHAIN.fasta


 * }
 * blast_matches</tt> ( id_misses, bitmark )
 * Enumerate alleles that match this chain.{| frame="void" rules="none" frame="void" rules="none"

! Parameters:
 * id_misses – a threshold as per blast_report.each_score</tt>
 * bitmark – a threshold as per blast_report.each_score</tt>
 * bitmark – a threshold as per blast_report.each_score</tt>

! Returns: an iterator of (chain, seq, allele, start, end, o, misses) tuples where:
 * chain is this chain’s name
 * seq is this chain’s AA sequence
 * allele is the name of the qualifying allele
 * start, end give the part of this chain that matched
 * o is a list where o[i] is the position on the reference allele for the locus of the matching allele that corresponds to the start+i’th position in this chain.
 * misses is the difference between the length of the matched sequence and identities, with (0 <= misses <= id_misses)


 * }
 * pdbid</tt>
 * Get ID of the PDB structure containing this chain


 * pdbmix.</tt>fix_origin</tt> ( pos, o )
 * Transform pre-PTM position with respect to PTM origin.Allele positions are numbered -24, -23, -22, ... -1, 1, 2, 3 i.e. there’s no 0, and they’re 1-based. >>> fix_origin(25, -24)

1


 * pdbmix.</tt>hla_to_fasta</tt> ( fp, dat )
 * Print allele protein sequences in FASTA format{| frame="void" rules="none" frame="void" rules="none"

! Parameters:
 * fp – file open for writing where fasta entries are to be written
 * dat – pathname of hla.dat file as per imgtp.parse_seq_dat</tt>
 * dat – pathname of hla.dat file as per imgtp.parse_seq_dat</tt>


 * }


 * <tt>pdbmix.</tt><tt>main</tt> ( argv )
 * See Usage above.


 * <tt>pdbmix.</tt><tt>offsets</tt> ( base, diff, start, end )
 * Convert diff syntax to chain residue offsets.{| frame="void" rules="none" frame="void" rules="none"

! Parameters:
 * base – diff of base allele, allele0 (mature part only)
 * diff – diff of related allele, allele1 (mature part only)
 * start – 1-based offset into part of allele1 that blast matched
 * end – 1-based end of part of chain that matched
 * end – 1-based end of part of chain that matched

! Returns: a list k such that the i’th position in the sequence for this allele corresponds to the k[i]’th position in the base diff.
 * }First, assuming the whole allele is matched:
 * Note i is 0-based and k[i] is 1-based::
 * >>> offsets('MAVM', '', 1, 4)

[1, 2, 3, 4]
 * Insertion::
 * >>> offsets('MA.M', '--T-', 1, 4)

[1, 2, 3, 4]
 * Deletion::
 * >>> offsets('MAVM', '--.-', 1, 3)

[1, 2, 4]
 * Initial unsequenced segment::
 * >>> offsets('MAVMAPRT', '****', 1, 4)

[5, 6, 7, 8]
 * Initial and final unsequenced segments::
 * >>> offsets('MAVMAPRTMAPRT', '*********', 1, 4)

[5, 6, 7, 8] Then matching after an initial segment: >>> offsets('MAVMR', '-', 2, 4) [2, 3, 4, 5]
 * Deletion::
 * >>> offsets('MAVMVMR', '--.', 2, 5)

[2, 4, 5, 6, 7]
 * Deletion outside matched part::
 * >>> offsets('MAVMVMR', '--.', 4, 5)

[4, 5, 6, 7] Then matching before a final segment: >>> offsets('MAVM', '', 1, 3) [1, 2, 3]
 * Deletion::
 * >>> offsets('MAVMVM', '--.---', 1, 4)

[1, 2, 4, 5] Actual usage that had to be debugged earlier: >>> o = offsets('MVVMAPRTLFLLLSGALTLTETWAGSHSMRYFSAAVSRPGRGEPRFIAMGYVDDTQFVRFDSDSACPRMEPRAPWVEQEGPEYWEEETRNTKAHAQTDRMNLQTLRGYYNQSEASSHTLQWMIGCDLGSDGRLLRGYEQYAYDGKDYLALNEDLRSWTAADTAAQISKRKCEAANVAEQRRAYLEGTCVEWLHRYLENGKEMLQRADPPKTHVTHHPVFDYEATLRCWALGFYPAEIILTWQRDGEDQTQDVELVETRPAGDGTFQKWAAVVVPSGEEQRYTCHVQHEGLPEPLMLRWKQSSLPTIPIMGIVAGLVVLAAVVTGAAVAAVLWRKKSSD', '--', 25, 277) >>> len(o) 277 NoteThis function perhaps belongs in alignment_parser.py.


 * <tt>pdbmix.</tt><tt>rdf_for_chain</tt> ( blastxml, chain_fasta, alignments_dir, outf )
 * Run an RDF-building template over data from an HLA chain.{| frame="void" rules="none" frame="void" rules="none"

! Parameters:
 * blastxml – filename of blast report for the chain
 * chain_fasta – filename of this chain’s sequence in fasta format
 * alignments_dir – directory name a la <tt>imgtp.parse_all_loci</tt>
 * outf – filname where output is to be written
 * outf – filname where output is to be written


 * }

blast_report.py – get best alignment matches

 * <tt>blast_report.</tt><tt>each_score</tt> ( tree, id_misses=0, bitmark=0.0, minbits=100 )
 * Enumerate best-matching alleles for a chain.Recall from BLAST documentation the notion of identities, i.e. the number of identical residues in an alignment, and bit score, a normalized alignment score.{| frame="void" rules="none" frame="void" rules="none"

! Parameters:
 * tree – a parsed XML blast report
 * id_misses – threshold on the difference between length of the matched sequence and identities.
 * bitmark – threshold on the ratio of a qualifying bit score to the first (and hence highest) bit score in the blast report
 * minbits – threshold on the absolute bit score
 * minbits – threshold on the absolute bit score

! Returns: an iterator of (pdbid, chain, allele, bitscore, ids, coords) tuples for each qualifying score. <tt>coords</tt> gives the matched sequence in a tuple of (chain-from, chain-to, allele-from, allele-to).
 * }


 * <tt>blast_report.</tt><tt>main</tt> ( argv )
 * Run an exploratory report.


 * <tt>blast_report.</tt><tt>matches_for_structure</tt> ( pdbid, chain_scores )
 * Enumerate matches...for design exploration purposes only


 * <tt>blast_report.</tt><tt>report</tt> ( files )
 * Report certain match info from blast results.for design exploration purposes only

pdbmix_tpl.py – template for alignment of PDB with IMGT/HLA

 * <tt>pdbmix_tpl.</tt><tt>eachPos</tt> ( start, end, positions )
 * Enumerate chain positions corresponding to reference allele positions.See tests in coords.doctest for further explanation.


 * <tt>pdbmix_tpl.</tt><tt>run</tt> ( parts, out )
 * Template for alignment data in RDF{| frame="void" rules="none" frame="void" rules="none"

! Parameters:
 * parts – a <tt>pdbmix.PDBMix</tt> object that provides access to alignment data
 * out – a <tt>streamrdf.StreamKB</tt> where output is to be sent
 * out – a <tt>streamrdf.StreamKB</tt> where output is to be sent


 * }TodoSpecify results of this template in pdb bundle documentation.

Usage
To print an assignment of a make variable to the list of chains from the pdb bundle:

$ python ncquery.py --host devsparql.neurocommons.org --makevar PDB_CHAINS

To write chain sequences out in fasta format:

$ python ncquery.py --host devsparql.neurocommons.org --fasta /work


 * <tt>ncquery.</tt><tt>chainq</tt> ( web, host )
 * Enumerate HLA-related PDB polypeptide chains in the Neurocommons KB.See pdb bundle documentation for details.{| frame="void" rules="none" frame="void" rules="none"

! Returns:
 * a list of (pdbid, chain) tuples for each chain of an HLA-related PDB structure; e.g. (‘1A1M’, ‘A’)
 * }


 * <tt>ncquery.</tt><tt>main</tt> ( argv )
 * See Usage above.


 * <tt>ncquery.</tt><tt>ncsparql</tt> ( web, query, cols, host )
 * Query the neurocommons SPARQL server{| frame="void" rules="none" frame="void" rules="none"

! Parameters:
 * web – an <tt>urllib.URLopener</tt> for access to the web
 * query – a SPARQL query
 * cols – a list of variable/column names from the SPARQL query
 * host – hostname of neurocommons SPARQL server
 * host – hostname of neurocommons SPARQL server

! Returns: a list of rows where each row is an answer to the query, i.e. a value for each of the variables in <tt>cols</tt>
 * }


 * <tt>ncquery.</tt><tt>pdb_to_fasta</tt> ( web, host, dir_ )
 * Write HLA chains in fasta format files.A query against the Neurocommons pdb bundle provides a list of chains and their sequences.{| frame="void" rules="none" frame="void" rules="none"

! Parameter:
 * dir – directory to store fasta files
 * }Output filename format is <tt>1A1M_A.fasta</tt>.See <tt>ncsparql</tt> for <tt>web</tt>, <tt>host</tt> params.NoteAs this module is not really about writing files, this function should probably be moved to the <tt>pdbmix</tt> module.