Pdb-hla-docs/Hla pkg

= IMGT/HLA Alignment data (hla package) =

Selected parts of the IMGT/HLA database are collected in the hla package.

alignment_parser.py – parse IMGT/HLA alignment data
The IMGT/HLA Database enumerates HLA alleles and provides:


 * amino acid residue aligmnent data in text files, one per locus; e.g. a_prot.txt, alignment data for HLA-A.
 * DNA and amino acid sequence data for all alleles in an hla.dat file

Functions for parsing the data in these files are parse and parse_seq_dat, respectively.

The IMGT/HLA FTP Directory documentation says:

A text file version of all the sequences alignments at both the nucleotide and protein level is provided as a zip file. ... The files in the archives use the following naming conventions:


 * locus_nuc.txt - nucleotide CDS alignment
 * locus_gen.txt - genomic nucleotide alignments
 * locus_prot.txt - protein alignment “

The parse_all_loci function aggregates data from all of the locus_prot.txt files.

We expect to find files for the following loci:

>>> FileLoci ('A', 'B', 'C', 'DMA', 'DMB', 'DOA', 'DOB', 'DPA', 'DPB', 'DQA', 'DQB', 'DRA', 'DRB', 'E', 'F', 'G', 'MICA', 'MICB', 'TAP1', 'TAP2')

Note

FileLoci (i.e. knowledge of which files to look in for alignment data) should perhaps be moved to the Makefile.


 * alignment_parser.apply_diff ( seq, diff, copy=True, keep_deletes=False )
 * Apply IMGT/HLA alignment diff to sequence.Each of the IMGT/HLA sequence alignment conventions is illustrated below:In the sequence alignments the following conventions are used.
 * The entry for each allele is displayed in respect to the reference sequences.
 * Where identity to the reference sequence is present the base will be displayed as a hyphen (-): >>> apply_diff('DTWRT', '-')

'DTWRT'
 * Non-identity to the reference sequence is shown by displaying the appropriate base at that position: >>> apply_diff('DTWRT', '--H--')

'DTHRT'
 * Where an insertion or deletion has occurred this will be represented by a period (.): >>> apply_diff('D.W', '-H-')

'DHW' >>> apply_diff('DTW', '-.-') 'DW' >>> apply_diff('DTW', '-.-', keep_deletes=True) 'D.W'
 * If the sequence is unknown at any point in the alignment, this will be represented by an asterisk (*): >>> apply_diff('DTW', '**-')

'DTW' >>> apply_diff('DTW', '**-', copy=False) '**W'
 * In protein alignments for null alleles, the ‘Stop’ codons will be represented by a hash (X): >>> apply_diff('DTWRT', '--X')

'DT'
 * In protein alignments, sequence following the termination codon, will not be marked and will appear blank.
 * These conventions are used for both nucleotide and protein alignments.


 * alignment_parser.parse</tt> ( fp,  pfx= , post=True'' )
 * Parse IMGT/HLA sequence alignment syntax.{| frame="void" rules="none" frame="void" rules="none"

! Parameters:
 * fp – file with alignment data, open for reading
 * pfx – a prefix to add to allele names
 * post – whether to chop parts other than post-translation
 * post – whether to chop parts other than post-translation

! Returns: a tuple (a, d, p) where
 * a</tt> is a list of allele names (e.g. ‘A*010103’, prefixed by pfx</tt>),
 * d</tt> is a mapping from allele names to sequence diffs, and
 * ;*; p</tt> is a mapping from allele names to the position
 * of the first residue in the sequence, according to the position numbering in the heading.

>>> len(a) 965 >>> a[:5] ['A*01010101', 'A*01010102N', 'A*010102', 'A*010103', 'A*010104'] The first allele is the reference sequence: >>> d[a[0]][:20] 'GSHSMRYFFTSVSRPGRGEP' For other alleles, we get an alignment diff reference to the reference sequence: >>> d['A*0102'][:20] 'S---S---' The first 24 residues of A*0102 are a signal sequence (i.e. not part of the mature protein): >>> p['A*0102'] -24
 * }Diffs are normalized to uppercase.Using some test data, we can see the list of allele names: >>> a, d, p = parse(open('a_prot_test.txt'))


 * alignment_parser.</tt>parse_all_loci</tt> ( aligndir )
 * Parse alignments of all HLA alleles, i.e. in all HLA loci.{| frame="void" rules="none" frame="void" rules="none"

! Parameter: ! Returns: a (a, d, p) tuple as per parse</tt>
 * aligndir – name of a directory where the IMGT/HLA alignments zip file has been unpacked.
 * }We use apply_diff</tt> to make up for reference sequences (in the sense of refseq.lookup</tt>) that do no occur as the first allele in a file.


 * alignment_parser.</tt>parse_seq_dat</tt> ( fp )
 * Iterate over alleles in an IMGT hla.dat file, extracting parts.{| frame="void" rules="none" frame="void" rules="none"

! Returns:
 * an iterator of tuples of (id, name, exons, nucseq, protseq)
 * }Sequences are normalized to uppercase.File format is documented in IMGT/HLA Database User Manual.NotePerhaps another bit of data to return: hla.dat says “partial” in some cases; e.g DE HLA-Cw*010202, Human MHC Class I sequence (partial) It doesn’t say which part, alas.


 * alignment_parser.</tt>perturbations</tt> ( diff, pos )
 * Iterate over perturbations in a diff.{| frame="void" rules="none" frame="void" rules="none"

! Parameters:
 * diff – an IMGT/HLA sequence diff
 * pos – position of the first item in the diff
 * pos – position of the first item in the diff

! Returns: a list of (p, x) pairs for each x in the diff that is not a noop (i.e. * or -) [(7, 'M'), (11, 'Y')] >>> list(perturbations('***---M---Y---', -3)) [(4, 'M'), (8, 'Y')] >>> list(perturbations('*RM---Y---', -3)) [(-2, 'R'), (4, 'M'), (8, 'Y')] NoteThis implementation iterates over each character in a python loop; if performance becomes an issue, the loop could be pushed into a regex.
 * } >>> list(perturbations('***---M---Y---', 1))


 * alignment_parser.</tt>seq_find</tt> ( pat, seq )
 * like python’s .find, but * are wildcards >>> seq_find("123abc", "abc")

3 >>> seq_find("abc", "def") -1 >>> seq_find("123a*c", "abc") 3 NoteThis is only used in <tt>alignemnt_check.py</tt>, an exploratory module not used in production; it should probably be moved there.

Usage
Download <tt>Alignments_Rel_*.zip</tt> from EBI ftp directory and unpack to <tt>align_dir</tt>. Then:

$ PYTHONPATH=../pdb python mkalign.py align_dir out.nt

Note

Positions outside mature protein are discarded, since <tt>parse_all_loci</tt> calls <tt>parse</tt> with post defaulted to <tt>True</tt>.


 * class <tt>mkalign.</tt><tt>Alignments</tt> ( alleles, diffs, psns )
 * Provide access to HLA aligment data.{| frame="void" rules="none" frame="void" rules="none"

! Parameters:
 * alleles – a list of allele names, as from <tt>parse_all_loci</tt>
 * diffs – as from <tt>parse_all_loci</tt>
 * psns – as from <tt>parse_all_loci</tt>
 * psns – as from <tt>parse_all_loci</tt>


 * }
 * <tt>base_positions</tt>
 * Iterate over residues in reference sequences.{| frame="void" rules="none" frame="void" rules="none"

! Returns: an iterator of (a0, char, pos) tuples where
 * <tt>a0</tt> is a reference allele name
 * ;;*; <tt>char</tt> is . or X or a one-letter residue code,
 * as per <tt>alignment_parser.apply_diff</tt>
 * ;;*; <tt>pos</tt> is the position of the residue in
 * sequential coordinates


 * }
 * <tt>bases</tt>
 * Iterate over reference sequences.{| frame="void" rules="none" frame="void" rules="none"

! Returns: an iterator of (locus, a0) pairs as per <tt>refseq.data</tt>
 * }
 * <tt>perturbations</tt>
 * Iterate over perturbations in all alleles.{| frame="void" rules="none" frame="void" rules="none"

! Returns: an iterator of (allele, position, diffchar) where <tt>position</tt> and <tt>diffchar</tt> are as per <tt>alignment_parser.perturbations</tt>
 * }


 * <tt>mkalign.</tt><tt>dump_alignments</tt> ( dir_, out )
 * Parse alignments and write as RDF using a template.The data is passed to the template, <tt>align_tpl.run</tt>, in an <tt>Alignments</tt> object.NoteThe template should perhaps be passed in as a command-line argument, but following You Aren’t Gonna Need It, that’s not yet implemented.{| frame="void" rules="none" frame="void" rules="none"

! Parameters:
 * dir – directory where the zip file has been unpacked, for use in <tt>parse_all_loci</tt>
 * out – filename where N-Triples results should be written.
 * out – filename where N-Triples results should be written.


 * }


 * <tt>mkalign.</tt><tt>main</tt> ( argv )
 * See Usage above.

refseq.py – HLA reference sequences
To access data from the IMGT/HLA table of reference sequences, use functions below such as <tt>lookup</tt> and <tt>data</tt>.

HLA-DRB3 is an exception: the table links to DRB3*010101, but there we see:

Deleted - Please note this has allele name has been deleted

Reason For Deletion: Sequence shown to be identical to DRB3*01010201 (Aug 06)


 * <tt>refseq.</tt><tt>check</tt> ( web, pg='http://www.ebi.ac.uk/imgt/hla/help/align_help.html' )
 * Check that the data in this module is consistent with data published in the web.{| frame="void" rules="none" frame="void" rules="none"

! Raises:
 * ValueError if the check fails.
 * }NoteTo access this function from the command line, use: $ python refseq.py --checkWe should probably use the package Makefile cache pattern instead, though.


 * <tt>refseq.</tt><tt>data</tt>
 * Get reference sequence data.{| frame="void" rules="none" frame="void" rules="none"

! Returns:
 * a list of (locus, a0) pairs where <tt>a0</tt> is the name of the reference sequence for <tt>locus</tt>.
 * }


 * <tt>refseq.</tt><tt>lookup</tt> ( allele )
 * Look up the reference sequence for an HLA allele. >>> lookup('DRA*010201')

'DRA*0101'

allele_names.py – allele name utilities

 * <tt>allele_names.</tt><tt>allele_max</tt> ( a1, a2 )
 * Partially order allele names by information content. >>> allele_max('A*0101', 'A*01')

'A*0101' >>> allele_max('B*0202', 'B*020202') 'B*020202' >>> allele_max("A*0101", "B*0202") ... ValueError: incomparable alleles A*0101 and B*0202


 * <tt>allele_names.</tt><tt>locus</tt> ( allele )
 * Extract locus part of allele name. >>> locus('A*01010101')

'A' Strips the HLA- part off too: >>> locus('HLA-A*01010101') 'A'


 * <tt>allele_names.</tt><tt>pick_allele</tt> ( alleles )
 * Pick a unique 4 digit allele, if possible; else choose a 2 digit group.{| frame="void" rules="none" frame="void" rules="none"

! Parameter: ! Returns: ('DRA*0102', None) Variation in the 2nd and 3rd digits means there’s no unique allele at the amino acid level, so return a 2 digit allele group name: >>> pick_allele(['DRA*0101', 'DRA*010201', 'DRA*010202']) (None, 'DRA*01') If there isn’t even a common allele group, punt: >>> pick_allele(['DRA*0101', 'DRA*020201', 'DRA*020202']) (None, None) Regression test: >>> pick_allele(['B*4405']) ('B*4405', None)
 * alleles – non-empty, sorted list of allele names
 * an (allele, group) pair as shown in examples below
 * }Variation after the 1st four digits are silent mutations; i.e. they vary only in nucleotide sequence, not amino acid sequence. So just pick the common 4 digits and let them stand for the amino acid sequence: >>> pick_allele(['DRA*010201', 'DRA*010202'])


 * <tt>allele_names.</tt><tt>strip_prefix</tt> ( name, pfx )
 * Normalize allele name by stripping prefix. >>> strip_prefix('HLA-A*0101', 'HLA-')

'A*0101' >>> strip_prefix('A*0101', 'HLA-') 'A*0101'


 * <tt>allele_names.</tt><tt>trim_allele</tt> ( allele, digits )
 * Trim digits in an allele name. >>> trim_allele('DRA*010201', 4)

'DRA*0102' >>> trim_allele('DRA*0101', 2) 'DRA*01'