Semantic resources project/Antibodies/NIF Antigen Mapping

NIF generously donated a dump of their antibody database table through read-only access to their PostgreSQL database. We offered to help them normalize the protein names (indicating antibody specificity) in their database and to match those names to PRO and UniprotKB identifiers where available using our Lucene index of PRO and the PIR Biothesaurus text dataset.

This page gives an overview of our work: the original dataset, our pre-processing, the Lucene indexing, the query strategy, and the final results.

NIF Database
The NIF antibody database dump is a single relational table containing over 600,000 rows and downloaded on June 11th, 2010.

The table contains the following fields:

itemid item vendor price catalognumber quantity applications conjugate form reactivity type antigen isotype antigenspecies clonenumber hostspecies antigensynonyms description product_url gene_target_url specific_info supplier_info original_item_name

After examining the datafile, we determined that one field would be the easiest single point of entry for systematically mapping specificities into Uniprot and PRO identifiers: antigen.

Although they are sometimes left blank, the antigensynonyms and original_item_name fields may also provide useful names or strings which we can use to double-check our results.

Finally, the antigenspecies field may help narrow down species-independent PRO terms to the species-specific subclasses.

Antigen Dataset
From the extracted NIF antibody database, I created a list of all unique entries -- normalized to upper case -- in the antigen field, along with a count of how many times each entry appeared in the original database.



Here is an example section of the file, with a list of antigens that all appear twice in the database:

2 ACTIVE CASPASE 2 2 ACTIVE CASPASE 7 2 ACTIVE CASPASE 8 2 ACTIVE GHRELIN 2 ACTIVE&lt;SUP&gt;&amp; X00AE;&lt;/SUP&gt; JNK, (PTPPY) 2 ACTIVIN B  2 ACTIVIN BETA 2 ACTIVIN RECEPTOR IB  2 ACTIVIN RECEPTOR II   2 ACTIVIN RECEPTOR IIA 2 ACTIVIN RECEPTOR IIB 2 ACTIVIN RECEPTOR TYPE 1 (ACRV1) 2 ACTIVIN RECEPTOR TYPE 1B (ACVR1B), CENTER 2 ACTIVIN RECEPTOR TYPE 1C (ACVR1C), CENTER 2 ACTIVITY-REGULATED CYTOSKELETON-ASSOCIATED PROTEIN 2 ACTL6A / ACTL6B 2 ACTR / AIB1 2 ACTR-IIA 2 ACTR5 2 ACTR6 2 ACVR2A, ACVR2B 2 ACYL-COA-BINDING DOMAIN-CONTAINING PROTEIN 5 (ACBD5) 2 ACYLATION STIMULATING PROTEIN 2 ACYP1 (ISOFORM A)  2 ACZONIN (PCLO) 2 ADAD1

Code for reading and parsing the file is in file: CountedAntigens.java

The entries are only converted to all upper-case from the original database, but are otherwise not cleaned at all; in particular, some of them still contain markup fragments ("&lt;SUP&gt; ... &lt;/SUP&gt;") or other HTML entities. Other idiosyncrasies were also observed:
 * Greek letter names are not normalized; some are given as text ("ALPHA") and some are listed using Unicode entities ("&amp;954;")
 * Dashes are used both as protein-name components ("ACAT-1") and as word-separators ("ACID-SENSING").
 * Dashes within protein names are not normalized: "ACAT-1" vs. "ACAT1"
 * "Terminal identifiers" are not normalized: "N-TERM" vs. "N TERMINAL" vs "N-TERMINUS", along with all permutations of dashes or abbreviating periods.
 * Some antigen identifiers include fragments, such as epitope region indicators, which we don't expect to match any existing standardized protein name, e.g. "AA 408-420"
 * Commas, periods, slashes, and parentheses are all used alternately as qualifiers or simply word-separators, and
 * Protein names are sometimes given in "array" notation, e.g. "BAGE-1, -2, -3, -4, -5, -6, -7</tt>", indicating multiple proteins with a common prefix.

Some of these problems could be avoided in our work with the AlzForum antibody database, since the antigen records there were (relatively) more normalized. In the case of pulling protein names out of free text, we could rely on a human annotator to manually highlight the relevant protein name section for lookup. In this case, however, we have 100,000+ unique antigen names and we wish to use each one as a complete query to the protein name database. We want to be able to discover accurate Uniprot and PRO entries, even in the presence of un-normalized terms or where only the prefix of the entry is a meaningful protein name.

PRO / Biothesaurus Index
Biothesaurus is a database, created and distributed by PIR, which maps text fragments to Uniprot (and other protein databank) identifiers. We extracted these text fragments and their corresponding UniprotKB identifiers, and created a Lucene index to facilitate text searching for proteins. Each UniprotKB identifier is entered into the index as a single Document</tt>, with a Field</tt> corresponding to each associated text fragment.

However, the text in each text field must be analyzed by the Lucene indexer in order to satisfy later queries, and this analysis must match the analysis of the query text. To this end, we implemented a general string rewriting engine, which attempts to fix (through regular-expression matching) a subset of the normalization problems noted above: The source code for this rewriter is: RewritingReader.java
 * separating "enjambed" Greek letters (e.g. "PKBALPHA</tt>" becomes "PKB-ALPHA</tt>")
 * normalizing indexed protein names (e.g. "ABIN-2</tt>" becomes "ABIN2</tt>")
 * recognizing and expanding protein arrays, and
 * normalizing epitope fragment identifiers.

The source code for the Lucene index creation routine is: ProteinIndexer.java ; the resulting index takes 2+ hours to create and is 2.3 Gb in size. It is not posted here, but is available upon request.

Antigen Queries
In order to construct a query against the Lucene index for a particular antigen string, we first rewrite the string using the RewritingReader.java</tt> methods, as above. This rewritten string is then used to construct two simultaneous queries: The first query ensures that, if the antigen name is simply a known protein database accession number or identifier, we can retrieve the corresponding Uniprot identifier through exact matching.
 * 1) an exact term match against any accession numbers stored in the Lucene index, and
 * 2) a weighted prefix phrasal query for the text fields in the index.

The second query is more complicated, a custom query designed to identify partial matches among the Lucene index text fields for each protein document. Take the antigen "ACTIVIN RECEPTOR TYPE 1B (ACVR1B), CENTER</tt>" as an example. One simple strategy would be to simply use a Lucene PhraseQuery</tt> on all the (normalized) terms from this string -- however, Lucene phrase queries require a complete set of terms to match, and not all the terms in this string may correspond to terms in a protein name. For example, it's likely that the protein names "ACTIVIN RECEPTOR TYPE 1B</tt>" and "ACVR1B</tt>" each individually appear in the Biothesaurus, but it is unlikely that they will appear in the same string, and furthermore neither may have the "CENTER</tt>" fragment in them (which is specific to this description of an antigen). Both of these details will cause the standard Lucene PhraseQuery to fail at matching this name to the index.

An alternative strategy would be to simply divide the original antigen name into terms, and combine these terms using the Lucene BooleanQuery</tt>. However, this suffers from the opposite problem of retrieving too much -- in this case, one of the terms ("RECEPTOR</tt>") will match thousands of proteins in the index, causing our query to retrieve far too many hits.

Instead, we opt for a third (custom) query method, halfway between the two options. We create a custom BooleanQuery</tt> with the following features: We call the resulting query strategy the "weighted prefix phrasal query", and we use it for all our antigen matching.
 * 1) all individual terms are optional matches, across any text field associated with a protein document,
 * 2) terms which are earlier in the antigen string will be weighted more heavily (so matching "<tt>ACTIVIN RECEPTOR</tt>" will count more than matching "<tt>CENTER</tt>"),
 * 3) the set of terms must match from the beginning of the antigen phrase, so that matching "<tt>RECEPTOR</tt>" counts for nothing unless "<tt>ACTIVIN</tt>" also matches as well, and
 * 4) Greek-letter terms in the query are required, not optional, matches.

The complete source code for the query construction and index search is contained in the source file: ProteinSearcher.java ; the complete search for all unique antigens required nearly 3 hours on the "<tt>ashby.csail.mit.edu</tt>" server. The results were formatted and are presented in the following section.

Results
The complete set of mapped results are given in this file:



Three example lines from the output file (slightly re-formatted for better view in this wiki):

ADA2        PRO:000016011   A1C6B2|A1DGY7|A3LQF2|A7ASE6|B3L3T9|B6AC88|B7XHU2|B7XQA0|B8NB21|Q17J15|Q17J16| Q17J17|Q2LKW8|Q2LKW9|Q4UHX0|Q4XMA2|Q4Z6G8|Q5CHZ0|Q5DX44|Q84KH2|Q86TJ2|Q8IJP9| Q8S9F8|Q9P7J7|Q9U6L1 ADA2 BETA   PRO:000016011   B3KSN0|B3KX99|B5DFL8|Q503N9|Q86TJ2 ADA2-BETA   PRO:000016011   B3KSN0|B3KX99|B5DFL8|Q503N9|Q86TJ2

The file is tab-separated, and contains three columns:
 * 1) The <tt>antigen</tt> field, corresponding to the antigen list created earlier.  Each unique antigen name occurs once in this <tt>mapped_antigens.txt</tt> output file.
 * 2) The set of PRO identifiers; multiple identifiers are separated by the "|" symbol.
 * 3) The set of Uniprot identifiers; multiple identifiers are separated by the "|" symbol.

There are at most 25 Uniprot entries in the third column -- this was an arbitrary cutoff, determined (by experiment) as sufficiently large to capture most exact PRO matches where available, but small enough to keep the total query time for all antigens under three hours. The existence of exactly 25 Uniprot entries may indicate that more matches would be returned if the threshold were set higher; these antigen records are candidates for further query if necessary.

The PRO identifier field contains the set of all PRO identifiers that correspond to any of the Uniprot identifier matches; this correspondence is determined by matching the Uniprot identifiers against the PIR-distributed uniprotmapping.txt file for PRO.

The terms in both the Uniprot and PRO columns are listed (from left to right) in sorted order according to query result score, from highest to lowest.

Of the 107,456 unique antigens in the input file, 101,870 (94.8%) were mapped to at least one Uniprot identifier; of these identified antigens, 71,526 (70.2%) have at least one corresponding PRO identifier.

NIF $ gunzip -c mapped_antigens.txt.gz | wc -l 107456 NIF $ gunzip -c mapped_antigens.txt.gz | cut -f 2,3 | perl -e 'while(<>) { chomp; if(length($_) > 1) { print length($_), "\n"; } }' | wc -l 101870 NIF $ gunzip -c mapped_antigens.txt.gz | cut -f 2 | grep PRO | wc -l 71526