Semantic resources project/Antibodies/ProteinNameMining

The initial data from AlzForum, as well as the datasheets that we can scrap or may receive from commercial antibody suppliers, often contain free text fields describing the specifity of the antibody, the name of the antigen, or the nature of the epitope.

We need to develop a software infrastructure to allow us to mine these text fields, in a high-throughput way, to identify protein names and other identifiers. These can then be organized for additional checking or hand-curation, and finally assembled into a submission to an ontology (eg. PRO), or for linking to existing terms.

Prior Work
This is not a new problem. There are many tools for accomplishing some, or all, of this task available both commercially and for open access on the web. In particular, this problem bears a resemblance to problems that are sometimes described as "automatic annotation," "tagging," or "entity recognition."


 * "Whatizit" : EBI textmining service
 * Neurocommons interface

Resources

 * IPI : International Protein Index
 * Text Extraction Software (National Center for Text Mining)
 * iProLink : PIR (Georgetown University)
 * Protein Names and Word Lists
 * "Guidelines for Protein Name Tagging"

Outline & Implementation
Our initial implementation builds upon the Lucene text-indexing and search engine, written in Java.

The antibodies from AlzForum (~26k entries) are pre-processed and indexed as documents, storing the fields: as analyzed, searchable fields in a Lucene index.
 * agName
 * datasheetLinkText
 * epitope
 * specificity

The index is then queried from two data sources: associating protein identifiers to antibody records, through matching of protein name terms.
 * PIR BioThesaurus
 * IPI XRef tables

The resulting mappings, which associate triples of antibody, search term, and protein identifier, are used to generate clusters of protein identifiers associated with each antibody (where two protein identifiers are in the same cluster, if they match using the same search term). Each search term is associated with one or more physical locations within the antibody document text.

Next, negative particles are identified within the text:
 * not
 * neither
 * nor

Proteins which match a record through a search term whose physical location is after the location of a negative particle are indicated as negative matches. Negative matches are likely to indicate the known absence of binding or specificity.

Pre-processing the AlzForum Text

 * Converting Unicode values into Greek-letter names: &amp;#945; (i.e. &#945;) is converted into alpha</tt>
 * Removing HTML tags, and inserting spaces at &lt;BR&gt;</tt> and &lt;P&gt;</tt> tags.

Custom Lucene Analyzer
Performance of the Lucene indexing and search features depends on the ability of Lucene to split, clean, and stem text fields into indexable terms.

Right now we use the lucene WhitespaceAnalyzer</tt>. We are working on a protein-specific analyzer to respect some of the typographic conventions of protein naming.