Semantic resources project/PRO/Lucene Index

Lucene Documents for Proteins
Lucene 2.9.0

Lucene creates indices out of Document objects. Each Document object can have a number of named Field objects associated with it (names do not have to be unique within a single Document). The Fields contain text, and also have several associated binary options:
 * Storage
 * Analysis
 * Part of the Term Vector

I use a custom analyzer ("BioAnalyzer") to analyze our description fields.

Indexing pro.obo
There are four fields immediately associated with each term in a pro.obo release which are available for text indexing:
 * name
 * def
 * synonym
 * xref

For example, the following PRO term has all four relevant fields: [Term] id: PRO:000000003 name: HLH DNA-binding protein inhibitor def: "A protein with a core domain composition consisting of a Helix-loop-helix DNA-binding  domain (PF00010) (HLH), common to the basic HLH family of transcription factors, but lacking   the DNA binding domain to the consensus E box response element (CANNTG). By binding to basic   HLH transcription factors, proteins in this class regulate gene expression." [PRO:CNA] comment: Category=family. synonym: "DNA-binding protein inhibitor ID" EXACT [] synonym: "ID protein" RELATED [] xref: PIRSF:PIRSF005808 is_a: PRO:000000001 ! protein

Furthermore, a simple regexp UniProtKB:((?:P|Q)\\d+(?:-\\d+)) can be used to pull UniProt references out of the def field of a PRO term.

In the simple mapping, I create one document per protein term. I index name, synonym</tt>, and def</tt> entries as description</tt> fields of the corresponding protein document, and I index xref</tt> and any UniProt identifiers parsed from the definition as accession</tt> fields. The PRO term identifier itself is stored in the protein-id</tt> field of the document.

Indexing PRO using BioThesaurus
BioThesaurus is a PIR-distributed resource linking text fragments with accession numbers into protein databases. We can use BioThesaurus to match text terms against UniProtKB identifiers; the uniprotmapping.txt map file associated with PRO can then be used to tied UniProtKB identifiers to PRO terms.

Searching with Lucene Phrase Queries
public Query createPhraseQuery(String phrase) throws IOException { Query t1 = new TermQuery(new Term("protein-id", phrase)); Query t3 = new TermQuery(new Term("accession", phrase)); BooleanQuery t2 = new BooleanQuery; String[] tokens = tokenize(phrase); for(String token : tokens) { t2.add(new TermQuery(new Term("description", token)), BooleanClause.Occur.MUST); }     BooleanQuery query = new BooleanQuery; query.add(t1, BooleanClause.Occur.SHOULD); query.add(t2, BooleanClause.Occur.SHOULD); query.add(t3, BooleanClause.Occur.SHOULD); return query; }