Text processing pilot

Nearby: Text processing

We have run neuroscience-related PubMed abstracts through Temis IDE, a commercial text analysis (text mining) tool, and rendered resulting annotations as RDF. Although similar annotation sets have been created previously (e.g. iHOP), to our knowledge this is the first time that such annotations have been made available to the public for unrestricted use, and one of the first attempts to link text analysis results to the Semantic Web.

Please use the Google discussion group Neurocommons RDF for your questions, suggestions, and problem reports.

Some resources mentioned on this page may be password protected. Please contact Jonathan Rees to obtain access.

Method
We started with the 2007 baseline MEDLINE/PubMed database files, containing 16,120,074 records. After discarding records not containing abstracts and articles that were not classified as related to central nervous system (according to their MeSH annotation), we had 874,727 abstracts. These were fed into Temis IDE equipped with the "biological entity recognizer" (BER) version 2.3. BER was able to perform some degree of processing on 368,688 of the abstracts.

BER categorizes terms and phrases in the input text in various ways (e.g. as a genetic population, chemical entity, or "therapy and analysis"), but the only controlled vocabulary handled by BER 2.3 is one for proteins and genes. As our interests here are data interoperability and information processing by machine, we discarded annotations not related to proteins/genes and their interactions. (We hope to make use of other annotations in future versions.) This filter left 94,381 abstracts containing protein/gene terms recognized by BER.

Each concept tree generated by BER was pruned to remove information not related to proteins/genes and their interactions, converted to a canonical format, and then rendered as RDF. Each leaf of the RDF concept tree is a protein/gene substance node, and internal nodes are called 'associations.' When an association derives from a grammatical structure that is identified as process-like (containing a designated verb, subject, or object) it is indicated as such by giving it a specialized type ('process' is a subtype of 'association'), and when a particular interaction type is identified (activation, binding, inhibition, interaction, gene expression, regulation, localization) the association node's type is specialized further. Where it has been identified by BER, the functional role played by each participant in a process (effector or target) is recorded. These roles are RDF subproperties of the 'has-participant' property.

The network records 30,187 associations between genes - that is, phrases that combine or relate different genes to one another. Of these, 2,474 have identified interaction-like verbs and may be called "interactions". About 5,500 different genes (or proteins, BER does not distinguish them) are identified in the abstracts.

Beyond processes and other associations, the RDF captures additional annotations that relate the annotations to the originating PubMed record and to other data sources. Protein/gene nodes are linked to their identified Entrez Gene and Swissprot records. Author usage - that is, the name used in the abstract to denote the protein/gene - is recorded. Provenance is recorded so that each putative functional relationship can be traced back to the particular abstract and span of text from which it was derived.

All database identifier annotations are for the human versions of the protein/gene, even though frequently the findings are actually for other species.

Sample annotation file in RDF/XML format derived from abstract of PubMed 15548600 (download RDF/XML)

Schema (ontology) documentation
Here

Availability

 * Download latest version of annotations (single .tgz file, 21Mbyte, 528M uncompressed)


 * The schema, as an OWL ontology (download RDF/XML)

The network is also available for query using this SPARQL endpoint:

http://sparql.neurocommons.org/nsparql/

(As of 5 May 2007, the version served by this SPARQL endpoint is not up to date. The .tgz file named above is more recent.)

The endpoint has both a web page front end at this page in addition to support for CGI-based queries. The graph is called http://sw.neurocommons.org/2007/2007-03-15/pubmed-annotations, which should be specified as either the default graph URI, or in the FROM clause of your query. Some sample queries:

What are the properties used? SELECT distinct ?p WHERE {?s ?p ?o}

What are all the CNS-related PubMed abstracts that mention that Entrez Gene 5999? prefix nc: &lt;http://sw.neurocommons.org/2007/annotations#> SELECT distinct ?pmid WHERE { ?pubmed nc:has-id ?pmid. ?pubmed nc:has-abstract ?abstract. ?span nc:has-context ?abstract. ?phrase nc:has-context ?span. ?phrase nc:has-nc0.0-interpretation ?ggp. ?ggp nc:if-gene-described-by &lt;http://sw.neurocommons.org/2007/entrez-gene/5999&gt;. }

Here is an example of the first query using a GET of the SPARQL endpoint. Note the default-graph-uri parameter (see above), and the format parameter, which is text/html in this example. To return xml, use application/sparql-results+xml.

http://sparql.neurocommons.org:8890/sparql/?default-graph-uri=http%3A%2F%2Fsw.neurocommons.org%2F2007/2007-03-15/pubmed-annotations&query=SELECT+distinct+%3Fp++WHERE+%7B%3Fs+%3Fp+%3Fo%7D&format=text%2Fhtml

Future work
Our long term goal is to provide resources that form, under different scenarios, a seed, backbone, or prototype for community efforts to annotate scientific literature and link independent data resources in a way that promotes advanced search and computational analytic methods. The semantic web is the best available solution for the kind of cooperation, interoperation, and aggregation that are necessary in order to remove data- and knowledge-related friction points in scientific investigations.

Our text analysis work is in very early stages. Documentation of the ontology is needed. There are undoubtedly design flaws in the schema and in the processing of the text mining files. These will be fixed as we come across them. Analysis of the network is needed, especially measurement of precision and recall. We need to make use of more of the information that BER is surfacing, such as cell and tissue type. Expansion to a larger set of abstracts, to papers from PubMed Central, to GeneRIFs, and to other open access sources is desirable. Practical interoperation with other semantic web data sources and schemas, such as bio-zen and BioPAX, is also important.

In the longer run, we hope to use networks like this one in practical analysis of high throughput experimental data, and also explore other means, both mechanical and manual, by which annotation sets like this one can be extended, so that the rich content of scientific publications can begin to become accessible to automated manipulation.

-

Reformatted September 2008. Previous version: http://sw.neurocommons.org/2007/text-mining.html