OBO Library Pipeline

Current situation
One of the neurocommons servers (norbert) runs an "OBO library triple store" which combines all of the OWL from all of the ontologies in the OBO library.

Actually there are two versions of this triple store, an "old" one, and a development or "dev" one. These differ in a variety of ways.

And actually there are two triple stores in each case, so that one can be refreshing while the other one stays up and running. This is managed using the load balancing feature of Apache.

The refresh step makes sure that the triple stores always reflect the latest versions of all the ontologies. Currently (Dec 2011) all stores are refreshed from http://www.berkeleybop.org/ontologies/ which Chris M calls the "legacy" area. The "old" store gets the ".owl" files found there, while the "dev" one gets the ".owl2" files.

This is a problem because apparently apparently the berkeleybop tree stopped updating as of 2 Oct 2011.

In addition to the berkeleybop files, the old and dev triples store contain a few extra ontologies obtained in an ad hoc fashion. These are BFO, IAO, and NCI thesaurus.

Suggestion
Let's keep the 'old' one running for, say, six months, but consider it a lost cause, focussing instead on evolving the "dev" version to use the output of Chris's "new" pipeline. The fact that the "old" one is out of date will act as a stimulus for everyone to switch over to "new".

Ideal pipeline

 * 1) Get master metadata file from http://obo.cvs.sourceforge.net/viewvc/obo/obo/website/cgi-bin/ontologies.txt   -- or perhaps an XML version of the same, if Chris M is so kind as to create one for us
 * 2) Generate .owl files for appropriate (recent) "new" versions of all OBO ontologies (this is a no-op if ontology is natively .owl; need conversion script if in some other format e.g. .obo)  -- This stage of the pipeline will ideally be performed by Chris's system; we could do it if we really had to, but I think we should plan on picking up files that Chris generates.  See http://purl.obolibrary.org/obo/oboformat
 * 3) Cache .owl files locally ('make cache')
 * 4) Do cleanup post-pass as needed to fix bad links - currently this is done using 'sed' - consider converting to N-triples first
 * 5) Put RDFherd 'bundles' in staging area
 * 6) Load up the triple store
 * 7) Start it going.

Worry
We don't want duplicate definitions, so we need to figure out how to suppress MIREOT-ed definitions - we don't want them anyhow since for the most part they're extracted from the "old" triple store, not the "new" one, right? Need to figure this out.

Note that the Neurocommons systems is oblivious to owl:import. I think in one case (NIF?) I wrote a script to process them but that is the exception - it doesn't apply to, say, OBI.

(Actually we're currently loading the "legacy" OBI generated from the OBI .obo file which in turn is a reduction of the original OBI in OWL - really we want the raw OBI in OWL. Another reason to do this upgrade!)

Current setup
The script is run nightly by the cron daemon as user 'oboadmin'. Starting point is /etc/cron.d/norbert-crontab on norbert.csail.mit.edu. This runs the script /raid/not_backed_up/obo-sparql/service/obo-job.sh.

Which in turn runs obo-sparql/packages/update-obo.sh (or -obo2.sh, as the case may be). This just runs the standard prepare - validate - cache - bundle steps for each of the bundles going into the build.

The output of each run is sent via email to user 'oboadmin', whose mail is redirected (according to /etc/postfix/aliases) to whoever is interested.

The big 'obo' (or 'obo2') bundle is not driven off of Chris's master file, it's driven from an XML version on the "legacy" site that is perhaps no longer maintained.

The following are the RDFherd bundles involved - all under http://svn.neurocommons.org/svn/trunk/bundlers/
 * obo - this is the "old" version
 * load-obo/all
 * load-obo/master
 * obo2 - this is the "dev" version
 * load-obo2/all
 * load-obo2/master
 * bfo - we may have to split this into separate "old" and "new" versions? although ideally it's not special-cased at all in "new"
 * iao - ditto
 * nci-thesaurus

Documentation for the 'bundlers' infrastructure is here: Package Makefile conventions ("bundler" and "package" mean the same thing; a "bundler" is a script - a Makefile, actually - that creates a "bundle" that can be loaded by RDFherd)

Complication number 1: One big bundle or lots of small ones?
The current version of the 'obo' and 'obo2' bundles puts all of the OBO files (other than iao, bfo) in a single 'obo' bundle with multiple graphs.

A different way to do this would be to have a script that generates a separate bundler (i.e. Makefile + Config_template.pl) and bundle for each ontology. Not sure why this would be better. It would make incremental triple store easier, but the nightly OBO build doesn't take advantage of incremental triple store update (the whole thing is reloaded from scratch each time).

Complication number 2: Upgrade to new version of RDFherd ?
This is desirable since the new RDFherd contains interesting provisions for bundle metadata, which we could make use of, e.g. in querying the source of the loaded RDF. Also Alan B put in a lot of effort to make this work and it would be good to upgrade.

But for the OBO build this is more of a luxury than a necessity.

Complication number 3: Error recovery
The current download and build script is very brittle - if there is an error at any stage of processing any ontology, the whole thing fails. Really it ought to keep the previous version of the ontology and plow ahead updating other ontologies.

Fortunately this has not been a problem in practice, but it's something to think about as the code evolves. It's probably not easy to fix without a pretty deep knowledge of how the download system works.