Porting

Nearby: Port Makefile conventions

We are working towards a practice of scientific "knowledge" management where a large number of information artifacts are available in forms that permit them to combine gracefully. Combination means both merging (pooling similar information to broaden resources available for query) and joining (bulk linking enabling queries that span different kinds of information joined via entities that are mentioned in both).

Any community practice that would make such mashups routine will need to have common ways to communicate - shared syntax, names, semantics, and packaging conventions. Such practices constitute a "platform" very similar in many ways to an operating system or programming language. In our prototype platform (Neurocommons) we use RDF and OWL as the common syntax, and community-maintained namespaces such as those from OBO and the Common Naming Project for names and their semantics.

How to port to the Neurocommons platform
Information can come from any kind of source (see Bundles). Sometimes almost no work needs to be done, as when the information is already rendered in RDF using ontologies compatible with the platform, while in other cases the port can be quite involved, requiring the processing of large number of bits and quite a bit of data cleaning and normalization.

Generally speaking, the steps are:
 * 1) Locate and obtain source materials (XML, CSV, RDF, ASN, etc.)
 * 2) Check terms of use
 * 3) Identify desired information elements for capture
 * 4) Extract information
 * 5) Design a "model"
 * 6) Deposit information into model-derived RDF templates
 * 7) Emit RDF files
 * 8) Debug by loading into a triple store and running queries
 * 9) Create and publish a 'bundle'

Mechanics
For concrete instructions on putting together packages that implement ports, as opposed to general ideas, see Port Makefile conventions.

Getting the source material
Often scripts must be written to copy source material locally for processing. In the simplest case this means copying a single file using wget (curl) specifying the file's URI. In other cases scripts are required to wrangle web interfaces not meant for use by automated processes.

Terms of use should be checked to ensure that any intended use, such as redistribution, is permitted. We prefer to work with material that is public domain or licensed for redistribution, so that we can add it to our RDF distribution.

Extraction
We generally convert source material to RDF suitable for the Neurocommons platform in two steps. The first step ("extraction") locates interesting information within the source (we often deal with subsets for expediency or size reasons), capturing it either in data structures internal to the porting scripts or, more commonly, as intermediate files in S-expression, XML, or RDF format. The second step is a "modeling" step which applies simple rules to render the captured information in RDF or OWL that makes use of vocabulary and modeling techniques that will promote integration with other ports.

Extraction may involve some amount of text processing, data cleaning, heuristic disambiguation, and so on.

We have used a variety of technologies for extraction, including XQuery, LSW, Scheme, and sed. LSW is a very productive programming environment for this purpose, while XQuery is a cleaner language in some ways and is easier to learn. One advantage of using XQUery, especially with XML sources, is the principled treatment of non-ASCII character encodings it inherits from XML.

Modeling
Given a particular modeling approach, it is easy to generate RDF by filling in templates from extracted information. Often the modeling has to be redone as we fix bugs or learn better ways to model the information. An update to the modeling strategy may require reprocessing the extracted information, but shouldn't necessarily trigger re-extraction. Thus a two-stage extract/model process may be more manageable and efficient than going straight to well-modeled RDF in one step.

Designing a good representation is difficult, so sometimes we don't. That is, for different ports, we employ any of several general strategies
 * 1) quick and dirty
 * 2) someone else's
 * 3) methodical

By 'methodical' we mean integrated with other ports in a methodical way - namely, by employing OBOF (OBO Foundry) tactics. Because of the emphasis on community process and design for reuse, the results can be quite good, but there is a price to pay. For expedience we also employ non-OBOF modeling (see for example Bundles/mesh/mesh-skos), some coming from outside sources, some coming from inside the Neurocommons project.

Creating a bundle
Material that is to be used with our RDF installer (including everything to be included in the Neurocommons distribution) needs to be placed in its own directory, with an appropriate configuration file (with the name Config.pl). The directory then becomes what we call a "bundle" or data package.

Documentation on bundle configuration files needs to be written. Here is the configuration file for the Gene Ontology annotations bundle (Bundles/ncbi/goa): {   class => "RDF Bundle", name => "ncbi/goa", version => 4, graph => "http://purl.org/science/graph/ncbi/goa", }

RDF files may be organized into subdirectories, and may be provided compressed or uncompressed, in RDF, OWL, or Turtle.

Sometimes the terms of use of the source material requires that some kind of notice be presented to those using the material. Such a notice may be placed in a NOTICE file in the bundle directory.

Refresh
All steps in the process should be automated, so that as sources are updated, new bundle versions may be issued.

Comparison with Linked Open Data
Neurocommons places an emphasis on porting to a platform, which includes coordinated choices of URIs and a common "world view". That is, the work required for data integration - bringing about name and model consistency - is supposed to be done up front. The idea is to do the work once, and preserve that investment, so as to relieve the integration burden on users of the material (those who would build triple stores and issue SPARQL queries). A wide variety of RDF sources could be, and in fact are, used as inputs to a Neurocommons port, but such a port may involve URI normalization and remodeling, as syntactic conversion to RDF is only part of the effort required for integration.