2007 prototype queries

This page needs to be updated to be compatible with the September 2008 Neurocommons RDF distribution.

This page is a descendent of the page on esw wiki. It originally captured useful queries and other information related to the triple store on the way to the 2007 Banff demo. It is the intention that these queries continue to run on the latest version of the Neurocommons, and that the information be updated to be current with version of the store we use. Please report any discrepancies to our [mailto:neurocommons-rdf@googlegroups.com mailing list].

We are using Virtuoso. In some cases these document useful Virtuoso extensions so that we don't forget how to use them.

Our model is that we are loading valid (as best we can determine) OWL-DL into the triple store. However we have limited inference capabilities, so we will be adding relations, including non OWL-DL relations in order that the queries become less verbose, or even possible, given the store's current capabilities.

Creating class level relations for easier querying of the GO.
4/17/2007 (AlanRuttenberg)

I had loaded


 * http://www.berkeleybop.org/ontologies/obo-all/biological_process/biological_process.owl
 * http://www.berkeleybop.org/ontologies/obo-all/cellular_component/cellular_component.owl
 * http://www.berkeleybop.org/ontologies/obo-all/molecular_function/molecular_function.owl

into the graph http://purl.org/commons/hcls/

The intention was to get the OWL restrictions relating classes having part_of links and creating direct links between the classes so that it would be easier to query. This uses an extension from virtuoso that lets one do an insert into a named graph.

I put the results into http://purl.org/commons/hcls/classrelations

Now I can do the following which queries across graphs.

Note that the class level relations wouldn't be valid OWL-DL, it would be valid OWL 1.1 given the current spec, because of punning.

Here is how you can count how many results a query would return. (Virtuoso extension)

(btw, the purl.org/commons urls are not real - just seeing what they look like )

Information about the GO evidence codes
Not all the codes have definitions, hence the optional clause. This sort of thing is a common gotcha - you don't realize that a property doesn't always have a value (as you might expect), so you don't make it optional and miss the rest of the information. Note that the "from" clause, once I followed the Virtuoso instructions to allow it to go to the net to retrieve files, sucked the file into the triple store the first time the query was run.

List all named graphs
This one isn't sparql proper, but can be useful for debugging. It is a sql procedure that can be run from isql. Credit: Orri Erling from Virtuoso. create procedure all_graphs {  declare i any; declare g varchar; result_names (g); i := iri_id_from_num (0); while (1) {    i := (select top 1 g from rdf_quad where g > i); if (i is null) return; result (id_to_iri (i)); }  done: ; }

I think this ought to be available from SPARQL directly...

Chimezie 2007/04/20 I couldn't possibly agree any more. It leads to problems that I've mentioned before:

1. Declarative GRAPH evaluation / operators? 2. No way to specify an RDF dataset of all the known named graphs

Information about a specific GO term
Here I'm asking for the label, synonyms, definition, and comment for GO_0000118. It's annoying that sparql has no mechanism to bind a variable, so we have to repeat the go:GO_0000118 four times in this query.

... or { ?go rdfs:label ?name. ...  FILTER (?go = go:GO_0000118) } ... though the engine may naively apply the FILTER late and therefor examine more than it needs to.

In a OWL aware reasoner, such as Pellet, one could write ?class owl:sameAs go:GO_0000118

And then use ?class the other 3 times. I've asked the Virtuoso folks if it is possible to implement this behavior, but I don't know if it would conform to the SPARQL spec which is based on RDF semantics.

Looking for something in an OBO ontology
The places to look would be the uri, label, comment, and definition (perhaps also the synonyms). Here I'm looking for anything that mentions "locat".

First query against MESH
We loaded up the skos version of MESH from http://thesauri.cs.vu.nl/eswc06/, as well as terms generated by JonathanRees for the qualified narrower terms (all terms x all allowable qualifiers for each). This finds information about the MESH term Phenylethylmalonamide. The union is handy here, since it lets us get both things that point to this record as well as things it points to without getting the cross product in the result.

Some points to note:
 * Jonathan generated the relationship between the qualified term and the base term using "narrower", but mesh translation uses "broader". With no inference support for inverses we need to either a) regenerate to use broader b) use an "insert into" to add the inverses c) Remember to query both ways.
 * Broader and narrower are transitive. I need to try the code to compute transitivity. That will be next.
 * There are not yet prefLabel or scopeNote for the generated terms.
 * The mesh *properties* are in the same namespace as the records, which is wrong. We need to adjust that.
 * This takes the purl.org naming scheme out for a ride. The namespace for MESH is http://purl.org/commons/record/mesh/ . No redirects are set up yet. The base name in the files we downloaded from thesauri.cs.vu.nl were adjusted (they used an ncbi base, which is a domain they don't own).

Pubmed to MESH
Just loaded some test data to check the conversion. This query finds a pubmed record's associated mesh's. The pubmed url does not use a qname because qnames can't start with a number. I've sent a note to some SPARQL types asking if SPARQL could relax the rules for qnames in queries. There shouldn't be any issues - there is no ambiguity that I can see.

Distinction between an article and a record about an article
The article is not the record about it. There can be many records about the same article. We define a class called journal article, instances of which can be identified by records, like in the following:

We made up the name http://purl.org/science/article/pmid/10548451 for the article. Jonathan points out that DOIs and SICIs also name articles, so we need to think more about what we name these. For now the recommendation is to search for articles as above, by reference to the pmid, rather than by explicitly using our names for the articles. The alternative is to use owl:sameAs when those names start being of interest in our system.

Performance
4/22/2007 AlanRuttenberg

We're trying to load up the mapping of MESH to Pubmed, which Jonathan estimated at 200M triples (actual: 206987720). Total turtle is 8+Gb, divided up into 500 chunks. Performance wasn't great and I've been working with Ivan Mikhailov at Virtuoso, who has helped speed it up a lot. So this is to record what we've done in case someone else runs into the same problem. Note that we started out with their demo database, which wasn't tuned at all.


 * Use TTLP_MT instead of TTLP to load. The former is multithreaded and has other optimizations.
 * Change parameters in demo.ini. Our machine has 4G RAM, so they recommended:
 * NumberOfBuffers = 250000
 * MaxDirtyBuffers = 200000
 * MaxCheckpointRemap = 250000
 * If we have further performance problems, he asks that we send a note with the following information:
 * output of 'top'.
 * output of 'iostat -k'
 * The result of query select top 10 * from SYS_D_STAT order by TOUCHES desc;. This will give you top 10 of most popular indexes to write. The column TOUCHES in row where KEY_TABLE='DB.DBA.RDF_QUAD' will give you the number of inserted/deleted triples. If you run the query twice with 10-second interval then the difference of TOUCHES will give you performance (triples added/deleted during this 10 seconds).
 * The result-set of procedure call status('');. This will provide general statistics in lines between 'Database Status' and 'Current backup timestamp'.
 * While loading a lot of stuff consider dropping index RDF_QUAD_PGOS.It can be recreated later with create bitmap index RDF_QUAD_PGOS on DB.DBA.RDF_QUAD (P, G, O, S);. Might be important if we do queries like ?s ro:part_of ?o
 * Total speed up was about 4-5X for loading time, plus substantial reduction in pauses for checkpointing. We now appear to be io limited. Recommendation is to get a second disk and use striping to increase throughput. Probably worth getting a machines with 2 or 4 fast disks and stripe across them (virtuoso striping, not RAID) if you were buying a machine solely for doing this sort of thing.

Adding the inverse of a property
When loading up the SKOS version of MESH, we ran in to a little blip. The relationships between the unqualified headings used skos:broader, but the relationship between the unqualified and the qualified headings that Jonathan created used skos:narrower. If we had a reasoner on the rdf store, this wouldn't matter, since the properties are inverse of each other. So we would prefer to have at least one property that we can use consistently in our queries. We'll choose one, and the use a SPARUL update to invert the other. But which one?

Since there are so many more skos:narrower relationship, we'll choose that one, choosing to add another 32K triples rather than another 500K.

Now when we think about doing a query which includes the pattern ?a skos:broader ?b we will instead write it as ?b skos:narrower ?a</tt>. Of course this isn't complete - if we forget and use skos:broader we'll miss results, and if there was a super property of skos:broader, then that triple wouldn't be inferred.

546922 is a lot of headings! (The number of headings is equal to the number of relations since all we loaded initially were the relations). I was wondering whether they are all used. Since we have now loaded the mapping of MESH terms to Pubmed abstract, I can ask that question:

Pretty impressive - almost 400,000 of them have been used to mark up at least one paper!

Gene Ontology annotations
Human, Mouse and Rat annotations are loaded now - 12154410 triples - too many! My first query discovered why (and took a little while to understand)

To step back: I am representing a protein by reference to a protein record (they are different things, after all). The protein record is well defined. The protein isn't. So we have to provide a definition, which I will take to be something like "polypeptides that have an amino acid sequence that is described by the sequence field in such-and-such protein record, or are derived from a such a polypeptide". This probably isn't quite what is mostly thought of as that protein - for instance proteins with some variations that don't effect function might also be considered in the class. With that caveat, I now have to find some way to say this sort of thing in OWL. The choice was to model it as a restriction:

Class(protseq:uniprotkb.P04637 partial     Restriction(sc:has_peptide_sequence_described_by hasValue(uniprot:P04637)))

This says every instance of a protseq:uniprotkb.P04637 (the protein) sc:has_peptide_sequence_described_by the record uniprot:P04637. (this should probably be "complete" versus "partial"). Note that the protseq: namespace is intended to indicate that these protein classes are defined in terms of sequence (not that they *are* protein sequences).

But I also added a triple to say the type.

Class(protseq:uniprotkb.P04637 partial sc:protein)

Here's where my tool took some liberties. It grouped the two expressions together and made an single intersection of them, since that's what the semantics are in OWL. The problem with that is that a) I didn't realize it b) IntersectionOf uses an rdf:List to hold the elements, forcing me to learn the turtle for expressing lists (parentheses around them), and complicating what would be otherwise simpler queries. One way to "degunk" this would be to do the intersection of inference: ?a rdfs:subClassOf (intersection ?c ?d) -> ?a subClassOf ?c, ?a subClassOf ?d.

A second issue is that I created the expression for the protein each time it was cited in a GO annotation. Because they only differ in the blank nodes, they are effectively merged by a reasoner (think of the RDF entailment rules). However since this store doesn't do RDF entailment they are repeated, increasing the number of triples. There are two possible solutions. The easy one would be for me to generate them only once. That's what I'll do in the next version. The harder one, since it requires support of the loader, is to consider each expression with blank nodes, when loading, as first a query with the blank nodes as variables. Only if the query fails do you load the triple. However I can't quite figure out how to define the bounds of the expression.

Anyways, we can still do interesting queries until then: How many proteins are annotated?

All in all I think I'll just regenerate the OWL.

Gene Ontology annotations reworked
I reworked the generation of the protein records in two ways.
 * Only wrote the protein definitions once.
 * Where possible flattened owl:intersectionOf into separate axioms for each of the conjuncts.

With these changes, the number of triples drops to 6.7M and the query becomes much simpler:

I changed subClassOf to equivalentClass because that more accurately reflects the definition I chose (discussed above).

Deleting a large graph
So it turns out I had loaded part of the Pubmed to MESH mapping (22M triples) to a graph with a different name, so I wanted to get rid of them. I tried the sparul  drop graph <http://purl.org/hcls/pubmesh> </tt> method, but sensible, I suppose, it wouldn't let me do that - I didn't have the right permissions. When I tried it in isql sparql drop graph <http://purl.org/hcls/pubmesh> </tt> I got a message that suggested that it didn't understand what I meant. So looking in the documentation I found the following: to execute from isql.

delete from DB.DBA.RDF_QUAD where G = DB.DBA.RDF_MAKE_IID_OF_QNAME ('http://purl.org/hcls/pubmesh');

However, I didn't think through the idea of having a transaction that large. The bottom line is that it's taking quite a while to execute, and is consuming disk space and virtual memory (the latter is more of an issue). I think there should be a lighter weight way of dropping a graph, so I sent a note off. We'll see. In the mean time, it's useful to know not to do this if you plan to use the database in the next few hours.

Note that the Virtuoso Performance Tuning documentation section discusses these issues, saying, among other things, ''Always break batch updates into multiple transactions. Update a few thousand or tens of thousands of rows per transaction, never more than that. Failing to do this makes for prohibitive cost of retry with deadlocks and can cause swapping by keeping tens or hundreds of megabytes in rollback state, locks and other transaction temporary structures. This happens if one inserts, updates, deletes several million rows in a single transaction.''. It then goes on to say what you should do. Oops.

Actually, it is worse than this. Apparently a checkpoint will cause a rollback, and start the whole thing over. So we really do need a different way to do this. (it was still running in the morning). We killed the server and restarted. Usefull for future cases like this: txn_killall(1)</tt> will kill all current transactions by having them return a timeout.

How many things are highly crossreferenced?
This was an exploratory query to see how many IPI protein records had, in this case, 4 links to other records. We were trying to get an estimate of how many triples would be generated by transitively and symmetrically closing the protein record to protein record relation. Unfortunately, the query doesn't have a reasonable plan, and we killed it after a little while using txn_killall(1)</tt>. Works for N=2... Asked EricP and Virtuoso folks whether there is a a better way to express this query...

Upgrading hardware
We upgraded our machine to add 4G RAM, for a total of 8G, and two 10K 150G disks, total cost $800. We striped the database in order to increase performance. This entailed backing up the database (started virtuoso with the -b option), then editing the ini file to set Striping=1 and add a bunch of "segments". Virtuoso said that smaller segments we better so we added a bunch of lines that looks like this:

Segment1 = 1G, /click/virtuoso/demo-db-seg1-1.db q1, /clack/virtuoso/demo-db-seg1-2.db q2 Segment2 = 1G, /click/virtuoso/demo-db-seg2-1.db q1, /clack/virtuoso/demo-db-seg2-2.db q2 Segment3 = 1G, /click/virtuoso/demo-db-seg3-1.db q1, /clack/virtuoso/demo-db-seg3-2.db q2 ... Segment40 = 1G, /click/virtuoso/demo-db-seg40-1.db q1, /clack/virtuoso/demo-db-seg40-2.db q2

This gives us a total of 40G to grow in to - we are currently at 17G. We can add more segments as we need more space. We increased the amount of ram used by Virtuoso by setting

NumberOfBuffers         = 450000 MaxCheckpointRemap = 250000 MaxDirtyBuffers         = 300000

This corresponds to about 3.5G of RAM. We initially set it to 6G but something wierd happened and so we dialed it down. We'll revisit these settings.

Finally, we needed to restart Virtuoso with the +restore-crash-dump option. Don't forget. We did and the backup was corrupted so we had to start again (after deleting all the segments).

Orri says we should have done this instead, which would be more efficient

Start with the old database, ini without the stripes. SQL> checkpoint; SQL> backup_online ('med', 1000000000); SQL> shutdown;

This makes a series of backup files called med1.bp, med2.bp... in the server wordking directory. The expected total size of backup files is 1/3 or so of the database file.


 * Delete the stripes that may exist.
 * Start the server with the ini with the stripes
 * virtuoso-iodbc-t +restore-backup med

The file med1.bp must be in the working directory., This is where the backup_online command puts it. The server will make the stripes and restore the backup and exit. Then restart with the normal flags.

The database is A LOT faster now. No precise timings yet, but a query that took over a half/hour before, now takes a few minutes.

Virtuoso documentation on backup/restore

More queries on Demo script page
Demo script (ESW wiki)

Some queries on NeuronDB and Brainpharm
Mixed reasoner and SPARQL queries on the most recent version of NeuronDB.

Senselab Notes (ESW wiki)

Queries involving transitive properties
The first test of computing transitive properties using transitive.sql

Doing the computation involves setting up a named graph which contains OWL. Only the subproperty relations and transitive properties are extracted. Then one executes a procedure in interactive SQL to compute the transitive closure. For the test I chose to have only 3 transitive properties: rdfs:subClassOf, skos:narrower, ro:part_of. These are in the graph http://purl.org/commons/hcls/transitives.

The graph http://purl.org/commons/hcls/20070416/classrelations will hold the results. We execute (in isql):

DB.DBA.MATINF_OWL_BASED_TRANSITIVE_CLOSURE(    'http://purl.org/commons/hcls/transitives',     'http://purl.org/commons/hcls/20070416',     'http://purl.org/commons/hcls/20070416/classrelations')

The graph http://purl.org/commons/hcls/20070416 contains, among other things, the skos version of MESH an the Gene ontology. About 15 minutes later, about 3.5 million triples have been added to http://purl.org/commons/hcls/20070416/transitives.

A smoke test. Superclasses of a GO term we use in the demo: go:GO_0005250, A-type (transient outward) potassium channel activity. First using the OWL, as is, and second with inferred transitive subclasses:

There are a couple of things that came up that need to be addressed. 1) We got the sense of the broader and narrower wrong when generating the qualified mesh terms. skos:broader means "has broader" or "is narrower than", and skos:narrower means "has narrower" or "is broader than". I think this is good reason to err on the side of naming properties a bit more verbosely..

Second, and more interesting, is that the part_of inference needed for GO isn't only transitivity. If

Every A is part of some B B is a subclass of C (every B is a C)

then

Every A is part of some C

So imagine a cascade: a part_of b, b is_a c, c part_of d,... which is what we would have put into the classrelations graph. We won't get the transitivity of part_of without doing some more work. The options are to a) Extend the virtuoso transitive procedure to handle this case, or b) Use pellet to infer the part_relations and insert those inferred triples into the Virtuoso store.

I chose (b), see discussion at 2007_prototype_queries/Part_of_inference

Cleaning out the mistaken MESH "narrower" triples using SPARUL
Taking a bit of a chance here, as this is currently a single transaction, and there are a few hundred thousand triples. I'm counting on our new hardware setup to handle this. Soon we'll have a new server that lets this happen without wrapping a transaction and associated overhead around it. (took around 1/2 hour)

Part of Hierarchy in GO
This uses the result of having Pellet compute the part_of relations between classes. Finds all GO terms that have "nerve" in their name and the things they are part of (upward in the partonomy)

The next one is a downward query. All the things that are part of "neurological process".

Note: I automatically include the FROM for each graph I query with an explicit graph<>{} scope, but I'm not sure I need to. Need to read the spec.

Look up genes by name
What are the genes that are called "Acy3"? What species are they in? What are their other synonyms.

Typical mysterious coding
The chromosome encoding includes items like 1|Un, X|Y, etc.

Map Uniprot to Entrez Gene
The EBI gene ontology annotations come with a cross reference of database identifiers from IPI. Although I'm now using the NCBI direct to entrez gene GO annotations, this version is still in the store and might come in handy in the short run. That's because NeuronDB is currently mapped to uniprot ids. (note a bit later: afraid note - apparently the xrefs are quite incomplete )

Molecular functions of proteins mentioned in the Senselab databases
As of today, the NeuronDB and BrainPharm mention proteins defined by Uniprot records. The version of the go annotations in the goa4 graph has these. Here we use that link to find the molecular functions associated with those proteins. The filter "isBlank" filters out molecular process annotations, which are also represented using has_function, but with a restriction: (has_function (some (realized_as some biological_process)))

Biological processes that proteins mentioned in the Senselab databases may participate in
In this query we follow the path ruled out by the "isBlank" filter in the above query.

Biological Processes about Dendrites. Banff Demo #1
Described in slides

Genes, CA1 Pyramidal Neurons and signal transduction processes. Banff Demo #2
Described in slides

Loading large RDF/XML files
I was trying to load a version of FMA and got error:

DBD::ODBC::st execute failed: [unixODBC][OpenLink][Virtuoso iODBC Driver][Virtuoso Server]SR325: Transaction aborted because it's log after image size went above the limit (SQL-40005)(DBD: st_execute/SQLExecute err=-1) at /home/alanr/neuro/product/load-rdf-directory.pl line 71.

So I looked at docs and tripled the default setting og TransactionAfterImageLimit to 150000000. Docs say that memory usage could be triple that. RDF/XML file size was ~140M.

-rw-rw-r-- 1 alanr alanr 142592013 Feb 16 09:04 fmaOwlDlComponent_1_4_0.owl

(just upped it to 300000000 as previous setting wasn't adequate for taxon.owl and DOID.owl from the obofoundry site)

Loading a normal RDF/XML file via the web
Here is an example of how to load a RDF/XML file into a specific graph in Virtuoso. Simple, essential, but still a bit under-documented in the Virtuoso manuals.

iSQL command: DB.DBA.RDF_LOAD_RDFXML( xml_uri_get('http://neuroweb.med.yale.edu/senselab/model-db.owl', 'http://neuroweb.med.yale.edu/senselab/model-db.owl'), 'http://neuroweb.med.yale.edu/senselab/graph1', 'http://neuroweb.med.yale.edu/senselab/graph1' )

About SPARQL

 * SPARQLing Queries - a nice tutorial by Bijan Parsia around SPARQL issues
 * SPARQL-DL: SPARQL Query for OWL-DL - more from Bijan and Evren Sirin.

(put more useful SPARQL links here...)

Articles indexed by a MESH term
Returns the articles and titles for the MESH term D017966 (Pyramidal Neurons). You can also use the property sc:has-as-minor-mesh

NCI Thesaurus
Got from http://ncicb.nci.nih.gov/download/evsportal.jsp

Loaded using perl ~/neuro/product/load-rdf-directory.pl pwd /Thesaurus.owl \ http://purl.org/commons/nci/thesaurus707c/ http://purl.org/commons/nci/thesaurus707c/ log

Subclasses transitively computed using

SQL> DB.DBA.MATINF_OWL_BASED_TRANSITIVE_CLOSURE('http://purl.org/commons/hcls/transitives',    'http://purl.org/commons/nci/thesaurus707c/','http://purl.org/commons/nci/thesaurus707c/');

took about a minute.

some queries: (go to http://sw.neurocommons.org:8890/nsparql/ and try them)

About properties

Body Regions

Some comments on the URLs. I'd rather see http://purl.org/commons/ncit/C12680 than http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#Body_Region

The C12680 is the nci:code for the term. First, the # is gone, which means that we can send a request for the full name to a server, and second the code is more robust to typos and nomenclature changes.

Or perhaps it should just be http://purl.org/commons/umls/C0005898 if the UMLS concept code is really equivalent.

Also, there are a number of datatype properties that should be object properties filled with URI, or perhaps primary URI for the concept, or linked with sameAs, if appropriate.
 * nci:Locus_ID
 * nci:NCI_META_CUI
 * nci:Swiss_Prot
 * nci:OMIM_Number
 * nci:PubMedID_Primary_Reference
 * lots more, you get the idea

Drugs and targets

What else do we know about a particular drug?

It looks like there is some stuffing of fields going on. If you look at nci:FULL_SYN values it looks like there is a suffix (source?)

(note that there is an older version of the thesaurus in the graph http://purl.org/hcls/potluck where all the obo graphs went)

Plasmids
Find plasmids against genes associated with Dementia.

Having Virtuoso follow and load owl:imports
Q: When loading a file in to the database, or when using a FROM clause with sponging on, would be nice to be able to say that owl:import statements should be followed and the imported files also loaded.

A: define input:grab-depth 10 define input:grab-seealso <http://www.w3.org/2002/07/owl#import> The above should do it with 10 designating traversal depth along this predicate during the dereferencing activity.