Virtuoso performance/One graph experiment

Experiment: Copy all the graphs to a single graph in a new rdf_quad table.

For the current Neurocommons, with 25G, and 8 X 100M/s disks, and about 4 or 5G for the unused index that yields some multiple of 30 seconds. At the moment, the scan takes 7' 20", and the actual amount read is somewhere between 9G and 11G.

Space before:

insert soft rdf_quad2 (g, s, p, o) select iri_id_from_num(2) as g, s, p, o      from rdf_quad table option (index rdf_quad); Done. -- 2817602 msec.

select ISS_KEY_TABLE,ISS_KEY_NAME,ISS_NROWS,ISS_ROW_BYTES,ISS_ROW_PAGES from DB.DBA.SYS_INDEX_SPACE_STATS where ISS_KEY_TABLE like '%rdf_quad2'; DB.DBA.rdf_quad2 DB.DBA.rdf_quad2  320469708   8607367807  1233790

status(''); OpenLink Virtuoso Server Version 05.00.3026-pthreads for Linux as of Mar 3 2008 Started on: 2008/07/26 22:26 GMT-240 Database Status: File size 0, 5242880 pages, 905621 free. 1280000 buffers, 1279741 used, 1 dirty 0 wired down, repl age 4094890 0 w. io 0 w/crsr. Disk Usage: 74242143 reads avg 0 msec, 13% r 0% w last 18129 s, 1242131 writes, 5866 read ahead, batch = 377. Autocompact 93753 in 57673 out, 38% saved. Gate: 74964 2nd in reads, 0 gate write waits, 0 in while read 0 busy scrap. Log = virtuoso.trx, 665 bytes 4190814 pages have been changed since last backup (in checkpoint state) Current backup timestamp: 0x0000-0x00-0x00 Last backup date: unknown Clients: 2 connects, max 2 concurrent RPC: 131 calls, 0 pending, 2 max until now, 0 queued, 20 burst reads (15%), 3 second brk=665206784 Checkpoint Remap 146192 pages, 0 mapped back. 17 s atomic time. DB master 5242880 total 905621 free 146192 remap 1 mapped back temp 200 total 196 free Lock Status: 0 deadlocks of which 0 2r1w, 0 waits, Currently 1 threads running 0 threads waiting 0 threads in vdb. Pending:

create bitmap index RDF_QUAD_OGPS2 on DB.DBA.RDF_QUAD2 (O, G, P, S); Done 2928907 msec. create bitmap index RDF_QUAD_PGOS on DB.DBA.RDF_QUAD2 (P, G, O, S); lost timing

alter table RDF_QUAD rename RDF_QUAD1; alter table RDF_QUAD2 rename RDF_QUAD;

So, what graph did we put everything in?

select id_to_iri(iri_id_from_num(2)); http://www.openlinksw.com/schemas/virtrdf#

Ah well, I was annoyed with figuring out the conversion - that's what you get.

load "/wd/test3/transitives.sql"; load "/wd/test3/load-transitives.sql"; load "/wd/test3/run-transitives.sql"; Done. -- 4117716 msec.

Space, after all this is done:

select ISS_KEY_TABLE,ISS_KEY_NAME,ISS_NROWS,ISS_ROW_BYTES,ISS_ROW_PAGES from DB.DBA.SYS_INDEX_SPACE_STATS where ISS_KEY_TABLE like 'DB.DBA.RDF_QUAD'

First Banff query (degraphed) run in deployed instance

PREFIX go:  PREFIX obo:  PREFIX rdfs: 

select distinct ?name  ?class ?definition from  where {  ?class rdfs:subClassOf go:GO_0008150. ?class rdfs:label ?name. ?class obo:hasDefinition ?def. ?def rdfs:label ?definition filter(regex(?name,"[Dd]endrite")) }

Query runs reasonably, if not zippy.

Second Banff query (degraphed) [http://sparql.neurocommons.org/?query=%23%20Banff%202007%20Demo%20Query%202%3A%20CA1%20Pyramidal%20Neuron%20related%20genes%20involved%20in%20signal%20transduction%20processes%0Aprefix%20go%3A%20%3Chttp%3A%2F%2Fpurl.org%2Fobo%2Fowl%2FGO%23%3E%0Aprefix%20rdfs%3A%20%3Chttp%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23%3E%0Aprefix%20owl%3A%20%3Chttp%3A%2F%2Fwww.w3.org%2F2002%2F07%2Fowl%23%3E%0Aprefix%20mesh%3A%20%3Chttp%3A%2F%2Fpurl.org%2Fcommons%2Frecord%2Fmesh%2F%3E%0Aprefix%20sc%3A%20%3Chttp%3A%2F%2Fpurl.org%2Fscience%2Fowl%2Fsciencecommons%2F%3E%0Aprefix%20ro%3A%20%3Chttp%3A%2F%2Fwww.obofoundry.org%2Fro%2Fro.owl%23%3E%0A%0Aselect%20distinct%20%3Fgenename%20%3Fprocessname%0A%0Awhere%0A%7B%20%20graph%20%3Chttp%3A%2F%2Fpurl.org%2Fcommons%2Fhcls%2Fpubmesh%3E%0A%20%20%20%20%7B%20%3Fpaper%20%3Fp%20mesh%3AD017966%20.%0A%20%20%20%20%20%20%3Farticle%20sc%3Aidentified_by_pmid%20%3Fpaper.%0A%20%20%20%20%20%20%3Fgene%20sc%3Adescribes_gene_or_gene_product_mentioned_by%20%3Farticle.%0A%20%20%20%20%7D%0A%20%20graph%20%3Chttp%3A%2F%2Fpurl.org%2Fcommons%2Fhcls%2Fgoa%3E%0A%20%20%20%20%7B%20%3Fprotein%20rdfs%3AsubClassOf%20%3Fres.%0A%20%20%20%20%20%20%3Fres%20owl%3AonProperty%20ro%3Ahas_function.%0A%20%20%20%20%20%20%3Fres%20owl%3AsomeValuesFrom%20%3Fres2.%0A%20%20%20%20%20%20%3Fres2%20owl%3AonProperty%20ro%3Arealized_as.%0A%20%20%20%20%20%20%3Fres2%20owl%3AsomeValuesFrom%20%3Fprocess.%0A%20%20%20%20%20%20graph%20%3Chttp%3A%2F%2Fpurl.org%2Fcommons%2Fhcls%2F20070416%2Fclassrelations%3E%0A%20%20%20%20%20%20%20%20%7B%7B%3Fprocess%20%3Chttp%3A%2F%2Fpurl.org%2Fobo%2Fowl%2Fobo%23part_of%3E%20go%3AGO_0007166%7D%20%0A%20%20%20%20%20%20%20%20%20union%0A%20%20%20%20%20%20%20%20%7B%3Fprocess%20rdfs%3AsubClassOf%20go%3AGO_0007166%20%7D%7D%20%0A%20%20%20%20%20%20%3Fprotein%20rdfs%3AsubClassOf%20%3Fparent.%0A%20%20%20%20%20%20%3Fparent%20owl%3AequivalentClass%20%3Fres3.%0A%20%20%20%20%20%20%3Fres3%20owl%3AhasValue%20%3Fgene.%0A%20%20%20%20%20%7D%0A%20%20graph%20%3Chttp%3A%2F%2Fpurl.org%2Fcommons%2Fhcls%2Fgene%3E%0A%20%20%20%20%7B%20%3Fgene%20rdfs%3Alabel%20%3Fgenename%20%7D%0A%20%20graph%20%3Chttp%3A%2F%2Fpurl.org%2Fcommons%2Fhcls%2F20070416%3E%0A%20%20%20%20%7B%20%3Fprocess%20rdfs%3Alabel%20%3Fprocessname%7D%0A%7D&format=&maxrows=100&go=1 run in deployed instance] prefix go:  prefix rdfs:  prefix owl:  prefix mesh:  prefix sc:  prefix ro: 
 * 1) Banff 2007 Demo Query 2: CA1 Pyramidal Neuron related genes involved in signal transduction processes

select distinct ?genename ?processname from  where {  ?paper ?p mesh:D017966. ?article sc:identified_by_pmid ?paper. ?gene sc:describes_gene_or_gene_product_mentioned_by ?article. ?protein rdfs:subClassOf ?res. ?res owl:onProperty ro:has_function. ?res owl:someValuesFrom ?res2. ?res2 owl:onProperty ro:realized_as. ?res2 owl:someValuesFrom ?process. {?process  go:GO_0007166} union {?process rdfs:subClassOf go:GO_0007166 } ?protein rdfs:subClassOf ?parent. ?parent owl:equivalentClass ?res3. ?res3 owl:hasValue ?gene. ?gene rdfs:label ?genename. ?process rdfs:label ?processname }

Query does *not* run - takes longer than I am willing to wait. Decide to create PGOS index (or should it be POGS or POSG?).

An unpleasant circumstance: I built the pgos index but had to shut my computer before it was done. When I came back it seemed to have been created, but the queries didn't return anything. Example follows.

sparql prefix rdfs:  prefix go:  select * from  {?x  ?y} ;

x               y VARCHAR          VARCHAR ______________________________________________________________________________ 0 Rows of max 10 allowed. -- 12 msec.

Here's the plan.

explain('sparql prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> prefix go: <http://purl.org/obo/owl/GO#> select * from <http://www.openlinksw.com/schemas/virtrdf#> {?x <http://purl.org/obo/owl/obo#part_of> ?y} ');

REPORT VARCHAR _______________________________________________________________________________

{ Precode: 0: $25 "rdf#" := Call DB.DBA.RDF_MAKE_IID_OF_QNAME_SAFE (<constant (http://www.openlinksw.com/schemas/virtrdf#)>) 7: $26 "obo#part_of" := Call DB.DBA.RDF_MAKE_IID_OF_QNAME_SAFE (<constant (http://purl.org/obo/owl/obo#part_of)>) 14: BReturn 0 from DB.DBA.RDF_QUAD by RDF_QUAD_PGOS Key RDF_QUAD_PGOS ASC ($29 "s-4-1-t0.s", $28 "s-4-1-t0.o") inlined <col=1771 p = $26 "obo#part_of">, <col=1769 g = $25 "rdf#"> Current of: <$31 "<DB.DBA.RDF_QUAD s-4-1-t0>" spec 5> After code: 0: $32 "x" := Call id_to_iri ($29 "s-4-1-t0.s") 5: $33 "y" := Call __rdf_sqlval_of_obj ($28 "s-4-1-t0.o") 10: BReturn 0 Select ($32 "x", $33 "y", <$31 "<DB.DBA.RDF_QUAD s-4-1-t0>" spec 5>) }

Here's a force check to make sure there really are some (I checked the Loaded.log to see whether these were loaded first).

select s, p, o from rdf_quad table option (index rdf_quad_ogps2) where p=DB.DBA.RDF_MAKE_IID_OF_QNAME_SAFE ('http://purl.org/obo/owl/obo#part_of');

s                p                 o VARCHAR NOT NULL  VARCHAR NOT NULL  VARCHAR NOT NULL _______________________________________________________________________________


 * 1) i1046277        #i1030051         #i1046272
 * 2) i1046278        #i1030051         #i1046272
 * 3) i1179522        #i1030051         #i1046272
 * 4) i1179523        #i1030051         #i1046272
 * 5) i1179526        #i1030051         #i1046272
 * 6) i1179527        #i1030051         #i1046272
 * 7) i1184115        #i1030051         #i1046272
 * 8) i1191734        #i1030051         #i1046272
 * 9) i1191735        #i1030051         #i1046272
 * 10) i1191782        #i1030051         #i1046272

10 Rows of max 10 allowed. -- 160709 msec.

So,

drop index RDF_QUAD_PGOS; Done. -- 167457 msec. create bitmap index RDF_QUAD_PGOS2 on RDF_QUAD (P, G, O, S); ... *** Error 40003: [Virtuoso Driver][Virtuoso Server]SR173: Transaction out of disk at line 139 of Top-Level:

Try again:

log_enable(2) create bitmap index RDF_QUAD_PGOS2 on RDF_QUAD (P, G, O, S) Done. -- 2702023 msec.

Darn it. I hate when this happens. The table scan now seems to reliably take 60 seconds or so. Don't know why it used to take 7.5 min. :(

File descriptors:

Delta, after starting virtuoso (2048 before).
 * Normal: 256
 * With DefaultIsolation: 2: 0
 * With FDsPerFile = 2: 512
 * With FDsPerFile = 3: 768, but isql doesn't start, and virtuoso doesn't shut down cleanly. Orri suggests 4 but with many fewer files than what we have.

Unexplained: The database has 8x50 files = 400. Why not a multiple?

/etc/rdfherd-config.pl

Deleted in the full graph scenario: Extra gene labels in goa4:

prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>

delete from graph <http://purl.org/commons/hcls/goa4> {?rec rdfs:label ?label1 } where {  graph <http://purl.org/commons/hcls/goa4> { ?rec rdf:type <http://purl.org/science/owl/sciencecommons/ncbi_gene_record>. ?rec rdfs:label ?label1 }   graph <http://purl.org/commons/hcls/gene> { ?rec rdf:type <http://purl.org/science/owl/sciencecommons/gene_record>. ?rec rdfs:label ?label2 } filter (?label1 != ?label2) }

Progress: In the g in the last column virtuoso recipe, banff query 2 runs as long as a graph is not specified. Performance is good: 70 seconds from cold, 3 seconds warm. However, not sure whether the results are correct (Answer: Forgot transitives). Below:

Compare with the [http://sparql.neurocommons.org/?query=%23%20Banff%202007%20Demo%20Query%202%3A%20CA1%20Pyramidal%20Neuron%20related%20genes%20involved%20in%20signal%20transduction%20processes%0Aprefix%20go%3A%20%3Chttp%3A%2F%2Fpurl.org%2Fobo%2Fowl%2FGO%23%3E%0Aprefix%20rdfs%3A%20%3Chttp%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23%3E%0Aprefix%20owl%3A%20%3Chttp%3A%2F%2Fwww.w3.org%2F2002%2F07%2Fowl%23%3E%0Aprefix%20mesh%3A%20%3Chttp%3A%2F%2Fpurl.org%2Fcommons%2Frecord%2Fmesh%2F%3E%0Aprefix%20sc%3A%20%3Chttp%3A%2F%2Fpurl.org%2Fscience%2Fowl%2Fsciencecommons%2F%3E%0Aprefix%20ro%3A%20%3Chttp%3A%2F%2Fwww.obofoundry.org%2Fro%2Fro.owl%23%3E%0A%0Aselect%20distinct%20%3Fgenename%20%3Fprocessname%0A%0Awhere%0A%7B%20%20graph%20%3Chttp%3A%2F%2Fpurl.org%2Fcommons%2Fhcls%2Fpubmesh%3E%0A%20%20%20%20%7B%20%3Fpaper%20%3Fp%20mesh%3AD017966%20.%0A%20%20%20%20%20%20%3Farticle%20sc%3Aidentified_by_pmid%20%3Fpaper.%0A%20%20%20%20%20%20%3Fgene%20sc%3Adescribes_gene_or_gene_product_mentioned_by%20%3Farticle.%0A%20%20%20%20%7D%0A%20%20graph%20%3Chttp%3A%2F%2Fpurl.org%2Fcommons%2Fhcls%2Fgoa%3E%0A%20%20%20%20%7B%20%3Fprotein%20rdfs%3AsubClassOf%20%3Fres.%0A%20%20%20%20%20%20%3Fres%20owl%3AonProperty%20ro%3Ahas_function.%0A%20%20%20%20%20%20%3Fres%20owl%3AsomeValuesFrom%20%3Fres2.%0A%20%20%20%20%20%20%3Fres2%20owl%3AonProperty%20ro%3Arealized_as.%0A%20%20%20%20%20%20%3Fres2%20owl%3AsomeValuesFrom%20%3Fprocess.%0A%20%20%20%20%20%20graph%20%3Chttp%3A%2F%2Fpurl.org%2Fcommons%2Fhcls%2F20070416%2Fclassrelations%3E%0A%20%20%20%20%20%20%20%20%7B%7B%3Fprocess%20%3Chttp%3A%2F%2Fpurl.org%2Fobo%2Fowl%2Fobo%23part_of%3E%20go%3AGO_0007166%7D%20%0A%20%20%20%20%20%20%20%20%20union%0A%20%20%20%20%20%20%20%20%7B%3Fprocess%20rdfs%3AsubClassOf%20go%3AGO_0007166%20%7D%7D%20%0A%20%20%20%20%20%20%3Fprotein%20rdfs%3AsubClassOf%20%3Fparent.%0A%20%20%20%20%20%20%3Fparent%20owl%3AequivalentClass%20%3Fres3.%0A%20%20%20%20%20%20%3Fres3%20owl%3AhasValue%20%3Fgene.%0A%20%20%20%20%20%7D%0A%20%20graph%20%3Chttp%3A%2F%2Fpurl.org%2Fcommons%2Fhcls%2Fgene%3E%0A%20%20%20%20%7B%20%3Fgene%20rdfs%3Alabel%20%3Fgenename%20%7D%0A%20%20graph%20%3Chttp%3A%2F%2Fpurl.org%2Fcommons%2Fhcls%2F20070416%3E%0A%20%20%20%20%7B%20%3Fprocess%20rdfs%3Alabel%20%3Fprocessname%7D%0A%7D%20order%20by%20%3Fgenename&format=text%2Fhtml&maxrows=100&go=1 current neurocommons server]

Space use using the "g in last column" technique:

Note that total space is much larger: ~13 versus ~9 G for RDF_QUAD, and indices that about about 1G bigger each.