IAP 2009 notes 2009-01-14

Nearby: IAP 2009, Mon, Tue, Wed, Thu, Fri

Alan has prepared a presentation. We'll link the powerpoint here... soon...

What semweb does for you
What semantic web does for you:
 * Build on web standards - enable browsing etc
 * Enable computation. False positives (e.g. look at hits for "camera review") are not good for computational agents.
 * Work both intra- and inter- organizationally.

Allen Brain Atlas: exposing information about image file names as RDF, and providing it in a central location, enabled BIRN to solve their problem (of accessing ABA) without having to coordinate either with ABA or with Science Commons.

GO - is-a is not the same as part-of.

E. coli metabolic database unification ("debugging the bug"). Chemical entity mapping is not easy. It could be done manually, or procedurally; but it's quicker and easier to use OWL. Bring in all information about names of chemical entities to RDF. Declare that :hasCASId to be owl:InverseFunctionalProperty and owl:FunctionalProperty and watch it go.

(Note violation of unique name assumption. In RDBs, no row can have two associated names (keys).  Not so here.)

Now try it with KEGG ids. Problem in example: X from ecocyc and Y from ucsd would be the same according to CAS, different according to KEGG. This is found easily by an OWL reasoner.

There are 3 or 4 good reasoners available, a mix of open and closed. Alan likes Pellet (by Clark & Parsia).

From natural language to OWL
PeptideReceptorLigand example. What do you need to do if you're looking for peptides? Initially thwarted by presence of three PeptideX names from three sources (ontologies) that don't match. How to cause them to be related to one another?

Every member of PeptideHormone is a member of the class of things that have a hormone activity function.

(Alan is using the word "instance" (of a class). Jonathan prefers "member".)

Restrictions give a way (in OWL) to make phrases - to capture restrictions on the members of classes. Restrictions are classes, and usually one intersects restrictions with some class that you want to make a subclass of.

Q: This is a rich language, and you may be able to write classes that have no members. Can this be detected? A: yes. The reasoner can ditinguish consistent from inconsistent axiom sets; so to find out whether a class could possibly be nonempty, add an axiom that says "x is in C" where x is otherwise unconstrained. If the resulting axiom set is inconsistent, then C cannot have a member.

Q: Is the problem that of representing the data you have, given that you have a model - what corresponds to a schema in the RDB world?

Tim: Well, you can query the schemas. Alan: But not in any standard way - different in each RDB platform.

Alan: The surprise was that we wrote almost no OWL, and still found errors in 5% of errors in these carefully curated databases. Having standard languages like this - as opposed to nonstandard notations, or APIs such as JDBC - will enable all kinds of things that have not been done.

SPARQL introduction. Standardized protocols (POST) and result formats. SQL doesn't specify these.

Why SPARQL instead of some other graph query language? - Because it's standard [and good enough; moderately powerful].

Tim: Have you seen Daniel Jackson's Alloy system?

(Break) discussion of transitive closure performance, approximate queries, etc.

OWL demo
Protege 4. Q: Does it have a SPARQL interface? A: No. Protege 3 does.

Do the generated URIs point to something? No, not yet. That has to be done separately.

Ontology with classes Person, Man, Woman, with Man and Woman disjoint. Now what happens if we make an individual that's both a Man and a Woman? Make an individual that has those types...

The class expression editor has completion. Or, you can select classes by the hierarchy.

Popup error: Cannot do reasoning with inconsistent ontologies! hmm... not very informative...

Now we can assert that the union of Man and Woman is equivalentClass to Person. Make a new class that is a subclass of Person, but disjoint from Man and Woman. Result shows up with the 'neither' class under the class Nothing (which is an OWL built-in class that has no members). Unsatisfiability.

New property: holds. New class: book. Superclass: holds max 1 Book. The class "Person holds exactly 2 books" is unsatisfiable.

About the state of the art in ...

Continuing Protege overview. Look at View menu. RDF/XML rendering. Think of ... as a phrase.

Define BookHolder as 'holds exactly 1 Thing'. Go to DL Query tab and ask the qquery "holds min 1 Thing'.

Summarize: What is OWL doing for us? constructors for classes and properties; combining them into new ones; say things about classes/properties/individuals; check consistency; classify things by properties. You could express some of this in SQL, but OWL is more expressive.

Twitter data set
http://neurocommons.org/w/images/1/17/Twitter.csv - about 10,000 records of twitter transactions. The problem here is that it's not just a simple CSV file - commas in the text itself (the last column) are not escaped, which will confuse a simple CSV converter or spreadsheet import. Also, every other is line blank. Also, we had to clean up the header line... and there seems to be line noise.. and we need to make sure the file is treated as UTF-8 - Apple's 'Numbers' application assumed some other character encoding, with no way to override.

http://twitter.com/statuses/public_timeline.xml -- a sample of the records, already in XML, with locations given as city names, and so on. Updated dynamically.

http://twittervision.com/user/current_status/ is a third party application that fetches locations of specific users and what they're doing.

Liang wrote small javascript program to fetch about 20 records every minute using public_timeline, and then convert city locations to lat/lon using twittervision. The result is the .csv file that we're looking at.

(An alternative to using this .csv file, and using a Python script (Tim, in progress) to convert it XML, is to modify the javascript program so that it either generate XML directly, or generate a new CSV file that has the commas escaped somehow.) import sys def findfields(line, count): lst = [] pidx = 0 for i in range(0, count-1): idx = line.find(',', pidx) lst.append(line[pidx:idx]) pidx = idx+1 lst.append(line[pidx:]) return lst def xmloutfield(field, outf): field = field.replace('&', '&amp;amp;').replace('<', '&amp;lt;') outf.write(' %s \n' % field) def xmloutline(line, outf): outf.write(' ') for i in range(len(line)): field = line[i] if i == len(line)-1: field = field.replace('\t', ' ') xmloutfield(field, outf) outf.write(' \n') def xmloutall(all, outf): outf.write(' \n') for line in all: xmloutline(line, outf) outf.write(' ') inf = open('Twitter.csv', 'r') lines = [x.strip for x in inf.readlines] inf.close lines = [x for x in lines[1:] if len(x) > 0] fields = [findfields(x, 10) for x in lines] xmloutall(fields, sys.stdout)

Having done this we still have some trouble with character encodings, and the file requires XML boilerplate (first line is special). Rather than push this further, we decided to look at the XML for one of the original sources that fed into the spreadsheet, and use XQuery to convert it from XML to RDF/XML. Here's what we came up with:

declare namespace rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"; declare namespace rdfs="http://www.w3.org/2000/01/rdf-schema#"; declare namespace xsd="http://www.w3.org/2001/XMLSchema#"; declare namespace e="http://neurocommons.org/page/IAP 2009/Eavesdrop/";

That's how you declare XML namespaces for use in XQuery.

declare function local:id_to_URI($id) { concat("e:eavesdrop_", $id) };

Each 'status' element in the XML file corresponds to the transmission of a single message from someone to (?). has a unique identifier, e.g. 1119259159. The above subroutine converts one of these strings to a URI.

let $result := (  for $record in /statuses/status   let $id := $record/id   return &lt;rdf:description rdf:about='{local:id_to_URI($id)}'>            &lt;rdf:type resource="e:Eavesdropping"/>            &lt;e:message_heard>{$record/text}&lt;/e:message_heard>            &lt;e:is_in_reply_to rdf:resource='{local:id_to_URI($record/in_reply_to_status_id)}' />          &lt;/rdf:decription> )

Here we make a list of RDF/XML elements, one for each status element in the original XML file...

return &lt;rdf:RDF xmlns="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:xsd="http://www.w3.org/2001/XMLSchema#" e="http://neurocommons.org/page/IAP 2009/Eavesdrop/"> {    $result } &lt;/rdf:RDF>

and the RDF/XML elements can then be placed inside some boilerplate. (Sadly, the namespace definitions need to be repeated.) The 'e' prefix is not particularly well chosen, and we have done nothing to make the URIs live, but it will work for query purposes (SPARQL) and for use with other tools that don't care about standard URI access.

The above is just a skeleton showing rudimentary processing of three of the fields in the XML. There was some discussion of whether entities should be created for users (described by 'user' elements in the original), and what to do if two records for the same user were inconsistent - e.g. if different locations showed up in different 'user' elements for the same user name.

To do the XQuery processing, use 'saxon' (from apache.org, I believe). A shell script for invoking it may be found in the notes for Tuesday.

One lesson here is that if you want to avoid character encoding problems, try to stay in the XML world, since it seems to deal with character encodings in a principled way. This is easier than it sounds, since there are XML toolkits available for many different programming languages, and many tools (such as Excel) are happy to export as XML. From XML it is straightforward (using XQuery or any of the XML toolkits) to convert to XML/RDF.