IAP 2010

= "RDF: The Good Parts" = or, Ontology-based data integration using Semantic Web technologies

Tim Danford, Jonathan Rees

= Schedule & Administration =

January 11-27, six sessions, room 36-153

''OK, we're going to try again in January 2010, learning, we hope, from all the mistakes we made in 2009. One goal is to make it much more hands-on. The ontology stuff makes the most sense if you see what can go wrong.''

Draft of course description: This course will teach students to use RDF, SPARQL, and OWL to model and integrate [scientific] data. Emphasis will be on practical use of existing software tools (logical reasoners, modeling software, and a triple store with a SPARQL interface) and community ontologies. We will also consider social and legal issues surrounding publication and use of ontologies and data.

Things we want students to learn/do:
 * Intro: RDF / OWL value proposition = simple inference (stupider than AI, smarter than XML)
 * Loading stuff (provided and generated) into a triple store
 * Querying with SPARQL
 * Using Protege for ontology browsing & construction
 * Using reasoners (run from Protege) to check ontologies
 * Data cleaning (but figure out how to limit time spent on this)
 * Converting information sources (spreadsheets, XML) to RDF
 * Orientation to linked data, semantic web, and community ontologies
 * Data and ontology publishing and licensing

Tentative Schedule:

 * 1) Using RDF, SparQL, and a Triple store.
 * 2) OWL, Protege, Pellet, basic reasoning
 * 3) Loading an existing ontology / data
 * 4) Building your own ontology / data
 * 5) Class Project Demonstrations
 * 6) Class Projects

= Software =

Requirements
Please email Tim or Jonathan if you have basic questions about Java on your system -- in particular, if you haven't ever used Java before, and are confused by the package system, the CLASSPATH environment variable, or anything else.

For this class, you will need to have the following software packages and libraries pre-installed.


 * Java 1.5 (aka Java 5)
 * We require that you run at least Java 5. Later versions (such as Java 6) should be fine too.
 * Kawa: a Scheme interpreter that runs on the Java VM. Download the binary distribution, and make sure that the following jar files within are listed on your CLASSPATH:
 * kawa-1.9.90.jar
 * The Jena Semantic Web Framework : Download a copy of Jena 2.6.2, and make sure that the following jar files from that distribution are in your CLASSPATH
 * jena-2.6.2.jar
 * arq-2.8.1.jar
 * xercesImpl-2.7.1.jar
 * iri-0.7.jar
 * icu4j-3.4.4.jar
 * log4j-1.2.13.jar
 * lucene-core-2.3.1.jar
 * slf4j-api-1.5.6.jar</tt>
 * slf4j-log4j12-1.5.6.jar</tt>
 * (If you want, you can simply add all the JAR files present in the lib</tt> subdirectory of the Jena distribution to your CLASSPATH -- but the ones above are those that you will definitely need.)
 * The Pellet Reasoner : the latest version of 2.0.1, although any 2+ version should work for our purposes.
 * pellet-el.jar</tt>
 * aterm-java-1.6.jar</tt>
 * pellet-core.jar</tt>
 * pellet-rules.jar</tt>
 * pellet-jena.jar</tt>
 * pellet-datatypes.jar</tt>
 * xsdlib/xsdlib.jar</tt>
 * Make sure that the file pellet-cli.jar</tt> is not in your CLASSPATH</tt>, however; this contains an older version of Jena and (if it precedes the Jena libraries in your CLASSPATH) will cause hard-to-debug conflicts and errors.
 * The Apache POI library : make sure the following jar files are in your CLASSPATH
 * poi-3.5-FINAL-20090928.jar</tt>
 * poi-contrib-3.5-FINAL-20090928.jar</tt>
 * poi-ooxml-3.5-FINAL-20090928.jar</tt>

= Datasets =

New York Times Dataset

 * [[Media:NYT-People-1.0.rdf.gz|People.rdf]] : Version 1.0 of the NYT people dataset.
 * [[Media:Raw-people-1.0-fragment.rdf.gz|Usable Fragment]] : An initial fragment of the dataset.

Dartmouth Atlas of Health Care

 * Atlas website
 * Data Download page
 * Data Tarball

Election Data

 * Election Resources : HTML tables of election results from around the world.
 * Psephos : Adam Carr's election archive

Subversion Access
Execute a subversion check out of the (world-readable) repository. svn co http://svn.neurocommons.org/svn/trunk/packages/common common

This will create a new directory under your current working directory, called common</tt>, which contains all the code (both Java and scheme) that we will need for this course. You can update this working copy to the latest version from the repository at any time by running svn update common (Or just <tt>svn update</tt> if the <tt>common</tt> directory is your current working directory.)

I (Tim) will be updating the code for the course at least once a night.

Java
All the Java code is located in <tt>common/java</tt>. You should add the full path to <tt>common/java/src</tt> to your <tt>CLASSPATH</tt> environment variable and then make sure the code is compiled.

In the future, I might add an Ant build file or a Makefile, but for now the build process just involves invoking <tt>javac</tt> (the Java compiler distributed as part of the JDK) and producing binaries in-place. To compile the code, invoke <tt>javac</tt>: javac common/src/org/neurocommons/**/*.java

Note that the <tt>CLASSPATH</tt> environment variable will have be set, as above, for this to complete.

Scheme
All the scheme code that you will need for this course is located in <tt>common/scheme</tt>.

To ensure that everything is working, start up Kawa from this directory cd common/scheme java kawa.repl and execute the command: (load "ontology.sch")

This should finish without errors; email me (Timothy) if it does not.

= External Resources =

SPARQL Resources
There are three core W3C documents that useful references for SPARQL:
 * SPARQL Query Language for RDF
 * SPARQL Protocol for RDF
 * SPARQL Query Results XML Format

For getting a sense of how SPARQL can be used for some simple problems, you can start with Feigenbaum and Prud'hommeaux's Cambridge Semantics presentation: "SPARQL By Example".

OWL & DL Reasoner Links

 * RDF Semantics
 * OWL2 Language Overview : in particular, look at the Documentation Roadmap, and be sure to read the Primer (if nothing else).

Software

 * Protege : the editor and "knowledge acquisition system".
 * Reasoners
 * FACT++
 * Pellet

OWL & DL Papers

 * Baader and Sattler, "An Overview of Tableau Algorithms for Description Logics"
 * Baader, Horrocks, and Sattler, "Description Logics as Ontology Languages for the Semantic Web"
 * Baader, Burckert, Hollunder, and Nutt, "Concept Logics"
 * Donini, Lenzerini, Nardi, and Schaerf, "Reasoning in Description Logics"
 * Hollunder and Nutt, "Subsumption Algorithms for Concept Languages"
 * Horrocks and Sattler, "A Description Logic with Transitive and Inverse Roles and Role Hierarchies"
 * Motik, Shearer, and Horrocks, "Hypertableau Reasoning for Description Logics"
 * McGuinness and Borgida, "Explaining Subsumption in Description Logics"
 * Brachman, "An Overview of the KL-ONE Knowledge Representation System"
 * Horrocks, "Optimising Tableaux Decision Procedures for Description Logics" (PhD Thesis)
 * Horrocks, Sattler, and Tobies, "Practical Reasoning for Very Expressive Description Logics"
 * Haarslev, Moller, and Wandelt, "The revival of structural subsumption in tableau-based description logic reasoners"
 * Brachman and Levesque, "The Tractability of Subsumption in Frame-Based Description Languages"

Data Links

 * NYTimes Data Page
 * We use the Talis-provided SPARQL endpoint in our Scheme examples.
 * Neurocommons SPARQL interface
 * The SPARQL endpoint in our Scheme examples.
 * DBPedia has a SPARQL endpoint, described here.

Other pointers

 * "Debugging the Bug"

Some pointers, from David Booth, that might be helpful:
 * N3: http://www.w3.org/2000/10/swap/Primer
 * N-triples: http://www.w3.org/TR/rdf-testcases/#ntriples
 * Converter that transforms RDF/XML to N3 or N-triples: http://www.mindswap.org/2002/rdfconvert/

"Yet the core problem remains. We still get questions like 'Why cannot I just use XML instead of RDF?' which demonstrate the fundamental misunderstanding and wrong focus; people are principally focused on syntax. I think, generally, it easy to operate in terms of something you can see and write. Perhaps that's also the reason why Semantic Web technologies, in a broader sense, are hard to adopt mentally: So much of the benefit of these technologies depends on reasoning and there, ultimately, one is dealing with something one cannot see. Let's just take RDF as an example: Applications should deal with the deductive closure of the RDF graph they process, not the (syntactic) graph itself. If all you do is process the graph that was input, you might as well use XML." "It's one thing to notice, post-hoc, that two previuosly created properties are inverses, but it seems costly to purposefully coin two properties for each relationship. While there are some things that are awkward to say in RDF/XML syntax without explicit inverse properties, the cost of writing these awkward expressions seems lower than dealing with these aliases: which one should be used in queries? Both? While owl:inverseOf standardizes the relevant inferences, they are still not without cost."
 * "Closed World vs. Open World: the First Semantic Web Battle" (Stefano Mazzochi) -- four years old, now, but probably (still) a good introduction to the two different world views which aren't always apparent.
 * "XML Considered Harmful, and other things" : Ora Lassila
 * "HasPropertyOf" on ESW Wiki : Dan Connolly
 * W.V.O. Quine, "On What There Is"

Related Projects:

 * SPARQL_in_Scheme : Jonathan's outline of what we're trying to accomplish with the SPARQL framework in Scheme.
 * Linked Data Validator Project