IAP 2009

Scientific Data Integration on the Semantic Web

MIT IAP non-credit course

January 12-16, 2009

Lectures 10-12 each morning, MIT room 2-143 (main building)

Clinic (lab) each afternoon: M/T/R 2-5 Stata Center 32-346, W 3-5 Stata 32-346, F Stata 32-G451

Instructors: Jonathan Rees, Alan Ruttenberg, Thinh Nguyen, and a cast of dozens

Clinic is not required, but is recommended

Prerequisites: For lectures: general knowledge of databases and/or information processing technologies. For clinic: simple scripting; regular expressions

The goal is for each student to understand how Semantic Web technologies (RDF, SPARQL) can be used for data integration in science. Each student participating in the clinic should bring a data set (spreadsheet, database) to combine with public Semantic Web content, and aim to get something interesting out of the exercise.

Content and techniques covered ought to apply equally to any kind of scholarly data integration, although the emphasis will be on biomedicine and the content of the Neurocommons data distribution. For other subject areas students should look to dbpedia and other Linking Open Data (LOD) RDF sources. You should try to identify some candidates for public data represented in RDF or OWL before the course starts.

To participate in the clinic, please bring a data set to work with or some other idea for an integration project. This could be, for example, a spreadsheet, an XML file, a relational database, RDF, or OWL. The data set should contain data elements that can relate to some existing RDF data source somehow.

Each lecture will consist of approximately 80 minutes of general explanation of the social and software ecology of the Semantic Web for science, followed by 30 minutes oriented toward hands-on work. In the afternoon clinics (labs) we will work together to marshal sources and pose queries against them.

Thinh Nguyen of Science Commons will be speaking on Thursday about copyright, licensing, and trademark as they apply to data, ontologies, and "knowledge bases". We anticipate having a few other guest speakers as well.

Tentative schedule
Help:Table

Activity listing on MIT IAP web site

Reading

 * Open Source Knowledge Management (presentation, Jonathan Rees, 2008)
 * Semantic Web Boot Camp (course web site, Jim Hendler and Tim Berners-Lee, 2007)
 * Semantic Web: The Big Picture (lecture(s), Jos de Bruijn, 2005)
 * A no-nonsense introduction to "semantic web" technologies (presentation, Stefano Mazzocchi, 2007)

Technical tools for data integration

 * Pellet - an OWL-DL reasoner.
 * Jena - Java platform for RDF, including in-memory triple store.
 * OpenLink Virtuoso - open source persistent triple store.
 * Protege - an ontology editor.
 * Saxon XQuery - scripting language for XML file transduction.

Resources

 * Google group
 * IAP 2009 outline - notes prepared before sessions
 * Notes prepared during and after: Mon, Tue, Wed, Thu, Fri