IAP 2009 outline

Nearby: IAP 2009

Monday
Introduction and overview.


 * Introductions: Why are you here?
 * About Science Commons. - An effort to revolutionize the practice of science by changing the way scientific materials (physical materials, information, "knowledge") are provided, accessed, and used.
 * About this course. - Semantic web looks like a way to make data integration more efficient. Thanks for helping us to improve the way we tell this story.
 * What is data integration? Why do we care?  - Examples of integrations (from participants). Combining similar information from different sources; combining different information linked via a shared entity; forging relationships so that combination becomes possible.
 * Why is it hard? - One can be thwarted in many ways: discovery, permission, access, technology, syntax, semantics. These problems are addressed to varying degrees in the worlds of prose and software, but much less so for data. Of them, semantics (understanding what a piece of information means) is the most complex and subtle.
 * Why is semantics so important? - Examples (Alan's exercise: reverse engineering a spreadsheet)


 * Porting to a "semantic web" is proposed as an alternative to artisanal data integration. - We integrate data by porting it to a platform (e.g. a custom database). Rather than create a new platform for each application, the semantic-web approach is to port to a common open platform. This way the ports are reuseable (by anyone using this same platform).
 * (Compare to "porting" and platform commonality in software and prose worlds.)
 * What do we require of a "semantic web"? - Interpretable (meaningful) statements about reality linked by documented names, linked together by virtue of appropriate reuse of those names.


 * How do we deploy a semantic web? - URIs are used as logical names for things. One can consult the web to find out what a URI means. Definitions combine formal and informal elements. Statements are written in RDF and OWL, collected in documents, and aggregated for query purposes in "triple stores". Questions are posed in SPARQL.
 * History, promise, and critique of "the semantic web." - TimBL's vision. Tools, applications, market, hype, and uptake. The coordination debate.

Clinic prep: (a) Talk about data brought by, or suggested by, participants; finding related RDF/OWL on the web. (b) Simple SPARQL queries related to same.

Tuesday
Tuesday is precedent day: What has been done, and is being done, in the way of semantic web for science.

How "the semantic web" works
In this introduction we give just enough explanation of the mechanics of "the semantic web" (URIs, RDF, OWL) to make sense of what we'll be talking about today.

URIs (~ URLs) are used as names for things, relations (properties), and classes. When you browse to a semantic web URI, you should get an explanation of what is named. (In practice, the explanation is often absent or is missing crucial information such as units. In the best case, the explanation is provided in machine-readable form, but presented in a human-readable fashion when viewed in a browser.

Data is couched as declarative assertions about the world. That is, instead of just putting a number in some syntactic position as in a spreadsheet or XML, one writes a statement that something is "related" to the number in a particular way (or has that number as some particular property).

Simple RDF statements have the form S P O., where S is a URI naming the subject, P is a URI naming the property (relation), and O is a URI naming the object (the thing to which the subject is related), which can be given as either a URI or as a literal value such as a number or string.

RDF has two common "serializations" - RDF/XML, which is the most standard and is useful for systems that already handle XML, and Turtle, which is much more concise and human-readable than RDF/XML.

OWL is an ontology language - in most cases used to define classes and properties, and their interrelations (such as subclass). It provides a simple algebra of classes - operators for building new classes out of existing ones. RDF can be "serialized" as either kind of RDF, making it possible to pose queries against OWL ontologies in the same way one would query any kind of data carried by RDF.

OWL has a logic, which means that it defines a formal notion of logical inference. Inference is implemented by software packages ("reasoners"), and can be used for a variety of purposes.

The query language for RDF (and therefore OWL as well) is called SPARQL.

What URIs to choose
When you model data in RDF or OWL, you need to assign URIs to individuals, classes, and properties (relations). When possible you want to choose URIs that are likely to match the ones used in other RDF artifacts with which you may want your data to integrate either now or in the future.

Collections of classes and relations are called "ontologies" (compare "controlled vocabulary" and "taxonomy"). There are also collections of URIs for individuals (such as cities). We'll talk about how to find ontologies you might be able to use.

Search tools:
 * Swoogle
 * Sindice
 * Schemaweb, Umbel, Zitgist, OLS, Bioportal

Collections of ontologies:
 * dbpedia
 * Linked Open Data (LOD), bio2rdf, dbpedia
 * Open Biomedical Ontologies

Dive a bit into OBO and OBO Foundry. Good and bad examples of URI definitions.

Interesting: http://www.rdfabout.com/demo/census/

"Ordinary" URLs (those naming web pages) can be used in RDF, and are supposed to name the documents that you get by doing a GET, but the underpinnings of this idea are weak both theoretically and from a standards point of view, so we won't talk about this much.

The Neurocommons project
What sources we have chosen, how we model things, technology choices.


 * Architecture: RDF distribution of information ports, Triple store, Common naming project
 * Neurocommons ports of info from NLM (MeSH, Entrez) and other data sources
 * Example: Senselab

Wednesday
Wednesday is semantics day.


 * What this is all about
 * Getting all the answers
 * Why we want semantics to be formal
 * Costs and benefits


 * RDF and RDFS
 * Description logic and OWL 2
 * Reasoning over OWL DL: Pellet
 * Sanity checking and debugging
 * SPARQL used with OWL
 * "Knowledge representation"
 * Tractability tradeoffs
 * Scalability

Clinic preparation: Using Protege and Pellet

[Mostly AR?]

Thursday
Thursday is citizenship day.

Thinh will speak from in the first hour on legal issues: copyright, licensing, and public domain as they relate to data, ontologies, and "knowledge".

In the second hour we'll talk about one or more of the following based on class interest:
 * incentives for data publication
 * incentives for porting to a semantic web; who is responsible
 * metadata capture
 * data archiving
 * data discovery
 * responsible management of URIs
 * quality control: GO, GOA, and OBO Foundry process

= sketch of workflow for data-in-RDF on the Web

Friday
Case studies selected based on interests of participants.


 * Text processing (simple indexing, entity recognition, and/or text mining)
 * Allan Brain Atlas / Google Maps mashup
 * Immunology Portal
 * others...

[Mostly AR?]