Linked Data Validator

Jonathan and others thought that a validator for "linked data" would be a good idea. This would be software which could spider through some portion of a linked data document and verify that a certain set of semantic and style conditions on the document were satisfied.

If such a thing exists - great, we should publicize it. But we're ignorant of a tool that does the full range of checks suggested below.

These are just rough notes. With more work they might evolve into a project plan or functional specification.

Other Validators

 * W3C Markup Validation Service
 * W3C RDF Validation Service
 * WonderWeb OWL Ontology Validator
 * vrdfa : RDFa validation from DERI
 * Vapour : a "linked data validator"
 * FOAF Validator
 * Sindice Web Data Inspector
 * RDF:ALERTS : also from DERI
 * Pellet Integrity Constraints: Validating RDF with OWL

Several of these links were culled from the end of this interview with the "Pedantic Web Group".
 * The Pedantic Web Group also has a good tools page.
 * An "Ontology Reviews" quasi-journal would be a good thing; the pedantic-web group already does this after a fashion. This is a pretty good example of a manual review.

Jonathan's wish list
Here's what it would have to do:
 * input: a document D that is an RDF or OWL file (maybe RDFa), or XHTML+RDFa; from a URI or pasted into form
 * locate URIs used in D
 * do HTTP GETs on these URIs (but not transitively - avoid wildfire) to get "supporting RDF documents" (complain if missing)
 * do HTTP-level caching (persistent across invocations)
 * follow redirects (301, 302, 303, 307)
 * do conneg favoring RDF (extra credit: if there is HTML as well, make sure no HTML fragids are used in any RDF, since they denote HTML elements according to media type registrations) (extra credit: the HTML should always contain a link back to the RDF - I hate having to hunt down the RDF using 'curl')
 * whine about non-http[s]: URIs, with careful exclusions (foaf mailboxes)
 * check that all URIs occurring in D are 'defined' somewhere in the supporting documents
 * check for presence of rdfs:label
 * for properties and classes, check that there are nontrivial comments (not just repeating the label or URI)
 * ... and have 'sorts'
 * check for existence of rdf:type assertions (or equivalent: Class, Property, etc.), determines sort = individual/property/class
 * whine if no sort or inconsistent sort
 * ... and have 'types'
 * rdf:type gives type of individual; 'type' of a property would be domain and range (or something appropriate in OWL: restriction etc.)
 * (anything to say about classes? subclass?...)
 * whine if no type(s)
 * "sort checking"
 * for s v o. in D, ensure s and o are individuals (many exceptions e.g. rdfs:label)
 * for s v o. in D, ensure v is a property
 * for s rdf:type c. in D, ensure c is a class
 * "type checking"
 * "aggressive disjointness": assume that sibling asserted classes are disjoint (this is not sound but is 'good practice')
 * when x belongs to disjoint classes A and B, whine if x, A, or B occurs in D
 * check domain and range (s v o. in D where s not in domain of v or o not in range of v)
 * (what about type inference? possible? desirable?)
 * check datatype membership s v xxx^^yyy. ensure that xxx is in yyy's lexical space and that xxx^^yyy is in range of v
 * find bogus sameAs assertions
 * be creative. maybe: check D both with and without looking at sameAses; report on relative improvement or degradation (problems repaired by attention to sameAs assertions / problems introduced by attention to sameAs assertions - we're predicting the latter, as in the NYT example)
 * protocol
 * if D was obtained from a 200 without an intervening 303, whine if D's URI is declared to be a non-individual or to belong a non-information-resource class such as foaf:Agent
 * support Web Linking when it's ready (? LRDD expired 9/24/09)
 * maybe support HTML element (does anyone use it?)

Expect that useful checks may require many special case hacks for common vocabularies. E.g. domain of dc: terms is not disjoint with "information resource". We may need to impose our own upper level ontology.

Do we need separate style checkers for OWL and for RDF(S)? Let's hope not. Try without for now, revise decision later if forced.

Output
Some thought about what output format the validator would use is appropriate.

It would be nice to express our discoveries and criticisms in RDF. So how do I write RDF about other RDF graphs, i.e. what ontology/vocabulary do I use? I.e. has anyone published a reasonable vocabulary that we might use?

Do I want something to do with Named Graphs? (JAR: 'Named graph' is a silly term as it is not an ontological category. There are graphs, and like anything else they may or may not be named.)

Or are XML Literals more along the lines of what we're thinking about? (JAR: You might for convenience say that some XML literal is related via serialization to some RDF graph, as a way of saying what graph you want to talk about. But usually the graph can be specified in some other way, e.g. by using an http: URI that leads to a serialization of the graph.)

(JAR: Need examples here.)

Test Cases
It would be nice to have some canonical examples of "bad" Linked Data, along with the output that the validator should produce. A canonical set of error cases would be both (a) a good way to test the implementation of this validator, and (b) a reasonable set of test cases for other validators that other people could write.

Both real world and artificial test cases would be interesting. [TimD] still has the original (flawed) NYT data set. Current version.

Here are some RDF pages picked at random
 * Dbpedia Akron
 * Dublin Core terms
 * Senselab Brainpharm
 * Dan Connolly FOAF