IAP 2009 notes 2009-01-12

Notes from 12 January 2009. Nearby: IAP 2009, Tue, Wed, Thu, Fri

Participant interests
Deeper understanding of what Science Commons is doing...

Making biological data accessible more available to researchers (sequence annotation, chip/chip). E.g. interactive visualizers, multiple resolutions (residue to gene to genome). Also consumption by programs. Current integration work is done using multiple relation databases.

Census data for urban blocks + traffic reports. Twitter. Vehicle presence detectors. Real time. Connect to semantics.

Where the Neurocommons project ship is going.

Big big corporate data.

Convert registry of biological parts for access on semantic web.

How to help MIT researchers manage and store data.

Putting data sets (robotics) into dspace in a way that they will be useful in the future.

Multiphysics simulation built by composing several independently created models. Multiple scales. Long legacies. How do we represent the consolidated input and output files for the simulation? Maybe drive it from a description of the inputs and outputs, and use the description to drive format and unit conversions. .... The metadata itself needs to be converted, too, since the separate simulations are driven by it..

Example (Alan R)
http://www.imtech.res.in/raghava/mhcbn/ = curated database of MHC binding. Look at the HLA alleles. How do people choose allele names - are they standard? Allele nomenclature has evolved - in part because of improvement in detection technology. No direct offer for download.

Q: Two options here: You create an interface to the source; or you collect what's at the source.

Alan: E.g. suppose you want to search on binding strength. How would one do a search, if the original source (web site) doesn't provide this particular search?

JAR: The question of distributed vs. centralized storage has both technical and social/legal aspects. For Neurocommons we are picking our battles - we choose to work with data that will fit into computers of modest abilities (32G RAM, 2T, etc.) - so a centralized setup is great - you can do any kind of query just by consulting RAM. For "big data" (high throughput, genomics, particle physics, astronomy) the technical story will be different and more complex.

Clinic - Monday
1. Registry of parts


 * Relational database + wiki
 * A representation (in RDF/OWL) problem.
 * We might integrate this with... what?
 * What questions might we want to ask of it?

2. Genomic / regulation integration


 * Information keyed to genome
 * Integrates public resources:
 * Gene annotations
 * Metabolic networks
 * Protein/protein interactions
 * Extracting data that's encoded in papers
 * e.g. diagrams / figures
 * blots
 * use semi-automated processes to get this stuff

3. Social history spreadsheet


 * Properties
 * Assessed values
 * Method of construction
 * Area

4. Twitter record


 * Messages. Who sent it; to whom it was sent; when; approximate locations of sender/receiver; what was sent.

JAR gives quick overview on the continuum from quick and dirty (ad hoc, transliteration) to principled/standard track.

Data integration: The hard way
Alan giving brief run-through on semantic analysis of 1792 properties data set.

Look at first row of data set. Owner is Young, Bishop of Methodist Church.

@prefix an: . an:x1 rdfs:label "Joseph Wiatt". an:x1 rdfs:comment "(explanation of what this URI means)". an:x2 rdfs:label "Occupant in 1798". an:x2 rdfs:subClassOf snap:role. an:x2 ro:bears an:x5. an:x3 rdf:type an:x1.

Annapolis 1798/x1

Annapolis 1798/x2

Annapolis 1798/x3

http://neurocommons.org/w/images/6/60/Annapolis_1798_v2.xls


 * Give everything a URL using a relatively opaque identifier (e.g. x1, x2, ... above).
 * Use purl.org or make other arrangements for potential stability.
 * Make sure the URL resolves to an explanation of what the URL means / names.
 * The comment and label should be written as @en if they're in English.
 * Use some sort of upper ontology to structure your thoughts. (requires justification)

What about # instead of / ? - you get lots of other stuff, no just info about the particular term you wanted to know about.

Data integration: The easy way
Now let's try to convert the .xls (see above). Alan's favorite tool is LSW. (For Python there is Fuxi. For PHP there's ARQ.  There's something for tools.)

Why Jena instead of Sesame? Sesame has a good in-memory model; it builds on Jena. We (Neurocommons project) don't Jena's in-memory triple store as everything is out of memory in Virtuoso (Neurocommons corpus is large). Pellet projects itself into Jena, and Jena acts as query processor. Alan recommends OWLAPI over Jena as it's higher level. Jena is good for reading and writing RDF/XML.

Write the Excel file as text so that we can deal with it more easily in Lisp. To convert CRLF to LF, try (in emacs) Set coding system (for saving this buffer).

transliterate.lisp -- first line is column headings. For quick and dirty translation (transliteration), just turn the column headings into URLs. These become properties. Each row (case) gets a URL as well. Each occupied cell becomes a triple.

Tim: Can't you just leave the data in the quick-and-dirty form, and assert some inference rules that implement the modeling? JAR: You would think that using some kind of abstraction - keeping the RDF close to the original, and delaying commitment to a model until the latest possible moment - would be good engineering. You could use RIF, maybe, to apply our modeling to the data. But the original data source is very close to the original, so you're not gaining much - you probably aren't storing the RDF itself if it's derived, and if you want to change the model you just change the script that generates the RDF.

The fields of a record aren't always about the same thing - so you may start with a single row, and in modeling decide you need to make statements about two individual things. When uniting with another data set, if you don't split the row into two things, you may end up with lies. Example: Pubmed record talks sometimes about the article and sometimes about the pubmed record itself; if another source talks about the article, things said about the record won't apply to it...

We're having character encoding problems with the spreadsheet - there are funny characters that aren't Unicode, and so on.

OK, now we can already see some problems. "Onion" probably isn't a person, and the string "5000 " won't work in range queries. Don't know the units of money. The temptation to go and fix things is great. Cost/benefit analysis.

Commonly, relational databases are assumed to be closed - something that's not stated is assumed to be false. RDF, on the other hand, by design has an open world assumption.

Are there ways to make data cleaning easier?