Semantic resources project/Meeting notes/2009-06-19

= Notes from June 19 meeting =

Taken by: Kaitlin Thaney. Edited by Alan Ruttenberg

Attendees: Alan Ruttenberg, Elizabeth Wu, Sudeshna Das, Jonathan Rees, Tim Clark, Kaitlin Thaney

Review of previous items

 * put in touch with Marco - (stricken off list for now since re: hire)
 * TC, AR to speak with Judy Blake -- haven't done yet - Tim to initiate.
 * EW to see if we can get the AlzForum search logs, if there are any. -- not sure they exist. EW to continue to press.
 * SD to write paragraph re: hiring issue concern to circulate to group - (stricken off list for now since re: hire)
 * SD and TC to talk following June 5 call -- done.
 * PRO discussion between EW and AR -- on the agenda for today.

New and continued action items

 * have a future meeting with AR presentation on IAO
 * TC to initiate call between TC, AR and Judy Blake
 * EW to suggest dates for PRO face to face meeting with Darren Natale. - Alan available July 2-20. EW to check with Gwen.
 * AR to talk to Maryanne Martone to get list of important proteins in neuroscience.
 * EW to talk to community members to help gather list of important proteins in neuroscience.
 * SD to talk to someone at Stem Cell Institute who's knowledgeable about antibodies, see who are their suppliers - see if ABCAM is one of them. See if we have access to computational form, which would allow us to bring under the fold.
 * have conversation w/ Paolo (when he gets back) to discuss SWAN modeling issues/anticipated transition to new models, with mind to moving forward on RDF export/ LOD version.
 * everyone: think about a wishlist for / brainstorm high leverage, lowest resources to do. Tim proposes first candidate - Jax mouse models.
 * Continue discussion of meaning of "open", "available on the web", "integrated into the Neurocommons"
 * Move forward on getting SCF mirror of the NeuroCommons up

Entity Recognition
TC: preface to the rest of the meeting. Tim says, very important to find common areas where we can agree and make progress on. Had chat with jtw -

Thinks there is a discomfort / concern around text recognition. (Alan: should be part of the portfolio, but in a measured amount JAR: Discomfort on SC's side re: previous experience with text recognition and the Board.)

TC: My proposal - part of the use case - is the essential thing for SCF and SWAN, since grounded in the discourse thru documents.

The proposal is the shared work include

a) Assumes we receive some document with markup by entity extraction algorithm. User interface for curator to lets them accept or reject the found entities and to select text and add markup if they want. The interface includes support, in the way of browsing ontologies used for entities, to correct or find appropriate term. Then stores result, where storage and representation is part of this task.

Suggests that this can - eliminate the area that there's discomfort about (the entity recognition portion), by focusing on the human annotation.

Note: Two ways to generate associations from span of text to entity: 1. entity recognition (automatic, but verified); 2. tagging (manual).

Discussion of architecture
AR/JAR a bit surprised as they thought *other* part was common.

JAR - automated processes to take certain inputs and process incrementally or in a batch. Automated entity recognition is good - it's replicable, its the UI portion that's of concern.

JAR suggested to make a framework plan that lets others plug in particular ways to do #1 AND #2... maybe do one example of #1 as a demo

AR: Notes part that's missing - interaction between entity extraction system and synonyms that are part of Neurocommons accessible ontologies.

TC: Discusses curation - curator decides that this annotation given the term and def for term is correct, narrows down set of annotations. Can add extra markup.

Question of where those get stored. it may be elsewhere, but also goes back into the same store from which the documents are taken, so that they can be processed on periodic rerunning of entity extraction.

Tim - doc being reviewed by either human or machine - recognition process that takes place - may involve previous recognition and previous entities.

SD: Question: Assuming we have these documents, and the repository of antibodies, will we be able to find entities in the antibody repository in the documents and then query for documents given antibody?

AR: As long as the entities have identifiers.

Some discussion of machine learning potential. E.g. term choices by curator could be fed back into DB and be algorithm could train using them. Potential ability to tune recognition algorithm.

AR: Notes that if curator indicates that particular phrase is to a term, this phrase can be submitted as a synonym to the ontology where the term is defined.

EW: machine learning - her understanding - suppose doc linked to gene as determined by curator. Note that there's a graph (KB) where gene is linked to coded for protein. At a later point we could use this to mark up the document with the protein, which is more specific and correct for typical context.

We discuss that as link between gene and protein is in graph, we don't need to redundantly also associated with document. Instead query documents by first expanding terms via KB relations. Relates to previously discussed "semantic search".

More detailed discussion of tasks
SD:

1. Don't want to invest in UI that's not usable. the UI they're thinking of is like a firefox plugin, not going to be a drupal/html kind of thing. very much so like a track changes function in word

2. assumption that the documents - her assumption that the documents would be the docs in the SCF communities. could extend to abstracts from PubMed. SCF documents are all public documents too.

AR: question: what would the work entail, what tasks would be a part.

TC: starting with archetype of the prototype, planning on having usable by the end of july.

The model: web service that calls other entity recognition services. when it finds something, it creates a link b/n context in doc and the entities recognized and puts in a store. then have the ability for the store that has annotations in it, a human reviewer can then look at the information. includes component of reviewer - new markup.


 * Envisioned tasks:
 * 1. information model : currently have JSON representation, but ideas for representation in RDF. want to work with SC to agree on how to represent in RDF found terms and doc relationships and provenance and capturing editorial actions and.
 * 2. Server: not sure if should be special annotation server, or if the server is a triple store - the annotation store. SCF models that as a restful web service that you can call to and retrieve annotations.

Discussion of open access to annotations
AR: Are all annotations public?

TC: If within SCF's power to make the annotations public, they will be.

SD: What about web communities not knowing if they can share the annotations?

JAR: Annotations can be open without the documents being open

AR: There can be architectural decision to make it more likely that it goes into the commons.

Agreement that the intention is to make the annotations open, short discussion about legal issues re: data sharing

TC: Architecturally, yes. Warning that some convincing work may be needed with web communities. But will try to do that.

SD: But when we license annotations from companies, we can't share those.

JAR: Can prevent access by license, but can't protect the annotation via copyright. licenses and contracts have priority when determining rights.

User Interface component - Tim's View
TC: AR had asked if UI is part of this, thinks that it needs to be. how big of a part - hard to guess - cant imagine it'd be more than 20% of TBH. could be significantly less. (SD think up to 30%). Want to get to something that can be used in a lot of different contexts. Not tied to drupal or any other architecture (maybe firefox)

Will not include Drupal-specific work.

AR: API that needs to be created to use these resources. Such bridging APIs are in the proposal, within scope.

Discussion of what "Resources" are
SD: Definition of semantic resources does not include annotations. Asked for agreement that this is beyond specific aims of the proposal and that we are now working beyond the proposal.

AR: Annotations are resources, explicitly laid out in specific aims. cites SWAN as an example.

AR: It's clear we don't have same understanding. Item to continue to discuss. Explains that to the extent we deviate from our boards understanding, that's where we get heat. Only when something that's not plainly in proposal and in not line with what we do.

SD: wondering if there are certain things that SCF can do that are not open access as part of this grant.

We share experience with oversight of our research and how constraints arise. There is increased understanding.

EW: 5-6 resources, antibody is one. Is the annotation part here one? SD and EW view the databases as the resources.

SC view: View product of annotations as a resource.

TC: Emphasis that the annotation portion he's laid out is Mission Critical and SD concurs.

TBC.

PRO
AR: Darren Natale (works heavily on PRO) said he'd come to Boston, need to schedule a date/time.

EW: Action item to suggest dates, check her and Gwen's schedule, then reach out to Darren

Antibodies
SD: make some tools to help Don do his work - is that part of the scope of the project?

AR: understanding from AR that he's a very light resource and to be handled delicately. (conclusion: Making tools for Don not on the table)

EW: Had discussion with don - have idea how antibody entry was filled over 15 years. have a good amount of background on the process used by Don (in GoogleGroups notes).

EW/AR: Want to narrow the scope.

EW: in terms of SWAN, have genes and proteins that have already been mentioned - or look at Alzforum.

Ways to narrow scope would be to focus on a selection of proteins, or focus on those antibodies that have papers associated with them in Alzforum. Not everything is linked to a paper - most are in the last 2 years according to Don.

AR: Had meeting with ABCAM - appealing to them to narrow to neuroscience.

AR: Also found out from Don - Reagent company practice is to write a spec. sheet which never gets updated again. Notes shared interest in that ABCAM would like to update more frequently, and they have very detailed spec sheets, and that NC/Semweb may be opportunity to make that possible/inexpensive.

 Proposed way to move forward / Next steps:

AR proposes that we pick limited set of genes and see how we do. Next level of expansion to ones annotated with papers.

Several strategies for selecting the gene are discussed.

 What would be needed to create the semantic resource?


 * Get list from Elizabeth
 * Get dump from Alzforum, can assign some list to don (who expressed some interest), some for hire.
 * Alan (+hire?) resuscitate old code.
 * Find some percentage of matches with automated, part manual curation.
 * Cleaning / curation - SD can help (one time thing) with writing code to find synonyms.
 * Cleaning and curation - anticipate that hire would be do that partially.
 * Create representation for antibodies

SD: What about proteins for neuroscience outside of Alzforum, how would we get the representations?

EW: Alzforum is fairly comprehensive, even outside alzheimer disease

SD: maybe hire students to help.

AR: Plan to calibrate once have first 100 or so, see what's needed.

Discourse
We had some discussion of IAO and LOD aspects. Tim thinks it useful if Alan presents IAO in future meeting.

AR: What's needed to move forward? - to get a dump from SWAN.

TC: Not that simple. current swan ontology model is a step ahead of where the internal representation of SWAN's triple store is. At some point, we will work to make that happen, but not right now. Long term objective: have to upgrade the ontology, have triples stored in an older format that need to be transformed.

AR: One thing to think about is that there will be some transformation work as part of this. Instead of working on new model then export independently, perhaps combine into one task.

TC: We will lay out requirements when Paolo gets back from travel. Help lay out any changes needed to SWAN to lay out in Linked Open Data format.

Discussion of meaning of "open" in context of the project
A discussion ensued about what the scope of work for project was, and whether it included making SCF resources available, and how.

We noted that we may not have shared understand of what we mean by "open" and "the commons", and "making available" on the Web / on Semantic Web?

AR says that included in scope is making SCF content available openly. Question of how open communities will be in SCF deployments.

Proposal has in section 1: "Providing means to link and publish knowledge authored by members of the community back to the Semantic Web"

TC agrees that making available would be open and on the Web, SD disagrees.

SD: sticking point from proposal about making content "available" - doesn't say that means it's a part of NeuroCommons. can keep it as "linked data".

SC disagrees.

AR: If interoperable, making part of the Neurocommons is trivial.

SD: Says not part of any of the specific aims - Tim disagrees, says it was. AR refers to language in the proposal.

Project plan item 4.
 * Help make SWAN and SCF data available on the web, fully integrated with the Neurocommons KB and Semantic Web.
 * Identify appropriate data elements for export, based on potential usage scenarios.
 * Model SWAN and SCF data to be exported following Neurocommons conventions and interfaces.
 * Establish protocols for packaging and disseminating data modules.
 * Establish intellectual property expectations and norms related to published content.

AR: It is in our interest and expect - goal that not all but a substantial portion of SCF is open and contributed to the commons.

SD: Saying "open" and "contributing to the commons" are not synonymous.

AR/KT: Point out that they are in the way we speak about them, in our organization.

All recognize need for future discussion.

Adjourned