Semantic resources project/Meeting notes/2009-06-02

Notes from June 2 meeting (May 29 reschedule)
Taken by: Kaitlin Thaney

Attendees: Kaitlin Thaney, Alan Ruttenberg, Elizabeth Wu, Sudeshna Das, Jonathan Rees, Tim Clark, Paolo Ciccarese

I. SCF annotation framework presentation by TC II. Structure of NeuroCommons distribution by JAR III. Hiring To be done via telecon this week. IV. Logistics V. Action
 * Background:
 * Belief that multi-disciplinary collaboration is necessary for progress in understanding and cure.
 * SCF seeks to address the need to be able to pose the right questions. If you build scientific communities on the web, can facilitate collaboration in a major way - better synthesis of knowledge to enable the user to pose better questions.
 * Question of how one best links together biomedical web communities to accelerate research?
 * Requirements for linking: participation of scientists and credibility, shared terminology (trying to be inclusive of non-computer systems, as well), robust and diverse technology ecosystem around software supporting the communities, incentive structure - need reason to contribute, efficient and reliable metadata assignment, common interfaces with toolset diversity
 * Drupal conversation:
 * Drupal only stores in MySQL. in Drupal, RDF does not go to triple stores.
 * AR: How does RDF place in such systems?
 * SD: Can download RDF/XML from a page describing content for each of the content fields. re: mapping to RDF - can use RDFa or produce RDF from that mapping.
 * AR: no bridging to NC? RDFa for naming entities? or doing things more complex?
 * TC: Stephane and John Breslin (sp?) at DERI - working on reading RDF in, no internal RDF, just mappings. can create a triple store from mappings and keep RDF there for various communities.
 * AR: Idea of keeping different databases seems fragile
 * PC: Drupal created to work on small machines with small amt of memory, need more to move to triple stores etc.
 * Discussion of RDF CCK
 * Annotation and Text mining / Entity recognition:
 * Linking the communities implies annotation of content (elements in a discussion, papers, etc. text content)
 * maps identified entities to text content
 * semi automatic text mining - using machines to suggest terms, use editors to refine choice
 * SCF currently mining for gene names, gene ontology terms
 * Text mining interface and notes:
 * Do you want to use this term at all in annotating this content, as a whole - can click on RDF icon and get a graph navigation of GO.
 * Text mining (referred to in terms of "ontology-driven entity recognition")
 * SCF's idea is to make a "service of services"
 * annotation repository - common or site specific
 * cross platform browser based
 * Entity recognition: can automatically expand by schema, ID other semantic relationships and link to other doc. process metadata, save in annotation repository, integrate
 * JAR: curious how it compares to GO annotation project (Judy Blake et al.) - GOA does information extraction, not just entity recognition, but structurally the project seems similar.
 * AR: how integral is it to the SCF model that annotations are between terms and papers. (example of GO annotator which maps gene products to processes).
 * suggestion by Alan to make more statement or association based, citing example of the structure of Gene Ontology DB - gene - qualifier - process makes up a package and then that's associated with a paper. Can have a number of little packages associated with the paper, what gene ontology people think of as "annotations". Suggests this since culturally adopted by GO.
 * TC/SD - (difficult to recreate, what GO is for)
 * Semantic search component of SCF:
 * AR: asks for clarification - Not just able to search down in the tree, but also allow for the ability of searching various names and relationships that can be exploited?
 * TC: this is how they want to map, an important part of SCF's use case.
 * AR: Need for relations or just entities. matter of providing URIs for known entities. not adding statements - just term -> document?
 * EW: provides example of antibody database. When annotated with gene -> protein -> search all antibodies available ... 3 steps removed from going from term -> document
 * What is a "semantic resource"?
 * TC: parts of project. identifies resources as biologists' resources, particularly various types of reagents, stuff that's in methods and materials; second part making it available to SCF users.
 * SD: PubMed - something that you can use in lab or in research - antibody, physical material, electronic resource, catalogue of antibodies, Entrez, datasets, any repository of information. If you express in triples then it's a "semantic resource"
 * PC: from software development perspective, SWAN would be considered as a resource, everything you import / reference / access in your system can be considered a "resource"
 * AR: collections of ways to ID things (naming), semantic comes in when you're able to get more out than you put in and do some inference.
 * AR : this details one way that set of semantic resources can be used for a particular annotation.
 * TC: can be a way to integrate SWAN with SCF. can expand greatly what you can say about an article..
 * re: Curation
 * AR: suggests some curation can be done in semi-automated ways (work alan did for Alzforum before, bit of spot checking)
 * Similar to Linux distribution ...
 * start with unruly bunch of software systems (bash, TeX, python, etc.)
 * go thru porting process to get "package" - satisfies set of rules.
 * now when you aggregate, now uniform, can choose which ones you want part of distribution
 * SC is advocating that the community follows the same pattern - to picture what it would be like to have package-like distribution of data/knowledge resources (RDF distribution = data, RDFherd (Perl module, manages processing of RDF packages - Alan Bawden), also package manager that will allow you to load and set up triple store)
 * like Bio2RDF tools, go to package manager, say "make" runs composite software, loads, goes into triple store automatically
 * Note: have second triple store that only loads OBO, updates every night.
 * TC: asks if he can just get the software if he wants. (Answer: yes)
 * JAR clarifies TC's question about NC being an open source project, to what extent? : control enough to make sure data coheres together like Linux distributions where some control is exerted, but still open source software.
 * Languages:
 * try not to make choices that would prevent someone from using a language.
 * Ones used to date: commonlisp for translators, runs on JVM, schemecode, XQuery, package managers in Perl, Python for PDB
 * TC: raises point re: translators - maintainability of the translators are very important for SCF
 * AR: recommendation that SCF runs its own instance to become familiar with.
 * Standing meeting on 5 June cancelled, timeslot partially used for hiring telecon.
 * TC, AR to speak with Judy Blake
 * Hiring telecon, for Friday 5 June