Semantic resources project/Contextual Annotation Cache

Motivation
An annotator identifies words or phrases within the context of a larger document, and creates an annotation (a match between the identified word-or-phrase and a identifier from a pre-existing vocabulary; we ignore, for the purposes of this document, all the associated metadata of an annotation such as the identity of the annotator, the time of the annotation's creation, etc.)

For our purposes, we identify the act of creating an annotation has having three components:
 * 1) a context,
 * 2) a source word or phrase, and
 * 3) a target identifier.

We will make three assumptions about the act of annotation.

First, we assume that the annotator is not creating annotations within a single document -- he or she may be creating annotations within hundreds or thousands of documents over a period of time. He or she may not even be human, but perhaps an "annotation software system" operating with or without human guidance.

Second, we assume that the documents to be annotated are themselves structured: there are words or phrases which are reused across documents, and which "mean" the same thing across those documents. (We will avoid any complicated theory of semantics here; for two source phrases to have identical "meaning" in this document suggests only that the same annotator should annotate both phrases with the same target identifier.) However, we do not assume that identical phrases always have identical meaning across all documents -- we allow the context of the phrase to modify its meaning, so that the word "Tau" in a document about ancient languages may mean "the greek letter 'tau'," while it may also indicate a kind of protein in a document concerning the biology of Alzheimer's disease. The context of these two phrases will be different.

Finally, we assume that the act of annotation is actually a negotiation between two parties: the annotator and the target-identifying agency. (We leave undetermined what the nature of this agency is, merely that it is a means by which the annotator converts a source-phrase-and-context into a target identifier.) Furthermore, we assume that this negotiation is expensive, either in terms of time, computational power, or other limited resources. An annotator who is searching thousands of similar documents, and finds a source phrase (with identical meaning) within each one, will be forced to go through the process of annotation thousands of times. If the annotator is human and realizes this identity-of-meaning, he or she may elect to duplicate the same target identifier "by hand" over and over into each new document, an operation that may not be easily supported by any annotation assistance software he or she is using. If the annotator is an automatic software component, it may lack the ability to identify when commonly-used source phrases of identical meaning and so be forced to repeat the same process of annotation many times.

What we need is an annotation cache, a middle layer which sits between the annotator and the pre-existing source of target identifiers. The cache receives a source phrase and context from the annotator, and (at first) passes it along to the target-identifier. However, the cache remembers the context and target identifier for each source term before passing it back to the annotator. At a later time, an identical source phrase in an identical (or "similar") context can then be avoided passing to the target-identifying agency, instead simply returning the previously-indicated target term.

To the annotator, this reduces the cost and increases the speed of annotation (since the cost of simple matching, and the memory requirements for storing previous matches, is presumably far less than the cost of talking directly to the target-identifying agency). It removes the requirement that a human annotator be able to remember thousands of previous annotations, and reduces the need for annotation software to have a similar "memory" built in.

It also reduces the demands on the target-identifying agency, which does not have to satisfy thousands of essentially-identical requests for identifiers.

Context
The context captures the identity and provenance of the source phrase to be annotated. The key requirement of the annotation cache is that identical source phrases with identical contexts always have identical meaning; i.e. should always be assigned the same target identifier(s) during annotation. The context may capture the following information:
 * 1) document identifier (DOI, PMID, or other uniquely-assigned expression-level identifier)
 * 2) provenance (URL, timestamp, HTTP headers)
 * 3) annotator name
 * 4) annotation "task" identifier
 * 5) internal-document references (column name, paragraph number, text offset)

The annotation cache does not require or impose the presence of any particular information in the context. The only requirement of the cache is that the set of all contexts be hierarchical, or tree-structured.

A cache context is an ordered array of strings. We write the context in the traditional array notation, S = s[0] ... s[N-1], and we use the 'len' operator to give the length of the context (i.e. len(S) = N).

Cache contexts are given a partial order, based on containment: S < S' iff len(S') <= len(S) and \forall i \in [0, len(S')]. s[i] = s'[i]. In other words, a cache context is a subcontext of another cache context if the second context is a prefix of the first.

There is a unique empty context, called the top context.

Target Identifier
A target identifier is an arbitrary string, intended to be associated with a piece of source text in an annotation. Examples of target identifiers include:
 * 1) ontological term identifiers
 * 2) URIs identifying RDF resources
 * 3) plain text strings
 * 4) database identifiers or references

Optionally, the target identifier may include a provenance attribute, indicating the database or resource from which the identifier was taken (if the target-identifying agent is able to fulfill queries against multiple different resources simultaneously). The cache does not require or restrict provenance or other attributes on the target identifiers; it must only be able to tell when two target identifiers are identical or not.

Cache Functions
The annotation cache should support three individual functions. The description of these functions depends on a few, basic types.

string := [any sequence of characters, including spaces and escaped characters] Source := string Context := string* Target := string

Query
query(Source, Context)

Returns an ordered list of Target objects, appropriate to the given Context. These elements may have been generated by an underlying query to the target-identifying agent, or may have been cached from a previous query, or both.

RegisterMatch
registerMatch(Source, Context, Target)

Signals to the cache that the given Target term is appropriate to the Source string within the given Context. This will result in the Source, Target pair being stored for the given Context, and may result in the pair being cached for additional (unspecified) Context values.

UnregisterMatch
unregisterMatch(Source, Context, Target)

Signals to the cache that the given Target</tt> term is inappropriate to the Source</tt> text within the indicated Context</tt>.

This call returns an exception (indicating an error), if the given Target</tt> was not cached for the Source</tt> in this Context</tt>.

Cache Operation
The cache operates in one of three modes: simple, specifying, and generalizing. The cache's mode determines how it caches Source, Target</tt> pairs for Context</tt> values other than the context for which the pair was originally registered.

Simple Mode
In the simple mode, the cache operates by simply remembering triples of Context</tt>, Source</tt>, and Target</tt> that are registed with it by the registerMatch</tt> function.

If any call to the query</tt> call provides a Context</tt> and Source</tt> which have been previously registered, then the list of all corresponding Target</tt> values is returned. If no appropriate <tt>Context</tt> and <tt>Source</tt> pair has been registered, the cache defaults to returning the results from the target-identifying agent.

In the simple mode, no attempt is made by the cache to take advantage of the hierarchical nature of the contexts.

Interface
Description of an HTTP-accessible interface.

Implementation
Details of the web-service implementation.