Semantic resources project/CandidateResources/Resources/SpatialData

All Experiments

 * 1) Publication
 * 2) Pubmed ID
 * 3) Unpublished data flag
 * 4) Provenance
 * 5) Lab
 * 6) Experimenter
 * 7) Date?
 * 8) Accession Number

Microarrays

 * 1) Platform Manufacturer
 * 2) Commercial Design? (Affy, Agilent, Nimblegen, other?)
 * 3) Old "spotted" arrays.
 * 4) Design
 * 5) Probe identities (sequences!)
 * 6) Annotated gene assignments
 * 7) Probe Locations
 * 8) Hybridization Strategy
 * 9) # Channels
 * 10) Assignment of Samples -> Channels
 * 11) Protocol Information (Prep, temperature, etc.)
 * 12) Normalization
 * 13) Spike-in, multi-array, arbitrary baseline?
 * 14) Algorithm, software version

Sequencing

 * 1) Sequencer and Sequencing Parameters
 * 2) Read Length
 * 3) Single vs. Double-ended
 * 4) Sequencer Manufacturer
 * 5) Alignment Method
 * 6) Algorithm, software, and version
 * 7) Parameters
 * 8) Uniqueness constraints
 * 9) Quality Score information (reads or sequencing run)
 * 10) Per-run quality information?
 * 11) Per-read quality scoring?

ChIP-*

 * 1) Antibody
 * 2) Tagging system
 * 3) Protein specificity
 * 4) Sample Description (Cell Line/Species)
 * 5) Peak-Calling Parameters
 * 6) Gene Assignment Parameters

Expression/Transcription

 * 1) Sample
 * 2) Transcript Calling?
 * 3) Differentiate between "spatial" and "non-spatial" techniques?

GEO: Gene Expression Omnibus
http://www.ncbi.nlm.nih.gov/geo/

EMBL-EBI ArrayExpress
http://www.ebi.ac.uk/microarray-as/ae/

NCBI Trace Archives
http://www.ncbi.nlm.nih.gov/Traces/home/ The SRA, or Short Read Archive, as a few transcriptome sequencing experiments in its index.

SMD: Stanford Microarray Database
http://smd.stanford.edu/

There are several other university microarray database websites (Texas and UNC, and perhaps Yale) that are based on the SMD system and schemas I believe. This might be useful for importing data from all of them; however, some of the experiments present in these databases will also be present in GEO (perhaps in slightly modified format) and so thinking about resolving reference identity between these databases will be important.

File Formats
SOFT, CEL, GPR, Tabular, MAGE-ML, Relational database (?), MINIML

Publication-based Access
The scientist likely keeps a set of publications in which they are interested -- either publications in which they are co-authors, publications that they have personally selected to be of interest, or publications which are derived from references from others. A scientist will want the ability to take a list of publications (presented as Pubmed IDs, or DOIs, or maybe names) and translate these into a set of URIs for microarray and sequencing experiments which were published in those articles. Furthermore, the ability to actually derive Accession Numbers, in bulk, from publication lists would be useful for bulk download.

Automatic Assembly-Based Update
Both microarrays and short-read sequencing experiments depend for their spatial interpretation on the locations of sequences (either probes or short reads) mapped to a genome assembly. However, genome assemblies are periodically updated, and those updates require a user to update his or her experimental sequence mappings in order to continue to interpret the experimental results against the new genome. An important use-case for a working scientist would be to take a set of experiment identifiers (experiments that the scientist has produced him or herself, or experiments that he or she has downloaded and is using) and to find the subset of those identifiers which correspond to a "new" genome assembly. This indicates which experimental datasets must be re-processed before research can continue.

Query by Biological Unit
The most important use case is the ability for a scientist to query the collection of experiments to find which contain results relevant to a particular biological unit of interest. experiment? Furthermore, these queries can be combined according to sets of genes or proteins -- questions about protein complexes, pathways or metabolic reactions, or arbitrary gene sets.
 * 1) Is a particular gene probed by any microarray experiment?
 * 2) Was the binding of a particular transcription factor measured in any ChIP-based
 * 1) Was a particular antibody used in any ChIP-based experiment?
 * 2) Which experiments contain information about an arbitrary spatial region of the genome?

Other Experimental Datatypes
ArrayCGH 4C, 5C Sequencing, genotyping (SNP detection)

Relevance to Stem Cell Research
There have been lots of expression experiments done against stem cells -- for the purposes of biomarker identification, and later for watching development of stem cells. ChIP-chip and ChIP-seq have also been used to discover the binding locations of transcription factors associated with "stemness."

Drafts

 * [[Media:Spatial-resource-proposal-09-20-2009.pdf|9/20/2009]]

Ontologies

 * Cell type (CL)
 * Chemical Entities of Biological Interest (CHEBI)
 * Evidence codes (ECO)
 * Ontology for Biomedical Investigations (OBI)
 * Protein Ontology (PRO)
 * Sequence Types and Features (SO)
 * Systems Biology (SBO)
 * Biological Imaging Methods (FBbi)
 * Expressed Sequence Annotation for Humans: eVOC (EV)
 * Gene Regulation Ontology (GRO)
 * Information Artifact Ontology (IAO)
 * Microarray Experimental Conditions (MO)
 * Physico-chemical methods and properties (FIX)
 * Proteomics data and process provenance (ProPreO)


 * "My Experiment" Ontologies
 * SCOVO

Datasets
ChIP-chip experiments:

Antibodies Used:

Miscellaneous
A list drawn up, of terms that need to be identified or defined in the course of looking at these kinds of experiments:

Probe
A short sequence of DNA (oligonucleotide) that hybridizes to DNA present in an experimental sample. In gel analyses, the probe is labeled (to make it visible, either in a radiograph or through fluorescent light) and applied to a sample which is fixed in place (the "blot"), while in microarray analyses the sample is labeled and then applied to the probes that are fixed in place; in either case, the combination of a labeled nucleic acid that hybridizes to another nucleic acid which has been fixed in place allows the experimenter to read which hybridizations have, or have not, occurred.

Microarray
An experimental platform on which are fixed many (thousands to hundreds-of-thousands) of probes in fixed locations. Microarrays are a massively parallel way of measuring the preferential hybridization of a labeled sample to different probes. The choice of probes to be placed on the array (and the location of the probes on the array) are known as the array design. Microarrays, after hybridization, are read by a scanner -- a robotic machine that takes the array design and the physical microarray as input, and systematically scans each spot on the array (each probe) with a laser and a camera. If the spot "lights up," then the probe has hybridized to labeled DNA from the sample; the machine quanitifies the amount of response for each probe and reports the total probe intensity set as output.

Hybridization
Generically, the binding of one nucleotide to another by matching of complementary basepairs. More specifically, the "hyb" is the experimental step in a microarray protocol where the sample is applied to the microarray and is allowed to specifically hybridize to individual probes. The Hybridization step is characterized by the content of the sample (and how it is labeled), the identity of the array, and the conditions under which it is performed (temperature is an important factor). Some arrays can be hybridized multiple times, either in the same experiment or across different experiments.

ChIP: Chromatin Immunopreciptation
An experimental protocol by which fragments of DNA (chromatin) that were physically near a particular protein in the nuclei of a cell population can be purified using antibodies (immuno-) and then extracted and measured (precipitation). The first step of ChIP is soaking the cells in a fixing agent, often formaldehyde. This "glues" together every piece of the cell to every nearby piece -- including proteins that bind DNA to the DNA that they are binding to. The second step is "sonication," the process of physically shattering the cells into pieces; some of those pieces will be the DNA-binding proteins, still attached to a piece of the DNA they bound to. Finally, the targeted protein is purified out using a purification system involving antibodies specific to that protein (or sometimes to a tag that's been inserted into the protein).

Short-Read Sequencing
Next-generation sequencing machines have the ability to determine, in parallel, the sequences (or portions of the sequence of) orders of magnitude more DNA fragments every year. Sequencers typically differ in the number of reads per run that they return, the length of the reads that they produce (generally, more reads means fewer bases per read), and the quality score information they provide per read (or per run). Some sequencing technologies generate base-pair reads only from one end of a fragment, while others generate reads from both ends (so called "paired-end" reads). Brand-name manufacturers for next-generation sequencing technology include Illumina (nee Solexa), 454, ABI (SOLiD) Biosystems, and Helicos, among others. Most short-read sequencing results must be aligned to a genome assembly (translating read sequence into genomic coordinates) before analysis is possible.

Tiling Microarrays
"Classical" microarrays designated only a single probe, or a few probes, to measure the expression of each annotated gene or sequence feature. As microarray technology improved, however, the density of the arrays increased, allowing more probes to be placed on each array. A "tiling" microarray chooses probes locations to have equal spacing along the length of each chromosome, without regard to the locations of the underlying sequence features. Depending on the organism, the array technology, and the tiled area, the densities can be chosen to either tile the genome at a fixed spacing, or to choose a loose spacing but avoid repetitive sequence. Tiling microarrays are useful for discovering unannotated or ubiquitous transcription: "between" genes, on the antisense strand of coding regions, etc. However, interpreting tiling microarrays requires an additional layer of spatial analysis before transcripts can be identified or characterized.

Normalization
In order to correct for common types of technical noise and adjust the values of different arrays to make them comparable, a computational "normalization" needs to be performed on nearly all microarray-based experiments. Normalization depends on the type of microarray, the sorts of corrections that are performed automatically by the array scanner itself, and whether there are built-in corrections in the hybridization scheme (most twochannel hybridizations are reported as ratios of one channel, the signal, to the second channel, which contains a non-specific sample). For sequencing, the primary normalization required is to the total number of reads sequenced by the machine, but this is still an active area of research.

Peak-Calling
ChIP-chip and ChIP-seq experiments must identify "peaks" -- regions of the genome identified by enriched probes or statistically-significant counts of sequence hits, and which are indicative of a protein binding event. There are a wide variety of computational peakcalling methods, from those that simply look for statistically enriched probes (on microarrays) or genomic "bins" (with ChIP-seq), to those that take into account the shape and intensity of the peak itself. Once peaks have been "called," they are often then "assigned" to genes or other sequence features. Assignment is usually based on (linear) distance along the chromosome and nearest-neighbor assignment, but there are many variations of this as well. Some peaks may not be assigned to genes at all, or to multiple genes -- other genes may have multiple peaks assigned to them, or none at all.