Semantic resources project/Antibodies/Test data set

AlzForum Examples
Gwen and Elizabeth were asked to prioritize Alzforum antibodies to work on based on their experience with the community and science

From the AlzForum antibodies database (antibodies to):
 * Amyloid Precursor & Amyloid Precursor-related Proteins (388 antibodies)
 * Presenilin Proteins (598 antibodies)
 * Amyloid-beta related (457 antibodies)
 * Presenilin complex and ubiquilin (275 antibodies)
 * gamma-, beta-, and alpha-secretase (141 antibodies)
 * Tau (466 antibodies)
 * Heat shock proteins (29 antibodies) - [[media:Antibodies_to_Heat_Shock_Proteins.docx‎|Results of Paolo's analysis of which proteins they target]]
 * GSK-3 (29 antibodies)
 * Huntingtin (154 antibodies) [[media:Antibodies_to_Huntingtin.docx‎|Results of Paolo's analysis of which proteins they target]]
 * Huntington's-related (220 antibodies)
 * Synucleins (189 antibodies)
 * SOD (300 antibodies)

AntigenIDs = [ 6, 54, 5, 55, 12, 59, 115, 33, 36, 35, 4, 67 ]

Here are some of the criteria Gwen and Elizabeth used when chosing these antigen classes
 * 1) they are of interest to several diseases - AD, HD, PD, and ALS
 * 2) they represent surface/membrane, intraceullar proteins, some are enzymes, cleavage products
 * 3) there is a group of protein complexes
 * 4) many link to the work we are doing with expanding PRO
 * 5) we have a rich set of SWAN AD research statements/hypotheses linked to  some of these antigen classes
 * 6) link to PDOnline content
 * 7) link to work Alan is involved in the HD community

Issues
There are several issues with this scraping process so far:
 * Character encoding issues -- I'm not sure what the character set used here is, and so (in some views of this file) some of the characters, such as the 'alphas', don't render properly.
 * Spacing -- there may be extra spaces omitted, or added, in some places.
 * "Next" -- these are only the first page of the relevant query. There is a "Next" button on each page, which leads to another set of results for the same query.  I will (eventually) spider these links and collect the entire set, but for now this doesn't include those results.
 * Commas -- I'm outputting in a CSV format, so "real" commas have been replaced with semicolons.

Formatting
AlzForum's tables have several sections, with varying degrees of formatting.
 * Antigen (Clone) IgG -- the format of this field mirrors the name; there is an antigen name, followed by a clone name in parentheses, followed by an IgG value. The clone and IgG value are optional.
 * Name (Source) -- Name is given, source is inconsistently parenthesized; this is largely a free-form field.
 * Immunogen (Epitope) -- Values of this field either seem to start with the phrasing "immunogen =" or have no prefix; no parentheses are consistenly used.
 * Host (Formulation) -- Formatting follows the name.
 * Methods -- comma-separated list.
 * Specificity
 * Reactivity

Example Output
I did a preliminary scrape of the front page of these two queries, and output the results to a table: [[Media:Tables.txt|Example Table Formatting]]

An example of what these files look like is this:

Title: Antibodies to Amyloid Precursor & Amyloid Precursor-related Proteins