Semantic resources project/PRO/Submission File Formats

Data Submission Formats to PRO

= Term Submission =

Most of the basic work for the submission of new terms to PRO is based on the RACE PRO web form.

One of the things that we need to work out is adding an identifier to each element of the submission itself -- this will help us track which submissions ultimately get which PRO terms, and allow us to develop our own consistent "provisional term" system in the meantime.

RACE PRO Fields
For the submission:


 * 1) Annotator Name
 * 2) Annotator Email
 * 3) Annotator Institution

For each submitted entry:


 * 1) UniProtKB Identifier OR amino acid sequence
 * 2) Organism [specified with an NCBI taxon identifier]
 * 3) Sequence Region ("Full length" or "from-to")
 * 4) PTM : amino acide number + type


 * 1) Protein Object Name(s) (separated by ;)
 * 2) Evidence Source [DB Name & ID]+

And then there's an Annotation Section:


 * Domain
 * Functional Annotation
 * Sequence Ontology
 * Disease

Perhaps these are best moved down below, into the "Annotation" submission format.

Finally, there's a "Comments" field.

Cecilia's Proposed Submission Format


DB_ID field being the accession number for protein (could be of the type: for isoform (Acc-#) or even variants (VAR_xxxxx))

Protein_region: is to indicate the protein region if the object is a cleaved form or fragment. The numbering is always in reference to that in DB_ID field.

Modified_residue: To indicate the residue(s) that has undergone a post-translational modification and the type of modification. In the distribution file we use MOD ID to refer to the modification, but you can use the vocabulary as used in the RACE-PRO Lys-19, Acetylation (we can convert automatically to Lys-19, MOD:xxx). The formatting for the Modified_residue, MOD follows the convention:

Format: 3 letter code residue-#, MOD_ID. The residue number refers to the sequence displayed in in DB_ID. If more than one residue is modified with same modification then separate residues using  /  (see smad2 example above). If more than one type of modification then separate them by pipes | (see smad2 example in table above). If you know the type of modification but not the exact residue, then use a question mark instead of a residue (see last example in table above).

Name:Name as displayed in the source. This field could be empty.

Source: Pubmed ID or database source for the evidence.

Assigned_by: to provide source attribution about submitters. We usually use the source and the curator initials, like in my case it is PRO:CNA, in your case it would be SWAN: then need to determined what initials to add, (if this is not suitable in your case and we can just leave as SWAN).

Term Submission Format
Need to represent for term submission (minimal information only)


 * UniProt ID or PIRSF for family (what about isoforms? Should we put the parent UniProt?)
 * Comment such as "category=gene/sequence/modification"
 * is_a relationship (subclass)

= Annotation Submission =

In the format of the PAF.txt file.

Also, there are guidelines for the PAF.txt file.

There are two ways this can be used -- either to submit new annotations for an existing PRO term, or to submit annotations for a Requested Term (presumably submitted simultaneously). Do these different uses require separate fields or formats?

PAF Fields
This table describes our reading of the PAF file format, given the example available from the PRO website and the published PAF File Guidelines document. (In fact, the content of this table is largely copied and edited from that document.)

Questions

 * 1) Is the "date" field the date of submission, or of entry?

Annotation Submission Fields
This table outlines our own sketch of the required fields in a text submission file to add new "annotations" to the PRO ontology.