Semantic resources project/MouseModels/Nomenclature parsing project

There are two principal nomenclature problems: strain names, and allele parts (Sue Mello's term).

Strain names
Strain nomenclature guide

It's easiest for me (JAR) to think of this as a stepwise process.

We start with a strain name. There are more than 11,000 of these listed in ftp://ftp.informatics.jax.org/pub/reports/MGI_Strain.rpt. This is not an exhaustive list; more strains are named in other places, e.g. the IMSR reports, which list strains for sale.

Peel off serial number + lab codes
From any strain name we can peel off a trailing serial number and list of laboratory codes, e.g. /2ArtMmnc in

BTBRTF/2ArtMmnc

I think the serial number 2 is assigned by the Mmnc lab, not the Art lab.

Lab codes are here (not easily found on ILAR web site).

We need to be a bit careful with the grandfathered slash-containing strain names. Some seem to have serial numbers without lab codes e.g. C57BL/10.

There are some grandfathered strain names with 'serial numbers' that are lower case letters. I believe the only cases are C57BR/a, C57BR/c, and BALB/c. These are followed by lab codes.

Here's a challenging example: DDfC57BL/6J/Tbr Why isn't this DDfC57BL/6JTbr ?

Some serial numbers are parenthesized thingies. B10.129-H47&lt;b>/(21M)Sn B6.C-H23&lt;c>/(HW53)ByJ

Separate the background part from the alleles-part
(Coisogenic strains) At this point, the leftmost hyphen, if any, will separate the background strain name from "allele parts". (Need to verify that this is in fact the case.)

Pathological: BTBR T<+> tf/tf-Fbxl3/Nwu This doesn't make any sense... the part before the hyphen doesn't look at all like a strain name: BTBR T<+> tf/tf and even if you change the first space to a hyphen it's still nonsense. Compare to this one C3.Cg-T tf/t<0> +/MsRbrc which parses OK, as {C3.Cg-{{T tf}/{t<0> +}}}/MsRbrc Maybe it's a mixed inbred strain (section 4.2), like B6C3Fe a/a-Dh ?

Background part
The background (probably the wrong word!!) takes one of several forms.


 * Inbred - NAME
 * Congenic - RECIPIENT(OTHER) or RECIPIENT.DONOR or RECIPIENT.DONOR(OTHER)
 * Cross - NAMEXNAME (upper case letter 'X' being used as an operator)
 * Recombinant congenic - NAMEcNAME (lower case 'c' as an operator)

where all of the NAMEs are sequences of upper case letters and digits (where the first isn't a digit).

DONOR can be 'Cg'.

Apparently RECIPIENT can be hairy: B6-A.Cg-Eda +/+ Ar/J Perhaps this is pathological... reject for now.

Challenging: B6.C-H2-K /ByBir-Gusb /BrkJ Be careful with those hyphens. Here H2-K is a gene name, and the donor is C-H2-K /ByBir

Allele parts / genetic makeup
In fullest glory we have A A A/A A A  where each A is an 'allele part'. (Sue's term.) I assume the slash means heterozygous? (how is that maintained in a stock anyhow?)

Usually there's no / and we just have A A A. Usually there's only one A.

Example (complete strain names): B10.L-Slc11a1 AK.L-Igh-1a/CyTyJ B10.PL/(73NS)Sn-Dst&lt;dt-27J>/J Igh-1 is a gene. (73NS) is a serial number. Sn is a lab code.

Examples with multiple A's: B6.Cg-Dock7 Lepr/+ +/J B6.Cg-Rora + +/+ Myo5a Bmp5/J B6.Cg-Slc45a2 H2 Tg(Ins2-CD80)3B7Flv Challenging example: (I think this is an outlier that we should ignore) B6.C3 Pde6b Hps4/+ +-Lmx1a/J Possibly pathological, not sure what it means (little 'a' as a marker name): B10.LP-a

Allele part
Allele nomenclature guide

An allele part A has one of the following forms:
 * marker-symbolallele-symbol. In the flat file superscripts are written with angle brackets &lt; &gt;.
 * marker-symbol -- ?
 * Chr NNstrain -- note that this space character is not the same as the one separating the As.
 * Tg(...) e.g. Tg(HLA-A2.1), Tg(Igh-6/Igh-V125), Tg(SOD1*G93A)

It probably does not pay to look at allele parts too closely. Just look up the strings in one of the allele tables. e.g. Then the table will tell you everything you want to know about it - no need to parse it. (You just have to know what table to look it up in - or maybe just look it up in all of them.)
 * Pah&lt;enu2> is MGI:1857272 in MGI_PhenotypicAllele.rpt
 * Mdmg1 is MGI:3615645 in MGI_QTLAllele.rpt.

Marker symbols
Usually a gene name, which is an upper case letter followed by lower case and digits. They can contain hyphens, e.g. Igh-J, and dots, e.g. Idd5.1. Casp1 Pcdh15 Et(cre/ERT2)837Rdav (Et = enhancer trap)

Here are some fun marker symbols T(14;15) (D17Mit16-H2-D)

I don't get these strain names - maybe an allele symbol can go where a marker symbol is expected? B6(129P2)-sabe/J B6(V)-chtl<2J>/GrsrJ

Allele symbols
This seems to be a free-for-all. Examples: + wobl ? 1Jrt 129S6/SvEvTac ALR/LtJ Ath1-PERA/EiJ Bioz:BP1 Gt(6LSN)6028Gos Gt713Lex Tg(Dct-Tyr)1220Ove Tn(sb-lacZ,GFP)T1.88Jtak av-Tg2742Rpw dmod1-129T2/SvEmsJ tm1(CAG-lacZ,-EGFP)Glh tm1(cre/ERT2)Cle tm1(cre/Esr1*)Htak tm2.1(Ptgs1)Fun

tm = targeted mutation, Gt = gene trap

Some challenging strain names
The following strain names are so peculiar they look like potential mistakes to me. 129T1/Sv-Oca2<+> Tyr<c-ch>-Aft/J     - what is '-' before 'Aft' B6.129P2-Apoe<tm1Unc>-T<8J>              - two '-'s   C57BL/10Sn-Del(Y)B10.BR-Y&lt;del>/Ms/Ms        - extra '/Ms' C57BL/6J-p23-ST1                         - is 'p23-ST1' a marker/gene/allele? FVB/NJ-Tg(Tyr)3412ARpw/Ei-XO             - '-XO' ? MRL/MpJ-Fas /J-ggld       - '-ggld' ? The following names use the space character in a manner I don't get. According to the documentation a space indicates a 'mixed inbred', but in these cases it seems to be something else. A.B6 Tyr<+>-Cyba /J A.B6 Tyr<+>-Pde6a /J B6.129 Rag2<tm1Fwa> Il2rg<tm1Cgn>-Tg(TcraH-Y,TcrbH-Y)1Pas/Pas B6.129S4 C4b<tm1Crr>/J-Cal6/GrsrJ B6.C3 Pde6b Hps4<le>/+ +-Lmx1a<dr-8J>/J B6.Cg H2<g7>-Tg(Ins2-CD80)3B7Flv/LwnJ BALB/cByJ Agtpbp1<pcd-3J>-Bmp5<cfe-se6J>/GrsrJ C.C3 Tlr4<Lps-d>/J-ru2l/GrsrJ C3.MRL Fas -Myo7a<sh1-9J> C3A Pde6b .O20/A-Prph2<Rd2>/J C3FeLe.Cg a-Grm1 /GrsrJ C57BL/6J Tyr<c-2J>-awag/GrsrJ MEV-O.CAST-A D2Mit57/TyJ MRL/MpJ Fas -Cal2/J MRL/MpJ Fas -Foxq1<sa-J>/J NOD.Cg Prkdc -Tg(CSF2)2Ygy Tg(IL3)1Ygy Tg(KITLG)3Ygy/YgyJ SJL.Cg Thy1<a>-Noxo1 /J

It's pretty clear what the following are supposed to mean, but they are a challenge to a simple parser:

(C57BL/6JEiJ x C3Sn.BLiA-Pde6b<+>)F1     - NYI 129S-Hba-a1<tm1Led>/J    		    - gene with '-' 129S-Hba-x<tm1Led>/J			   - gene with '-' B6-A<w-J>.CBy-Eda<Ta-By>/J               - complex host B6-A<w-J>.Cg-Eda<Ta-6J> +/+ Ar<Tfm>/J    - complex host B6-Pax3<Sp>.Cg-N/J	     		    - complex host B6.Cg-Il2rb<tm1Mak> Tg(CD2-Il2rb)A-1Ttg  - 'A-1' is part of Tg notation B6.Cg-Il2rb<tm1Mak> Tg(CD2-Il2rb)H-1Ttg  - 'H-1' is part of Tg notation B6.Cg-Tg(Prnp-APP)A-2Dbo/J		   - 'A-2' is part of Tg notation C.Cg-Tcrb-V<a>			   - 'Tcrb-V' is a gene C57BL/6J-Chr X.2<PWD/Ph>/ForeJ	   - chromosome 'X.2' B6.Cg-Tg(Gt(ROSA)26Sor-EGFP)I1Able/J     - nested parens! CBy.Cg-Tg(Gt(ROSA)26Sor-EGFP)I1Able/J	   - nested parens! FVB-Tg(Gt(ROSA)26Sor-EGFP)130910Eps/Mmmh - nested parens! FVB/N-Tg(MMTV-PyVT*Y315F*Y322F)Db-1Mul   - 'Db-1Mul' part of Tg notation MRL-Fas .129P2(B6)-B2m<tm1Unc>	   - complex host B6.129P2-Tcrd-V1<tm1Kjk>		   - ? gene 'Tcrd-V1'

Verify hunch that these involve complex donor strain names

B6.C-H2-K /ByBir-Gusb /BrkJ	   - allele is 'H2-K ' B6.C-H2-K /ByBir-Gusb /J  	    - donor is 'C-H2-K /ByBir' MRL.129P2(B6)-B2m<tm1Unc>/Dcr-Dab1<scm-2J>/J - donor is '129.../Dcr' B10.A-H2&lt;h4>/(4R)SgDvEg-Sh3pxd2b 	   - complex donor B10.D1-H2 /SgJ-shmy<2J>/GrsrJ	   - ? complex donor B10.D2-H2<d>/nSnJ-Shh<Hx>          	    - ? complex donor B10.D2-H2<d>/oSn-Shh<Hx>          	    - ? complex donor B6.C-H38<c>/By-Kit<W-56J>/J              - complex donor? B6.CAST-Gpi1<a>.Cg-Hba<th-J>             - complex donor? C3A.BLiA-Pde6b<+>.O20-Prph2<Rd2>/J	   - complex donor?

Here's a funny one: (space after '-')

MGI:4421519	B6.Cg- Bcl2l11<tm1.1Ast> Bmf<tm2.1Rjd>	congenic

Here's a funny one: to what does the lab code apply?

MGI:3617481	B6.Cg/NTac-Foxn1<nu>	congenic

Here's a mixin strain with what appears to be a lab code. We also see (CBy), (C3Rl)

MGI:2164797	B6.RBF(C3Fe)-Nek1 /J	congenic MGI:2165376	CHa.SWV(C3Fe)-Mbp /J	congenic

Host can also have a lab code.

MGI:3576860	C3fBi.AK-Rb(6.15)1Ald/StmRbrc	congenic MGI:4358731	CByJ.B6(Cg)-Rag2<tm1Cgn>/J	congenic

Grammar?
strain ::= strain2 '/' lab-part       -- substrain   3.4 | strain2

strain2 ::= host '.' donor '-' alleles-part     -- congenic  5.2 | host '.' donor '(' donor ')' '-' alleles-part  -- 5.2 | host

host ::= host2 '(' donor ')' '-' alleles-part     -- congenic  5.2 | host2 '-' alleles-part    -- coisogenic (single locus) 5.1 | host2 ' ' alleles-part    -- mixed inbred  4.2 | host2

[Can donor have a lab-part? I think so.]

donor ::= host2 | 'Cg'

host2 ::= strain3 ';' strain3         -- mixed inbred  4.2 | special                     -- e.g. C57BL/6, BALB/c | special lab-codes         -- e.g. C57BL/10ChPr | smushed

smushed ::= atomic 'c' atomic       -- recombinant congenic  4.3 | atomic 'X' atomic         -- recombinant inbred  4.1 | atomic lab-codes         -- this is ambiguous!! | atomic                     -- inbred 3.2

alleles-part ::= allele-half | allele-half '/' allele-half     -- 5.4 seg. inbred

allele-half ::= allele-part {* ' ' allele-part}

allele-part ::= marker                             -- e.g. 'tth' 5.1 | marker '<' superscript-part '>'   -- 5.1 | 'Chr ' chromosome-designator '<' strain '>'   -- 5.3 chrom. sub. | 'mt' '<' strain '>'          -- 5.5 conplastic | '(' marker '-' marker ')'    -- 5.2 | 'Tg(' marker {* ',' marker} ')'   -- 4 transgene | 'Gt(' stuff ')' stuff    -- 5.2 | 'Tn(' stuff ')' stuff

lab-part ::= {optional serial-number} lab-codes

lab-codes ::= {+ lab-code}

serial-number ::= {as previously analyzed}

atomic ::= upper-case {* {upper-case | numeric}} | numeric-strain **** EXCEPT not the pre-slash part of a "special". ****   **** ALSO no internal X unless followed by a digit. ****

special ::= 'C57BL/6' | 'C57BL/10' | 'C57BR/a' | 'C57BR/c' | 'BALB/c' | 'DBA/1' | 'DBA/2'

numeric-strain ::= simple-numeric-strain | simple-numeric-strain upper-case numeric | '7R75M'

simple-numeric-strain ::= numeric numeric numeric {? numeric}

Notes:

B6J is ambiguous - could be a strain that's simply called 'B6J', or it could decompose into B6 + J (i.e. B6/J).

BALB/c is not a substrain of BALB; similarly for the other 'specials'.

'Chr' and 'mt' (conplastic) are not valid markers, treat specially

chromosome-designator can be X, Y, or a number optionally followed by '.' and a number

'Numeric strains' include: 101, 102, 1194, 129, 129P2, 129X1, 201, 615, 7R75M. Note that 129XA parses as 129 X A !

The superscripted 'stuff' never contains a space.

TBD: advanced intercross section 4.4