This section describes the rules for encoding of carbohydrate and derivative structure in a single line. You may need this information in three cases:
Residues are described by the sequence of terms like <residue name>(<goes by>-<goes to>), where goes by denotes a position (carbon number) by which this residue substitutes another residue (usually 1 or 2), and goes to denotes to which position the linked residue is substituted. Both goes_by and goes_to can be presented by question mark (?) if unknown. In the case of the reducing end residue the expression in parentheses is not needed. For example, A(1-3)B(1-4)C is a linear fragment in which residue A substitutes position 3 of residue B by its first position, and residue B substitutes a position 4 of residue C by its first position, and residue C substitutes nothing. Here and below latin capitals stand for residue names.
If the structure is polymeric, the leftmost and rightmost residues should have open linkages, e.g. -2)A(1-3)B(1-4)C(1- means the same structure as above, with the only difference that it represents repeating units linked each to other by 1-2 linkage.
If there are branching points, one chain is always considered the main one, and others are the side chains. To distinguish which chains are main and which are side, refer to the corresponding section below. The side chains are enclosed in square brackets together with linkage indication parentheses. For example, t)A(1-3)[B(1-4)]C is a branched fragment in which residue A substitutes position 3 of residue C, while residue B substitutes position 4 of residue C. The t) prefix indicates that the residue on the left of the main chain is unsubstituted. Several side chains attached to one residue are separated by commas. Side chains may be also linear or branched and all combinations of nesting square brackets are allowed, e.g. -4)A(1-3)[D(2-6)B(1-4),F(1-3)[G(1-4)]E(1-2)]C(1- is
G-(1-4)+
|
F-(1-3)-E-(1-2)+
|
-4)-A-(1-3)-C-(1-
|
D-(2-6)-B-(1-4)+
If a residue substitutes more than one position of another residue you have to write it twice and separate by colon, e.g. ...(1-2)[xRPyr(2-4):xRPyr(2-6]aDGal(1-... means a 4,6-pyruvated galactose. If such a residue is at the chain terminus, use this notation: xRPyr(2-4)[:xRPyr(2-6]aDGal(1-...
Except possible question marks for linkage positions, anomeric and absolute configurations and ringsizes, the fuzziness on the level of residues and linkages can be encoded. Two synthactic constructions are allowed: <<A(n-m)|B(p-q)>> for exclusive combination (XOR) and <A(n-m)|B(p-q)> for inclusive combination (OR). This means, e.g., <<A(1-3)|B(1-4)>>C is a disaccharide, in which either C3 of the residue C is substituted by the residue A, or C4 of the residue C is substituted by the residue B, but not both at once, while <A(1-3)|B(1-4)>C is a disaccharide, in which C3 of the residue C is substituted by the residue A, or C4 of the residue C is substituted by the residue B, or both these positions are substituted by A and B, accordingly. Please note, that the fuzzy residue enclosed in angle brackets can be substituted itself, e.g. D(1-2)<<A(1-3)|B(1-4)>>C means the structure is either D(1-2)A(1-3)C or D(1-2)B(1-4)C.
All monovalent substituents (Ac, Me, Et, Fo and other residues that CAN NOT be substituted) should be described as separate residues, e.g. aDGal(1-3)bDGlcNAc should be recorded as aDGal(1-3)[Ac(1-2)]bDGlcN. If a monovalent residue is an aglycon at the reducing end, write it as following: aDGlc(1-Me. During the substructure search (but not during data upload or automated exchange) you can specify monovalent residues in usual way, e.g. bDQuiNAc3NAc4Ac.
Phosphates and sulphates should be included into the linkage parenthesis like this: aDGlc(1-P-4)bLFuc (in the chain) or P-4)bLFuc (at the non-reducing end) or aDGlc(1-P (at the reducing end).
To keep this encoding unambiguous you should take care about which chains are encoded as main and which are the side ones.
Each residue name is composed of several fields following each other without any separators:
Examples: aD6dTalA, aXKdo, xLGro, ?DManN-ol, Ac, xRPyr, bDFucN3N.
You can try to combine these fields and examine more names here.
Click here for the monomer namespace table
The naming system for lipid residues match the general naming system described above. l should be used for anomeric configuration. For the most of lipids there are reserved names like Pam, Ole, Vac etc (the complete list is available in the aliphatic acid section of the complete residue list). If there is no reserved basename, the following rules may be used to construct a new base names:
If it is unknown which exact residue is on a certain location in the structure, a superclass name can be used instead of a residue name. Superclasses do not require anomeric and absolute configurations and ringsize. The following superclasses are supported:
|
|