|
Atoms, Bonds & Branches |
Top Previous Next |
|
Atoms, Bonds & Branches
(1) Atoms Atoms are represented by their atomic symbols. For example: C is carbon N is nitrogen S is sulfur F is fluorine O is oxygen P is phosphorus I is iodine ..... Upper and lower case letters are important. All aliphatic atoms are entered in upper case. All aromatic atoms are entered in lower case. The possible aromatic atoms are carbon, oxygen, sulfur, nitrogen, and selenium. Other potential aromatic atoms are not allowed by because the current estimation method can not use them.
Chlorine or bromine must have the first letter entered in upper case. The second letter of the atomic symbol can be either upper or lower case. The "r" in bromine's symbol is usually entered in lower case. It is suggested that the "l" in chlorine's symbol be entered in upper case ("L") because it is possible to mis-identify a lower case "l" and the number one "1". Therefore, chlorine can be entered as either Cl or CL and bromine can be entered as either Br or BR.
NOTE: With very rare exception, the hydrogen atom is not included in a SMILES notation. Hydrogens are NOT be entered by the user UNLESS they are attached to aliphatic or aromatic nitrogen....and this is allowed ONLY for the purpose of designating explicit hydrogen attachment to nitrogens having a valence greater than +3 .....typical examples are various ammonium compounds. ALL OTHER hydrogen attachments are determined solely by the program. For more information concering hydrogen, go to the Entering Hydrogen Directly page.
Not entering hydrogens greatly simplifies SMILES. For example:
Compound Mol Formula SMILES Methane CH4 C Ethane CH-CH3 CC Propane CH3-CH2-CH3 CCC Butane CH3-CH2-CH2-CH3 CCCC Bromoethane CH3-CH2-Br CCBr Ethanethiol CH3-CH2-SH CCS Ethanol CH3-CH2-OH CCO Propylamine CH3-CH2-CH2-NH2 CCCN
Many more example SMILES are available at the Example SMILES page.
(2) Bonds The four basic bonds in SMILES notation are single, double, triple, and aromatic bonds. Single bonds do not need to be shown and are usually omitted. A single can be designated with the hyphen symbol "-". For example, a correct SMILES notation for propane is C-C-C ; however, there is no advantage to entering the single bond. Therefore, it is not normally used (the EPI Suite programs automatically remove any hyphens entered in a SMILES string).
The double bond is designated by the equal symbol "=" and is required to identify a double bond. The following examples illustrate the double bond: Compound Mol Formula SMILES Ethylene CH2=CH2 C=C Propylene CH2=CH-CH3 C=CC 2-Butene CH3-CH=CH-CH3 CC=CC
The triple bond is designated by the number symbol "#" and is required to identify a triple bond. The following examples illustrate the triple bond: Compound SMILES Acetylene C#C Propyne C#CC Butyne C#CCC Acetonitrile CC#N Acrylonitrile C=CC#N
The aromatic bond has no designation. It is explicitly implied by a "lower case letter" for carbon, nitrogen, oxygen, selenium and sulfur. For example, a typical SMILES notation for benzene is c1ccccc1 and a typical notation for pyridine is n1ccccc1. The use of the numbers as ring opening and closing positions is discussed in Cyclic Structures.
(3) Branches Branches in molecular structures are designated by enclosures in parentheses. When a structure contains a branch, the SMILES Notation of the structure requires that the branch be designated in enclosed parentheses. For example, 2-propanol, tert-butanol and isobutyric acid:
OH | CH3-CH-CH3
2-Propanol SMILES: CC(O)C
CH3 | CH3-C-CH3 | OH
tert-Butanol SMILES: CC(C)(O)C
CH3 OH | | CH -C=O | CH3
Isobutyric acid SMILES: CC(C)C(=O)O
As previously noted, a single structure can have more than one valid SMILES notation. As an example, valid SMILES notations for the isobutyric acid structure shown above include the following: CC(C)C(=O)O C(C)(C)C(=O)O OC(=O)C(C)C O=C(O)C(C)C
A branch can not begin a SMILES notation. For example, (C)CCO is an invalid SMILES notation. A branch must immediately follow the atom to which it is connected. If an atom has more than one branch, the branches are coded as consecutive pairs of parentheses. The tert-butanol structure shown above is an example. The order of the parentheses is not important; for example, tert-butanol can be either CC(C)(O)C or CC(O)(C)C. A branch can not immediately follow a double bond symbol "=" or a triple bond symbol "#"; it must immediately follow the atom. For example: C=(CC)C is invalid; if the double bond is connected to the carbon inside the parentheses, the SMILES should be C(=CC)C; if the double bond is connected to the final carbon, the SMILES should be C(CC)=C. Nested branches or "branches-within-branches" are allowed (and frequently needed). The following example (3-isopropyl-3-tert-butyl-1-pentene) illustrates nested branches:
CH3 | CH3-CH-CH3 | CH2=CH-C-CH2-CH3 | CH3-CH-CH3
A correct SMILES for this structure is: C=CC(C(C)C)(C(C)C)CC Dozens of different, valid SMILES notations could be coded for the structure shown above. The notation could begin at any carbon in the structure. For example, if the notation begins at the center-most carbon, the SMILES notation could be: C(C=C)(CC)(C(C)C)(C(C)(C)C)
The SMILES interpreter used in the EPI Suite programs does not allow two or more consecutive left-sided (starting) parentheses such as "((" to be used. An example would be: CC((CC))CC. The reason is: two left-sided parentheses are never needed to correctly represent any structure; their use promotes poorly coded SMILES notations. SMILES notations are usually easiest to comprehend when they have the fewest number of possible branches! Unnecessary branching can complicate a SMILES notation. For example, butane is best coded as: CCCC although, it is valid to code it as: C(C(C(C))). |