Atoms, Bonds & Branches

Top  Previous  Next

Atoms, Bonds & Branches

 

(1) Atoms

Atoms are represented by their atomic symbols.  For example:

  C  is  carbon

  N  is  nitrogen

  S  is  sulfur

  F  is  fluorine

  O  is  oxygen

  P  is  phosphorus

  I  is iodine .....

Upper and lower case letters are important.  All aliphatic atoms are entered in upper case.  All aromatic atoms are entered in lower case.  The possible aromatic atoms are carbon, oxygen, sulfur, nitrogen, and selenium.  Other potential aromatic atoms are not allowed by because the current estimation method can not use them.

 

Chlorine or bromine must have the first letter entered in upper case. The second letter of the atomic symbol can be either upper or lower case.  The "r" in bromine's symbol is usually entered in lower case.  It is suggested that the "l" in chlorine's symbol be entered in upper case ("L") because it is possible to mis-identify a lower case "l" and the number one "1".  Therefore, chlorine can be entered as either  Cl  or  CL  and bromine can be entered as either  Br  or  BR.

   

NOTE:   With very rare exception, the hydrogen atom is not included in a SMILES notation.   Hydrogens are NOT be entered by the user UNLESS they are attached to aliphatic or aromatic nitrogen....and this is allowed ONLY for the purpose of designating explicit hydrogen attachment to nitrogens having a valence greater than +3 .....typical examples are various ammonium compounds.   ALL OTHER hydrogen attachments are determined solely by the program.  For more information concering hydrogen, go to the Entering Hydrogen Directly page.

 

Not entering hydrogens greatly simplifies SMILES.  For example:

 

Compound     Mol Formula       SMILES

Methane       CH4               C

Ethane        CH-CH3            CC

Propane       CH3-CH2-CH3       CCC

Butane        CH3-CH2-CH2-CH3   CCCC

Bromoethane   CH3-CH2-Br        CCBr

Ethanethiol   CH3-CH2-SH        CCS

Ethanol       CH3-CH2-OH        CCO

Propylamine   CH3-CH2-CH2-NH2   CCCN

 

Many more example SMILES are available at the Example SMILES page.

 

(2) Bonds

The four basic bonds in SMILES notation are single, double, triple, and aromatic bonds. Single bonds do not need to be shown and are usually omitted.  A single can be designated with the hyphen symbol "-".  For example, a correct SMILES notation for propane is  C-C-C ; however, there is no advantage to entering the single bond.  Therefore, it is not normally used (the EPI Suite programs automatically remove any hyphens entered in a SMILES string).

 

The double bond is designated by the equal symbol "=" and is required to identify a double bond.  The following examples illustrate the double bond:

Compound   Mol Formula     SMILES

Ethylene   CH2=CH2         C=C

Propylene  CH2=CH-CH3      C=CC

2-Butene   CH3-CH=CH-CH3   CC=CC

 

The triple bond is designated by the number symbol "#" and is required to identify a triple bond.  The following examples illustrate the triple bond:

Compound       SMILES

Acetylene       C#C

Propyne         C#CC

Butyne          C#CCC

Acetonitrile    CC#N

Acrylonitrile   C=CC#N

 

  The aromatic bond has no designation.  It is explicitly implied by a "lower case letter" for carbon, nitrogen, oxygen, selenium and sulfur.  For example, a typical SMILES notation for benzene is c1ccccc1 and a typical notation for pyridine is n1ccccc1.  The use of the numbers as ring opening and closing positions is discussed in Cyclic Structures.

 

(3)  Branches

  Branches in molecular structures are designated by enclosures in parentheses.  When a structure contains a branch, the SMILES Notation of the structure requires that the branch be designated in enclosed parentheses.  For example,  2-propanol, tert-butanol and isobutyric acid:

 

    OH

    |

CH3-CH-CH3

 

2-Propanol SMILES:  CC(O)C

 

 

   CH3

   |

CH3-C-CH3

   |

   OH

 

tert-Butanol SMILES:   CC(C)(O)C

 

 

CH3 OH

|   |

CH -C=O

|

CH3

 

Isobutyric acid SMILES:   CC(C)C(=O)O

 

 

As previously noted, a single structure can have more than one valid SMILES notation.  As an example, valid SMILES notations for the isobutyric acid structure shown above include the following:

 CC(C)C(=O)O

 C(C)(C)C(=O)O

 OC(=O)C(C)C

 O=C(O)C(C)C

 

  A branch can not begin a SMILES notation. For example,  (C)CCO  is an invalid SMILES notation.  A branch must immediately follow the atom to which it is connected.  If an atom has more than one branch, the branches are coded as consecutive pairs of parentheses.  The tert-butanol structure shown above is an example. The order of the parentheses is not important; for example, tert-butanol can be either  CC(C)(O)C  or  CC(O)(C)C.

  A branch can not immediately follow a double bond symbol "=" or a triple bond symbol "#"; it must immediately follow the atom.  For example:  C=(CC)C  is invalid; if the double bond is connected to the carbon inside the parentheses, the SMILES should be  C(=CC)C; if the double bond is connected to the final carbon, the SMILES should be  C(CC)=C.

 Nested branches or "branches-within-branches" are allowed (and frequently needed).  The following example (3-isopropyl-3-tert-butyl-1-pentene) illustrates nested branches:

 

      CH3

      |

  CH3-CH-CH3

      |

CH2=CH-C-CH2-CH3

      |

  CH3-CH-CH3

 

A correct SMILES for this structure is:  C=CC(C(C)C)(C(C)C)CC

Dozens of different, valid SMILES notations could be coded for the structure shown above.  The notation could begin at any carbon in the structure.  For example, if the notation begins at the center-most carbon, the SMILES notation could be:   C(C=C)(CC)(C(C)C)(C(C)(C)C)

 

  The SMILES interpreter used in the EPI Suite programs does not allow two or more consecutive left-sided (starting) parentheses such as "((" to be used. An example would be:  CC((CC))CC. The reason is:  two left-sided parentheses are never needed to correctly represent any structure; their use promotes poorly coded SMILES notations. SMILES notations are usually easiest to comprehend when they have the fewest number of possible branches!  Unnecessary branching can complicate a SMILES notation. For example, butane is best coded as:  CCCC   although, it is valid to code it as:  C(C(C(C))).