CAS Number Database
The SMILECAS Database is used to input SMILES notations and chemical names onto the data entry screen by entering the Chemical Abstract Service (CAS) Registry number of a chemical.
Chemicals in the SMILECAS Database
The CAS Number data base currently contains 112,000 entry records. Each entry
record includes three fields (CAS Number, Chemical Name and SMILES Notation).
All records have CAS Number and SMILES Notation entries, but not all Chemical
Name fields are filled. Currently 34,300 Chemical Name fields are blank.
Origin of the data:
(1) The initial 20,000 entries were obtained from the U.S. EPA file of CAS
numbers, SMILES notations and chemical names used by the GEMS (Graphical
Exposure Modeling System) program software. The entries in this file were
discrete organic compounds listed in the original U.S. EPA TSCA Inventory.
Although the original TSCA Inventory contained 62,000 chemicals, only 20,000
compounds were added to the EPA's CAS-SMILES database. Compounds such as
polymers and many inorganics were apparently excluded.
(2) Approximately 40,000 additional CAS numbers with corresponding SMILES
notations (but some blank names) were added as sent by the US EPA in Duluth,
Minnesota
(3) Approximately 30,000 entries (unique from chemicals already in SMILECAS)
were added as sent by the Danish Environmental Protection Agency (many had blank
name fields).
(4) Approximately 10,000 unique entries have been added through various tasks
with the Canadian regulatory agencies (Environment Canada and Health Canada)
(5) The remainder were added by Syracuse Research Corporation in conjunction
with various projects involving development of training and test sets of
chemicals for physical and chemical property estimation. Additional chemicals
were added through work on the National Library of Medicine's Hazardous
Substances Data Bank (http://toxnet.nlm.nih.gov/)
Considerations:
(1) SMILECAS currently contains 3,980 CAS Numbers that are Invalid CAS Numbers.
These CAS numbers were "made-up" by Syracuse Research Corporation because the
real CAS Number did not exist at the time the chemical was added to SMILECAS or
the real CAS could not be found.
All made-up CAS numbers in SMILECAS are lower in sequential number than the
first real CAS Number (which is 50-00-0 for formaldehyde). For example, the
first "made-up" number currently in SMILECAS is: 000000-00-2 ... another example
is: 000000-01-7 or 000002-77-9 ... All made-up CAS numbers in SMILECAS are
retrievable either individually or in batch mode.
The vast majority of made-up CAS numbers pertain to chemicals and structures
used to develop estimation methodologies. Many were part of the KOWWIN program
development.
(2) Some CAS numbers added from the Danish EPA or Duluth EPA have been found to
correspond to salts of various compounds; many have missing name fields (but not
all). The SMILES representation for some of these salts excludes the salt
designation. For example, the SMILES for a sodium salt of an acid is actually
the acid itself (the [Na] is not added).
(3) The CAS Number Registry contains more than 36 million organic and inorganic
compounds (CAS, 2008). Although SMILECAS contains over 108,000 actual CAS
numbers (including many common organic chemicals), it is only a small subset of
the entire CAS Registry. Various chemicals of interest may not be included in
the SMILECAS data base.