CHEMINFO Title Bar

STN Messenger Command-Language Searching: Basic Concepts


I590 Lecture Notes
Updated: 5 February 2006

0. Introduction

STN commands: You need to know basic STN commands in order to do the searches in I590. Examples of those can be found at:

http://www.indiana.edu/~cheminfo/cciim03.html

Use the files on that page from the: Tips for Searching the CA & Registry Files (CAS8721-0894). Note especially the "Display Scan" option to view modified answers at no charge.

REMEMBER THAT THE STN SYSTEM WE ARE USING COSTS REAL MONEY TO SEARCH, AND THE COST IS NOT COVERED BY YOUR STUDENT TECHNOLOGY FEE.

You can build your structures anytime, but the search must be performed when the STN computers allow the Academic Program passwords to access the LREG and LCA files:

http://www.indiana.edu/~cheminfo/31-35.html

I. Logging On and Logging Off

One way to gain access to the STN International system is via telephone lines and a modem. Another way to access the STN system is via the Telnet program on the Internet, using the address STNC.CAS.ORG or STN.FIZ-KARLSRUHE.DE. If using STN Express with Discover! or STN on the Web, the connection to STN is performed by the software.

All commercial systems that charge for online searching of their databases require a loginid and a password. For the STN Academic Program via the Internet, the logon sequence via modem or Telnet would be as below. User input is indicated in bold. "(CR)" means hit the "Enter" key.

Logging onto STN's CAS ONLINE Academic Program via Telnet:

telnet stnc.cas.org (CR)
(CR)
Welcome to STN International! Enter x: i (CR) (1)
LOGINID: dummyid (CR)
PASSWORD: ######### (Enter the password and a CR) (2)
TERMINAL (ENTER 1, 2, 3, OR ?): 3 (CR) (3)
* * * * * * * * * Welcome to STN International * * * * * * * * * *

[News messages appear here.]

=> file lreg (CR)

[Searching occurs here.]

=> log y (CR)


Comments:

(1) The "i" indicates that we are entering with a restricted access Academic Program account, accessible after 5:00 PM on weekdays and certain weekend hours. Users with full access enter "x" at this point.

(2) The LOGINID will appear on the screen, but the password (we hope!) is masked by the #########.

(3) Terminal choices are:

Once in the STN system, the prompt is: =>

II. Basic STN Commands

The STN Messenger search software assumes that you are a novice searcher if you spell out the entire command words. Some commands have single letter equivalents which, if used, signal Messenger that you do not want to be prompted for any information the system needs to complete your search. In this case, it will DEFAULT to system-defined parameters--what the computer assumes you want to do in the absence of explicit information to the contrary.

The five basic STN commands, with single letter equivalents in parentheses where appropriate, are:

"Basic STN Commands" gives fuller information.

III. Fields in Records

The STN database summary sheets have examples of the RECORDS in the corresponding databases. Limiting a search to a specific part of the record (a FIELD) is done on STN using a two-letter code preceded by a forward slash. When the search term or phrase is entered, the field code is appended as below:

=> S PARMENTER C?/AU (CR)

=> S ISATIN/CN (CR)

See "Ways to Narrow Your Answer Set in the CA File" for CA File examples of the use of the language field, the document type field , and the publication year field.

IV. The Concept of the Basic Index

What if no field code is used in the search statement? Messenger assumes by default that you want the search to run in the BASIC INDEX. The fields included in the Basic Index vary from database to database.

For the CA File, the Basic Index includes:

For the Registry File, the Basic Index includes:

V. Proximity Operators.

There are more specific variants of the AND command that can be used to define the spatial relationships of search terms. These are called POSITIONAL or PROXIMITY OPERATORS. On STN, they are:

STN assumes that multi-word phrases are to be searched using the (W) operator in the absence of explicit positional or other Boolean operators.

VI. Truncation (Masking) of Characters to Expand a Search

TRUNCATION is the search technique that allows the searching of more than one form of a word with a single command. In many cases where subject searches are concerned, we are looking for topics that involve words built on a common root word, or that have some other variations that are easily signaled to a computer by means of a special symbol. Truncation tells the computer to form an answer set consisting of all records that contain words with the characters input for the search, but also reocords that contain related words with suffixes (or, in some cases, prefixes or variable characters at a given point in the word).

Truncation can occur at the left end or the right end of a word stem or within the word. STN now allows all three types of truncation in the CA File Basic Index. The limit of terms that can be gathered in a set by truncation is 30,000 stems. For left truncation the search term must have at least four characters.

On the STN system, truncation symbols are:

Symbol Function Example
exclamation point (!) Exactly one character cataly!e
hash mark (#) One or no character alcohol#
question mark (?) Any number of characters ?therap?

As noted in the table, the # sign can be used at the end of a word to pick up both singular and plural forms of a word. Another way of accomplishing the same thing on STN using the command language option is to enter SET PLURALS ON at the system prompt. Both left- and right-hand truncations are allowed with the "?". See:

for other examples of truncation in STN.

There are limits to the number of terms that can be gathered into a set using truncation. Therefore, caution must be exercised in using truncation to prevent too many search terms (or unexpected words) from entering the answer set.

VII. Expanding (Neighboring)

We have already seen the expand technique profitably used in author searching. It is also a very useful option in subject searching, especially since it allows us to determine whether the search term we are considering is actually used in the system. In addition, keyboarding errors that have gone undetected may be revealed in an expand list. For example, in the STN CA file, the following list appeared when "organomagnesium" was expanded in the Basic Index at the time of the search:

Set
#
# of
Answers
Variant
Spelling

 E1    1  ORGANOMAGNESIATE/BI
 E2    1  ORGANOMAGNESIATES/BI
 E3  823  ORGANOMAGNESIUM/BI
 E4    1  ORGANOMAGNESIUMALUMINUM/BI
 E5    2  ORGANOMAGNESIUMOXANE/BI
 E6   59  ORGANOMAGNESIUMS/BI
 E7    1  ORGANOMAGNETIC/BI
 E8    1  ORGANOMAGNETISM/BI
 E9    1  ORGANOMAGNSIUM/BI
 E10   1  ORGANOMANGANATE/BI
 E11   1  ORGANOMANGANATES/BI
 E12  74  ORGANOMANGANESE/BI

Note that E9, the one document in the file with the misspelled term "organomagnsium" would probably be missed in a subject search if not spotted in the expand list, so the search statement to pull all of the variants into one set in the CA file would be:

For the online CA File on STN, the preferred terms are searched with the field labels "CT" for phrases or "CW" for words. Thus, a search for parasympathomimetics would find in the printed CA Index Guide that the preferred phrase to search is Cholinergic agonists. The online CA file search using command language would then be:

=> S CHOLINERGIC AGONISTS/CT

In an online search, it is important to include the CAS standard abbreviations and acronyms since the abbreviations are used in preference to the full terms in the online records, hence, in the Basic Index of the CA File.

One can always issue the DISPLAY IND command to see how a particularly relevant document has been indexed and then input relevant indexing terms to broaden or narrow a search. Look especially for abbreviations such as DETN or DEGRDN. These are used in preference to the full terms such as determination, degradation, etc. in indexing CA. See CAS Standard Abbreviations and Acronyms. On the STN system, it is now possible to use the command SET ABBREVIATION ON to automatically check if there are CAS abbreviations used for the search terms you input. If so, the system automatically searches those forms. If SET PLURAL ON is also in use, the plural forms of the abbreviations will also be found. Users of SciFinder and SciFinder Scholar do not need to worry about such subleties because the search algorithm authomatically makes allowances for such variants.

Look at a sample record from the CA Student Edition on OCLC, paying particular attention to the index terms and the use of abbreviations.

VIII. CAS Roles in the CA and other Files

ROLES are CAS indexing terms assigned to every indexed substance and to controlled index terms for classes of compounds. The use of roles began to be appplied to the new online CA File records with v. 121 (July 1994). They were then applied retrospectively to all CA File records by means of a computer algorithm. Originally there were 38 specific roles and 7 broad super roles. They substantially expand the indexing terms that were used prior to their introduction. The role terms give a more precise link to the substance. For example, it is now possible to specify not only that you want the preparation of the substance, but also that the preparation be a synthetic preparation, as opposed to industrial manufacture. In the past, there was no distinction made in the use of the term "Preparation" in such cases. Nevertheless, it is still possible to search in the CA File for all manner of preparations of a substance or a group of substances found in the Registry File by appending a "/P" to the answer set number from the Registry File (or for a single substance, by appending a "P" directly to the Registry Number in a CA File search), e.g.,

=> SEARCH L2/P (where L2 is an answer set from the Registry File)

or

=> SEARCH 494-12-2P (where 494-12-2 is the CAS Registry Number for Flavan)

Roles must be attached to an L# answer set formed in the Registry File if used in conjunction with that L# to search the CA File. An example of the use of the role code "SPN" (Synthetic Preparation) is:

=> FILE REGISTRY

=> S FULLERENE/CNS

L2 3287 FULLERENE/CNS

CNS is the chemical name segment field designator on STN.

=> FILE CA

=> S L2/SPN OR FULLERENES/SPN 5347 L2 35422 SPN/RL 206 L2/SPN (L2 (L) SPN/RL) 1759 FULLERENES/CT 35422 SPN/RL 108 FULLERENES/SPN (FULLERENES/CT (L) SPN/RL) L3 248 L2/SPN OR FULLERENES/SPN

The Roles can be viewed in an online thesaurus to see the role hierarchies and definitions. They are currently used in the CA and CAplus files and in the CASREACT and MARPAT files.

To ensure that the CAS Role Indicators are in agreement with the current focus and direction of chemistry, the following key changes to new and modified Role Indicators were made in late 2001. New Roles have been added:

Ambiguous Biological Roles have been discontinued. Most will fall into the Biological Study, Unclassified Role. Other Roles have been divided to allow for more precision, e.g., Reactant Roles and Reagent Roles. The Nonbiological Use, Unclassified Role definition has been clarified as Other Use, Unclassified Role.

IX. Searching the Registry File with a Chemical Name

The Registry File is the largest single source of chemical names in existence. It can be searched by a trade or common name for a substance (CN), by its CAS Index Name (CN) or by fragments of the CAS Index Name (CNS field). (See: Tips for Chemical Name Searching.) The Basic Index of the Registry File includes both chemical name fragments and molecular formula fragments. It may be necessary to follow certain protocols for special characters in order to search for a chemical name. Greek characters, for example, are spelled out in their entirety with a period before and after the Greek part of the name. Examples of chemical name searches in the Complete Chemical Name Index (/CN) or the Chemical Name Segment Index (/CNS) of the Registry File are:

=> SEARCH ISATIN/CN

=> SEARCH .ALPHA.-METHYLBENZOIN/CN

=> SEARCH ACETYLSALICYLIC ACID/CN

=> SEARCH IMINO/CNS

Since there is a fee to search terms in the Registry File, it is best to check the name by first expanding it in the relevant index. Often, the combination of a molecular formula search and a Chemical Name Segment search is an effective way to retrieve a substance when the molecular formula alone has many isomers.

An example of such a chemical name search in SciFinder Scholar is below. Note that in the SciFinder Scholar system, the search will work with or without the periods around the "alpha," but in STN command-language searching, the dots are mandatory.

alpha-Methylbenzoin Name Search

Since there is a fee to search terms in the Registry File, it is best to check the name by first expanding it in the relevant index. Often, the combination of a molecular formula search and a Chemical Name Segment search is an effective way to retrieve a substance when the molecular formula alone has many isomers.

X. Section Codes for Online Searches

Since the information in Chemical Abstracts is classified into 80 major subject sections, the section numbers and codes can actually be used on STN with the CA Classification "CC" field in subject searches to assist in limiting a search. For example, works dealing primarily with enzymes are found in section 7 of the weekly Chemical Abstracts. Other documents are assigned to one of the 80 subject categories divided into the following gross categories:

Section
Name
Section
Code
Section
Numbers
Biochemistry BIO/CC 1-20
Organic Chemistry ORG/CC 21-34
Macromolecular Chemistry MAC/CC 35-46
Applied Chemistry & Chemical Engineering APP/CC 47-64
Physical, Inorganic, & Analytical Chemistry PIA/CC 65-80

Thus, a strategy that included in an online search on STN:

or

would have the effect of limiting the retrieved documents in answer set L4 to those dealing with enzymes (found in section 7 of the printed CA) or more broadly, those a biochemical nature found anywhere in section 1-20 of the printed product.

XI. Introduction to Structure Searching on STN

STRUCTURE SEARCHING allows a search to be run using the chemical structure as input. The searches are generally run against online chemical dictionary files, such as STN's Registry File. Depending on the type of structure search allowed by the system, the complete molecule or any compound containing the structure of the molecule will be retrieved as an answer set. The retrieved structures may include salts, isotopically labeled substances, mixtures, and structures in which the drawn structure is contained as a subset of a larger structure.

Unlimited substitution of the input molecule may be allowed at free sites on the molecule (a FULL SUBSTRUCTURE SEARCH) or substitution may be limited to certain sites (a CLOSED SUBSTRUCTURE SEARCH). On the STN system, once an answer set is formed in the Registry File, it can be crossed over to the CA or other files to conduct further subject searches of the compounds thus isolated in a structure search. In these cases, it is actually the CAS Registry Number for the compounds that is being searched in the crossover files. Note that it is now possible to conduct a search that takes into account the stereochemistry of the chiral centers and double bonds. Stereo searching can be performed in the Registry File and the Beilstein File on STN or on the Beilstein CrossFire system. Finally, MARKUSH STRUCTURE SEARCHING, an important technique in patent searches that allows for considerable variablility in the structures retrieved, is another option in some files.

XII. Why Use Structure Searching?

There are many reasons to do a substructure search, among them:

In combination with other types of searches, structure searching is a very powerful complement.

Over 30,000,000 registered small molecule substances appear in the Chemical Abstracts Service Registry File. All of those have been registered since 1965, but, of course, not all of the compounds in the Registry File were discovered since that date. In fact, there are many compounds in the Registry File that have no new information on them in the CA or CAPlus Files (that is, in the literature from 1967 onward). However, most of the millions of compounds in the Registry File have their Registry Numbers linked to the to databases on the STN system. The LC (File Locater) field of a Registry File record tells in which databases on STN the Registry Number is found. In addition to the Registry File, structure searches can be conducted in such databases on STN as BEILSTEIN, CASREACT, and others. A similar file locater function is included in other chemical dictionary files, such as NLM's ChemID.

There are several types of structure searches possible in the Registry File, as well as different options for views of the molecules and different methods of inputting the structure. SciFinder Scholar masks to a certain extent the relationship between the Registry File and the CAPlus File, CASREACT, and other databases intertwined with its software.

Once the structure is built and the answer set retrieved, the search proceeds as it does with compounds identified by name or molecular formula searches. The structure search can be further refined with additional structural features or by limiting it with other parameters. Once refined, the references can be retrieved that have the Registry Number of the compounds in their indexing.

In traditional, command-driven structure searching, when logging on to STN, the choice of terminal determines what type of view of the molecule you will see. If one selects option 3 at the prompt:

TERMINAL (Enter 1, 2, 3 OR ?)

the structural depictions will be encoded with regular punctuation symbols found on a computer keyboard. Thus a double bond might be indicated by a ":" or a "=". With the proper telecommunications software, selecting option 2 will depict the structures as true graphical representations. That is the default option when using STN Express with Discover! (front-end software that allows the building of the structures offline).

III. Types of Structure Searches in the STN Registry File.

The following types of structure searches are possible on STN:

With SciFinder Scholar, one of two true structure searcing options is available, depending on whether the Substructure Search Module is included in the version of the software. The basic SciFinder Scholar search covers an exact and family search. The SSS module allows the fuller search options. (A similarity search has recently been added to the options available in SciFinder and SciFinder Scholar, but this is based on a different principle than the structure searches.)

There are actually several stages of a Registry File structure search. The first stage involves a screening of the huge file for compounds that have the requisite substitutents and other features, without regard to their position on the molecule. The much more computer-intensive iteration stage involves an atom-by-atom, bond-by-bond look at the candidate molecules isolated in the screen search. Since this stage requires so much of STN's computer resources, there are limits on the number of compounds that can be looked at during the iterative stage. A sample search must be run on approximately 5% of the file, after which a prediction of whether the full file search will run to completion is given. Assuming the prediction is favorable, the full file search can be compared to the structure. Otherwise, the structure must be modified to be able to run to completion. With SciFinder Scholar, there is some built-in intelligence that offers to "autofix" a molecule that might give the system trouble. It is also wise to preview the SciFinder Scholar search to see what kinds of substances might be retrieved with the structure as drawn.

XIV. How to Create a Structure in the Registry File.

The "old-fashioned" way of building structures on the STN system is to use alphanumeric commands to gradually create the molecule. There are front-end programs such as STN Express or STN on the Web that can be used to draw a graphic depiction of the molecule offline and upload it to STN once the connection is made. Of course, SciFinder or SciFinder Scholar have a structure searching option. Nevertheless, it is instructive to see the original commands used to draw the molecule and the options for assigning parameters to the structure. When building the structure online via commands, it is advisable for cost reasons to build it in the cheap LREG file. Once complete and an L# is assigned to the structure query, you can transfer to the more expensive Registry File to run the search.

These are the basic steps that must be followed to create the structure online on STN using command language:

  1. Initiate the structure creation sub-program on the STN system by giving the STRUCTURE command at the STN LREG file prompt "=>".
  2. Build the outline of the structure using the GRAph command.
  3. Specify the non-carbon atoms with the NODe command.
  4. Specify the types of bonds in the molecule with the BONd command.
  5. Specify additional requirements for the molecule, such as:
  6. Do a final display of the molecule you have built with the DIS SIA (Display the Structure Image and Attributes) command.
  7. Terminate structure building with the END command.

At this point, an L# is assigned to the structure query you have created. Once the Registry File is entered, the structure search is initiated with the SEARCH L# command. An example of the structure building process using commands on STN and a Type 3 (alphanumeric) terminal setting is seen here.

XV. The GRA Command and the Use of Pre-Drawn Structures

The Graph command builds the basic outline of the molecule. This can be a cumbersome process for larger molecules. Hence, there are alternatives. One way is to start with the Registry Number of a known substance that is similar to the compound of interest. Once the STRUCTURE command is given, you are prompted to:

ENTER NAME OF STRUCTURE TO BE RECALLED (NONE):

At this point, you could enter a Registry Number or, if you have built another structure in this session, the L# for that query structure. Another alternative is to enter a code for the pre-drawn systems used in creating structures. Rings of size 5 to 12 ring atoms can be created simply by inputting the appropriate number at the prompt. Other pre-drawn options include STEROD (steroids) and ADAMAN (adamantanes).

If starting from scratch, the two basic options for the GRA command are to draw a chain (c) or ring (r) followed by a number indicating the size of the chain or ring. Thus, GRA c3 builds a chain of 3 atoms, and GRA r6 builds a 6-membered ring. The structures appear on the screen with carbon atoms as the default nodes, and unspecified bonds. All nodes are numbered, so further commands to the system utilize the node numbers for appropriate actions.

One potentially confusing use of the GRA command occurs when two nodes are to be connected. Intuitively, this would seem to involve the BON command because we want to form a bond between the two atom nodes. However, BON is used only to modify an unspecified bond created with the GRA command. Thus, if we wanted to create a 14- membered ring, one way to do it would be to GRA c14, then GRA 1-14. That puts the necessary link between the two end nodes (although some other moving of the atoms would be necessary to make it appear reasonable on the screen).

XVI. Use of the NOD Command

The NOD command takes the form: NOD # symbol where the # refers to the number of the node in the molecule and the symbol is defined either by regular symbols for the elements or by special node symbols understood by the STN system. The latter include such things as "X" to represent any halogen, "M" to represent a metal, or "Gk" (where k represents a number from 1 to 20) to indicate a node which can vary according to your defintion of the possible symbols (done with the VARiable command). There are also a number of SHORTCUT SYMBOLS for groups such as methyl "ME" or tert-butyl "T-BU".

There are four GENERIC GROUP SYMBOLS:

By issuing the GGC (Generic Group Category) command, these symbols can be further limited by type, for example, linear "LIN" or low carbon (6 or fewer carbons) "LOC".

Finally, it may be necessary to define a node as potentially being in either a ring or a chain. This is done with the command NOD # rc. Since the system assumes by default that the node is only to exist in the environment drawn, it is necessary to override the default with the rc specification when it is ok for an end node to be in either a ring or a chain in a substructure answer set.

XVII. Use of the BON Command

The bond codes used in the Registry File structure building process are letter codes to specify bond types such as "se" (single exact) or "d" (double) or "n" (normalized). A NORMALIZED BOND is an aromatic bond or one found in a tautomer or combinations of rings and tautomers. If a ring has an even number of atoms and contains alternating single and double bonds all the way around the ring system, the bonds in the ring are designated as normalized. For fused rings, only the outside path is considered.

For a tautomer, the following environment must exist:

where:

It is also possible to specify that a bond is only a ring bond or only a chain bond by defining it as BON rs or BON cd, for example. By default the system will assume that the bond is only to be part of a compound that has the environment in which it is drawn.

CASREACT: In addition, common functional groups in the reactants, reagents, and products are searchable with a name labeled as /FG. For example,

=> S PRIMARY AMINE/FG

or

=> S TRIHALIDE/FG.RCT

ROLES: used to describe the information that deals with the substances indexed in a document. One of the super roles is PREP (Preparation), which has more specific roles:

The PREP super role is equivalent to the STN CA/CAplus File search => S L#/P. The same results would be found with the strategy: => S L#/PREP

Roles must be appended to a Registry File answer set if used with an L#. However, they can be applied both to L#'s which may contain one or more substances and to individual CAS Registry Numbers or General Subject Index terms for classes of substances. For example, => S 91-56-5/SPN would find a laboratory-scale preparation of isatin.

It is also possible to label a substance with the role RCT (Reactant) in order to limit the answer set to references where a particular reactant is used, as in:

=> S L# AND 91-56-5/RCT (Note that the L# in this case may not be from the Registry File.)

In the CA File on STN, a convenient way to find all kinds of ways of preparing a substance is to search the CAS Registry Number, either directly or by crossing over from the Registry File. A "P" appended to the search strategy results in the search being limited to the items of interest. For example:

=> S 91-56-5P

or

=> S L#/P

In the second case, the "L#" (answer set) would have resulted from a search in the Registry File that found one or more compounds. It could represent a group of related substances from a substructure search. The "/" is required before the "P" in the CA/CAplus file search when using such a L#.

MISCELLANEOUS COMMANDS:

SET REG OFF

This command allows you to suppress the automatic REGISTRY search and crossover initiated by REG1stRY when a CAS Registry Number is entered in the CA/CAplus family of files. REG1stRY is, by default, ON. Simply enter SET REG1stRY OFF at any arrow prompt to suppress REG1stRY. When SET REG OFF is used, the CAS Registry Number is searched in the BASIC INDEX (/BI). When you search terms in other REG1stRY fields, e.g. chemical name (/CN) or molecular formula (/MF), SET REG OFF does not affect the automatic REGISTRY search and crossover. See HELP SET REG for more details.

How Structure Searching Really Works

From a CHMINF-L posting for Harold Helson of CambridgeSoft's ChemFinder development team:

Most all structure search programs use the same basic approach nowadays. There are differences in how structures are drawn and interpreted, and in advanced capabilities, but the basic methodology described below is widely followed.

Atom-by-Atom Search

At the heart of the search is a utility called ABAS, or atom-by-atom-search. It starts with a more or less arbitrary atom in the query and maps it to every matching atom in the target, in turn. For each of these cases, it examines a neighbor atom of the mapped atom in the query, and explores all the (allowed) mappings of this onto the neighbor atoms of the mapped atom of the target. For each successful second-atom mapping, another neighbor is chosen, etc. This is a recursive process that explores all of the possible mappings of the query structure onto the target. As soon as it finds a complete mapping, one that has assigned every query atom to a matching target atom, it can quit with the answer. The matching criteria generally include atom type, and may or may not include charge, isotopy and radical character, depending on the search preferences. (Implicit in the description above is that bond types must also match along the way.)

A substructure search is one in which the query has fewer atoms than the target, whereas a full structure search is one with equal numbers of atoms. Screening

ABAS is relatively slow, particularly when thousands or millions of targets must be searched. Therefore most all programs use a "screening" step before ABAS. For example, if the query contains a carbonyl group, it is pointless to examine targets that lack one. A bitscreen, i.e. a vector of ones and zeroes, representing boolean values ("present" or "absent") is used to describe the functionality in a given structure. Bit 89, for example, might signify that a C=O bond is present. If the query has bit 89 set, but the target does not, then there is no point in going on to ABAS; the target can be rejected in short order.

Bitscreens typically consist of between 100 and 5000 bits, each describing a different chemical facet, known as a "descriptor" or "key". Larger bitscreens take longer to process, but are better at weeding out impossible targets. In practice, fast boolean AND's can be used to operate on many bits in parallel, i.e. at once.

Keys must be carefully designed. For example, "Has exactly one ring" is a bad choice, because in a substructure search, the query might have exactly one ring, but the target might have three. A better key would be, "Has at least one ring". Variable features in the query reduce the effectiveness of screening. For example, if a bond is labelled, "single *or* double", then either the screen's keys must not take account of bond order, or else that bond in the query must be omitted from the screen.

Similarity Search

A similarity search is one in which targets are found that "look like" the query. What this means is obviously subject to interpretation and depends on application. In biological applications the drug absorption properties are relevant. In a toxicological context the metabolism is of interest. More simply, though, the similarity in functional groups is what is measured. The bitscreen used in screening is very convenient to this end. Comparing the number of keys the query and target have in common gives an indication of how similar they are. This method of similarity search is very fast to execute. A quantitative measure of similarity, called Tanimoto after its discoverer, is given by the number of bits in common divided by the total number of bits. It ranges from 0 to 100%. (That is appropriate for full-structure similarity. It is also possible to calculate substructural similarity, by substituting the number of bits set in the query as the denominator.)

Tricks

Various tricks other than screening can be used to speed up searching. One of the slow steps is the disk access time involved in loading the candidate targets for ABAS. (This is of less concern to screening, because the screens occupy relatively little disk space.) By installing huge amounts of RAM in the computer, one can keep all of the targets' connection tables in memory, and avoid the disk access hit altogether. Alternatively, it may be possible to order the targets, so that the targets passing screening in various common searches are located adjacently on the disk. This decreases the number of disk reads required.

It is also possible to pre-analyze the query and targets in ways that speed up ABAS at search time. For example, sorting the atoms in the query so that rarer atoms appear first can speed rejection of "false" targets. That is, the most quixotic features of the query are exposed to the target early. Analysis of symmetry in the query or the target can reduce the number of pairings that must be tried.

Challenges

Matching stereochemistry is one of the harder aspects of structure searching. Internal (computer) representation of stereochemistry is somewhat cumbersome, as is its perception from a drawing. "Generic" (or "variable") features also offer formidable challenges. Various pseudo-atom types are used, such as "X" for any halogen, "Q" for any heteroatom, "M" for any metal, and "R" for anything at all. Atoms may be required to be neutral, or possess a given charge or any charge. Isotopy may likewise be of interest. It is often useful to specify that a given query atom have exactly (or at least; or no more than) "n" substituents, without drawing out what those substituents are. Likewise, bonds may be specified as residing in rings, or in chains. "Markush" searching is the term used to describe queries with generic features. It is possible, but very complicated, to screen generic structures well and rapidly.

Reactions

Searching reactions poses additional challenges. It is important that the "spirit" of the transformation in the query be present in the target. If the query consists of ---OH --> ===O then it is not enough that the target contains a hydroxyl among its reactants and a carbonyl in its products. Rather, it must contain a hydroxyl that reacts to form a carbonyl. To this end, an atom-to-atom-map (AAMap) relates the atoms in the reactants to those in the products. Atoms which have no counterparts represent parts of missing reagents or omitted products (such as water of hydrolysis or condensation). Bonds made, broken, or changing order clearly indicate a "reaction center". It is these reaction centers that are of interest, and must align between query and target. The AAMap is surprisingly hard to construct in real-life examples, which are drawn by humans with many "obvious" human assumptions that are oblique to a machine.

-----Original Message----- From: CHEMICAL INFORMATION SOURCES DISCUSSION LIST [mailto:CHMINF-L@LISTSERV.INDIANA.EDU]On Behalf Of Dana Roth Sent: Monday, February 26, 2001 9:56 PM To: CHMINF-L@LISTSERV.INDIANA.EDU Subject: Explanation of structure searching

Does anyone have a favorite written description of how structure searching works behind the scenes -- something understandable to undergraduate chemistry majors, some of whom have taken only the minimum required organic courses?