Substructure searching on very large files by using multiple storage techniques.

Autor: Bartmann, A1 (AUTHOR), Walkowiak, D (AUTHOR), Roth, B (AUTHOR), Hicks, M G (AUTHOR)
Předmět:
Zdroj: Journal of Chemical Information & Computer Sciences. Jul-Aug1993, Vol. 33 Issue 4, p539-541. 3p.
Abstrakt: Traditional substructure search systems use a two stage algorithm consisting of a preliminary screening which operates on (inverted) index files to determine a set of candidates to be processed by the atom-by-atom search (ABAS). The screening stage is usually fast, the performance of the system being governed by the screening efficiency. If a large number of candidates is left after the screening, the result is often a very large increase in retrieval time. The ABAS becomes the time dependent stage due to the excessively large number of disk seeks required to get the randomly distributed structure records into memory. The new search algorithm described in this paper is based on a special preprocessed structure file. It contains multiples of each molecule's connection table organized in clusters forming contiguous portions of the search file. Each cluster can be characterized by a substructure contained in all its molecules. A molecule may be a member of several different clusters, or it may appear repeatedly in the same cluster. File generation and update is fast and simple, and the mass storage requirements are only about 1 kbyte/molecule. A substructure search is performed by finding the minimum set of clusters containing all candidates for a given query. The ABAS only has to scan sequentially through the relevant portions of the structure file. Furthermore, each single I/O-operation can read hundreds of structures into memory. Only the structures that are not already verified to be hits by the screening must be processed. The ABAS is CPU-bound. This architecture offers an extremely good performance on very large files for various computer platforms (e.g., IBM-PC, IBM-Mainframe, VAX) and even on slow storage devices like CD-ROMs. [ABSTRACT FROM AUTHOR]
Databáze: Library, Information Science & Technology Abstracts