Abstrakt: |
Large databases can be a source of useful knowledge. Yet this knowledge is implicit in the data. It must be mined and expressed in a concise, useful form of statistical patterns, equations, rules, conceptual hierarchies, and the like. Automation of knowledge discovery is important because databases are growing in size and number, and standard data analysis techniques are not designed for exploration of huge hypotheses spaces. We concentrate on discovery of regularities, defining a regularity by a pattern and the range in which that pattern holds. We argue that two types of patterns are particularly important: contingency tables and equations, and we present Forty-Niner (49er), a general-purpose database mining system which conducts large-scale search for those patterns in many subsets of data, conducting a more costly search for equations only when data indicate a functional relationship. 49er can refine the initial regularities to yield stronger and more general regularities and more useful concepts. 49er combines several searches, each contributing to a different aspect of a regularity. Correspondence between the components of search and the structure of regularities makes the system easy to understand, use, and expand. Finally, we discuss 49er's performance in four categories of tests: (1) open exploration of new databases; (2) reproduction of human findings (limited because databases which have been extensively explored are very rare); (3) hide- and -seek testing on artificially created data, to evaluate 49er on large scale against known results; (4) exploration of randomly generated databases. |