Popis: |
The central theme of this dissertation is the statistical analysis of retrieval data. Features commonly used in modern retrieval systems are studied and modeled. The product of this analysis is a methodology for the study of retrieval data and the construction of probabilistic retrieval models. Model building is based on the formal concept of weight of evidence, which is a measure of how much our belief in a hypothesis (such as the relevance of a document) is increased as a result of the observation of the value of a random variable (for example, the number of times a query term appears in the document). Application of the methodology results in the development of a probabilistic model from which a ranking formula is derived. The ranking status value assigned to each document is equal to the weight of the evidence due to the combination of features that have been observed. The resulting formula has two important properties: (1) it is decomposable, with each component corresponding to observed statistical regularities of retrieval situations; and (2) the value produced has a precise, empirically verifiable probabilistic interpretation. Experimental evidence is reported indicating that the ranking formula derived from the data analysis is able to produce retrieval performance comparable to that of a state of the art IR system. In conjunction with the study of empirical data, a formal framework is developed which supports the approach to modeling that is used. The formalism is founded on the Maximum Entropy Principle. This principle-states that the probability distribution that we attribute to an unknown stochastic process should be that which assumes the least consonant with constraints embodying the knowledge we do possess. Guided by this principle, a theory of weight of evidence is developed. In this theory additivity of weight of evidence is proved to be a characteristic of the maximum entropy distribution under general conditions on the form of the constraints. As well as serving as a justification for the modeling strategy adopted in the dissertation, two classical probabilistic retrieval models are shown to follow from the theory. |