Detecting changes in high frequency data streams, with applications

Autor: Ross, Gordon J.
Rok vydání: 2013
Předmět:
Druh dokumentu: Electronic Thesis or Dissertation
Popis: In recent years, problems relating to the analysis of data streams have become widespread. A data stream is a collection of time ordered observations x1, x2, ... generated from the random variables X1, X2, ... It is assumed that the observations are univariate and independent, and that they arrive in discrete time. Unlike traditional sequential analysis problems considered by statisticians, the size of a data stream is not assumed to be fixed, and new observations may be received over time. The rate at which these observations are received can be very high, perhaps several thousand every second. Therefore computational efficiency is very important, and methods used for analysis must be able to cope with potentially huge data sets. This paper is concerned with the task of detecting whether a data stream contains a change point, and extends traditional methods for sequential change detection to the streaming context. We focus on two different settings of the change point problem. The first is nonparametric change detection where, in contrast to most of the existing literature, we assume that nothing is known about either the pre- or post-change stream distribution. The task is then to detect a change from an unknown base distribution F0 to an unknown distribution F1. Further, we impose the constraint that change detection methods must have a bounded rate of false positives, which is important when it comes to assessing the significance of discovered change points. It is this constraint which makes the nonparametric problem difficult. We present several novel methods for this problem, and compare their performance via extensive experimental analysis. The second strand of our research is Bernoulli change detection, with application to streaming classification. In this setting, we assume a parametric form for the stream distribution, but one where both the pre- and post-change parameters are unknown. The task is again to detect changes, while having a control on the rate of false positives. After developing two different methods for tackling the pure Bernoulli change detection task, we then show how our approach can be deployed in streaming classification applications. Here, the goal is to classify objects into one of several categories. In the streaming case, the optimal classification rule can change over time, and classification techniques which are not able to adapt to these changes will suffer performance degradation. We show that by focusing only on the frequency of errors produced by the classifier, we can treat this as a Bernoulli change detection problem, and again perform extensive experimental analysis to show the value of our methods.
Databáze: Networked Digital Library of Theses & Dissertations