Popis: |
Fisher's exact test is often a preferred method to estimate the significance of statistical dependence. However, in large data sets the test is usually too worksome to be applied, especially in an exhaustive search (data mining). The traditional solution is to approximate the significance with the $\chi^2$-measure, but the accuracy is often unacceptable. As a solution, we introduce a family of upper bounds, which are fast to calculate and approximate Fisher's $p$-value accurately. In addition, the new approximations are not sensitive to the data size, distribution, or smallest expected counts like the $\chi^2$-based approximation. According to both theoretical and experimental analysis, the new approximations produce accurate results for all sufficiently strong dependencies. The basic form of the approximation can fail with weak dependencies, but the general form of the upper bounds can be adjusted to be arbitrarily accurate. |