Popis: |
Text indexing is a fundamental problem in computer science, where the task is to index a given text (string) T 1 . . n , such that whenever a pattern P 1 . . p comes as a query, we can efficiently report all those locations where P occurs as a substring of T. In this paper, we consider the case when P contains wildcard characters (which can match with any other character). The first non-trivial solution for the problem was given by Cole et al. 11, where the index space is O ( n log k ? n ) words or O ( n log k + 1 ? n ) bits and the query time is O ( p + 2 h log ? log ? n + occ ) , where k is the maximum number of wildcard characters allowed in P, h ? k is the number of wildcard characters in P and occ represents the number of occurrences of P in T. Even though many indexes offering different space-time trade-offs were later proposed, a clear improvement on this result is still not known. In this paper, we first propose an O ( n log k + ? ? n ) bits index achieving the same query time as the of Cole et al.'s index, where 0 < ? < 1 is an arbitrary small constant. Then we propose another index of size O ( n log k ? n log ? ? ) bits, but with a slightly higher query time of O ( p + 2 h log ? n + occ ) , where ? denotes the alphabet set size.We also study a related problem, where the task is to index a collection of documents (of n characters in total) so as to find the number of distinct documents containing a query pattern P. For the case where P contains at most a single wildcard character, we propose an O ( n log ? n ) -word index with optimal O ( p ) query time. |