FastRCA-Seq: An efficient approach for extracting hierarchies of multilevel closed partially-ordered patterns

Autor: Cristina Nica, Adrian Groza, Victor-Petru Almăşan
Rok vydání: 2020
Předmět:
Zdroj: Knowledge-Based Systems. 210:106533
ISSN: 0950-7051
DOI: 10.1016/j.knosys.2020.106533
Popis: Discovering concise representations of sequential patterns in sequential data is a well-established data mining task. Recently, Nica et al. have put forward an original approach RCA-Seq for directly extracting a hierarchy of multilevel closed partially-ordered patterns (MCPO-patterns) from a sequence database within the Relational Concept Analysis (RCA) framework. RCA-Seq has been applied successfully to small ( ∼ 1 , 000 sequences) but interesting real hydro-ecological datasets. However, RCA-Seq only focuses on providing comprehensible results to the detriment of performance. To improve the performance of RCA-Seq , we propose a new approach FastRCA-Seq that stems from RCA-Seq , and whose contributions are beneficial for two fields: Formal Concept Analysis, namely the RCA extension, and sequential pattern mining. FastRCA-Seq spans two key steps: the exploration of sequential data based on RCA, and the extraction of MCPO-patterns by navigating the RCA result. Firstly, our approach introduces an effective RCA implementation based on bit-array representations, bitwise operations, parallel computing, and several new properties of RCA that may prevent expensive computations. In addition, we state the bottleneck of RCA. Secondly, FastRCA-Seq is a self-contained approach for directly and efficiently mining hierarchies of MCPO-patterns from sequential data. We assess FastRCA-Seq on various benchmark datasets, precisely Gazelle, Kosarak, and FIFA. The results show that FastRCA-Seq outperforms RCA-Seq in terms of execution time (in average ∼ 169 times faster) and memory usage (in average with ∼ 42 % less) while preserving the benefits of interpretability and usability of results by stakeholders.
Databáze: OpenAIRE