Mining cohesive patterns in sequences and extreme multi-label classification

Autor: Feremans, Len
Přispěvatelé: Goethals, Bart
Jazyk: angličtina
Rok vydání: 2020
Předmět:
Popis: Finding patterns in long event sequences is an important data mining task. In the past, research focused on finding all frequent patterns, where the anti-monotonic property of frequency was used to design efficient algorithms. Recently, research focused on producing a smaller output containing only the most interesting patterns. In this thesis, we discover patterns using cohesion and quantile-based cohesion. Cohesion measures how close the items making up the pattern are on average. Quantile-based cohesion measures the proportion of pattern occurrences that are cohesive. We tackle the fact that both measures are not anti-monotonic by developing an upper bound to prune the search space. Experiments show that our method efficiently discovers important patterns that existing state-of-the-art methods fail to discover. In the second part of this thesis, we focus on multi-label classification which is important in different applications such as text categorisation, scene classification and bioinformatics. In machine learning, multi-label classification is the problem of identifying a set of labels for a new instance, based on a training database of labelled instances. Traditionally, methods learn a separate model for each label, however, this is not feasible for datasets with millions of labels. We propose a new algorithm that predicts labels using a linear ensemble of instance- and feature-based nearest neighbours. We tackle the problem of computing cosine similarity and similarity weighted predictions on large datasets using an inverted index and sparse optimisation. In addition, we propose a new top-k query with pruning based on a partition of the training database. Experiments show that our method is more accurate and orders of magnitude faster than state-of-the-art methods and requires less than 20 ms per instance to predict labels for extreme datasets consisting of hundreds of thousands of labels without the need for expensive hardware.
Databáze: OpenAIRE