Abstrakt: |
High recall Information REtrieval (HIRE) aims at identifying only and (almost) all relevant documents for a given query. HIRE is paramount in applications such as systematic literature review, medicine, legal jurisprudence, among others. To address the HIRE goals, active learning methods have proven valuable in determining informative and non-redundant documents to reduce user effort for manual labeling. We propose a new active learning framework for the HIRE task. REVEAL-HIRE selects a very reduced set of documents to be labeled, significantly mitigating the user's effort. The proposed approach selects the most representative documents by exploiting a novel, specifically designed active learning strategy for HIRE, called REVEAL (RelEVant rulE-based Active Learning). REVEAL aims at selecting the maximum number of relevant documents for a given query based on discriminative rule-based patterns and a penalization factor. The method is applied to the top-ranked documents to choose the most informative ones to be labeled, a hard task due to data skewness – most documents are irrelevant for a given query. The enhanced active learning process is repeated incrementally until a stopping point is achieved, using REVEAL to identify the point in the process when relevant documents should stop to be sampled. Experimental results in several standard benchmark datasets (e.g. 20-Newsgroups, Trec Total Recall, and CLEF eHealth) demonstrate that REVEAL-HIRE can reduce the user labeling effort up to 3 times (320% of reduction) in comparison with state-of-the-art baselines while keeping the effectiveness at the highest levels. [ABSTRACT FROM AUTHOR] |