Applying attention-based models to classify small datasets: An example using community-engaged research protocols (Preprint)

Autor: Brian J. Ferrell, Sarah E. Raskin, Emily B. Zimmerman, David H. Timberline, Bridget T. McInnes, Alex H. Krist
Rok vydání: 2021
Popis: BACKGROUND Community-Engaged Research (CEnR) is a research approach in which scholars partner with community organizations or individuals with whom they share an interest in the study topic, typically with the goal of supporting that community’s wellbeing. CEnR is well-established in numerous disciplines including the clinical and social sciences. However, universities experience challenges reporting comprehensive CEnR metrics, limiting development of appropriate CEnR infrastructure and advancement of relationships with communities, funders, and stakeholders. OBJECTIVE n/a METHODS We propose a novel approach to identifying and categorizing community-engaged studies by applying attention-based deep learning models to human subjects protocols that have been submitted to the university’s Institutional Review Board (IRB). We manually classified a sample of protocols submitted to the IRB using a 3 and 6-level CEnR heuristic. We then trained an attention-based Bidirectional-LSTM on the classified protocols and compared it to transformer models such as BERT, Bio+ClinicalBERT, and XLM-RoBERTa. We applied the best performing models to the full sample of unlabeled IRB protocols submitted in the years 2013-2019 (n > 6000). RESULTS Transfer learning appears to be superior, receiving a .9952 testing F1 Score for all transformer models implemented compared to the attention-based Bi-LSTM model. This finding is consistent across several methodological adjustments: an augmented dataset with and without cross-validation, an unaugmented dataset with and without cross-validation, a 6 class CEnR spectrum, and a 3 class one. BERT and the transformer models showed an understanding of our data unlike the attention-based model, promising usability for real-world application. CONCLUSIONS Transfer learning is a viable method for differentiating small datasets characterized by the idiosyncrasies and errors of CEnR descriptions used by principal investigators in research protocols.
Databáze: OpenAIRE