Cost-aware active learning for named entity recognition in clinical text
Autor: | Joshua C. Denny, Trevor Cohen, Hua Xu, Qiang Wei, Qiaozhu Mei, Qingxia Chen, Amy Franklin, Yukun Chen, Thomas A. Lasko, Stephen Wu, Mandana Salimi |
---|---|
Jazyk: | angličtina |
Rok vydání: | 2019 |
Předmět: |
Big Data
Active learning (machine learning) Computer science Information Storage and Retrieval Health Informatics Sample (statistics) 02 engineering and technology computer.software_genre Machine learning Research and Applications Task (project management) 03 medical and health sciences Annotation 0302 clinical medicine Named-entity recognition 0202 electrical engineering electronic engineering information engineering Electronic Health Records Humans Computer Simulation 030212 general & internal medicine Natural Language Processing business.industry Data set Models Economic Learning curve Passive learning 020201 artificial intelligence & image processing Artificial intelligence business computer Algorithms |
Zdroj: | J Am Med Inform Assoc |
Popis: | Objective Active Learning (AL) attempts to reduce annotation cost (ie, time) by selecting the most informative examples for annotation. Most approaches tacitly (and unrealistically) assume that the cost for annotating each sample is identical. This study introduces a cost-aware AL method, which simultaneously models both the annotation cost and the informativeness of the samples and evaluates both via simulation and user studies. Materials and Methods We designed a novel, cost-aware AL algorithm (Cost-CAUSE) for annotating clinical named entities; we first utilized lexical and syntactic features to estimate annotation cost, then we incorporated this cost measure into an existing AL algorithm. Using the 2010 i2b2/VA data set, we then conducted a simulation study comparing Cost-CAUSE with noncost-aware AL methods, and a user study comparing Cost-CAUSE with passive learning. Results Our cost model fit empirical annotation data well, and Cost-CAUSE increased the simulation area under the learning curve (ALC) scores by up to 5.6% and 4.9%, compared with random sampling and alternate AL methods. Moreover, in a user annotation task, Cost-CAUSE outperformed passive learning on the ALC score and reduced annotation time by 20.5%–30.2%. Discussion Although AL has proven effective in simulations, our user study shows that a real-world environment is far more complex. Other factors have a noticeable effect on the AL method, such as the annotation accuracy of users, the tiredness of users, and even the physical and mental condition of users. Conclusion Cost-CAUSE saves significant annotation cost compared to random sampling. |
Databáze: | OpenAIRE |
Externí odkaz: |