Data programming with DDLite
Autor: | Henry R. Ehrenberg, Christopher Ré, Alexander Ratner, Jaeho Shin, Jason A. Fries |
---|---|
Rok vydání: | 2016 |
Předmět: |
0301 basic medicine
Feature engineering business.industry Computer science 02 engineering and technology Construct (python library) Machine learning computer.software_genre Domain (software engineering) 03 medical and health sciences Information extraction 030104 developmental biology Knowledge base 020204 information systems 0202 electrical engineering electronic engineering information engineering Data analysis Programming paradigm Artificial intelligence Heuristics business computer |
Zdroj: | HILDA@SIGMOD |
DOI: | 10.1145/2939502.2939515 |
Popis: | Populating large-scale structured databases from unstructured sources is a critical and challenging task in data analytics. As automated feature engineering methods grow increasingly prevalent, constructing sufficiently large labeled training sets has become the primary hurdle in building machine learning information extraction systems. In light of this, we have taken a new approach called data programming [7]. Rather than hand-labeling data, in the data programming paradigm, users generate large amounts of noisy training labels by programmatically encoding domain heuristics as simple rules. Using this approach over more traditional distant supervision methods and fully supervised approaches using labeled data, we have been able to construct knowledge base systems more rapidly and with higher quality. Since the ability to quickly prototype, evaluate, and debug these rules is a key component of this paradigm, we introduce DDLite, an interactive development framework for data programming. This paper reports feedback collected from DDLite users across a diverse set of entity extraction tasks. We share observations from several DDLite hackathons in which 10 biomedical researchers prototyped information extraction pipelines for chemicals, diseases, and anatomical named entities. Initial results were promising, with the disease tagging team obtaining an F1 score within 10 points of the state-of-the-art in only a single day-long hackathon's work. Our key insights concern the challenges of writing diverse rule sets for generating labels, and exploring training data. These findings motivate several areas of active data programming research. |
Databáze: | OpenAIRE |
Externí odkaz: |