Green Interaction for Extracting Family Information from OCR'd Books

Autor:	David W. Embley, George Nagy
Rok vydání:	2018
Předmět:	Computer science business.industry media_common.quotation_subject 02 engineering and technology computer.software_genre Security token Optical character recognition software Punctuation Dozen 020204 information systems Pointer (computer programming) 0202 electrical engineering electronic engineering information engineering Proper noun 020201 artificial intelligence & image processing Artificial intelligence business computer Natural language processing media_common
Zdroj:	DAS
DOI:	10.1109/das.2018.58
Popis:	Repetitively formatted historical books are tokenized and tagged according to eight token types (capitalized words, numbers, punctuation …). To extract family information, templates of short sequences of tags are generated around frequent proper nouns and specified tokens like "born". Each template is associated with a user-assigned class (head of household, father, mother, spouse, geographic location …) and a pointer to an overlapping or nearby fragment of text to be extracted. Matching the template against the book text yields class-labeled factoids. In an interaction cycle, new extraction templates are proposed for user approval or editing. Each edit-then-extract cycle typically yields thousands of factoids and a dozen new templates. With five approximately half-hour interactive sessions, 44,000 genealogical factoids were extracted from a 17th century Scottish register of marriages and births and from published 19th-20th century Ohio funeral parlor records. The experience indicates that this method quickly yields quality results with higher F-score than reported for hand-constructed rule templates.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::3cb33d88fbddb22af50b06e759ef0744 https://doi.org/10.1109/das.2018.58 Zobrazit plný text záznamu