Green Interaction for Extracting Family Information from OCR'd Books

Autor: David W. Embley, George Nagy
Rok vydání: 2018
Předmět:
Zdroj: DAS
DOI: 10.1109/das.2018.58
Popis: Repetitively formatted historical books are tokenized and tagged according to eight token types (capitalized words, numbers, punctuation …). To extract family information, templates of short sequences of tags are generated around frequent proper nouns and specified tokens like "born". Each template is associated with a user-assigned class (head of household, father, mother, spouse, geographic location …) and a pointer to an overlapping or nearby fragment of text to be extracted. Matching the template against the book text yields class-labeled factoids. In an interaction cycle, new extraction templates are proposed for user approval or editing. Each edit-then-extract cycle typically yields thousands of factoids and a dozen new templates. With five approximately half-hour interactive sessions, 44,000 genealogical factoids were extracted from a 17th century Scottish register of marriages and births and from published 19th-20th century Ohio funeral parlor records. The experience indicates that this method quickly yields quality results with higher F-score than reported for hand-constructed rule templates.
Databáze: OpenAIRE