Title extraction and generation from OCR'd documents

Autor:	Allen Condit, Julie Borsack, Kazem Taghva, Thomas A. Nartker, Steven E. Lumos
Rok vydání:	2007
Předmět:	Metadata Information extraction Information retrieval Data model Knowledge extraction Process (engineering) Computer science Optical character recognition computer.software_genre computer Automatic summarization Field (computer science) Task (project management)
Zdroj:	DRR
ISSN:	0277-786X
DOI:	10.1117/12.712264
Popis:	Extraction of metadata from documents is a tedious and expensive process. In general, documents are manually reviewed for structured data such as title, author, date, organization, etc. The purpose of extraction is to build metadata for documents that can be used when formulating structured queries. In many large document repositories such as the National Library of Medicine (NLM) 1 or university libraries, the extraction task is a daily process that spans decades. Although some automation is used during the extraction process, generally, metadata extraction is a manual task. Aside from the cost and labor time, manual processing is error prone and requires many levels of quality control. Recent advances in extraction technology, as reported at the Message the Understanding Conference (MUC), 2 is comparable with extraction performed by humans. In addition, many organizations use historical data for lookup to improve the quality of extraction. For the large government document repository we are working with, the task involves extraction of several fields from millions of OCR'd and electronic documents. Since this project is time-sensitive, automatic extraction turns out to be the only viable solution. There are more than a dozen fields associated with each document that require extraction. In this paper, we report on the extraction and generation of the title field.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::3b6b0d093765d501267ff0023e1259d2 https://doi.org/10.1117/12.712264 Zobrazit plný text záznamu