Extracting structured data from unstructured document with incomplete resources

Autor:	Hervé Déjean
Rok vydání:	2015
Předmět:	Set (abstract data type) Information retrieval Data model Data extraction Computer science Document layout analysis
Zdroj:	ICDAR
Popis:	We present a method for extracting structured elements of information, called structured data (sdata), from ocr'ed pages. The method first analyzes the layout of the page, building several concurrent layout structures. Then a tagging step is performed in order to tag textual elements based on their content. Combining the layout structures and the tagged elements, layout models for representing the structured data are inferred for the current page. These models are used to correct or tag some elements missed by the tagging step. The final set of structured data is extracted. An evaluation is presented.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::dff6bde6edf32073e14f9c2e2c4d5ed9 https://doi.org/10.1109/icdar.2015.7333766 Zobrazit plný text záznamu