Recognition of the Logical Structure of Arabic Newspaper Pages

Autor:	János Csirik, Hassina Bouressace
Rok vydání:	2018
Předmět:	Connected component Structure (mathematical logic) computer.internet_protocol business.industry Computer science computer.software_genre Document processing Set (abstract data type) Identification (information) Simple (abstract algebra) Hierarchical organization Artificial intelligence business computer Natural language processing XML
Zdroj:	Text, Speech, and Dialogue ISBN: 9783030007935 TSD
DOI:	10.1007/978-3-030-00794-2_27
Popis:	In document analysis and recognition, we seek to apply methods of automatic document identification. The main goal is to go from a simple image to a structured set of information exploitable by machine. Here, we present a system for recognizing the logical structure (hierarchical organization) of Arabic newspapers pages. These are characterized by a rich and variable structure. They may contain several articles composed of titles, figures, author’s names and figure captions. However, the logical structure recognition of a newspaper page is preceded by the extraction of its physical structure. This extraction is performed in our system using a combined method which is essentially based on the RLSA (Run Length Smearing/Smoothing Algorithm) [1], projections profile analysis, and connected components labeling. Logical structure extraction is then performed based on certain rules of sizes and positions of the physical elements extracted earlier, and also on an a priori knowledge of certain properties of logical entities (titles, figures, authors, captions, etc.). Lastly, the hierarchical organization of the document is represented as an XML file generated automatically. To evaluate the performance of our system, we tested it on a set of images and the results are encouraging.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::f3c4257883031ba47e9b1d32f5282219 https://doi.org/10.1007/978-3-030-00794-2_27 Zobrazit plný text záznamu