Document flow segmentation for business applications

Autor:	Hani Daher, Abdel Belaïd
Přispěvatelé:	Hani, Daher, Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria), Recognition of writing and analysis of documents (READ), Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)
Rok vydání:	2013
Předmět:	Computer science Feature vector Scale-space segmentation 02 engineering and technology computer.software_genre 01 natural sciences 010309 optics Document Flow segmentation Fragment (logic) 0103 physical sciences 0202 electrical engineering electronic engineering information engineering Segmentation Representation (mathematics) Textual descriptors Information retrieval Point (typography) Business flow [INFO.INFO-NA]Computer Science [cs]/Numerical Analysis [cs.NA] Continuity and rupture classification Flow (mathematics) Binary classification [INFO.INFO-NA] Computer Science [cs]/Numerical Analysis [cs.NA] 020201 artificial intelligence & image processing SPIE Data mining computer
Zdroj:	DRR IS&T/SPIE Electronic Imaging Document Recognition and Retrieval Document Recognition and Retrieval XXI Document Recognition and Retrieval XXI, Feb 2014, San Francisco, France. pp.9201-15
ISSN:	0277-786X
DOI:	10.1117/12.2043141
Popis:	International audience; The aim of this paper is to propose a document flow supervised segmentation approach applied to real world heterogeneous documents. Our algorithm treats the flow of documents as couples of consecutive pages and studies the relationship that exists between them. At first, sets of features are extracted from the pages where we propose an approach to model the couple of pages into a single feature vector representation. This representation will be provided to a binary classifier which classifies the relationship as either segmentation or continuity. In case of segmentation, we consider that we have a complete document and the analysis of the flow continues by starting a new document. In case of continuity, the couple of pages are assimilated to the same document and the analysis continues on the flow. If there is an uncertainty on whether the relationship between the couple of pages should be classified as a continuity or segmentation, a rejection is decided and the pages analyzed until this point are considered as a "fragment". The first classification already provides good results approaching 90% on certain documents, which is high at this level of the system.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::ebbdd3218cdf42bed131429a022ffcd4 https://doi.org/10.1117/12.2043141 Zobrazit plný text záznamu