Document flow segmentation for business applications
Autor: | Hani Daher, Abdel Belaïd |
---|---|
Přispěvatelé: | Hani, Daher, Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria), Recognition of writing and analysis of documents (READ), Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS) |
Rok vydání: | 2013 |
Předmět: |
Computer science
Feature vector Scale-space segmentation 02 engineering and technology computer.software_genre 01 natural sciences 010309 optics Document Flow segmentation Fragment (logic) 0103 physical sciences 0202 electrical engineering electronic engineering information engineering Segmentation Representation (mathematics) Textual descriptors Information retrieval Point (typography) Business flow [INFO.INFO-NA]Computer Science [cs]/Numerical Analysis [cs.NA] Continuity and rupture classification Flow (mathematics) Binary classification [INFO.INFO-NA] Computer Science [cs]/Numerical Analysis [cs.NA] 020201 artificial intelligence & image processing SPIE Data mining computer |
Zdroj: | DRR IS&T/SPIE Electronic Imaging Document Recognition and Retrieval Document Recognition and Retrieval XXI Document Recognition and Retrieval XXI, Feb 2014, San Francisco, France. pp.9201-15 |
ISSN: | 0277-786X |
DOI: | 10.1117/12.2043141 |
Popis: | International audience; The aim of this paper is to propose a document flow supervised segmentation approach applied to real world heterogeneous documents. Our algorithm treats the flow of documents as couples of consecutive pages and studies the relationship that exists between them. At first, sets of features are extracted from the pages where we propose an approach to model the couple of pages into a single feature vector representation. This representation will be provided to a binary classifier which classifies the relationship as either segmentation or continuity. In case of segmentation, we consider that we have a complete document and the analysis of the flow continues by starting a new document. In case of continuity, the couple of pages are assimilated to the same document and the analysis continues on the flow. If there is an uncertainty on whether the relationship between the couple of pages should be classified as a continuity or segmentation, a rejection is decided and the pages analyzed until this point are considered as a "fragment". The first classification already provides good results approaching 90% on certain documents, which is high at this level of the system. |
Databáze: | OpenAIRE |
Externí odkaz: |