HRDoc: Dataset and Baseline Method Toward Hierarchical Reconstruction of Document Structures

Autor: Ma, Jiefeng, Du, Jun, Hu, Pengfei, Zhang, Zhenrong, Zhang, Jianshu, Zhu, Huihui, Liu, Cong
Rok vydání: 2023
Předmět:
Druh dokumentu: Working Paper
Popis: The problem of document structure reconstruction refers to converting digital or scanned documents into corresponding semantic structures. Most existing works mainly focus on splitting the boundary of each element in a single document page, neglecting the reconstruction of semantic structure in multi-page documents. This paper introduces hierarchical reconstruction of document structures as a novel task suitable for NLP and CV fields. To better evaluate the system performance on the new task, we built a large-scale dataset named HRDoc, which consists of 2,500 multi-page documents with nearly 2 million semantic units. Every document in HRDoc has line-level annotations including categories and relations obtained from rule-based extractors and human annotators. Moreover, we proposed an encoder-decoder-based hierarchical document structure parsing system (DSPS) to tackle this problem. By adopting a multi-modal bidirectional encoder and a structure-aware GRU decoder with soft-mask operation, the DSPS model surpass the baseline method by a large margin. All scripts and datasets will be made publicly available at https://github.com/jfma-USTC/HRDoc.
Comment: 8 pages, 6 figures. Accepted by AAAI-2023
Databáze: arXiv