A Rule-Based Method for Table Detection in Website Images

Autor: Jihu Kim, Hyoseok Hwang
Jazyk: angličtina
Rok vydání: 2020
Předmět:
Zdroj: IEEE Access, Vol 8, Pp 81022-81033 (2020)
Druh dokumentu: article
ISSN: 2169-3536
DOI: 10.1109/ACCESS.2020.2990901
Popis: Table detection is an essential part of a document analysis because tables are among the most efficient methods for systematically summarizing information. Therefore, numerous studies on detecting tables not only from documents but also from websites have been conducted. Although, the number of websites has been growing explosively recently, most of these studies suffer from detecting tables which are image types rather than tagging due to the variability of size, contents, color, and shapes. In this paper, we propose an efficient yet robust method for detecting tables in image formats, which can apply to both documents and websites. Instead of employing recently developed deep learning methods, which require extensive training for diversity, we apply a rule-based detection method by using key features of many tables, namely, the grid format of the text provided in the tables. The proposed method consists of two stages: a feature extraction stage and a grid pattern recognition stage. In the first stage, we extract the features of the contents in the tables. We then remove the features of non-text objects and texts not included in tables. In the second stage, we build tree structures from the features and apply a novel algorithm for determining the grid pattern. When we applied our method to a website dataset, the experimental results showed a precision, recall, and F1-measure of 84.5%, 72%, and 0.778, which are improvements of 3.6%, 24.16%, and 0.276 over a previous method, respectively, while also achieving the fastest processing time. In addition, the proposed rule-based method allows the structure of the contents in the table to be easily restored.
Databáze: Directory of Open Access Journals