Validation of an improved vision-based web page parsing pipeline

Autor: Michael Cormier, Robin Cohen, Richard Mann, Karyn Moffatt, Daniel Vogel, Mengfei Liu, Shangshang Zheng
Rok vydání: 2023
Předmět:
Zdroj: ACM Transactions on the Web.
ISSN: 1559-114X
1559-1131
DOI: 10.1145/3580519
Popis: In this paper, we present a novel approach to quantitative evaluation of a model for parsing web pages as visual images, intended to provide improvements for users with assistive needs (cognitive or visual deficits, enabling decluttering or zooming and supporting more effective screen reader output). This segmentation-classification pipeline is tested in stages: We first discuss the validation of the segmentation algorithm, showing that our approach produces automated segmentations that are very similar to those produced by real users when making use of a drawing interface to designate edges and regions. We also examine the properties of these ground truth segmentations produced under different conditions. We then describe our Hidden-Markov tree approach for classification and present results which serve provide important validation for this model. The analysis is set against effective choices for dataset and pruning options, measured with respect to manual ground truth labelling of regions. In all, we offer a detailed quantitative validation (focused on complex news pages) of a fully pipelined approach for interpreting web pages as visual images, an approach which enables important advances for users with assistive needs.
Databáze: OpenAIRE