Web Page Segmentation for Non Visual Skimming

Autor: Judith Jeyafreeda Andrew, Stéphane Ferrari, Fabrice Maurel, Gaël Dias, Emmanuel Giguet
Přispěvatelé: Equipe Hultech - Laboratoire GREYC - UMR6072, Groupe de Recherche en Informatique, Image et Instrumentation de Caen (GREYC), Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Ingénieurs de Caen (ENSICAEN), Normandie Université (NU)-Normandie Université (NU)-Université de Caen Normandie (UNICAEN), Normandie Université (NU)-Centre National de la Recherche Scientifique (CNRS)-École Nationale Supérieure d'Ingénieurs de Caen (ENSICAEN), Normandie Université (NU), Emmanuel, Giguet
Jazyk: angličtina
Rok vydání: 2019
Předmět:
Zdroj: The 33rd Pacific Asia Conference on Language, Information and Computation (PACLIC 33)
The 33rd Pacific Asia Conference on Language, Information and Computation (PACLIC 33), Sep 2019, Hakodate, Japan
HAL
ISSN: 2619-7782
Popis: International audience; Web page segmentation aims to break a page into smaller blocks, in which contents with coherent semantics are kept together. Examples of tasks targeted by such a technique are advertisement detection or main content extraction. In this paper, we study different seg-mentation strategies for the task of non visual skimming. For that purpose, we consider web page segmentation as a clustering problem of visual elements, where (1) all visual elements must be clustered, (2) a fixed number of clusters must be discovered, and (3) the elements of a cluster should be visually connected. Therefore, we study three different algorithms that comply to these constraints: K-means, F-K-means, and Guided Expansion. Evaluation shows that Guided Expansion evidences statistically-relevant results in terms of compactness and separateness, and satisfies more logical constraints when compared to the other strategies.
Databáze: OpenAIRE