Automatically Inferring the Document Class of a Scientific Article

Autor: Gauquier, Antoine, Senellart, Pierre
Přispěvatelé: Value from Data (VALDA ), Département d'informatique - ENS Paris (DI-ENS), École normale supérieure - Paris (ENS-PSL), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Paris (ENS-PSL), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Inria de Paris, Institut National de Recherche en Informatique et en Automatique (Inria), Ecole nationale supérieure Mines-Télécom Lille Douai (IMT Nord Europe), Institut Mines-Télécom [Paris] (IMT), Télécom Paris, Institut Universitaire de France (IUF), Ministère de l'Education nationale, de l’Enseignement supérieur et de la Recherche (M.E.N.E.S.R.), ANR-19-P3IA-0001,PRAIRIE,PaRis Artificial Intelligence Research InstitutE(2019)
Jazyk: angličtina
Rok vydání: 2023
Předmět:
Zdroj: DocEng 2023-23rd ACM Symposium on Document Engineering
DocEng 2023-23rd ACM Symposium on Document Engineering, Aug 2023, Limerick, Ireland. ⟨10.1145/3573128.3604894⟩
Popis: International audience; We consider the problem of automatically inferring the (LaTeX) document class used to write a scientific article from its PDF representation. Applications include improving the performance of information extraction techniques that rely on the style used in each document class, or determining the publisher of a given scientific article. We introduce two approaches: a simple classifier based on hand-coded document style features, as well as a CNN-based classifier taking as input the bitmap representation of the first page of the PDF article. We experiment on a dataset of around 100k articles from arXiv, where labels come from the source LaTeX document associated to each article. Results show the CNN approach significantly outperforms that based on simple document style features, reaching over 90% average F1-score on a task to distinguish among several dozens of the most common document classes.
Databáze: OpenAIRE