Extraction and evaluation of formulaic expressions used in scholarly papers

Autor: Florian Boudin, Akiko Aizawa, Kenichi Iwatsuki
Přispěvatelé: The University of Tokyo (UTokyo), Traitement Automatique du Langage Naturel (TALN ), Laboratoire des Sciences du Numérique de Nantes (LS2N), IMT Atlantique Bretagne-Pays de la Loire (IMT Atlantique), Institut Mines-Télécom [Paris] (IMT)-Institut Mines-Télécom [Paris] (IMT)-Université de Nantes - UFR des Sciences et des Techniques (UN UFR ST), Université de Nantes (UN)-Université de Nantes (UN)-École Centrale de Nantes (ECN)-Centre National de la Recherche Scientifique (CNRS)-IMT Atlantique Bretagne-Pays de la Loire (IMT Atlantique), Université de Nantes (UN)-Université de Nantes (UN)-École Centrale de Nantes (ECN)-Centre National de la Recherche Scientifique (CNRS), National Institute of Informatics (NII), This work was supported by JSPS, Japan KAKENHI Grant Numbers 19J12466 and 18H03297 and by Atlanstic 2020 sabbatical grant IKEBANA, France.
Jazyk: angličtina
Rok vydání: 2022
Předmět:
Zdroj: Expert Systems with Applications
Expert Systems with Applications, Elsevier, 2022, 187, pp.115840. ⟨10.1016/j.eswa.2021.115840⟩
ISSN: 0957-4174
DOI: 10.1016/j.eswa.2021.115840⟩
Popis: Formulaic expressions, such as 'in this paper we propose', are helpful for authors of scholarly papers because they convey communicative functions; in the above, it is showing the aim of this paper'. Thus, resources of formulaic expressions, such as a dictionary, that could be looked up easily would be useful. However, forms of formulaic expressions can often vary to a great extent. For example, 'in this paper we propose', 'in this study we propose' and 'in this paper we propose a new method to' are all regarded as formulaic expressions. Such a diversity of spans and forms causes problems in both extraction and evaluation of formulaic expressions. In this paper, we propose a new approach that is robust to variation of spans and forms of formulaic expressions. Our approach regards a sentence as consisting of a formulaic part and non-formulaic part. Then, instead of trying to extract formulaic expressions from a whole corpus, by extracting them from each sentence, different forms can be dealt with at once. Based on this formulation, to avoid the diversity problem, we propose evaluating extraction methods by how much they convey specific communicative functions rather than by comparing extracted expressions to an existing lexicon. We also propose a new extraction method that utilises named entities and dependency structures to remove the non-formulaic part from a sentence. Experimental results show that the proposed extraction method achieved the best performance compared to other existing methods.
21 pages, 11 figures
Databáze: OpenAIRE