Extraction and evaluation of formulaic expressions used in scholarly papers
Autor: | Florian Boudin, Akiko Aizawa, Kenichi Iwatsuki |
---|---|
Přispěvatelé: | The University of Tokyo (UTokyo), Traitement Automatique du Langage Naturel (TALN ), Laboratoire des Sciences du Numérique de Nantes (LS2N), IMT Atlantique Bretagne-Pays de la Loire (IMT Atlantique), Institut Mines-Télécom [Paris] (IMT)-Institut Mines-Télécom [Paris] (IMT)-Université de Nantes - UFR des Sciences et des Techniques (UN UFR ST), Université de Nantes (UN)-Université de Nantes (UN)-École Centrale de Nantes (ECN)-Centre National de la Recherche Scientifique (CNRS)-IMT Atlantique Bretagne-Pays de la Loire (IMT Atlantique), Université de Nantes (UN)-Université de Nantes (UN)-École Centrale de Nantes (ECN)-Centre National de la Recherche Scientifique (CNRS), National Institute of Informatics (NII), This work was supported by JSPS, Japan KAKENHI Grant Numbers 19J12466 and 18H03297 and by Atlanstic 2020 sabbatical grant IKEBANA, France. |
Jazyk: | angličtina |
Rok vydání: | 2022 |
Předmět: |
FOS: Computer and information sciences
Dependency (UML) Computer science Formulaic expressions computer.software_genre Lexicon Artificial Intelligence Digital Libraries (cs.DL) ComputingMilieux_MISCELLANEOUS 060201 languages & linguistics Computer Science - Computation and Language business.industry 05 social sciences General Engineering 050301 education Computer Science - Digital Libraries 06 humanities and the arts 16. Peace & justice Computer Science Applications [INFO.INFO-TT]Computer Science [cs]/Document and Text Processing Variation (linguistics) 0602 languages and literature Extraction methods Artificial intelligence business Computation and Language (cs.CL) 0503 education computer Natural language processing Sentence |
Zdroj: | Expert Systems with Applications Expert Systems with Applications, Elsevier, 2022, 187, pp.115840. ⟨10.1016/j.eswa.2021.115840⟩ |
ISSN: | 0957-4174 |
DOI: | 10.1016/j.eswa.2021.115840⟩ |
Popis: | Formulaic expressions, such as 'in this paper we propose', are helpful for authors of scholarly papers because they convey communicative functions; in the above, it is showing the aim of this paper'. Thus, resources of formulaic expressions, such as a dictionary, that could be looked up easily would be useful. However, forms of formulaic expressions can often vary to a great extent. For example, 'in this paper we propose', 'in this study we propose' and 'in this paper we propose a new method to' are all regarded as formulaic expressions. Such a diversity of spans and forms causes problems in both extraction and evaluation of formulaic expressions. In this paper, we propose a new approach that is robust to variation of spans and forms of formulaic expressions. Our approach regards a sentence as consisting of a formulaic part and non-formulaic part. Then, instead of trying to extract formulaic expressions from a whole corpus, by extracting them from each sentence, different forms can be dealt with at once. Based on this formulation, to avoid the diversity problem, we propose evaluating extraction methods by how much they convey specific communicative functions rather than by comparing extracted expressions to an existing lexicon. We also propose a new extraction method that utilises named entities and dependency structures to remove the non-formulaic part from a sentence. Experimental results show that the proposed extraction method achieved the best performance compared to other existing methods. 21 pages, 11 figures |
Databáze: | OpenAIRE |
Externí odkaz: |