A Hybrid Model Combining Formulae with Keywords for Mathematical Information Retrieval

Autor: Yuqi Shen, Cheng Chen, Yifan Dai, Jinfang Cai, Liangyu Chen
Rok vydání: 2021
Předmět:
Zdroj: International Journal of Software Engineering and Knowledge Engineering. 31:1583-1602
ISSN: 1793-6403
0218-1940
Popis: Formula retrieval is an important research topic in Mathematical Information Retrieval (MIR). Most studies have focused on formula comparison to determine the similarity between mathematical documents. However, two similar formulae may appear in entirely different knowledge domains and have different meanings. Based on N-ary Tree-based Formula Embedding Model (NTFEM, our previous work in [Y. Dai, L. Chen, and Z. Zhang, An N-ary tree-based model for similarity evaluation on mathematical formulae, in Proc. 2020 IEEE Int. Conf. Systems, Man, and Cybernetics, 2020, pp. 2578–2584.], we introduce a new hybrid retrieval model, NTFEM-K, which combines formulae with their surrounding keywords for more accurate retrieval. By using keywords extraction technology, we extract keywords from context, which can supplement the semantic information of the formula. Then, we get the vector representations of keywords by FastText N-gram embedding model and the vector representations of formulae by NTFEM. Finally, documents are sorted according to the similarity between keywords, and then the ranking results are optimized by formula similarity. For performance evaluation, NTFEM-K is not only compared with NTFEM but also hybrid retrieval models combining formulae with long text and hybrid retrieval models combining formulae with their keywords using other keyword extraction algorithms. Experimental results show that the accuracy of top-10 results of NTFEM-K is at least 20% higher than that of NTFEM and can be 50% in some specific topics.
Databáze: OpenAIRE