Assessment of Latent Semantic Analysis (LSA) text mining algorithms for large scale mapping of patent and scientific publication documents : MSI Working Paper

Autor: Magerman, Tom, Van Looy, Bart, Baesens, Bart, Debackere, Koenraad
Jazyk: angličtina
Rok vydání: 2011
Popis: In this study we conduct a thorough assessment of the LSA text mining method and its options (preprocessing, weighting, …) to grasp similarities between patent documents and scientific publications to develop a new method to detect direct science-technology linkages - as this is instrumental for research on topics in innovation management, e.g. anticommons issues. We want to assess effectiveness (in terms of precision and recall) and derive best practices on weighting and dimensionality reduction for application on patent data. We use LSA to derive similarity from a large set of patent and scientific publication documents (88,248 patent documents and 948,432 scientific publications) based on 40 similarity measurement variants (four weighting schemas are combined with ten levels of dimensionality reduction and the cosine metric). A thorough validation is set up to compare the performance of those measure variants (expert validation of 300 combinations plus a control set of 30,000 patents). We do not find evidence for the claims of LSA to be superior to plain cosine measures or simple common term or co- occurrence based measures in our data; dimensionality reduction only seems to approach cosine measures applied on the full vector space. We propose the combination of two measures based on the number of common terms (weighted by the minimum of the number of terms of both documents and weighted by the maximum of the number of terms of both documents respectively) as a more robust method to detect similarity between patents and publications. ispartof: FBE Research Report MSI_1114 nrpages: 77 status: published
Databáze: OpenAIRE