Accelerating Economic Innovation and Impact Discovery Through HPC

Autor:	Cody Kankel, Grace Enright, Scott S. Hampton, Conor Flynn
Rok vydání:	2017
Předmět:	Disk formatting Information retrieval Documentation Computer science Data integrity Edit distance Data mining Levenshtein distance computer.software_genre Data structure Fuzzy logic computer Implementation
Zdroj:	PEARC
DOI:	10.1145/3093338.3104183
Popis:	This project aims to use historical census and patent data to determine the factors and conditions that influence innovative productivity, with the intention of applying these findings to foster future creativity and accelerate economic growth. By matching patent data with inventor census information, this project attempts to determine these factors and conditions. A large data set, spanning decades of patent documentation and several census, along with diminished data integrity due to inconsistent formatting and incomplete fields, creates significant computational complexity. An original implementation of the merge of these two data sets applied dynamic regular expressions to identify "fuzzy matches," or pairings with an acceptable amount of error. This implementation, utilizing 1,152 cores for analysis of a single census, completed, on average, in one week's time. With the goal of improving run-time, we sought out adapted, alternative implementations. Our initial implementation utilized Levenshtein distance (edit distance) to discover the fuzzy matches. Though this improved the quality of the pairings substantially, run-time improved only modestly. To expand on this improvement, we are currently exploring an implementation combining Levenshtein distance with a special data structure that drastically reduces the number of string comparisons in order to decrease run-time. Initial results show that when occupying 1,152 cores, the new algorithm can complete a census in approximately 24 hours. This implementation offers the benefits of higher quality pairings with a seven fold improvement in execution time compared to the original run-time. The motivation for this work stems from a collaboration with Dr. Kirk Doran, Economics professor at the University of Notre Dame.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::29c342160c5ca789449aa78ed914b395 https://doi.org/10.1145/3093338.3104183 Zobrazit plný text záznamu