Software understanding: Automatic classification of software identifiers

Autor:	Mathieu Lafourcade, Marianne Huchard, Pierre Pompidor, Anne Laurent, Pattaraporn Warintarawej
Přispěvatelé:	ADVanced Analytics for data SciencE (ADVANSE), Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier (LIRMM), Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM)-Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM), Fuzziness, Alignments, Data & Ontologies (FADO), Models And Reuse Engineering, Languages (MAREL), Exploration et exploitation de données textuelles (TEXTE), WEB-CUBE
Jazyk:	angličtina
Rok vydání:	2015
Předmět:	Computer science 02 engineering and technology computer.software_genre Theoretical Computer Science Software analytics Artificial Intelligence 0202 electrical engineering electronic engineering information engineering Data Mining Software system Software verification and validation Automatic Software Understanding 060201 languages & linguistics Software visualization Information retrieval business.industry Software development Software Engineering 06 humanities and the arts Identifier Software framework [INFO.INFO-IR]Computer Science [cs]/Information Retrieval [cs.IR] 0602 languages and literature Software construction Text classification 020201 artificial intelligence & image processing Computer Vision and Pattern Recognition business computer
Zdroj:	Intelligent Data Analysis Intelligent Data Analysis, IOS Press, 2015, 19 (4), pp.761-778. ⟨10.3233/IDA-150744⟩
ISSN:	1088-467X
DOI:	10.3233/IDA-150744⟩
Popis:	International audience; Identifier names (e.g., packages, classes, methods, variables) are one of most important software comprehension sources. Identifier names need to be analyzed in order to support collaborative software engineering and to reuse source codes. Indeed, they convey domain concept of softwares. For instance, ''getMinimumSupport'' would be associated with association rule concept in data mining softwares, while some are difficult to recognize such as the case of mixing parts of words (e.g., ''initFeatSet''). We thus propose methods for assisting automatic software understanding by classifying identifier names into domain concept categories. An innovative solution based on data mining algorithms is proposed. Our approach aims to learn character patterns of identifier names. The main challenges are (1) to automatically split identifier names into relevant constituent subnames (2) to build a model associating such a set of subnames to predefined domain concepts. For this purpose, we propose a novel manner for splitting such identifiers into their constituent words and use N-grams based text classification to predict the related domain concept. In this article, we report the theoretical method and the algorithms we propose, together with the experiments run on real software source codes that show the interest of our approach.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::3e23277f8c78ba423afb24e202129e1e https://hal-lirmm.ccsd.cnrs.fr/lirmm-00834051 Zobrazit plný text záznamu