A Framework for Evaluating the Readability of Test Code in the Context of Code Maintainability: A Family of Empirical Studies

Autor: Urbanke, Pirmin
Jazyk: angličtina
Rok vydání: 2022
Předmět:
DOI: 10.34726/hss.2022.103606
Popis: Context and Motivation: Software testing is a common practice in software development and serves many functions. It provides certain guarantees that the software works as expected across the life cycle of the system, it helps with finding and fixing erroneous behaviour, it acts as documentation, provides usage examples, etc.. Still, test code is often treated as an orphan, which leads to poor quality tests also with respect to readability. However, if the test has poor readability, upstream activities like maintaining tests or drawing correct conclusions from tests may be compromised. But what is readable test code? Since test code has a different purpose than production code and contains exclusive features like assertion methods, the factors influencing readability may deviate from production code. Objective: We propose a framework, which can be used to evaluate the readability of test code. It also provides information on factors influencing readability and gives best-practice examples for improvements. Aside from this main goal, we give an overview on academic literature in the field of test code readability and compare it to opinions of practitioners. We investigate the impact of modifications, related to widely discussed readability factors, on the readability of test cases. Furthermore, we gather readability rating criteria from free text answers, investigate impact of developer experience on readability ratings and evaluate the accuracy of a readability rating tool, which is often used in other studies. Methods: We collect extensive information on test code readability by combining a systematic mapping of academic literature with the results of a systematic mapping of grey literature. We conduct a human-based experiment on test code readability with 77 mostly junior-level participants in academic context, to investigate various influence factors to readability. We categorise and group free text answers from the experiments participants and compare the human readability ratings with tool generated readability ratings. Finally, after the construction of the readability assessment framework, which is based on the previous results, we perform an evaluation and compare it to the results of the initial human-based experiment. Results: The literature studies result in 16 relevant sources from the scientific community and 56 sources from practitioners. From both literature mappings we see an ongoing interest in test code readability. Scientific sources focus on investigating automatically generated test code, which is often compared to manually written tests (88%). For capturing human readability, they primarily use surveys as methods (44%), which contain Likert scales in almost all cases. Grey literature (56 sources) mostly consists of blogs from practitioners, sharing their opinion and experience on problems found in their daily work. There is a clear intersection on readability factors discussed in both communities, but some factors are exclusive to each community. For the human-based experiment, we found statistical significant influence on the readability of test cases in five of ten investigated modifications, which map to readability factors. We do not see much influence of experience on readability ratings, although previous research found experience influencing understanding and maintenance tasks. Judging from the categorisation of around 2500 free text answers, the participants rate readability based on Test naming, Structure and Dependencies (i.e., does the test ensure only one behaviour?). The ratings of the readability rating tool are between the 0.25% and 0.75% quantile of our human ratings in around 51% of the investigated test cases. We also found influence of invisible differences in formatting (i.e. spaces, tabulators) affecting the tools ratings up to 0.25 on a scale from 0 to 1. The framework evaluation shows a decreased variation in the ratings across participants and increased rating speed compared to gut feeling ratings from the initial experiments. Overall, the framework rates tests to optimistically. Nevertheless, the validity is very limited, due to a small number of survey participants (5). Therefore, this evaluation is merely a concept, which we pursue in future work. Conclusion: From the literature mappings we found different views on test case readability between practitioners and academia, which come from the different contexts of the communities. The ratings from the readability tool are not accurate enough in order to trust them blindly. They still need to be complemented with human expertise. Our readability evaluation framework enables a more efficient assessment of readability. A large scale evaluation is planned for future work.
Databáze: OpenAIRE