Towards Developing a Repository of Logical Errors Observed in Parallel Code for Teaching Code Correctness

Autor: Trung Nguyen Ba, Ritu Arora
Rok vydání: 2018
Předmět:
Zdroj: EduHPC@SC
DOI: 10.1109/eduhpc.2018.00011
Popis: Debugging parallel programs can be a challenging task, especially for the beginners. While the debuggers like DDT and TotalView can be extremely useful in tracking down the program statements that are connected to the bugs, often the onus is on the programmers to reason about the logic of the program statements in order to fix the bugs in them. These debuggers may neither be able to precisely indicate the logical errors in the parallel programs nor they may provide information on fixing those errors. Therefore, there is a need for developing tools and educational content on teaching the pitfalls in parallel programming and writing correct code. Such content can be useful to guide the beginners in avoiding commonly observed logical errors and in verifying the correctness of their parallel programs. In this paper, we 1) enumerate some of the logical errors that we have seen in the parallel programs (OpenMP, MPI, and CUDA) that were written by the beginners working with us, and 2) discuss the ways to fix those errors. The errors are mainly related to the data distribution, exiting distributed for-loops, and workload-imbalance. The documentation on these logical errors can contribute in enhancing the productivity of the beginners, and can potentially help them in their debugging efforts. We have added the code samples containing logical errors and their solutions in a Github repository so that the others in the community can reproduce the errors on their systems and learn from them. The content presented in this paper may also be useful for those developing high-level tools for detecting and removing logical errors in parallel programs.
Databáze: OpenAIRE