OpenMP Transient-Fault Tolerance via Tasks Redundancy on Multi and Many Core Architectures
Auteur : Oussama TAHAN (HEUDIASYC)
Conférence : Communications avec actes dans un congrès international - 23/01/2012 - 5th Workshop on Programmability Issues for Heterogeneous Multicores
The increasing need to secure the execution of critical em-
bedded applications has fostered the focus on the growing number of failures in today’s systems due to transient faults. Hence, the need for fault tolerant systems and applications is becoming more and more important.
Shared memory parallel programs on multi/many core architectures are now widely used, therefore, fault tolerance techniques for this type of applications are gaining interest as a research topic. OpenMP is a well established shared memory programming model. The few research works on fault tolerant OpenMP applications are based on application levelcheckpointing and restart techniques and on thread level redundancy techniques. However, checkpointing for OpenMP illustrates scalability issues when data becomes larger and when the number of threads and cores increases. On the other hand, using thread level redundancy, hence nested parallelism in OpenMP may also cause poor scalability and more overhead compared to tasks. In this paper, we present our approach based on a dynamic task level redundancy and voting technique in order to obtain task-centric fault tolerant OpenMP applications.