Link to the University of Pittsburgh Homepage
Link to the University Library System Homepage Link to the Contact Us Form

Energy and Completion Time Aware Fault Tolerance for Extreme Scale Computing in Heterogeneous Environments

Li, Longhao (2023) Energy and Completion Time Aware Fault Tolerance for Extreme Scale Computing in Heterogeneous Environments. Doctoral Dissertation, University of Pittsburgh. (Unpublished)

[img] PDF
Restricted to University of Pittsburgh users only until 10 January 2026.

Download (6MB) | Request a Copy


As future systems scale up to extreme, their propensity to failure increases significantly, making it difficult for long running applications that span a large number of cores to make forward progress. Achieving resilience in extreme-scale environments under energy constraint is a major challenge. Current fault-tolerance frameworks mainly rely on the classic checkpoint-restart approach for recovery. At extreme scale, it is likely that the mean time between failures exceeds time to recovery. Consequently, these techniques may fail to achieve forward progress, which often results in completion time violation.
In this thesis, we take a radical approach to fault tolerance and explore proactive and re- active replication-based techniques to achieve fault-tolerance in heterogeneous extreme-scale environments. The focus of this thesis is to tolerate both fail-stop errors and silent errors. The main objective is to minimize simultaneously completion time and energy consumption in failure-prone, energy-constrained extreme-scale computing environments. To this end, we present a new selective replication-based framework referred to as Differential Shadowing (diffShadowing) for fault tolerance in extreme-scale environments. In these environments, cores fail independently, but non-identically. The diffShadowing framework harnesses this property to selectively shadow processes, based on the likelihood of failure, to ensure forward progress, while minimizing energy consumption and reducing time to completion.
To tolerate fail-stop errors, processes are replicated selectively, based on the likelihood of computational core failure rates. For silent errors, a pure replica and a diffShadow are initiated to run concurrently with the original process, to ensure majority voting in the likelihood of failure. The computational attribute of the process is determined based on the core failure rates. To this end, a deep-attention-based failure prediction model is developed, focusing on offline and online failure prediction. It monitors runtime log event sequences to detect abnormal behaviors for potential failure mitigation.


Social Networking:
Share |


Item Type: University of Pittsburgh ETD
Status: Unpublished
CreatorsEmailPitt UsernameORCID
Li, Longhaolol16@pitt.edulol16
ETD Committee:
TitleMemberEmail AddressPitt UsernameORCID
Committee ChairZnati, Taiebznati@pitt.eduznati
Committee MemberZhang, Youtaozhangyt@cs.pitt.eduyoutao
Committee MemberTang, Xulongxulongtang@pitt.eduxulongtang
Committee MemberBabay, Amybabay@pitt.edubabay
Date: 10 January 2023
Date Type: Publication
Defense Date: 29 July 2022
Approval Date: 10 January 2023
Submission Date: 28 October 2022
Access Restriction: 1 year -- Restrict access to University of Pittsburgh for a period of 1 year.
Number of Pages: 143
Institution: University of Pittsburgh
Schools and Programs: School of Computing and Information > Computer Science
Degree: PhD - Doctor of Philosophy
Thesis Type: Doctoral Dissertation
Refereed: Yes
Uncontrolled Keywords: differential shadowing, heterogeneous environment, extreme-scale, resilience, energy aware, completion time aware,
Date Deposited: 10 Jan 2023 16:15
Last Modified: 10 Jan 2023 16:15


Monthly Views for the past 3 years

Plum Analytics

Actions (login required)

View Item View Item