Li, Longhao
(2023)
Energy and Completion Time Aware Fault Tolerance for Extreme Scale Computing in Heterogeneous Environments.
Doctoral Dissertation, University of Pittsburgh.
(Unpublished)
Abstract
As future systems scale up to extreme, their propensity to failure increases significantly, making it difficult for long running applications that span a large number of cores to make forward progress. Achieving resilience in extreme-scale environments under energy constraint is a major challenge. Current fault-tolerance frameworks mainly rely on the classic checkpoint-restart approach for recovery. At extreme scale, it is likely that the mean time between failures exceeds time to recovery. Consequently, these techniques may fail to achieve forward progress, which often results in completion time violation.
In this thesis, we take a radical approach to fault tolerance and explore proactive and re- active replication-based techniques to achieve fault-tolerance in heterogeneous extreme-scale environments. The focus of this thesis is to tolerate both fail-stop errors and silent errors. The main objective is to minimize simultaneously completion time and energy consumption in failure-prone, energy-constrained extreme-scale computing environments. To this end, we present a new selective replication-based framework referred to as Differential Shadowing (diffShadowing) for fault tolerance in extreme-scale environments. In these environments, cores fail independently, but non-identically. The diffShadowing framework harnesses this property to selectively shadow processes, based on the likelihood of failure, to ensure forward progress, while minimizing energy consumption and reducing time to completion.
To tolerate fail-stop errors, processes are replicated selectively, based on the likelihood of computational core failure rates. For silent errors, a pure replica and a diffShadow are initiated to run concurrently with the original process, to ensure majority voting in the likelihood of failure. The computational attribute of the process is determined based on the core failure rates. To this end, a deep-attention-based failure prediction model is developed, focusing on offline and online failure prediction. It monitors runtime log event sequences to detect abnormal behaviors for potential failure mitigation.
Share
Citation/Export: |
|
Social Networking: |
|
Details
Item Type: |
University of Pittsburgh ETD
|
Status: |
Unpublished |
Creators/Authors: |
|
ETD Committee: |
|
Date: |
10 January 2023 |
Date Type: |
Publication |
Defense Date: |
29 July 2022 |
Approval Date: |
10 January 2023 |
Submission Date: |
28 October 2022 |
Access Restriction: |
1 year -- Restrict access to University of Pittsburgh for a period of 1 year. |
Number of Pages: |
143 |
Institution: |
University of Pittsburgh |
Schools and Programs: |
School of Computing and Information > Computer Science |
Degree: |
PhD - Doctor of Philosophy |
Thesis Type: |
Doctoral Dissertation |
Refereed: |
Yes |
Uncontrolled Keywords: |
differential shadowing, heterogeneous environment, extreme-scale, resilience, energy aware, completion time aware, |
Date Deposited: |
10 Jan 2023 16:15 |
Last Modified: |
10 Jan 2023 16:15 |
URI: |
http://d-scholarship.pitt.edu/id/eprint/43758 |
Metrics
Monthly Views for the past 3 years
Plum Analytics
Actions (login required)
|
View Item |