Cui, Xiaolong
(2018)
Adaptive and Power-aware Fault Tolerance for Future Extreme-scale Computing.
Doctoral Dissertation, University of Pittsburgh.
(Unpublished)
Abstract
Two major trends in large-scale computing are the rapid growth in HPC with in particular an international exascale initiative, and the dramatic expansion of Cloud infrastructures accompanied by the Big Data passion. To satisfy the continuous demands for increasing computing capacity, future extreme-scale systems will embrace a multi-fold increase in the number of computing, storage, and communication components, in order to support an unprecedented level of parallelism. Despite the capacity and economies benefits, making the upward transformation to extreme-scale poses numerous scientific and technological challenges, two of which are the power consumption and fault tolerance. With the increase in system scale, failure would become a norm rather than an exception, driving the system to significantly lower efficiency with unforeseen power consumption.
This thesis aims at simultaneously addressing the above two challenges by introducing a novel fault-tolerant computational model, referred to as \textit{Leaping Shadows}. Based on Shadow Replication, Leaping Shadows associates with each main process a suite of coordinated shadow processes, which execute in parallel but at differential rates, to deal with failures and meet the QoS requirements of the underlying application under strict power/energy constraints. In failure-prone extreme-scale computing environments, this new model addresses the limitations of the basic Shadow Replication model, and achieves adaptive and power-aware fault tolerance that is more time and energy efficient than existing techniques.
In this thesis, we first present an analytical model based optimization framework that demonstrates Shadow Replication's adaptivity and flexibility in achieving multi-dimensional QoS requirements. Then, we introduce Leaping Shadows as a novel power-aware fault tolerance model, which tolerates multiple types of failures, guarantees forward progress, and maintains a consistent level of resilience. Lastly, the details of a Leaping Shadows implementation in MPI is discussed, along with extensive performance evaluation that includes comparison to checkpoint/restart. Collectively, these efforts advocate an adaptive and power-aware fault tolerance alternative for future extreme-scale computing.
Share
Citation/Export: |
|
Social Networking: |
|
Details
Item Type: |
University of Pittsburgh ETD
|
Status: |
Unpublished |
Creators/Authors: |
|
ETD Committee: |
|
Date: |
24 January 2018 |
Date Type: |
Publication |
Defense Date: |
10 November 2017 |
Approval Date: |
24 January 2018 |
Submission Date: |
13 December 2017 |
Access Restriction: |
No restriction; Release the ETD for access worldwide immediately. |
Number of Pages: |
132 |
Institution: |
University of Pittsburgh |
Schools and Programs: |
School of Computing and Information > Computer Science |
Degree: |
PhD - Doctor of Philosophy |
Thesis Type: |
Doctoral Dissertation |
Refereed: |
Yes |
Uncontrolled Keywords: |
Fault tolerance; Extreme-scale computing; Forward recovery; Power awareness; Resilience |
Date Deposited: |
24 Jan 2018 16:27 |
Last Modified: |
24 Jan 2018 16:27 |
URI: |
http://d-scholarship.pitt.edu/id/eprint/33624 |
Metrics
Monthly Views for the past 3 years
Plum Analytics
Actions (login required)
|
View Item |