Link to the University of Pittsburgh Homepage
Link to the University Library System Homepage Link to the Contact Us Form

Adaptive and Power-aware Fault Tolerance for Future Extreme-scale Computing

Cui, Xiaolong (2018) Adaptive and Power-aware Fault Tolerance for Future Extreme-scale Computing. Doctoral Dissertation, University of Pittsburgh. (Unpublished)

[img]
Preview
PDF
Download (1MB) | Preview

Abstract

Two major trends in large-scale computing are the rapid growth in HPC with in particular an international exascale initiative, and the dramatic expansion of Cloud infrastructures accompanied by the Big Data passion. To satisfy the continuous demands for increasing computing capacity, future extreme-scale systems will embrace a multi-fold increase in the number of computing, storage, and communication components, in order to support an unprecedented level of parallelism. Despite the capacity and economies benefits, making the upward transformation to extreme-scale poses numerous scientific and technological challenges, two of which are the power consumption and fault tolerance. With the increase in system scale, failure would become a norm rather than an exception, driving the system to significantly lower efficiency with unforeseen power consumption.

This thesis aims at simultaneously addressing the above two challenges by introducing a novel fault-tolerant computational model, referred to as \textit{Leaping Shadows}. Based on Shadow Replication, Leaping Shadows associates with each main process a suite of coordinated shadow processes, which execute in parallel but at differential rates, to deal with failures and meet the QoS requirements of the underlying application under strict power/energy constraints. In failure-prone extreme-scale computing environments, this new model addresses the limitations of the basic Shadow Replication model, and achieves adaptive and power-aware fault tolerance that is more time and energy efficient than existing techniques.

In this thesis, we first present an analytical model based optimization framework that demonstrates Shadow Replication's adaptivity and flexibility in achieving multi-dimensional QoS requirements. Then, we introduce Leaping Shadows as a novel power-aware fault tolerance model, which tolerates multiple types of failures, guarantees forward progress, and maintains a consistent level of resilience. Lastly, the details of a Leaping Shadows implementation in MPI is discussed, along with extensive performance evaluation that includes comparison to checkpoint/restart. Collectively, these efforts advocate an adaptive and power-aware fault tolerance alternative for future extreme-scale computing.


Share

Citation/Export:
Social Networking:
Share |

Details

Item Type: University of Pittsburgh ETD
Status: Unpublished
Creators/Authors:
CreatorsEmailPitt UsernameORCID
Cui, Xiaolongsunshine870@gmail.comxic51
ETD Committee:
TitleMemberEmail AddressPitt UsernameORCID
Committee ChairZnati, Taiebznati@pitt.edu
Committee CoChairMelhem, Ramimelhem@cs.pitt.edu
Committee MemberLange, Johnjacklange@cs.pitt.edu
Committee MemberMeneses, Estebanemeneses@ic-itcr.ac.cr
Date: 24 January 2018
Date Type: Publication
Defense Date: 10 November 2017
Approval Date: 24 January 2018
Submission Date: 13 December 2017
Access Restriction: No restriction; Release the ETD for access worldwide immediately.
Number of Pages: 132
Institution: University of Pittsburgh
Schools and Programs: School of Computing and Information > Computer Science
Degree: PhD - Doctor of Philosophy
Thesis Type: Doctoral Dissertation
Refereed: Yes
Uncontrolled Keywords: Fault tolerance; Extreme-scale computing; Forward recovery; Power awareness; Resilience
Date Deposited: 24 Jan 2018 16:27
Last Modified: 24 Jan 2018 16:27
URI: http://d-scholarship.pitt.edu/id/eprint/33624

Metrics

Monthly Views for the past 3 years

Plum Analytics


Actions (login required)

View Item View Item