Adaptive and Power-aware Fault Tolerance for Future Extreme-scale Computing

Cui, Xiaolong (2018) Adaptive and Power-aware Fault Tolerance for Future Extreme-scale Computing. Doctoral Dissertation, University of Pittsburgh. (Unpublished)

Preview

PDF
Download (1MB) | Preview

Abstract

Two major trends in large-scale computing are the rapid growth in HPC with in particular an international exascale initiative, and the dramatic expansion of Cloud infrastructures accompanied by the Big Data passion. To satisfy the continuous demands for increasing computing capacity, future extreme-scale systems will embrace a multi-fold increase in the number of computing, storage, and communication components, in order to support an unprecedented level of parallelism. Despite the capacity and economies benefits, making the upward transformation to extreme-scale poses numerous scientific and technological challenges, two of which are the power consumption and fault tolerance. With the increase in system scale, failure would become a norm rather than an exception, driving the system to significantly lower efficiency with unforeseen power consumption.

This thesis aims at simultaneously addressing the above two challenges by introducing a novel fault-tolerant computational model, referred to as \textit{Leaping Shadows}. Based on Shadow Replication, Leaping Shadows associates with each main process a suite of coordinated shadow processes, which execute in parallel but at differential rates, to deal with failures and meet the QoS requirements of the underlying application under strict power/energy constraints. In failure-prone extreme-scale computing environments, this new model addresses the limitations of the basic Shadow Replication model, and achieves adaptive and power-aware fault tolerance that is more time and energy efficient than existing techniques.

In this thesis, we first present an analytical model based optimization framework that demonstrates Shadow Replication's adaptivity and flexibility in achieving multi-dimensional QoS requirements. Then, we introduce Leaping Shadows as a novel power-aware fault tolerance model, which tolerates multiple types of failures, guarantees forward progress, and maintains a consistent level of resilience. Lastly, the details of a Leaping Shadows implementation in MPI is discussed, along with extensive performance evaluation that includes comparison to checkpoint/restart. Collectively, these efforts advocate an adaptive and power-aware fault tolerance alternative for future extreme-scale computing.

Citation/Export:
Social Networking:	Share \|

Details

Item Type:

University of Pittsburgh ETD

Status:

Unpublished

Creators/Authors:

Creators	Email	Pitt Username	ORCID
Cui, Xiaolong	sunshine870@gmail.com	xic51

ETD Committee:

Title	Member	Email Address
Committee Chair	Znati, Taieb	znati@pitt.edu
Committee CoChair	Melhem, Rami	melhem@cs.pitt.edu
Committee Member	Lange, John	jacklange@cs.pitt.edu
Committee Member	Meneses, Esteban	emeneses@ic-itcr.ac.cr

Date:

24 January 2018

Date Type:

Publication

Defense Date:

10 November 2017

Approval Date:

24 January 2018

Submission Date:

13 December 2017

Access Restriction:

No restriction; Release the ETD for access worldwide immediately.

Number of Pages:

132

Institution:

University of Pittsburgh

Schools and Programs:

School of Computing and Information > Computer Science

Degree:

PhD - Doctor of Philosophy

Thesis Type:

Doctoral Dissertation

Refereed:

Yes

Uncontrolled Keywords:

Fault tolerance; Extreme-scale computing; Forward recovery; Power awareness; Resilience

Date Deposited:

24 Jan 2018 16:27

Last Modified:

24 Jan 2018 16:27

URI:

http://d-scholarship.pitt.edu/id/eprint/33624

Metrics

Monthly Views for the past 3 years

Plum Analytics

Actions (login required)

View Item

My Account

Search

Browse

Information

Adaptive and Power-aware Fault Tolerance for Future Extreme-scale Computing

Abstract

Share

Details

Metrics

Monthly Views for the past 3 years

Plum Analytics

Actions (login required)

Connect with us

Send Comments or Questions

Feeds