Mills, Bryan
(2014)
Power-Aware Resilience for Exascale Computing.
Doctoral Dissertation, University of Pittsburgh.
(Unpublished)
Abstract
To enable future scientific breakthroughs and discoveries, the next generation of scientific applications will require exascale computing performance to support the execution of predictive models and analysis of massive quantities of data, with significantly higher resolution and fidelity than what is possible within existing computing infrastructure. Delivering exascale performance will require massive parallelism, which could result in a computing system with over a million sockets, each supporting many cores. Resulting in a system with millions of components, including memory modules, communication networks, and storage devices. This increase in number of components significantly increases the propensity of exascale computing systems to faults, while driving power consumption and operating costs to unforeseen heights. To achieve exascale performance two challenges must be addressed: resilience to failures and adherence to power budget constraints. These two objectives conflict insofar as performance is concerned, as achieving high performance may push system components past their thermal limit and increase the likelihood of failure. With current systems, the dominant resilience technique is checkpoint/restart. It is believed, however, that this technique alone will not scale to the level necessary to support future systems. Therefore, alternative methods have been suggested to augment checkpoint/restart -- for example process replication.
In this thesis, we present a new fault tolerance model called shadow replication that addresses resilience and power simultaneously. Shadow replication associates a shadow process with each main process, similar to traditional replication, however, the shadow process executes at a reduced speed. Shadow replication reduces energy consumption and produces solutions faster than checkpoint/restart and other replication methods in limited power environments. Shadow replication reduces energy consumption up to 25 depending upon the application type, system parameters, and failure rates. The major contribution of this thesis is the development of shadow replication, a power-aware fault tolerant computational model. The second contribution is an execution model applying shadow replication to future high performance exascale-class systems. Next, is a framework to analyze and simulate the power and energy consumption of fault tolerance methods in high performance computing systems. Lastly, to prove the viability of shadow replication an implementation is presented for the Message Passing Interface.
Share
Citation/Export: |
|
Social Networking: |
|
Details
Item Type: |
University of Pittsburgh ETD
|
Status: |
Unpublished |
Creators/Authors: |
|
ETD Committee: |
|
Date: |
24 September 2014 |
Date Type: |
Publication |
Defense Date: |
30 May 2014 |
Approval Date: |
24 September 2014 |
Submission Date: |
9 June 2014 |
Access Restriction: |
No restriction; Release the ETD for access worldwide immediately. |
Number of Pages: |
143 |
Institution: |
University of Pittsburgh |
Schools and Programs: |
Dietrich School of Arts and Sciences > Computer Science |
Degree: |
PhD - Doctor of Philosophy |
Thesis Type: |
Doctoral Dissertation |
Refereed: |
Yes |
Uncontrolled Keywords: |
resilience, fault tolerance, energy-aware, shadow computing |
Date Deposited: |
24 Sep 2014 15:58 |
Last Modified: |
15 Nov 2016 14:20 |
URI: |
http://d-scholarship.pitt.edu/id/eprint/21776 |
Metrics
Monthly Views for the past 3 years
Plum Analytics
Actions (login required)
|
View Item |