Link to the University of Pittsburgh Homepage
Link to the University Library System Homepage Link to the Contact Us Form

Power-Aware Resilience for Exascale Computing

Mills, Bryan (2014) Power-Aware Resilience for Exascale Computing. Doctoral Dissertation, University of Pittsburgh. (Unpublished)

Primary Text

Download (1MB) | Preview


To enable future scientific breakthroughs and discoveries, the next generation of scientific applications will require exascale computing performance to support the execution of predictive models and analysis of massive quantities of data, with significantly higher resolution and fidelity than what is possible within existing computing infrastructure. Delivering exascale performance will require massive parallelism, which could result in a computing system with over a million sockets, each supporting many cores. Resulting in a system with millions of components, including memory modules, communication networks, and storage devices. This increase in number of components significantly increases the propensity of exascale computing systems to faults, while driving power consumption and operating costs to unforeseen heights. To achieve exascale performance two challenges must be addressed: resilience to failures and adherence to power budget constraints. These two objectives conflict insofar as performance is concerned, as achieving high performance may push system components past their thermal limit and increase the likelihood of failure. With current systems, the dominant resilience technique is checkpoint/restart. It is believed, however, that this technique alone will not scale to the level necessary to support future systems. Therefore, alternative methods have been suggested to augment checkpoint/restart -- for example process replication.

In this thesis, we present a new fault tolerance model called shadow replication that addresses resilience and power simultaneously. Shadow replication associates a shadow process with each main process, similar to traditional replication, however, the shadow process executes at a reduced speed. Shadow replication reduces energy consumption and produces solutions faster than checkpoint/restart and other replication methods in limited power environments. Shadow replication reduces energy consumption up to 25 depending upon the application type, system parameters, and failure rates. The major contribution of this thesis is the development of shadow replication, a power-aware fault tolerant computational model. The second contribution is an execution model applying shadow replication to future high performance exascale-class systems. Next, is a framework to analyze and simulate the power and energy consumption of fault tolerance methods in high performance computing systems. Lastly, to prove the viability of shadow replication an implementation is presented for the Message Passing Interface.


Social Networking:
Share |


Item Type: University of Pittsburgh ETD
Status: Unpublished
CreatorsEmailPitt UsernameORCID
Mills, Bryanbmills@cs.pitt.eduBNM15
ETD Committee:
TitleMemberEmail AddressPitt UsernameORCID
Committee ChairZnati, Taiebznati@cs.pitt.eduZNATI
Committee MemberMelhem, Ramimelhem@cs.pitt.eduMELHEM
Committee MemberMossé, Danielmosse@cs.pitt.eduMOSSE
Committee MemberJones, Alexakjones@pitt.eduAKJONES
Committee MemberFerreira, Kurt
Date: 24 September 2014
Date Type: Publication
Defense Date: 30 May 2014
Approval Date: 24 September 2014
Submission Date: 9 June 2014
Access Restriction: No restriction; Release the ETD for access worldwide immediately.
Number of Pages: 143
Institution: University of Pittsburgh
Schools and Programs: Dietrich School of Arts and Sciences > Computer Science
Degree: PhD - Doctor of Philosophy
Thesis Type: Doctoral Dissertation
Refereed: Yes
Uncontrolled Keywords: resilience, fault tolerance, energy-aware, shadow computing
Date Deposited: 24 Sep 2014 15:58
Last Modified: 15 Nov 2016 14:20


Monthly Views for the past 3 years

Plum Analytics

Actions (login required)

View Item View Item