Link to the University of Pittsburgh Homepage
Link to the University Library System Homepage Link to the Contact Us Form

Heterogeneity aware fault tolerance for extreme scale computing

Hussain, Zaeem (2020) Heterogeneity aware fault tolerance for extreme scale computing. Doctoral Dissertation, University of Pittsburgh. (Unpublished)

Download (4MB) | Preview


Upcoming Extreme Scale, or Exascale, Computing Systems are expected to deliver a peak performance of at least 10^18 floating point operations per second (FLOPS), primarily through significant expansion in scale. A major concern for such large scale systems, however, is how to deal with failures in the system. This is because the impact of failures on system efficiency, while utilizing existing fault tolerance techniques, generally also increases with scale. Hence, current research effort in this area has been directed at optimizing various aspects of fault tolerance techniques to reduce their overhead at scale. One characteristic that has been overlooked so far, however, is heterogeneity, specifically in the rate at which individual components of the underlying system fail, and in the execution profile of a parallel application running on such a system. In this thesis, we investigate the implications of such types of heterogeneity for fault tolerance in large scale high performance computing (HPC) systems. To that end, we 1) study how knowledge of heterogeneity in system failure likelihoods can be utilized to make current fault tolerance schemes more efficient, 2) assess the feasibility of utilizing application imbalance for improved fault tolerance at scale, and 3) propose and evaluate changes to system level resource managers in order to achieve reliable job placement over resources with unequal failure likelihoods. The results in this thesis, taken together, demonstrate that heterogeneity in failure likelihoods significantly changes the landscape of fault tolerance for large scale HPC systems.


Social Networking:
Share |


Item Type: University of Pittsburgh ETD
Status: Unpublished
CreatorsEmailPitt UsernameORCID
Hussain, Zaeemzah20@pitt.eduzah20
ETD Committee:
TitleMemberEmail AddressPitt UsernameORCID
Committee ChairMelhem, Ramimelhem@cs.pitt.edumelhem
Committee CoChairZnati, Taiebznati@pitt.eduznati
Committee MemberMosse, Danielmosse@pitt.edumosse
Committee MemberPalanisamy, Balajibpalan@pitt.edubpalan
Date: 20 August 2020
Date Type: Publication
Defense Date: 17 July 2020
Approval Date: 20 August 2020
Submission Date: 27 July 2020
Access Restriction: No restriction; Release the ETD for access worldwide immediately.
Number of Pages: 140
Institution: University of Pittsburgh
Schools and Programs: School of Computing and Information > Computer Science
Degree: PhD - Doctor of Philosophy
Thesis Type: Doctoral Dissertation
Refereed: Yes
Uncontrolled Keywords: exascale, fault tolerance, hpc, high performance computing, resilience, parallel computing, speedup
Date Deposited: 20 Aug 2020 18:47
Last Modified: 20 Aug 2020 18:47


Monthly Views for the past 3 years

Plum Analytics

Actions (login required)

View Item View Item