Link to the University of Pittsburgh Homepage
Link to the University Library System Homepage Link to the Contact Us Form

Resilient TensorFlow: A Software Approach to Dependable Hardware Acceleration of Space Applications

Garrett, Tyler (2024) Resilient TensorFlow: A Software Approach to Dependable Hardware Acceleration of Space Applications. Doctoral Dissertation, University of Pittsburgh. (Unpublished)

[img] PDF
Restricted to University of Pittsburgh users only until 6 September 2026.

Download (30MB) | Request a Copy

Abstract

Autonomous systems and artificial intelligence have become commonplace in today’s world, transforming everything from health care and manufacturing to automobiles and even our homes. Consequently, these technologies have drawn interest within the space domain as government and industry partners aim to enable new levels of spacecraft and crew autonomy. Systems achieve advanced autonomy through data-driven decision-making powered by machine-learning (ML) models. To effectively deploy ML-based applications, avionic systems must have sufficient memory, power, and computational capabilities to handle large volumes of data onboard. Unfortunately, such specifications are at odds with state-of-the-art flight hardware, which is often constrained by size, weight, power, and cost (SWaP-C). Moreover, to handle the harsh space environment and support necessary levels of safety-criticality, radiation-hardened (rad-hard) processors are typically used. However, current rad-hard processors fail to offer the performance needed to deploy ML models. Commercial-off-the-shelf (COTS) hardware accelerators are being explored to replace or accompany rad-hard processors but introduce reliability concerns.
This research aims to evaluate, understand, and mitigate vulnerabilities of COTS GPUs and TPUs to enable dependable hardware acceleration of ML models in onboard space applications. Focusing on data reliability, this work provides targeted model reinforcement through software by leveraging key attributes of each device’s microarchitecture to prevent catastrophic misclassifications. First, TensorFlow models are evaluated on Edge TPUs under neutron radiation to demonstrate the impact of soft errors on model caching. Through testing, this work characterizes and identifies a vulnerability in on-chip SRAM capable of compromising hardware redundancy approaches. A new method of preserving model parameters through refreshing is validated and optimized to mitigate persistent faults on the device during ML inference. Next, the Resilient TensorFlow (RTF) framework is introduced to improve the ease of deployment and scalability of fault-mitigation techniques for ML models. RTF embeds error detection and correction inside TensorFlow custom operations through fault-aware device kernels. The operations are then used as the building blocks of layers and models. RTF’s approach is demonstrated and validated on GPUs using fault injection. Lastly, the framework is applied and evaluated on a real mission use case leveraging data collected from STP-H7-CASPR aboard the International Space Station.


Share

Citation/Export:
Social Networking:
Share |

Details

Item Type: University of Pittsburgh ETD
Status: Unpublished
Creators/Authors:
CreatorsEmailPitt UsernameORCID
Garrett, Tylertyler.garrett@pitt.edutmg61
ETD Committee:
TitleMemberEmail AddressPitt UsernameORCID
Committee ChairGeorge, AlanAlan.George@pitt.eduADG910000-0001-9665-2879
Committee MemberHu, JingtongJTHU@pitt.edujthu0000-0003-4029-4034
Committee MemberZhou, Peipeipeipei.zhou@pitt.edupez41
Committee MemberMahmoud, Amramm418@pitt.eduamm418
Committee MemberRamsey, Michael Seanmramsey@pitt.edumramsey
Thesis AdvisorGeorge, Alan DAlan.George@pitt.eduADG910000-0001-9665-2879
Date: 6 September 2024
Date Type: Publication
Defense Date: 12 July 2024
Approval Date: 6 September 2024
Submission Date: 27 June 2024
Access Restriction: 2 year -- Restrict access to University of Pittsburgh for a period of 2 years.
Number of Pages: 149
Institution: University of Pittsburgh
Schools and Programs: Swanson School of Engineering > Electrical and Computer Engineering
Degree: PhD - Doctor of Philosophy
Thesis Type: Doctoral Dissertation
Refereed: Yes
Uncontrolled Keywords: Computer Architecture, Deep Learning, Fault-Tolerant Design, GPU Computing, High-Performance Computing, Machine Learning, Onboard Processing, Space Computing, TensorFlow
Date Deposited: 06 Sep 2024 19:57
Last Modified: 06 Sep 2024 19:57
URI: http://d-scholarship.pitt.edu/id/eprint/46634

Metrics

Monthly Views for the past 3 years

Plum Analytics


Actions (login required)

View Item View Item