Garrett, Tyler
(2024)
Resilient TensorFlow: A Software Approach to Dependable Hardware Acceleration of Space Applications.
Doctoral Dissertation, University of Pittsburgh.
(Unpublished)
Abstract
Autonomous systems and artificial intelligence have become commonplace in today’s world, transforming everything from health care and manufacturing to automobiles and even our homes. Consequently, these technologies have drawn interest within the space domain as government and industry partners aim to enable new levels of spacecraft and crew autonomy. Systems achieve advanced autonomy through data-driven decision-making powered by machine-learning (ML) models. To effectively deploy ML-based applications, avionic systems must have sufficient memory, power, and computational capabilities to handle large volumes of data onboard. Unfortunately, such specifications are at odds with state-of-the-art flight hardware, which is often constrained by size, weight, power, and cost (SWaP-C). Moreover, to handle the harsh space environment and support necessary levels of safety-criticality, radiation-hardened (rad-hard) processors are typically used. However, current rad-hard processors fail to offer the performance needed to deploy ML models. Commercial-off-the-shelf (COTS) hardware accelerators are being explored to replace or accompany rad-hard processors but introduce reliability concerns.
This research aims to evaluate, understand, and mitigate vulnerabilities of COTS GPUs and TPUs to enable dependable hardware acceleration of ML models in onboard space applications. Focusing on data reliability, this work provides targeted model reinforcement through software by leveraging key attributes of each device’s microarchitecture to prevent catastrophic misclassifications. First, TensorFlow models are evaluated on Edge TPUs under neutron radiation to demonstrate the impact of soft errors on model caching. Through testing, this work characterizes and identifies a vulnerability in on-chip SRAM capable of compromising hardware redundancy approaches. A new method of preserving model parameters through refreshing is validated and optimized to mitigate persistent faults on the device during ML inference. Next, the Resilient TensorFlow (RTF) framework is introduced to improve the ease of deployment and scalability of fault-mitigation techniques for ML models. RTF embeds error detection and correction inside TensorFlow custom operations through fault-aware device kernels. The operations are then used as the building blocks of layers and models. RTF’s approach is demonstrated and validated on GPUs using fault injection. Lastly, the framework is applied and evaluated on a real mission use case leveraging data collected from STP-H7-CASPR aboard the International Space Station.
Share
Citation/Export: |
|
Social Networking: |
|
Details
Item Type: |
University of Pittsburgh ETD
|
Status: |
Unpublished |
Creators/Authors: |
|
ETD Committee: |
|
Date: |
6 September 2024 |
Date Type: |
Publication |
Defense Date: |
12 July 2024 |
Approval Date: |
6 September 2024 |
Submission Date: |
27 June 2024 |
Access Restriction: |
2 year -- Restrict access to University of Pittsburgh for a period of 2 years. |
Number of Pages: |
149 |
Institution: |
University of Pittsburgh |
Schools and Programs: |
Swanson School of Engineering > Electrical and Computer Engineering |
Degree: |
PhD - Doctor of Philosophy |
Thesis Type: |
Doctoral Dissertation |
Refereed: |
Yes |
Uncontrolled Keywords: |
Computer Architecture, Deep Learning, Fault-Tolerant Design, GPU Computing, High-Performance Computing, Machine Learning, Onboard Processing, Space Computing, TensorFlow |
Date Deposited: |
06 Sep 2024 19:57 |
Last Modified: |
06 Sep 2024 19:57 |
URI: |
http://d-scholarship.pitt.edu/id/eprint/46634 |
Metrics
Monthly Views for the past 3 years
Plum Analytics
Actions (login required)
 |
View Item |