Garrett, Dillon M.
(2023)
Evaluation of Scalability for Distributed Data-Parallel Training of Swin Transformer V2.
Master's Thesis, University of Pittsburgh.
(Unpublished)
Abstract
As recent research demonstrates, the trend in model size across deep learning has rapidly increased, helping to further the state-of-the-art. Along with an increase in model size comes increased computational demands on hardware and software computing platforms, leading to training scalability being of interest. Following the development of transformer-based models, it has become common practice to begin training with a pre-trained model and fine-tune it on a specific dataset to allow for wider adoption without full model retraining. While originally designed for the natural language processing field, transformers have been adapted to many other domains. Swin Transformer V2 is a transformer model used for computer-vision tasks that achieved state-of-the-art semantic segmentation results. This research provides a scalability analysis for the distributed data-parallel training of Swin Transformer V2 on the semantic segmentation vision task. The ADE20K semantic segmentation dataset is used for training instances to fine-tune this model. A weak scalability experiment is designed, increasing the number of GPUs for training while holding the problem size constant. To implement this experiment, the sub-batch size per GPU is held constant at 8 images per GPU per iteration and the total number of iterations is scaled down. Training time, GPU utilization, and CPU utilization metrics for single- and multi-GPUs are measured on NVIDIA A100 SXM, NVIDIA A100 PCIe, and NVIDIA V100 PCIe GPU platforms hosted by the Center for Research Computing at the University of Pittsburgh. Training speedup and parallel efficiency metrics are calculated. For all computing platforms, training on 2 GPUs is 26% faster on average when compared to single GPU training. However, diminishing returns are observed when adding additional GPUs because smaller speedup benefits are observed. When increasing the number of GPUs from 2 to 4, the training is only 1.9% faster on average on NVIDIA A100 PCIe and NVIDIA V100 PCIe nodes. For NVLINK-enabled NVIDIA A100 nodes, training is only 2.9% faster when increasing the number of GPUs from 4 to 8. Consequentially, distributed data-parallel training of Swin Transformer V2 scales poorly as the number of devices is increased.
Share
Citation/Export: |
|
Social Networking: |
|
Details
Item Type: |
University of Pittsburgh ETD
|
Status: |
Unpublished |
Creators/Authors: |
|
ETD Committee: |
|
Date: |
14 September 2023 |
Date Type: |
Publication |
Defense Date: |
27 April 2023 |
Approval Date: |
14 September 2023 |
Submission Date: |
8 May 2023 |
Access Restriction: |
No restriction; Release the ETD for access worldwide immediately. |
Number of Pages: |
34 |
Institution: |
University of Pittsburgh |
Schools and Programs: |
Swanson School of Engineering > Electrical and Computer Engineering |
Degree: |
MS - Master of Science |
Thesis Type: |
Master's Thesis |
Refereed: |
Yes |
Uncontrolled Keywords: |
high-performance computing, machine learning, GPU, transformer, training |
Date Deposited: |
14 Sep 2023 13:33 |
Last Modified: |
14 Sep 2023 13:33 |
URI: |
http://d-scholarship.pitt.edu/id/eprint/44738 |
Metrics
Monthly Views for the past 3 years
Plum Analytics
Actions (login required)
|
View Item |