Link to the University of Pittsburgh Homepage
Link to the University Library System Homepage Link to the Contact Us Form

Evaluation of Scalability for Distributed Data-Parallel Training of Swin Transformer V2

Garrett, Dillon M. (2023) Evaluation of Scalability for Distributed Data-Parallel Training of Swin Transformer V2. Master's Thesis, University of Pittsburgh. (Unpublished)

Download (292kB) | Preview


As recent research demonstrates, the trend in model size across deep learning has rapidly increased, helping to further the state-of-the-art. Along with an increase in model size comes increased computational demands on hardware and software computing platforms, leading to training scalability being of interest. Following the development of transformer-based models, it has become common practice to begin training with a pre-trained model and fine-tune it on a specific dataset to allow for wider adoption without full model retraining. While originally designed for the natural language processing field, transformers have been adapted to many other domains. Swin Transformer V2 is a transformer model used for computer-vision tasks that achieved state-of-the-art semantic segmentation results. This research provides a scalability analysis for the distributed data-parallel training of Swin Transformer V2 on the semantic segmentation vision task. The ADE20K semantic segmentation dataset is used for training instances to fine-tune this model. A weak scalability experiment is designed, increasing the number of GPUs for training while holding the problem size constant. To implement this experiment, the sub-batch size per GPU is held constant at 8 images per GPU per iteration and the total number of iterations is scaled down. Training time, GPU utilization, and CPU utilization metrics for single- and multi-GPUs are measured on NVIDIA A100 SXM, NVIDIA A100 PCIe, and NVIDIA V100 PCIe GPU platforms hosted by the Center for Research Computing at the University of Pittsburgh. Training speedup and parallel efficiency metrics are calculated. For all computing platforms, training on 2 GPUs is 26% faster on average when compared to single GPU training. However, diminishing returns are observed when adding additional GPUs because smaller speedup benefits are observed. When increasing the number of GPUs from 2 to 4, the training is only 1.9% faster on average on NVIDIA A100 PCIe and NVIDIA V100 PCIe nodes. For NVLINK-enabled NVIDIA A100 nodes, training is only 2.9% faster when increasing the number of GPUs from 4 to 8. Consequentially, distributed data-parallel training of Swin Transformer V2 scales poorly as the number of devices is increased.


Social Networking:
Share |


Item Type: University of Pittsburgh ETD
Status: Unpublished
CreatorsEmailPitt UsernameORCID
Garrett, Dillon M.dmg111@pitt.edudmg1110009-0006-3961-6541
ETD Committee:
TitleMemberEmail AddressPitt UsernameORCID
Committee ChairGeorge, Alan
Committee MemberDallal, Ahmed Hassan Sayedahd12@pitt.eduahd12
Committee MemberDickerson, Samuel J.dickerson@pitt.edusjdst31
Date: 14 September 2023
Date Type: Publication
Defense Date: 27 April 2023
Approval Date: 14 September 2023
Submission Date: 8 May 2023
Access Restriction: No restriction; Release the ETD for access worldwide immediately.
Number of Pages: 34
Institution: University of Pittsburgh
Schools and Programs: Swanson School of Engineering > Electrical and Computer Engineering
Degree: MS - Master of Science
Thesis Type: Master's Thesis
Refereed: Yes
Uncontrolled Keywords: high-performance computing, machine learning, GPU, transformer, training
Date Deposited: 14 Sep 2023 13:33
Last Modified: 14 Sep 2023 13:33


Monthly Views for the past 3 years

Plum Analytics

Actions (login required)

View Item View Item