Link to the University of Pittsburgh Homepage
Link to the University Library System Homepage Link to the Contact Us Form

Vertical Memory Optimization for High Performance Energy-efficient GPU

Mao, Mengjie (2016) Vertical Memory Optimization for High Performance Energy-efficient GPU. Doctoral Dissertation, University of Pittsburgh. (Unpublished)

Primary Text

Download (2MB)


GPU heavily relies on massive multi-threading to achieve high throughput. The massive multi-threading imposes tremendous pressure on different storage components. This dissertation focuses on the optimization of memory subsystem including register file, L1 data cache and device memory, all of which are featured by the massive multi-threading and dominate the efficiency and scalability of GPU.

A large register file is demanded in GPU for supporting fast thread switching. This dissertation first introduces a power-efficient GPU register file built on the newly emerged racetrack memory (RM). However, the shift operators of RM results in extra power and timing overhead. A holistic architecture-level technology set is developed to conquer the adverse impacts and guarantees its energy merit. Experiment results show that the proposed techniques can keep GPU performance stable compared to the baseline with SRAM based RF. Register file energy is significantly reduced by 48.5%.

This work then proposes a versatile warp scheduler (VWS) to reduce the L1 data cache misses in GPU. VWS retains the intra-warp cache locality with a simple yet effective per-warp working set estimator, and enhances intra- and inter-thread-block cache locality using a thread block aware scheduler. VWS achieves on average 38.4% and 9.3% IPC improvement compared to a widely-used and a state-of-the-art warp schedulers, respectively.

At last this work targets the off-chip DRAM based device memory. An integrated architecture substrate is introduced to improve the performance and energy efficiency of GPU through the efficient bandwidth utilization. The first part of the architecture substrate, thread batch enabled memory partitioning (TEMP) improves memory access parallelism. TEMP introduces thread batching to separate the memory access streams from SMs. The second part, Thread batch-aware scheduler (TBAS) is then designed to improve memory access locality. Experimental results show that TEMP and TBAS together can obtain up to 10.3% performance improvement and 11.3% DRAM energy reduction for GPU workloads.


Social Networking:
Share |


Item Type: University of Pittsburgh ETD
Status: Unpublished
CreatorsEmailPitt UsernameORCID
Mao, Mengjiemem231@pitt.eduMEM231
ETD Committee:
TitleMemberEmail AddressPitt UsernameORCID
Committee ChairLi, Haihal66@pitt.eduHAL66
Committee CoChairChen, Yiranyic52@pitt.eduYIC52
Committee MemberJones, Alex K.akjones@pitt.eduAKJONES
Committee MemberMao, Zhi-Hongzhm4@pitt.eduZHM4
Committee MemberMelhem, Ramimelhem@cs.pitt.eduMELHEM
Date: 15 June 2016
Date Type: Publication
Defense Date: 24 March 2016
Approval Date: 15 June 2016
Submission Date: 27 March 2016
Access Restriction: No restriction; Release the ETD for access worldwide immediately.
Number of Pages: 111
Institution: University of Pittsburgh
Schools and Programs: Swanson School of Engineering > Computer Engineering
Degree: PhD - Doctor of Philosophy
Thesis Type: Doctoral Dissertation
Refereed: Yes
Uncontrolled Keywords: GPU Racetrack Memory Warp scheduler Memory partitioning
Date Deposited: 15 Jun 2016 17:40
Last Modified: 15 Nov 2016 14:32


Monthly Views for the past 3 years

Plum Analytics

Actions (login required)

View Item View Item