Vertical Memory Optimization for High Performance Energy-efficient GPU

Mao, Mengjie (2016) Vertical Memory Optimization for High Performance Energy-efficient GPU. Doctoral Dissertation, University of Pittsburgh. (Unpublished)

Preview

PDF
Primary Text
Download (2MB)

Abstract

GPU heavily relies on massive multi-threading to achieve high throughput. The massive multi-threading imposes tremendous pressure on different storage components. This dissertation focuses on the optimization of memory subsystem including register file, L1 data cache and device memory, all of which are featured by the massive multi-threading and dominate the efficiency and scalability of GPU.

A large register file is demanded in GPU for supporting fast thread switching. This dissertation first introduces a power-efficient GPU register file built on the newly emerged racetrack memory (RM). However, the shift operators of RM results in extra power and timing overhead. A holistic architecture-level technology set is developed to conquer the adverse impacts and guarantees its energy merit. Experiment results show that the proposed techniques can keep GPU performance stable compared to the baseline with SRAM based RF. Register file energy is significantly reduced by 48.5%.

This work then proposes a versatile warp scheduler (VWS) to reduce the L1 data cache misses in GPU. VWS retains the intra-warp cache locality with a simple yet effective per-warp working set estimator, and enhances intra- and inter-thread-block cache locality using a thread block aware scheduler. VWS achieves on average 38.4% and 9.3% IPC improvement compared to a widely-used and a state-of-the-art warp schedulers, respectively.

At last this work targets the off-chip DRAM based device memory. An integrated architecture substrate is introduced to improve the performance and energy efficiency of GPU through the efficient bandwidth utilization. The first part of the architecture substrate, thread batch enabled memory partitioning (TEMP) improves memory access parallelism. TEMP introduces thread batching to separate the memory access streams from SMs. The second part, Thread batch-aware scheduler (TBAS) is then designed to improve memory access locality. Experimental results show that TEMP and TBAS together can obtain up to 10.3% performance improvement and 11.3% DRAM energy reduction for GPU workloads.

Citation/Export:
Social Networking:	Share \|

Details

Item Type:

University of Pittsburgh ETD

Status:

Unpublished

Creators/Authors:

Creators	Email	Pitt Username	ORCID
Mao, Mengjie	mem231@pitt.edu	MEM231

ETD Committee:

Title	Member	Email Address	Pitt Username
Committee Chair	Li, Hai	hal66@pitt.edu	HAL66
Committee CoChair	Chen, Yiran	yic52@pitt.edu	YIC52
Committee Member	Jones, Alex K.	akjones@pitt.edu	AKJONES
Committee Member	Mao, Zhi-Hong	zhm4@pitt.edu	ZHM4
Committee Member	Melhem, Rami	melhem@cs.pitt.edu	MELHEM

Date:

15 June 2016

Date Type:

Publication

Defense Date:

24 March 2016

Approval Date:

15 June 2016

Submission Date:

27 March 2016

Access Restriction:

No restriction; Release the ETD for access worldwide immediately.

Number of Pages:

111

Institution:

University of Pittsburgh

Schools and Programs:

Swanson School of Engineering > Computer Engineering

Degree:

PhD - Doctor of Philosophy

Thesis Type:

Doctoral Dissertation

Refereed:

Yes

Uncontrolled Keywords:

GPU Racetrack Memory Warp scheduler Memory partitioning

Date Deposited:

15 Jun 2016 17:40

Last Modified:

15 Nov 2016 14:32

URI:

http://d-scholarship.pitt.edu/id/eprint/27357

Metrics

Monthly Views for the past 3 years

Plum Analytics

Actions (login required)

View Item

My Account

Search

Browse

Information

Vertical Memory Optimization for High Performance Energy-efficient GPU

Abstract

Share

Details

Metrics

Monthly Views for the past 3 years

Plum Analytics

Actions (login required)

Connect with us

Send Comments or Questions

Feeds