Link to the University of Pittsburgh Homepage
Link to the University Library System Homepage Link to the Contact Us Form

CGAcc: A Compressed Sparse Row Representation-Based BFS Graph Traversal Accelerator on Hybrid Memory Cube

Qian, Cheng and Childers, Bruce and Huang, Libo and Guo, Hui and Wang, Zhiying (2018) CGAcc: A Compressed Sparse Row Representation-Based BFS Graph Traversal Accelerator on Hybrid Memory Cube. Electronics, 7 (11). p. 307. ISSN 2079-9292

[img]
Preview
PDF
Published Version

Download (1MB) | Preview

Abstract

Graph traversal is widely used in map routing, social network analysis, causal discovery and many more applications. Because it is a memory-bound process, graph traversal puts significant pressure on the memory subsystem. Due to poor spatial locality and the increasing size of today’s datasets, graph traversal consumes an ever-larger part of application execution time. One way to mitigate this cost is memory prefetching, which issues requests from the processor to the memory in anticipation of needing certain data. However, traditional prefetching does not work well for graph traversal due to data dependencies, the parallel nature of graphs and the need to move vast amounts of data from memory to the caches. In this paper, we propose a compressed sparse row representation-based graph accelerator on the Hybrid Memory Cube (HMC), called CGAcc. CGAcc combines Compressed Sparse Row (CSR) graph representation with in-memory prefetching and processing to improve the performance of graph traversal. Our approach integrates the prefetching and processing in the logic layer of a 3D stacked Dynamic Random-Access Memory (DRAM) architecture, based on Micron’s HMC. We selected HMC to implement CGAcc because it can provide quite high bandwidth and low access latency. Furthermore, this device has multiple DRAM layers connected to internal logic to control memory access and perform rudimentary computation. Using the CSR representation, CGAcc deploys prefetchers in the HMC to exploit the short transaction latency between the logic and DRAM layers. By doing this, it can also avoid large data movement costs. In the runtime, CGAcc pipelines the prefetching to fetch data from DRAM arrays to improve memory-level parallelism. To further reduce the access latency, several optimized internal caches are also introduced to hold the prefetched data to be Processed In-Memory (PIM). A comprehensive evaluation shows the effectiveness of CGAcc. Experimental results showed that, compared to a conventional HMC main memory equipped with a stream prefetcher, CGAcc achieved an average 3.51× speedup with moderate hardware cost.


Share

Citation/Export:
Social Networking:
Share |

Details

Item Type: Article
Status: Published
Creators/Authors:
CreatorsEmailPitt UsernameORCID
Qian, Cheng
Childers, Brucechilders@cs.pitt.edu
Huang, Libo
Guo, Hui
Wang, Zhiying
Date: 7 November 2018
Date Type: Publication
Journal or Publication Title: Electronics
Volume: 7
Number: 11
Publisher: MDPI AG
Page Range: p. 307
DOI or Unique Handle: 10.3390/electronics7110307
Schools and Programs: School of Computing and Information > Computer Science
Refereed: Yes
Uncontrolled Keywords: compressed sparse row, graph traversal, hybrid memory cube
ISSN: 2079-9292
Official URL: http://dx.doi.org/10.3390/electronics7110307
Funders: National Natural Science Foundation of China, YESSunder
Article Type: Research Article
Date Deposited: 21 Jul 2021 20:04
Last Modified: 21 Jul 2021 20:04
URI: http://d-scholarship.pitt.edu/id/eprint/41448

Metrics

Monthly Views for the past 3 years

Plum Analytics

Altmetric.com


Actions (login required)

View Item View Item