Link to the University of Pittsburgh Homepage
Link to the University Library System Homepage Link to the Contact Us Form

Embedding indices and bloom filters in parquet files for fast Apache arrow retrievals

Lekshmi Narayanan, Arun Balajiee (2020) Embedding indices and bloom filters in parquet files for fast Apache arrow retrievals. Master's Thesis, University of Pittsburgh. (Unpublished)

This is the latest version of this item.

Download (3MB) | Preview


Apache Parquet is a column major table file format developed for the Hadoop ecosystem, with support for data compression. Hadoop SQL engines process queries like relational databases but read the parquet file to retrieve data. The caveat is that reading takes time and needs to be optimized. Irrelevant to a query I/O must be avoided for faster reads. The file is organized in rows segmented serially per column, which are segmented serially into DataPages. Two indices were proposed, namely, ColumnIndex (storing DataPage minimum and maximum values) and OffsetIndex (storing DataPage offsets), which support reading only the required DataPages in retrieving a row, skipping irrelevant DataPages. In this thesis, we investigate methods to accelerate row retrieval in parquet files within Apache Arrow, which is an in-memory big data analytics library that supports fast data processing applications on modern hardware. Towards this, we first implement the proposed ColumnIndex and OffsetIndex. We then propose and integrate the indices with Split Block Bloom Filters (SBBF). Our hypothesis is that a combination of the indices and SBBF should enhance the overall performance by avoiding unnecessary I/O in queries with predicate values not present in the parquet file. We validate our hypothesis through extensive experimentation. Our experiments show that using either indices or SBBF reduces average reading time by 20x. Their combination reduces the average reading time by an additional 10%. Adding indices does not significantly increase the parquet file size, but adding SBBF approximately increases the parquet file size by 2x. We contribute our code to Apache Arrow open source project along with a conceptual design for DataPage level SBBF for further read optimization.


Social Networking:
Share |


Item Type: University of Pittsburgh ETD
Status: Unpublished
CreatorsEmailPitt UsernameORCID
Lekshmi Narayanan, Arun Balajieearl122@pitt.eduarl122
ETD Committee:
TitleMemberEmail AddressPitt UsernameORCID
Committee ChairChrysanthis, Panos
Committee CoChairCosta,
Committee MemberMosse,
Committee MemberLabrinidis,
Date: 20 August 2020
Date Type: Publication
Defense Date: 30 July 2020
Approval Date: 20 August 2020
Submission Date: 7 August 2020
Access Restriction: No restriction; Release the ETD for access worldwide immediately.
Number of Pages: 81
Institution: University of Pittsburgh
Schools and Programs: School of Computing and Information > Computer Science
Degree: MS - Master of Science
Thesis Type: Master's Thesis
Refereed: Yes
Uncontrolled Keywords: PittCS Arrow, Apache Parquet, parquet, Apache Arrow, Impala, Big Data, ColumnIndex-OffsetIndex, Index, row retrieval, databases, bloom filters, split block bloom filters
Date Deposited: 20 Aug 2020 19:05
Last Modified: 05 Oct 2020 16:26

Available Versions of this Item

  • Embedding indices and bloom filters in parquet files for fast Apache arrow retrievals. (deposited 20 Aug 2020 19:05) [Currently Displayed]


Monthly Views for the past 3 years

Plum Analytics

Actions (login required)

View Item View Item