Embedding indices and bloom filters in parquet files for fast Apache arrow retrievals

Lekshmi Narayanan, Arun Balajiee (2020) Embedding indices and bloom filters in parquet files for fast Apache arrow retrievals. Master's Thesis, University of Pittsburgh. (Unpublished)

This is the latest version of this item.

Preview

PDF
Download (3MB) | Preview

Abstract

Apache Parquet is a column major table file format developed for the Hadoop ecosystem, with support for data compression. Hadoop SQL engines process queries like relational databases but read the parquet file to retrieve data. The caveat is that reading takes time and needs to be optimized. Irrelevant to a query I/O must be avoided for faster reads. The file is organized in rows segmented serially per column, which are segmented serially into DataPages. Two indices were proposed, namely, ColumnIndex (storing DataPage minimum and maximum values) and OffsetIndex (storing DataPage offsets), which support reading only the required DataPages in retrieving a row, skipping irrelevant DataPages. In this thesis, we investigate methods to accelerate row retrieval in parquet files within Apache Arrow, which is an in-memory big data analytics library that supports fast data processing applications on modern hardware. Towards this, we first implement the proposed ColumnIndex and OffsetIndex. We then propose and integrate the indices with Split Block Bloom Filters (SBBF). Our hypothesis is that a combination of the indices and SBBF should enhance the overall performance by avoiding unnecessary I/O in queries with predicate values not present in the parquet file. We validate our hypothesis through extensive experimentation. Our experiments show that using either indices or SBBF reduces average reading time by 20x. Their combination reduces the average reading time by an additional 10%. Adding indices does not significantly increase the parquet file size, but adding SBBF approximately increases the parquet file size by 2x. We contribute our code to Apache Arrow open source project along with a conceptual design for DataPage level SBBF for further read optimization.

Citation/Export:
Social Networking:	Share \|

Details

Item Type:

University of Pittsburgh ETD

Status:

Unpublished

Creators/Authors:

Creators	Email	Pitt Username	ORCID
Lekshmi Narayanan, Arun Balajiee	arl122@pitt.edu	arl122

ETD Committee:

Title	Member	Email Address
Committee Chair	Chrysanthis, Panos K.	panos@pitt.edu
Committee CoChair	Costa, Constantinos	costa.c@pitt.edu
Committee Member	Mosse, Daniel	mosse@pitt.edu
Committee Member	Labrinidis, Alexandros	labrinid@pitt.edu

Date:

20 August 2020

Date Type:

Publication

Defense Date:

30 July 2020

Approval Date:

20 August 2020

Submission Date:

7 August 2020

Access Restriction:

No restriction; Release the ETD for access worldwide immediately.

Number of Pages:

Institution:

University of Pittsburgh

Schools and Programs:

School of Computing and Information > Computer Science

Degree:

MS - Master of Science

Thesis Type:

Master's Thesis

Refereed:

Yes

Uncontrolled Keywords:

PittCS Arrow, Apache Parquet, parquet, Apache Arrow, Impala, Big Data, ColumnIndex-OffsetIndex, Index, row retrieval, databases, bloom filters, split block bloom filters

Date Deposited:

20 Aug 2020 19:05

Last Modified:

05 Oct 2020 16:26

URI:

http://d-scholarship.pitt.edu/id/eprint/39587

Available Versions of this Item

Embedding indices and bloom filters in parquet files for fast Apache arrow retrievals. (deposited 20 Aug 2020 19:05) [Currently Displayed]

Metrics

Monthly Views for the past 3 years

Plum Analytics

Actions (login required)

View Item

My Account

Search

Browse

Information

Embedding indices and bloom filters in parquet files for fast Apache arrow retrievals

Abstract

Share

Details

Available Versions of this Item

Metrics

Monthly Views for the past 3 years

Plum Analytics

Actions (login required)

Connect with us

Send Comments or Questions

Feeds