Lekshmi Narayanan, Arun Balajiee
(2020)
Embedding indices and bloom filters in parquet files
for fast Apache arrow retrievals.
Master's Thesis, University of Pittsburgh.
(Unpublished)
This is the latest version of this item.
Abstract
Apache Parquet is a column major table file format developed for the Hadoop ecosystem, with support for data compression. Hadoop SQL engines process queries like relational databases but read the parquet file to retrieve data. The caveat is that reading takes time and needs to be optimized. Irrelevant to a query I/O must be avoided for faster reads. The file is organized in rows segmented serially per column, which are segmented serially into DataPages. Two indices were proposed, namely, ColumnIndex (storing DataPage minimum and maximum values) and OffsetIndex (storing DataPage offsets), which support reading only the required DataPages in retrieving a row, skipping irrelevant DataPages. In this thesis, we investigate methods to accelerate row retrieval in parquet files within Apache Arrow, which is an in-memory big data analytics library that supports fast data processing applications on modern hardware. Towards this, we first implement the proposed ColumnIndex and OffsetIndex. We then propose and integrate the indices with Split Block Bloom Filters (SBBF). Our hypothesis is that a combination of the indices and SBBF should enhance the overall performance by avoiding unnecessary I/O in queries with predicate values not present in the parquet file. We validate our hypothesis through extensive experimentation. Our experiments show that using either indices or SBBF reduces average reading time by 20x. Their combination reduces the average reading time by an additional 10%. Adding indices does not significantly increase the parquet file size, but adding SBBF approximately increases the parquet file size by 2x. We contribute our code to Apache Arrow open source project along with a conceptual design for DataPage level SBBF for further read optimization.
Share
Citation/Export: |
|
Social Networking: |
|
Details
Item Type: |
University of Pittsburgh ETD
|
Status: |
Unpublished |
Creators/Authors: |
Creators | Email | Pitt Username | ORCID |
---|
Lekshmi Narayanan, Arun Balajiee | arl122@pitt.edu | arl122 | |
|
ETD Committee: |
|
Date: |
20 August 2020 |
Date Type: |
Publication |
Defense Date: |
30 July 2020 |
Approval Date: |
20 August 2020 |
Submission Date: |
7 August 2020 |
Access Restriction: |
No restriction; Release the ETD for access worldwide immediately. |
Number of Pages: |
81 |
Institution: |
University of Pittsburgh |
Schools and Programs: |
School of Computing and Information > Computer Science |
Degree: |
MS - Master of Science |
Thesis Type: |
Master's Thesis |
Refereed: |
Yes |
Uncontrolled Keywords: |
PittCS Arrow, Apache Parquet, parquet, Apache Arrow, Impala, Big Data, ColumnIndex-OffsetIndex, Index, row retrieval, databases, bloom filters, split block bloom filters |
Date Deposited: |
20 Aug 2020 19:05 |
Last Modified: |
05 Oct 2020 16:26 |
URI: |
http://d-scholarship.pitt.edu/id/eprint/39587 |
Available Versions of this Item
-
Embedding indices and bloom filters in parquet files
for fast Apache arrow retrievals. (deposited 20 Aug 2020 19:05)
[Currently Displayed]
Metrics
Monthly Views for the past 3 years
Plum Analytics
Actions (login required)
|
View Item |