Link to the University of Pittsburgh Homepage
Link to the University Library System Homepage Link to the Contact Us Form

Methods and techniques for efficient processing of aggregated data

Yang, Fan (2022) Methods and techniques for efficient processing of aggregated data. Doctoral Dissertation, University of Pittsburgh. (Unpublished)

Download (4MB) | Preview


With the explosion of information, massive amounts of data are being generated daily from different sources. Due to the limited infrastructure and human capacity for data integration and the requirement of efficient processing, some data, especially historical data, are stored in an aggregated form at different levels of aggregation. For example, epidemiological data preserves monthly counts of infected people. Meanwhile, data analysis and machine learning models often require elaborate knowledge of data for accurate analysis and prediction. This information should be obtained either from original or from aggregated data.

Motivated by the above challenge, this thesis aims to facilitate the generation and utilization of aggregated data from three aspects: 1) reconstructing higher-resolution time series from aggregated data with acceptable performance; 2) selecting aggregated data for analysis with minimal hurt for performance; 3) generating aggregated data for future studies with less information loss.

Most data reconstruction methods utilize domain knowledge, e.g., smoothness, periodicity, or sparsity, to improve reconstruction accuracy. Meanwhile, domain knowledge is limited and may be inaccurate in many applications, which leads to a worse reconstruction. In order to tackle this, I present two advanced methods: 1) ARES that performs data reconstruction by automatically discovering patterns in the time series using annihilating filter technique, 2) TURBOLIFT that aims to improve the quality of any existing disaggregation methods by refining the initial reconstruction.

Despite that reconstruction provides an elaborate view of data, its performance may vary depending on the data aggregation level, and it requires extra computational cost. Moreover, in some cases, analyzing coarse data may be sufficient to achieve acceptable accuracy. Therefore, I propose the SMARTPROGNOSIS to automatically suggest aggregation levels, which maximizes the performance under specific machine learning models.

It is noteworthy that most aggregation methods face information loss when aggregation levels increase. That results in lossy aggregated data, e.g., with annual counts, it is hard to capture the detailed trade during the year. In order to tackle this drawback, I propose the IAGG to aggregate data by emphasizing the critical information of original data.


Social Networking:
Share |


Item Type: University of Pittsburgh ETD
Status: Unpublished
CreatorsEmailPitt UsernameORCID
Yang, Fanfay28@pitt.eduFAY28
ETD Committee:
TitleMemberEmail AddressPitt UsernameORCID
Committee ChairZadorozhny, Vladimirvladimirz@gmail.comviz
Committee MemberFaloutsos,
Committee MemberMunro,
Committee MemberPelechrinis,
Date: 2 June 2022
Date Type: Publication
Defense Date: 18 March 2022
Approval Date: 2 June 2022
Submission Date: 20 April 2022
Access Restriction: No restriction; Release the ETD for access worldwide immediately.
Number of Pages: 109
Institution: University of Pittsburgh
Schools and Programs: School of Computing and Information > Information Science
Degree: PhD - Doctor of Philosophy
Thesis Type: Doctoral Dissertation
Refereed: Yes
Uncontrolled Keywords: Data Disaggregation, Data Navigation, Data Summarization
Date Deposited: 02 Jun 2022 21:11
Last Modified: 02 Jun 2022 21:11


Monthly Views for the past 3 years

Plum Analytics

Actions (login required)

View Item View Item