Methods and techniques for efficient processing of aggregated dataYang, Fan (2022) Methods and techniques for efficient processing of aggregated data. Doctoral Dissertation, University of Pittsburgh. (Unpublished)
AbstractWith the explosion of information, massive amounts of data are being generated daily from different sources. Due to the limited infrastructure and human capacity for data integration and the requirement of efficient processing, some data, especially historical data, are stored in an aggregated form at different levels of aggregation. For example, epidemiological data preserves monthly counts of infected people. Meanwhile, data analysis and machine learning models often require elaborate knowledge of data for accurate analysis and prediction. This information should be obtained either from original or from aggregated data. Motivated by the above challenge, this thesis aims to facilitate the generation and utilization of aggregated data from three aspects: 1) reconstructing higher-resolution time series from aggregated data with acceptable performance; 2) selecting aggregated data for analysis with minimal hurt for performance; 3) generating aggregated data for future studies with less information loss. Most data reconstruction methods utilize domain knowledge, e.g., smoothness, periodicity, or sparsity, to improve reconstruction accuracy. Meanwhile, domain knowledge is limited and may be inaccurate in many applications, which leads to a worse reconstruction. In order to tackle this, I present two advanced methods: 1) ARES that performs data reconstruction by automatically discovering patterns in the time series using annihilating filter technique, 2) TURBOLIFT that aims to improve the quality of any existing disaggregation methods by refining the initial reconstruction. Despite that reconstruction provides an elaborate view of data, its performance may vary depending on the data aggregation level, and it requires extra computational cost. Moreover, in some cases, analyzing coarse data may be sufficient to achieve acceptable accuracy. Therefore, I propose the SMARTPROGNOSIS to automatically suggest aggregation levels, which maximizes the performance under specific machine learning models. It is noteworthy that most aggregation methods face information loss when aggregation levels increase. That results in lossy aggregated data, e.g., with annual counts, it is hard to capture the detailed trade during the year. In order to tackle this drawback, I propose the IAGG to aggregate data by emphasizing the critical information of original data. Share
Details
MetricsMonthly Views for the past 3 yearsPlum AnalyticsActions (login required)
|