Indexing for Very Large Data Series Collections

Zoumpatianos, Konstantinos (2016) Indexing for Very Large Data Series Collections. PhD thesis, University of Trento.

[img]PDF (Indexing for Very Large Data Series Collections) - Doctoral Thesis
Restricted to Repository staff only until 28 November 2018.

4Mb
[img]PDF - Doctoral Thesis
Restricted to Repository staff only

522Kb

Abstract

Data series are a prevalent data type that has attracted lots of interest in recent years. Specifically, there has been an explosive interest towards the analysis of large volumes of data series in many different domains. This is both in businesses (e.g., in mobile applications) and in sciences e.g., in biology). In several time-critical scenarios, analysts need to be able to explore these data as soon as they become available, which is not currently possible for very large data series collections. In this thesis, we present the first adaptive indexing mechanism, specifically tailored to solve the problem of indexing and querying very large data series collections. The main idea is that instead of building the complete index over the complete data set up-front and querying only later, we interactively and adaptively build parts of the index, only for the parts of the data on which the users pose queries. The contents and the resolution of the index are purely driven by query patterns; the more queries that arrive, the more data series are indexed and at a higher resolution. Adaptive indexing significantly outperforms previous solutions, gracefully handling large data series collections, reducing the data to query delay: by the time state-of-the-art indexing techniques finish indexing 1 billion data series (and before answering even a single query), our method has already answered 3 * 10^5 queries. At the same time, we present novel algorithms for both full indexing of data series collections, as well as for efficient exact query answering. Our algorithms perform efficient skip-sequential scans of the data, avoiding the need of costly random accesses on the disk. Moreover, up to this point very little attention has been paid to properly evaluating data series index structures, with most previous work relying solely on randomly selected data series to use as queries (with/without adding noise). In this thesis, we show that random workloads are inherently not suitable for the task at hand and we argue that there is a need for carefully generating a query workload. We define measures that capture the characteristics of queries, and we propose a method for generating workloads with the desired properties, that is, effectively evaluating and comparing data series summarizations and indexes. In our experimental evaluation, with carefully controlled query workloads, we shed light on key factors affecting the performance of nearest neighbor search in large data series collections. Finally, apart from ad hoc data exploration, we also investigate methods for the systematic analysis of very large data series collections, supporting business intelligence applications. We present techniques, which borrow ideas from Strategic Management, for a goal-oriented analysis of large collections of performance indicator data series. Such algorithms can additionally be sped up through the use of the index structures presented in this work.

Item Type:Doctoral Thesis (PhD)
Doctoral School:Information and Communication Technology
PhD Cycle:27
Subjects:Area 01 - Scienze matematiche e informatiche > INF/01 INFORMATICA
Uncontrolled Keywords:Data series, time series, indexing, adaptive indexing, data exploration, data mining, similarity search, nearest neighbors
Funders:FP7 EU ERC; FP7 EU IP
Repository Staff approval on:29 Nov 2016 09:40

Repository Staff Only: item control page