Three Case Studies For Understanding, Measuring and Using a Compound Notion of Data Quality With Emphasis on the data Staleness Dimension

Chayka, Oleksiy (2012) Three Case Studies For Understanding, Measuring and Using a Compound Notion of Data Quality With Emphasis on the data Staleness Dimension. PhD thesis, University of Trento.

[img]
Preview
PDF - Doctoral Thesis
4Mb

Abstract

By its nature, the term “data quality” with its generic meaning “fitness for use” has both subjective and objective aspects. There are numerous methodologies and techniques to evaluate its subjective parts and to measure its objective parts. However, none of them are uniform enough for exploitation in diverse real-world applications. None of those, in fact, can be created as such, since data quality penetrates too deep into business operations to prevent from finding “a silver bullet” for all of them: it normally goes from representation of real world entities or their properties with data in an information system, to data processing and delivering to consumers. In this work, we considered three real world use cases which entirely or partially cover those areas of data quality scope. In particular, we study the following problems: 1) how quality of data can be defined and propagated to customers in a business intelligence application for quality-aware decision making; 2) how data quality can be defined, measured and used in a web-based system operating with semi-structured data from and designated to both humans and machines; 3) how a data-driven (vs. system-driven) time-related data quality notion of staleness can be defined, efficiently measured and monitored in a generic information system. Thus, we expand the corresponding state of the art with Application, System and Dimension aspects of data quality. In the Application context, we propose a quality-aware architecture for a typical business intelligence application in a healthcare environment. We demonstrate potential quality issues implications, including intra- and inter-dimensional quality dependencies, prone to data from early processing stages up to the reporting level. In the part dedicated to the System, we demonstrate an approach to understand, measure and disseminate data quality measurement results in a context of a web based system called Entity Name System (ENS). On the Dimension side, we propose a definition of data staleness in accordance with key time-related quality metrics requirements, relying on the corresponding similar notions elaborated by the researchers before. We demonstrate an approach to measure data staleness by different statistical methods, including exponential smoothing. In our experiments, we compare their space efficiency and data update instants predictive accuracy using history of updates of sample representative articles from Wikipedia.

Item Type:Doctoral Thesis (PhD)
Doctoral School:Information and Communication Technology
PhD Cycle:XXIII
Subjects:Area 01 - Scienze matematiche e informatiche > INF/01 INFORMATICA
Repository Staff approval on:13 Feb 2013 10:30

Repository Staff Only: item control page