Exploiting the Volatile Nature of Data and Information in Evolving Repositories and Systems with User Generated Content

Bykau, Siarhei (2013) Exploiting the Volatile Nature of Data and Information in Evolving Repositories and Systems with User Generated Content. PhD thesis, University of Trento.

[img]
Preview
PDF - Doctoral Thesis
Available under License Creative Commons Attribution Non-commercial.

4Mb

Abstract

Modern technological advances have created a plethora of an extremely large, highly heterogeneous and distributed collection of datasets that are highly volatile. This volatile nature makes their understanding, integration and management a challenging task. One of the first challenging issues is to create the right models that will capture not only the changes that have taken place on the values of the data but also the semantic evolution of the concepts that the data structures represent. Once this information has been captured, the right mechanisms should be put in place to enable the exploitation of the evolution information in query formulation, reasoning, answering and representation. Additionally, the continuously evolving nature of the data hinders the ability of determining the quality of the data that is observed at a specific moment, since there is a great deal of uncertainty on whether this information will remain as is. Finally, an important task in this context, known as information filtering, is to match a specific piece of information which is recently updated (or added) in a repository to a user or query at hand. In this dissertation, we propose a novel framework to model and query data which have the explicit evolution relationships among concepts. As a query language we present an expressive evolution graph traversal query language which is tested on a number of real case scenarios: the history of Biotechnology, the corporate history of US companies and others. In turn, to support query evaluation we introduce an algorithm using the idea of finding Steiner trees on graphs which is capable of computing answers on-the-fly taking into account the evolution connections among concepts. To address the problem of data quality in user generated repositories (e.g. Wikipedia) we present a novel algorithm which detects individual controversies by using the substitutions in the revision history of a content. The algorithm groups the disagreements between users by means of a context, i.e. the surrounding content, and by applying custom filters. In the extensive experimental evaluation we showed that the proposed ideas lead to high effectiveness on a various sources of controversies. Finally, we exploit the problem of producing recommendations in evolving repositories by focusing on the cold start problem, i.e. when no or little past information about the users and/or items is given. In the dissertation we present a number of novel algorithms which cope with the cold-start by leveraging the item features using the k-neighbor classifier, Naive Bayes classifier and maximum entropy principle. The obtained results enable recommender systems to operate in rapidly updated domains such that news, university courses and social data.

Item Type:Doctoral Thesis (PhD)
Doctoral School:Information and Communication Technology
PhD Cycle:XXIV
Subjects:Area 01 - Scienze matematiche e informatiche > INF/01 INFORMATICA
Repository Staff approval on:10 Jun 2013 13:50

Repository Staff Only: item control page