A Semi-supervised Approach for Improving Search, Navigation and Data Quality in Autonomous Digital Libraries

Krapivin, Mikalai (2010) A Semi-supervised Approach for Improving Search, Navigation and Data Quality in Autonomous Digital Libraries. PhD thesis, University of Trento.

PDF - Doctoral Thesis


The current rapid uptake of Autonomous Digital Libraries (both in scholarly and generalist domains) has driven the need for automated procedures for extracting, processing and representing the digital information contained in these digital repositories. Concurrently, the development of Web 2.0 technologies and applications has provided new opportunities and challenges for web-based information system and user interactions: novel features of social interactions as well as new usability and visualization model are having an impact on how people search, navigate and rank digital objects (being multimedia content like music or movies or - as in the case of Autonomous Digital Libraries - digital publications): as an example a simple quick glance at a user-based ``cloud of tags'' for a given Journal may now provide us a lot of information about its content and/or the way that the specific Journal is perceived by the people reading it. In this thesis we have tackled two open research dimension in state-of-the-art Autonomous Digital Libraries, namely: 1. Automated key-phrases/tags extraction from digital scientific contributions (papers): the ever increasing dimensionality of modern ADLs does not permit realistic manual documents processing and needs efficient and high quality methods for automatic or semi-automatic key-phrases extractions. 2. Exploration of current metrics and proposal of novel metrics for ranking of digital objects in order to improve the navigation within a large number of objects present in modern ADL. Moreover, we have implemented a prototype for a Digital Library interface capable to integrate the tools developed on the base of the results obtained in the above research directions. The prototype supports the user: (1) in the search of documents related to a topic - using the novel semi-automatic key-phrases extractions techniques proposed in this work; and (2) in the navigation and identification of ``relevant'' documents for a given topic based on a number of user-selectable relevance metrics. The first challenge we met in our work has been the lack of large, high quality and publicly available document datasets containing both the full text and human (experts) assigned key-phrases to be used for analytical assessment. Thus we constructed one from available public content sources and curated metadata repositories. Our dataset (named in the following as Trento Computer Science (TCS) dataset) consists in a subset of 2000 scientific papers published within 2003 and 2006 in Computer Science domain in the ACM Digital Library. The TCS dataset consists of the full text of papers, curated metadata (authors, title, affiliations, references etc.) and human (both authors and curators) assigned key-phrases. The original paper type (typically PDF, PS or LaTeX) has been processed and transformed into a textual format with the support of commercial pdf-to-text transformations tool and refined with the support of maximum entropy machine learning in order to improve the final quality of the full-texts. For the semi-supervised key-phrases/tags extraction task we have compared several Machine Learning techniques - namely, Random Forest, Support Vector Machines a novel Fast Local Kernel SVM and the Naive Bayes learning-based system KEA on the same TCS dataset. In particular, we have performed a number of experiments and explored in details the effect of including in the chosen feature sets linguistic and domain specific knowledge. In our experiments, Random Forest has been identified as the most precise method outperforming KEA (used as baseline for key-phrases extraction) by 36\% when using a novel feature sets including linguistic and domain specific knowledge. Moreover, compared with the other Machine Learning techniques, Random Forest is the best trade-off between accuracy and computational speed. The second task taken on by this thesis - navigation and ranking - relates to the large dimensionality of current ADL. In fact, in the presence of a huge quantity of documents connected to a specific topic, it is hard to navigate and find ``interesting'' contributions. The challenge here is to be able to identify the most important set of papers in a specific topic or for a particular author. A number of used metrics are available for this task, namely citation count for papers and Hirsch-index for authors. We have applied them as well as novel metrics based on the PageRank metric, named PaperRank, Focused PaperRank and PaperRank h-index, that captures - where data is available - more information present in complete citation graphs. As part of our analysis, we have developed methods and tools for qualitatively and quantitatively analyzing metrics that evaluate content and people. We have used them to explore the differences between various metrics as well as to understand in more details what do they measure. We also believe, that these methods and tools could successfully be used to compare rankings in different domains (search engine, review processes, etc.). We have carried out an extensive investigation of the various ranking metrics on the dataset of over 266,000 scientific papers, and related citation graphs. We discovered that the difference in ranking results is indeed very significant for the different metrics and investigated in details the reasons of this difference. Although initially this research has started as an independent line of research, it has found a significant number of important interactions with the European FET-Open project LiquidPub. A specific prototype of plug-in used for tagged search and ranking has been incorporated in the LiquidPub portal http://demo.liquidpub.org:8081/ResevalGUI/ that allows the tagged search and ranking over the whole collection of 266,000 papers.

Item Type:Doctoral Thesis (PhD)
Doctoral School:Information and Communication Technology
PhD Cycle:XXII
Subjects:Area 01 - Scienze matematiche e informatiche > INF/01 INFORMATICA
Repository Staff approval on:03 May 2010 14:45

Repository Staff Only: item control page