Contributions to Semantic Textual Similarity Algorithms

Vo, Ngoc Phuoc An (2016) Contributions to Semantic Textual Similarity Algorithms. PhD thesis, University of Trento.

[img]PDF - Doctoral Thesis
Restricted to Repository staff only until 9999.

4Mb

Abstract

Similarity plays a central role in language understanding process. However, it is always difficult to precisely define on which type of data and what similarity metrics we can apply in order to assess the similarity of two texts. According to this spirit, the task Semantic Textual Similarity (STS) was introduced as a pilot task at the Semantic Evaluation (SemEval) workshop in year 2012. This thesis seeks to investigate the variances of performance of STS systems with respect to the heterogeneous data sources, and find solutions to alleviate these variances to improve the system performance. We carry a series of works focusing on addressing different aspects of measuring semantic similarity for texts under the umbrella of the Semantic Textual Similarity task. Firstly, we analyze the variance of system performance on dierent corpora with preliminary experiments and propose the hypothesis that system performance depends heavily on the type of train and test corpora coming from heterogeneous sources. We analyze a standard textual similarity model built on vectorial representation and we derive a couple of modalities which help significantly alleviating the negative in influence of vectorial mapping model. In particular, we study how structural information and the most advanced word alignment models in Machine Translation improve the accuracy of systems. Our analysis also leads us to carry out, for the first time, an analysis between Semantic Relatedness and Textual Entailment, then we propose a co-learning model to improve the accuracy on both tasks by exploiting their mutual relationship. As a result, all these steps lead to a consistent improvement over the standard model which is manifested across corpora. The evaluation shows that our system systematically achieves and goes beyond the former state of the art, whereas it also reduces the variation of the accuracy on various types of corpora.

Item Type:Doctoral Thesis (PhD)
Doctoral School:Information and Communication Technology
PhD Cycle:28
Subjects:Area 01 - Scienze matematiche e informatiche > INF/01 INFORMATICA
Funders:Fondazione Bruno Kessler
Repository Staff approval on:03 May 2016 09:15

Repository Staff Only: item control page