Learning to merge - language and vision: A deep evaluation of the encoder, the role of the two modalities, the role of the training task.

Shekhar, Ravi (2019) Learning to merge - language and vision: A deep evaluation of the encoder, the role of the two modalities, the role of the training task. PhD thesis, University of Trento.

PDF - Disclaimer
Restricted to Repository staff only until 9999.
88Kb

Preview

PDF - Doctoral Thesis
8Mb

Abstract

Most human language understanding is grounded in perception. There is thus growing interest in combining information from language and vision. Multiple models based on Neural Networks have been proposed to merge language and vision information. All the models share a common backbone consisting of an encoder which learns to merge the two types of representation to perform a specific task. While some models have seemed extremely successful on those tasks, it remains unclear how the reported results should be interpreted and what those models are actually learning. Our contribution is three-fold. We have proposed (a) a new model of Visually Grounded Dialogue; (b) a diagnostic dataset to evaluate the encoder ability to merge visual and language input; (c) a method to evaluate the quality of the multimodal representation computed by the encoder as general purposed representations. We have proposed and analyzed a cognitive plausible architecture in which dialogue system modules are connected through a common \emph{grounded dialogue state encoder}. Our in-depth analysis of the dialogues shows the importance of going beyond task-success in the evaluation of Visual Dialogues: the dialogues themselves should play a crucial role in such evaluation. We have proposed a diagnostic dataset, \emph{FOIL} which consists of images associated with incorrect captions that the model has to detect and correct. Finally, we have used FOIL to evaluate the quality of the multimodal representation produced by an encoder trained on different multimodal tasks. We have shown how the training task used effects the stability of the representation, their transferability and the model confidence.

Item Type:	Doctoral Thesis (PhD)
Doctoral School:	Information and Communication Technology
PhD Cycle:	31
Subjects:	Area 01 - Scienze matematiche e informatiche > INF/01 INFORMATICA
Repository Staff approval on:	04 Jun 2019 09:59

Repository Staff Only: item control page

Università degli Studi di Trento

Unitn-eprints.PhD

Learning to merge - language and vision: A deep evaluation of the encoder, the role of the two modalities, the role of the training task.

Abstract