Cross-Domain and Cross-Language Porting of Shallow Parsing

Stepanov, Evgeny (2014) Cross-Domain and Cross-Language Porting of Shallow Parsing. PhD thesis, University of Trento.

[img]
Preview
PDF - Doctoral Thesis
592Kb

Abstract

English was the main focus of attention of the Natural Language Processing (NLP) community for years. As a result, there are significantly more annotated linguistic resources in English than in any other language. Consequently, data-driven tools for automatic text or speech processing are developed mainly for English. Developing similar corpora and tools for other languages is an important issue. However, this requires significant amount of effort. Recently, Statistical Machine Translation (SMT) techniques and parallel corpora were used to transfer annotations from a linguistic resource rich languages to a resource-poor languages for a variety of Natural Language Processing (NLP) tasks, including Part-of-Speech tagging, Noun Phrase chunking, dependency parsing, textual entailment, etc. This cross-language NLP paradigm relies on the solution of the following sub-problems: - Data-driven NLP techniques are very sensitive to the differences in training and testing conditions. Different domains, such as financial news-wire and biomedical publications, have different distributions of NLP task-specific properties; thus, the domain adaptation of the source language tools -- either the development of models with good cross-domain performance or tuned to the target domain -- is critical. - Another difference in training and testing conditions arises with cross-genre applications such as written text (monologues) and spontaneous dialog data. Properties of written text such as punctuation and the notion of sentence are not present in spoken conversation transcriptions. Thus, style-adaptation techniques to cover a wider range of genres is critical as well. - The basis of cross-language porting is parallel corpora. Unfortunately, parallel corpora are scarce. Thus, generation or retrieval of parallel corpora between the languages of interest is important. Additionally, these parallel corpora most often are not in the domains of interest; consequently, the cross-language porting should be augmented with SMT domain adaptation techniques. - The language distance play an important role within the paradigm, since for close family language pairs (e.g. Romance languages Italian and Spanish) the range of linguistic phenomena to consider is significantly less compared to the distant family language pairs (e.g. Italian and Turkish). The developed cross-language techniques should be applicable to both conditions. In this thesis we address these sub-problems on complex Natural Language Processing tasks of Discourse Parsing and Spoken Language Understanding. Both tasks are cast as token-level shallow parsing. Penn Discourse Treebank (PDTB) style discourse parsing is applied cross-domain and we contribute feature-level domain adaptation techniques for the task. Additionally, we explore PDTB-style discourse parsing on dialog data in Italian are report on challenges. The problems of parallel corpora creation, language style adaptation, SMT domain-adaptation and language distance are addressed on the task of cross-language porting of Spoken Language Understanding. This thesis contributes to the task with the language-style and domain adaptation techniques for machine translation of spoken conversations using off-the-shelf systems like Google Translate, SMT systems trained on both out-of-domain and in-domain parallel data. We demonstrate that the techniques are beneficial for both close and distant language pairs. We propose the methodologies for the creation of parallel spoken conversation corpora via professional translation services that considers speech phenomena such as disfluencies. Additionally, we explore the semantic annotation transfer using automatic SMT methods and crowdsourcing. For the later, we propose the computational methodology to obtain acceptable quality corpus without the target language references and the low worker agreement.

Item Type:Doctoral Thesis (PhD)
Doctoral School:Information and Communication Technology
PhD Cycle:24
Subjects:Area 01 - Scienze matematiche e informatiche > INF/01 INFORMATICA
Repository Staff approval on:04 Aug 2014 11:19

Repository Staff Only: item control page