Automatic Speech Recognition Quality Estimation

Jalalvand, Shahab (2017) Automatic Speech Recognition Quality Estimation. PhD thesis, University of Trento.

[img]PDF (Shahab's doctoral thesis) - Doctoral Thesis
Restricted to Repository staff only until 9999.

1128Kb
[img]PDF - Disclaimer
Restricted to Repository staff only until 9999.

969Kb

Abstract

Evaluation of automatic speech recognition (ASR) systems is difficult and costly, since it requires manual transcriptions. This evaluation is usually done by computing word error rate (WER) that is the most popular metric in ASR community. Such computation is doable only if the manual references are available, whereas in the real-life applications, it is a too rigid condition. A reference-free metric to evaluate the ASR performance is \textit{confidence measure} which is provided by the ASR decoder. However, the confidence measure is not always available, especially in commercial ASR usages. Even if available, this measure is usually biased towards the decoder. From this perspective, the confidence measure is not suitable for comparison purposes, for example between two ASR systems. These issues motivate the necessity of an automatic quality estimation system for ASR outputs. This thesis explores ASR quality estimation (ASR QE) from different perspectives including: feature engineering, learning algorithms and applications. From feature engineering perspective, a wide range of features extractable from input signal and output transcription are studied. These features represent the quality of the recognition from different aspects and they are divided into four groups: signal, textual, hybrid and word-based features. From learning point of view, we address two main approaches: i) QE via regression, suitable for single hypothesis scenario; ii) QE via machine-learned ranking (MLR), suitable for multiple hypotheses scenario. In the former, a regression model is used to predict the WER score of each single hypothesis that is created through a single automatic transcription channel. In the latter, a ranking model is used to predict the order of multiple hypotheses with respect to their quality. Multiple hypotheses are mainly generated by several ASR systems or several recording microphones. From application point of view, we introduce two applications in which ASR QE makes salient improvement in terms of WER: i) QE-informed data selection for acoustic model adaptation; ii) QE-informed system combination. In the former, we exploit single hypothesis ASR QE methods in order to select the best adaptation data for upgrading the acoustic model. In the latter, we exploit multiple hypotheses ASR QE methods to rank and combine the automatic transcriptions in a supervised manner. The experiments are mostly conducted on CHiME-3 English dataset. CHiME-3 consists of Wall Street Journal utterances, recorded by multiple far distant microphones in noisy environments. The results show that QE-informed acoustic model adaptation leads to 1.8\% absolute WER reduction and QE-informed system combination leads to 1.7% absolute WER reduction in CHiME-3 task. The outcomes of this thesis are packed in the frame of an open source toolkit named TranscRater -transcription rating toolkit- (https://github.com/hlt-mt/TranscRater) which has been developed based on the aforementioned studies. TranscRater can be used to extract informative features, train the QE models and predict the quality of the reference-less recognitions in a variety of ASR tasks.

Item Type:Doctoral Thesis (PhD)
Doctoral School:Information and Communication Technology
PhD Cycle:28
Subjects:Area 01 - Scienze matematiche e informatiche > MAT/05 ANALISI MATEMATICA
Area 01 - Scienze matematiche e informatiche > INF/01 INFORMATICA
Area 01 - Scienze matematiche e informatiche > MAT/06 PROBABILITÀ E STATISTICA MATEMATICA
Funders:Fondazione Bruno Kessler
Repository Staff approval on:17 May 2017 09:26

Repository Staff Only: item control page