Exploiting spatial and spectral information for audio source separation and speaker diarization

Abdelraheem, Mahmoud Fakhry Mahmoud (2016) Exploiting spatial and spectral information for audio source separation and speaker diarization. PhD thesis, University of Trento.

[img]PDF - Doctoral Thesis
Restricted to Repository staff only until 9999.

2897Kb
[img]PDF - Doctoral Thesis
Restricted to Repository staff only until 9999.

519Kb

Abstract

The goal of multichannel audio source separation is to produce high quality separated audio signals, observing mixtures of these signals. The difficulty of tackling the problem comes from not only the source propagation through noisy and echoing environments, but also overlapped source signals. Among the different research directions pursued around this problem, the adoption of probabilistic and advanced modeling aims at exploiting the diversity of multichannel propagation, and the redundancy of source signals. Moreover, prior information about the environments or the signals is helpful to improve the quality and to accelerate the separation. In this thesis, we propose methods to increase the effectiveness of model-based audio source separation methods by exploiting prior information applying spectral and sparse modeling theories. The work is divided into two main parts. In the first part, spectral modeling based on Nonnegative Matrix Factorization is adopted to represent the source signals. The parameters of Gaussian model-based source separation are estimated in sense of Maximum-Likelihood using a Generalized Expectation-Maximization algorithm by applying supervised Nonnegative Matrix and Tensor Factorization, given spectral descriptions of the source signals. Three modalities of making the descriptions available are addressed, i.e. the descriptions are on-line trained during the separation, pre-trained and made directly available, or pre-trained and made indirectly available. In the latter, a detection method is proposed in order to identify the descriptions best representing the signals in the mixtures. In the second part, sparse modeling is adopted to represent the propagation environments. Spatial descriptions of the environments, either deterministic or probabilistic, are pre-trained and made indirectly available. A detection method is proposed in order to identify the deterministic descriptions best representing the environments. The detected descriptions are then used to perform source separation by minimizing a non-convex $l_0$-norm function. For speaker diarization where the task is to determine ``who spoke when" in real meetings, a Watson mixture model is optimized using an Expectation-Maximization algorithm in order to detect the probabilistic descriptions, best representing the environments, and to estimate the temporal activity of each source. The performance of the proposed methods is experimentally evaluated using different datasets, between simulated and live-recorded. The elaborated results show the superiority of the proposed methods over recently developed methods used as baselines.

Item Type:Doctoral Thesis (PhD)
Doctoral School:Information and Communication Technology
PhD Cycle:28
Subjects:Area 01 - Scienze matematiche e informatiche > INF/01 INFORMATICA
Funders:Fondazione Bruno Kessler
Repository Staff approval on:02 Dec 2016 14:14

Repository Staff Only: item control page