Masera, Luca (2019) Multi-target Prediction Methods for Bioinformatics: Approaches for Protein Function Prediction and Candidate Discovery for Gene Regulatory Network Expansion. PhD thesis, University of Trento.
| PDF - Doctoral Thesis 3271Kb | |
PDF - Disclaimer Restricted to Repository staff only until 9999. 320Kb |
Abstract
Biology is experiencing a paradigm shift since the advent of next generation sequencing technologies. The retrieved data largely exceeds the capability of biologists to investigate all possibilities in the laboratories, hence predictive tools able to guide the research are now a fundamental component of their workflow. Given the central role of proteins in living organisms, in this thesis we focus on their functional analysis and the intrinsic multi-target nature of this task. To this end, we propose different predictive methods, specifically developed to exploit side knowledge among target variables and examples. As a first contribution we face the task of protein-function prediction and more in general of hierarchical-multilabel classification (HMC). We present Ocelot a predictive pipeline for genome-wide protein characterization. It relies on a statistical-relational-learning tool, where the knowledge on the input examples is coded by the combination of multiple kernel matrices, while relations among target variables are expressed as logical constraints. Both, the mislabeling of examples and the infringement of logical rules are penalized by the loss function, but Ocelot do not forces hierarchical consistency. To overcome this limitation, we present AWX, a neural-networks output-layer that guarantees the formal consistency of HMC predictions. The second contribution is VSC, a binary classifier designed to incorporate the concepts of subsampling and locality in the definition of features to be used as the input of a perceptron. A locality-based confidence measure is used to weight the contribution of maximum-margin hyper-planes built by subsampling pairs of examples of opposite class. The rationale is that local methods can be exploited when a multi-target task is expected, but not reflected in the annotation space. The third and last contribution are NES2RA and OneGenE, two approaches for finding candidates to expand known gene regulatory networks. NES2RA adopts variable-subsetting strategies, enabled by volunteer distributed computing, and the PC algorithm to discover candidate causal relationships within each subset of variables. Then, ranking aggregators combine the partial results into a single ranked candidate genes list. OneGenE overcomes the main limitation of NES2RA, i.e. latency, by precomputing candidate expansion lists for each transcript of an organism that are then aggregated on-demand.
Item Type: | Doctoral Thesis (PhD) |
---|---|
Doctoral School: | Information and Communication Technology |
PhD Cycle: | 31 |
Subjects: | Area 01 - Scienze matematiche e informatiche > INF/01 INFORMATICA Area 05 - Scienze biologiche > BIO/11 BIOLOGIA MOLECOLARE |
Repository Staff approval on: | 19 Jul 2019 08:54 |
Repository Staff Only: item control page