Grapevine acidity: SVM tool development and NGS data analyses.

Leonardelli, Lorena (2014) Grapevine acidity: SVM tool development and NGS data analyses. PhD thesis, University of Trento, INRA-SupAgro Montpellier, Fondazione Edmund Mach - San Michele a/A.

[img]
Preview
PDF (Grapevine acidity: SVM tool development and NGS data analyses.) - Doctoral Thesis
4Mb

Abstract

Single Nucleotide Polymorphisms (SNPs) represent the most abundant type of genetic variation and they are a valuable tool for several biological applications like linkage mapping, integration of genetic and physical maps, population genetics as well as evolutionary and protein structure-function studies. SNP genotyping by mapping DNA reads produced via Next generation sequencing (NGS) technologies on a reference genome is a very common and convenient approach in our days, but still prone to a significant error rate. The need of defining in silico true genetic variants in genomic and transcriptomic sequences is prompted by the high costs of the experimental validation through re-sequencing or SNP arrays, not only in terms of money but also time and sample availability. Several open-source tools have been recently developed to identify small variants in whole-genome data, but still the candidate variants, provided in the VCF output format, present a high false positive calling rate. Goal of this thesis work is the development of a bioinformatic method that classifies variant calling outputs in order to reduce the number of false positive calls. With the aim to dissect the molecular bases of grape acidity (Vitis vinifera L.), this tool has been then used to select SNPs in two grapevine varieties, which show very different content of organic acids in the berry. The VCF parameters have been used to train a Support Vector Machine (SVM) that classifies the VCF records in true and false positive variants, cleaning the output from the most likely false positive results. The SVM approach has been implemented in a new software, called VerySNP, and applied to model and non-model organisms. In both cases, the machine learning method efficiently recognized true positive from false positive variants in both genomic and transcriptomic sequences. In the second part of the thesis, VerySNP was applied to identify true SNPs in RNA-seq data of the grapevine variety Gora Chirine, characterized by low acidity, and Sultanine, a normal acidity variety closely related to Gora. The comparative transcriptomic analysis crossed with the SNP information lead to discover non-synonymous polymorphisms inside coding regions and, thus, provided a list of candidate genes potentially affecting acidity in grapevine.

Item Type:Doctoral Thesis (PhD)
Doctoral School:Information and Communication Technology
PhD Cycle:26
Subjects:Area 01 - Scienze matematiche e informatiche > INF/01 INFORMATICA
Area 05 - Scienze biologiche > BIO/18 GENETICA
Repository Staff approval on:16 Dec 2014 09:17

Repository Staff Only: item control page