Frame-Based Ontology Population from Text: Models, Systems, and Applications

Corcoglioniti, Francesco (2016) Frame-Based Ontology Population from Text: Models, Systems, and Applications. PhD thesis, University of Trento.

[img]PDF - Doctoral Thesis
Restricted to Repository staff only until 9999.
Available under License Creative Commons Attribution.

3069Kb

Abstract

Ontology Population from text is an interdisciplinary task that integrates Natural Language Processing (NLP), Knowledge Representation, and Semantic Web techniques to extract assertional knowledge from texts according to specific ontologies. As most information on the Web is available as unstructured text, Ontology Population plays an important role in bridging the gap between structured and unstructured data, thus helping realizing the vision of a (Semantic) Web where contents are equally consumable by humans and machines. In this thesis we move beyond Ontology Population of instances and binary relations, and focus on (what we call) Frame-based Ontology Population, whose target is the extraction of semantic frames from text. Semantic frames are defined by RDFS/OWL ontologies, such as FrameBase and the Event Situation Ontology derived from FrameNet, and consist in events, situations and other structured entities reified as ontological instances (e.g., a sell event) and connected to related instances via properties specifying their semantic roles in the frame (e.g., seller, buyer). This representation (called neo-Davidsonian) supports expressing n-ary and arbitrarily qualified relations, and permits leveraging complex NLP tasks such as Semantic Role Labeling (SRL), which annotates frame-like structures in text consisting of predicates and their semantic arguments as defined by domain-general predicate models. We contribute to the task of Frame-based Ontology Population from multiple directions. We start with developing an extension of the Lemon lexicon model for ontologies (PreMOn) to represent predicate models --- PropBank, NombBank, VerbNet, and FrameNet --- and their mappings to FrameBase. Based on this, our core contribution is a Frame-based Ontology Population approach (PIKES) where processing is decoupled in two phases: first, an English text is processed by a SRL-based NLP pipeline to extract mentions, i.e., snippets of text denoting entities or facts; then, mentions are processed by mapping rules to extract ontological instances aligned to DBpedia and Yago, and semantic frames aligned to FrameBase. We represent all the contents involved in this process in RDF with named graphs, according to an ontological model (KEM) built on top of the semiotic notions of meaning and reference, aligned to DOLCE and the NLP Interchange Format (NIF) ontologies. The model allows navigating from any piece of extracted knowledge to its mentions and back, and allows representing all the generated intermediate information (e.g., NLP annotations) and associated metadata (e.g., confidence, provenance). Based on this model, we propose a scalable system (KnowledgeStore) for storing and querying all the text, mentions, and RDF data involved in the population process, together with relevant RDF background knowledge, so that they can be jointly accessed by applications. Finally, to support the necessary RDF processing tasks, such as rule evaluation, RDFS and owl:sameAs inference, and data filtering and integration, we propose a tool (RDFpro) implementing a simple, non-distributed processing model combining streaming and sorting techniques in complex pipelines, capable of processing billions of RDF triples on a commodity machine. We describe the application of these solutions for processing differently scoped/sized datasets within and outside the NewsReader EU Project, and for improving search performances in Information Retrieval, through an approach (KE4IR) that enriches the term vectors of documents and queries with semantic terms obtained from extracted knowledge. All the proposed solutions were implemented and released open-source with demonstrators, and ontological models were published online according to Linked Data best practices. The results obtained were validated via empirical performance evaluations and case studies.

Item Type:Doctoral Thesis (PhD)
Doctoral School:Information and Communication Technology
PhD Cycle:26
Subjects:Area 01 - Scienze matematiche e informatiche > INF/01 INFORMATICA
Area 09 - Ingegneria industriale e dell'informazione > ING-INF/05 SISTEMI DI ELABORAZIONE DELLE INFORMAZIONI
Uncontrolled Keywords:Ontology Population, Semantic Frames, Ontologies, Predicate Models, Semantic Web
Funders:Fondazione Bruno Kessler
Repository Staff approval on:06 Jun 2016 10:32

Related URLs:

Repository Staff Only: item control page