Improving the Effectiveness of Information Extraction from Biomedical Text

Chowdhury, Md. Faisal Mahbub (2013) Improving the Effectiveness of Information Extraction from Biomedical Text. PhD thesis, University of Trento, Fondazione Bruno Kessler.

PDF - Doctoral Thesis
Available under License Creative Commons Attribution.



Information extraction (IE) is the task which aims at automatically extracting specific target information from texts by means of various natural language processing (NLP) and Machine Learning (ML) techniques. The huge amount of available biomedical and clinical texts is an important source of undiscovered knowledge and an interesting domain where IE techniques can be applied. Although there has been a considerable amount of work for IE on other genres of text (such as newspaper articles), results of the state-of-the-art approaches for some of the IE tasks show there is still the need of improvement. Moreover, when these IE approaches are directly applied on biomedical/clinical data, the performance drops considerably. Customization of the IE approaches with biomedical/clinical genre specific features and pre/post-processing techniques does improve the results (with respect to applying the approaches directly) but the situation is still not completely satisfactory. There are many ways to accomplish this goal (e.g. exploitation of scope of negations, discourse structure, semantic roles, etc) which are yet to be fully harnessed for the improvement of IE systems. Additional challenges come from the usage of machine learning (ML) techniques themselves. Imbalance in data distribution is quite common in many NLP (including IE) tasks. Previous studies have empirically shown that unbalanced datasets lead to poor performance for the minority class. In this PhD research, we aim to address the open issues outlined above. We focus on three core IE tasks which are crucial for text mining: named entity recognition (NER), coreference resolution (CoRef), and relation extraction (RE). For NER, we propose an approach for the recognition of disease entity mentions which achieves state-of-the-art performance and is later exploited as a component in our RE system. Our NER system achieves results on par with the state of the art also for other bio-entity types such as genes/proteins, species and drugs. Since the creation of manually annotated training data is a costly process, we also investigate the practical usability of automatically annotated corpora for NER and propose how to automatically improve the quality of such corpora. CoRef, which is naturally the next step after NER, is often deemed as one of the stumbling blocs for other IE tasks such as RE. We propose a greedy and constrained CoRef approach that achieves high results in clinical texts for each individual entity mention type and for each of the four different evaluation metrics usually computed for assessing systems' performance. As for RE, one of the fundamental characteristics of our approach is that we propose to exploit other NLP areas such as scope of negations, elementary discourse units and semantic roles. We propose a novel hybrid kernel that not only takes advantage of different types of information (syntactic, semantic, contextual, etc) but also of the different ways they can be represented (i.e. flat structure, tree, graph). Our approach yields significantly better results than the previous state-of-the-art approaches for drug-drug interaction and protein-protein interaction extraction tasks. In each of the above tasks, we concentrate to develop pro-active IE approaches to automatically get rid of unnecessary training/test instances even before training ML models and using those models on test data. This enables better performance because of the reduction of less skewed data distribution as well as faster runtime. We tested our NER and RE approaches on other genres of text such as newspaper articles and automatically transcribed broadcast news. The results show that our approaches are largely domain independent.

Item Type:Doctoral Thesis (PhD)
Doctoral School:Information and Communication Technology
PhD Cycle:XXV
Subjects:Area 01 - Scienze matematiche e informatiche > INF/01 INFORMATICA
Uncontrolled Keywords:Information extraction, biomedical text mining, named entity recognition, coreference resolution, relation extraction
Repository Staff approval on:10 May 2013 13:29

Related URLs:

Repository Staff Only: item control page