|Thesis abstract: |
Genes are the most important and essential molecular units of a living organism, and the knowledge of their functions is a crucial key in the understanding of physiological and pathological biological processes, and in the development of new drugs and therapies. This association between a gene and its function has been named as biomolecular annotation. Unfortunately, the discovery of new annotations is often time-consuming and expensive, because so are biological in vitro experiments carried out by physicians and biologists.
Rapid advances in high-throughput technology have been making many new gene functions available online in public databases and data banks. Despite their undeniable importance, these data sources cannot be considered neither complete nor totally accurate, because annotations are not always revised before their publication, and sometimes include erroneous information, beside being incomplete by definition. In this scenario, computational methods that are able to quicken the curation process of such data are very important.
This has motivated the development of computational algorithms and softwares that utilize the available genomics information for gene function prediction, able to provide prioritized lists of biomolecular annotations to the biologists, in order to orientate their future research and experiments.
With this thesis, we first face the problem of predicting novel gene functions (or biomolecular annotations) through different computational machine learning methods, in which we take advantage of the properties of co-occurrences of annotations to suggest new likely gene functions. We propose some computational methods, implemented in an integrated framework, able to produce prioritized lists of predicted annotations, sorted on the basis of their likelihood. Particularly, we enhance an annotation prediction method already available in the literature, and then developed two variants of it, based on gene clustering and term-term similarity weight.
In addition, we also deal with the issue of the validation of the predicted annotations. Scientists keep adding new data and information to the annotation data banks as long as they discover new gene functions, and sometimes these data are erroneous or inaccurate. In addition, new discoveries are made every day, and the available information cannot be considered definitive. For these reasons, such databases are always incomplete. This leads to a significant problem of validation, because we do not have a true gold standard to refer. So, we designed and developed different validation procedures able to check the quality of our predictions. We introduce a validation phase consisting of a Receiver Operating Characteristic (ROC) analysis, a search for the predicted annotations into a new updated database version, and possibly an analysis of the available knowledge in the literature and through some available web tools.
To better understand the variation of the output predicted lists of annotations, we design and develop new measures, based on the Spearman coefficient and the Kendall distance. Such measures are able to state the correlation level between two lists by analyzing the difference between positions of the same element in two lists, and by evaluating the number of element couples having contrary order in the two lists. These measures demonstrated to be able to show important patterns otherwise difficult to notice.
Finally, we provide a visualization and statistical tool able to state the novelty of the predicted gene annotations, denoted as novelty indicator. For each gene, this tool is able to depict the tree graph of the predicted ontological annotation terms, producing images easily understandable also by non-experts, and also a statistical value that states the level of novelty of the prediction.
Our tests and experiments confirmed the efficiency and the effectiveness of our algorithms, by retrieving manifold predicted annotations as confirmed in the updated database or in the literature. The similarity measures resulted very useful to understand the similarity of our predicted lists, making us able to see specific similarity patterns while key parameters vary.
The novelty indicator, possibly, resulted very useful in producing tree graphs able to make our lists of predicted biomolecular annotations clearly usable by biologists and scientists.
We believe that the tools presented within this thesis may be very useful to the bioinformatics and scientific community to address future research experiments about gene functions.