|BERNASCONI ANNA||Cycle: XXXII |
Section: Computer Science and Engineering
Tutor: PERNICI BARBARA
Advisor: CERI STEFANO Major Research topic
:A Sound Approach to Building Integrated Repositories of Genomic DataAbstract:
The integration of genomic metadata is, at the same time, an important, difficult, and well-recognized challenge. It is important because a wealth of public data repositories is available to drive biological and clinical research; combining information from various heterogeneous and widely dispersed sources is paramount to a number of biological discoveries. It is difficult because the domain is complex and there is no agreement among the various metadata definitions, which refer to different vocabularies and ontologies. It is well-recognized in the bioinformatics community because, in the common practice, repositories are accessed one-by-one, learning their specific metadata definitions as result of long and tedious efforts, and such practice is error-prone.
We propose a conceptual model of genomic metadata (the Genomic Conceptual Model), whose purpose is to query the underlying data sources for locating relevant experimental datasets, describing a typical genomic region data file by different perspectives (biology, technology, management and extraction). We describe META-BASE, an architecture for integrating metadata extracted from a variety of genomic data sources, based upon a structured transformation process; we present a variety of innovative techniques for data extraction, cleaning, normalization and enrichment and we show a general, open and extensible pipeline that can easily incorporate any number of new data sources, and describe the resulting repository – already integrating several important sources – which is exposed by means of practical user interfaces to respond biological researchers' needs. Finally, we explain GenoSurf (http://www.gmql.eu/genosurf/), a multi-ontology semantic search system providing access to the consolidated repository of metadata attributes found in the most relevant genomic datasets and to the related genomic datasets, which can be analyzed with off-the-shelf bioinformatics tools.
Inspired by our work on genomic data integration, during the outbreak of the COVID-19 pandemic we searched for effective ways to help mitigate its effects with our research; we were able to re-apply the model-build-search paradigm used for human genomics. The domain of viral genomics is completely new, yet it presents many analogies with our previous challenges. We model nucleotide sequences accounting for their technological, biological and organizational aspects (the Viral Conceptual Model); we compute their annotations and variants on both nucleotides and amino acid sequences; we then integrate sequences with their metadata from a variety of different sources; finally, we propose a powerful search interface ViruSurf (http://www.gmql.eu/virusurf/) able to quickly extract sequences based on their combined variants, compare different conditions, build interesting populations for downstream analysis.
Our future work will in perspective take a three-systems viewpoint, centered on the human phenotype augmented with information from the viral sequence and the human genome (a new holistic approach stemming from our unique experience). Current efforts are targeting novel ways to connect many existing datasets to one another in order to build such general view.