Section: Computer Science and Engineering
Tutor: PERNICI BARBARA
Advisor: CERI STEFANO Major Research topic
:Exploiting the Array Data Model for Genomic Data ManagementAbstract:
In 2015, the GenoMetric Query Language (GMQL) was published, an algebraic language for querying genomic datasets, supported by Genomic Data Management System (GDMS), an open-source big data engine implemented on top of Apache Spark. GMQL datasets are represented as genomic regions (i.e. intervals of the genome, included within a start and stop position) with an associated value, representing the signal associated to that region (the most typical signals represent gene expressions, peaks of expressions, and variants relative to a reference genome.) GMQL can process queries over billions of regions, organized within distinct datasets.
My research work is concerning with the adoption of the array data model to support genomic computations. Starting from GMQL, the language developed within the GeCo ERC project, I have designed and implemented a new architecture and methods for genomic data management that use the array data model; this is contrasted with the standard row-based data model adopted so far in the project. My first solution was “incremental”, using the array data model for implementing specific chains of operations where arrays give maximum advantage, and keeping otherwise a row-based data model. My second solution is radical, as the array model substitutes for the row-based model and is used for any GMQL operation. Results show very significant performance improvements; this thesis indicates that the array data model can be effectively applied to arbitrary interval-based datasets, such as genomic regions.