Section: Computer Science and Engineering
Tutor: PERNICI BARBARA
Advisor: CERI STEFANO Major Research topic
:Query optimization based on multi-dimensional data structureAbstract:
In 2015, the GenoMetric Query Language (GMQL) was published, an algebraic language for querying genomic datasets, supported by Genomic Data Management System (GDMS), an open-source big data engine implemented on top of Apache Spark. GMQL datasets are represented as genomic regions (i.e. intervals of the genome, included within a start and stop position) with an associated value, representing the signal associated to that region (the most typical signals represent gene expressions, peaks of expressions, and variants relative to a reference genome.) GMQL can process queries over billions of regions, organized within distinct datasets.
In my research, we presented Multi-Dimensional Genomic Data Model (MGDM) for region-preserving operations, in which the regions of the result are a subset of the regions of one of the operands. Currently, we propose a query optimization based on multi-dimensional data structure. It operates over two internal data structures: table (GDM) and array (MGDM). The optimizer is able to determine and adapt parts of a query with certain characteristics, like region-preservation, and then pick the proper data model. In this way, part of the query can be computed using GDM while another part is computed using MGDM. To achieve that, we are currently developing array-based algorithms for arbitrary operations so that we cover the whole set of GMQL operations and can choose the most suited algorithms.