|GULINO ANDREA||Cycle: XXXIII |
Section: Computer Science and Engineering
Tutor: TANCA LETIZIA
Advisor: CERI STEFANO Major Research topic
:Distributed Processing and Optimization Techniques for Genomic Data.Abstract:
In recent years, new technologies for sequencing the DNA, known as next-generation-sequencing (NGS) technologies, have become available. These technologies allow reading the whole genome much more quickly and cheaply than their predecessors, producing an increasing amount of data that can be used to answer fundamental biological questions (e.g. how cancer arises, how mutations occur) and set the ground for personalised medicine. Large and well-organised collections of sequencing data have become publicly available over the years and new technologies to integrate, query, mine and visualise this type of data have been developed. One of such technologies is the “GenoMetric Query Language” (GMQL), an ad-hoc designed query language for extracting useful information out of genomic data. DNA, in this case, is represented in the form of genomic regions; similarly to interval data, a region of the DNA can be characterised by a start position, a stop position and an arbitrary set of attributes. GMQL includes several operators, some of which are common relation algebra operators (SQL-like), and some domain-specific operators, which take into account the spatial information related to regions, i.e. their start and stop coordinates. Each GMQL query can be described as a Directly Acyclic Graph (DAG) of tasks (GMQL operators) and runs on top of a cloud computing based system, called Genomic Data Management System (GDMS), implemented using the Apache Spark framework. Custom “binning algorithms” were designed to further increase the parallelism of domain-specific operations; they main idea consists in splitting the genome into small portions, named bins, and process regions in each bin independently. This thesis starts addressing the problem of modelling and optimizing the performance of the GDMS, proposing solutions that can be easily extended to broader domains and more generic applications. The first result shows how the optimal bin size can be determined through an analytical description of the domain-specific operators. By taking into account the input data, the specific query and some parameters that vary based on the execution environment, the performance of domain-specific operators can be optimized. Another challenging task, helpful for proper resource allocation and optimal query scheduling, consists in predicting the execution time, of a generic query. In this thesis this problem is addressed by proposing an approach that mixes machine learning (ML) and analytical modelling. A GMQL DAG is decomposed into its tasks, each implemented through a set of Spark operations. For each of them, a ML model is built, taking into account a variety of features, which, similarly to what was used for optimal binning, describe the data, the query and the environment. Analytical modelling can instead replace ML for predicting features related to intermediate data, which are not known offline. Moreover, this thesis describes a functional and architectural extension of GMQL, called Federated GMQL, in which a single GMQL query can run using data and computational resources from different geographically-spread instances of the GDMS, joining the so-called “Federation”. Operations of a Federated GMQL query can therefore be allocated to different environments (clusters). The scheduling of federated queries can take advantage of the aforementioned performance models. To the best of our knowledge, the design of a system in which multiple Spark Applications cooperate to run this type of computations is unprecedented. While GMQL is suited for data manipulation, a complementary tool, called MutViz, is intended for visualisation. Specifically, MutViz provides several visualisations for analysing mutations that occur in the neighbourhood of some user-provided genomic regions, giving useful hints on what is their role in the development of cancer. We mixed traditional databases and cloud computing, re-adapting algorithms that are used within the GDMS, to build a performant REST api, which serves its results both to an advanced web interface and a Python library.