Section: Computer Science and Engineering
Tutor: TANCA LETIZIA
Advisor: MASSEROLI MARCO Major Research topic
:Computational methods for data-driven prediction and understanding of biological interactions.Abstract:
Purpose of the research project is to develop innovative methodologies for the quantitative study, based on available data, of the interactions among biological entities, to identify and understand them. In this perspective two specific projects are being developed.
The first project aims at the identification of statistical relationships among transcription factors (proteins that bind the DNA) based on their co-occurrence in specific DNA regions. It involved the extension and optimization of a new R/Bioconductor software package, TFARM, which was submitted to and passed the Bioconductor review process on September 2017 and it is now publicly available in the official Bioconductor release, with full documentation and examples for easy usage; currently it has had more than 1000 downloads (https://bioconductor.org/packages/release/bioc/html/TFARM.html). This software allows to find associations through the apriori method , the most frequently used method to search for Association Rules, and rank them using an innovative index called Importance Index. The project has taken advantage of this software to select interactors of a target of interest and to rank them in an unbiased way, based on the analysis of public data from ENCODE (Encyclopedia of DNA Elements) , regarding the localization of transcription factors in genomic regions of interest; such data were extracted and preprocessed using the GenoMetric Query Language (GMQL) , a high-level, declarative query language for genomic big data, developed by the Genomic Computing group at Politecnico di Milano.
Conversely, the second project has the purpose to analyze heterogenous data using a generalizable network-based clustering approach which can classify a particular data type. We collected four datasets including drug-protein, protein-pathway, protein-protein and pathway-pathway interactions, respectively extracted form different databases. We used them to construct four different networks, which we then merged in a sigle tri-partite network. The aim is to classify drugs according to their behaviour in the tri-partite network, and for this reason we proposed a novel approach based on Non-negative Matrix Tri-Factorization (NMTF). The NMTF is a dimensionality reduction method and an enstablished co-clustering technique in machine learning to cope with the heterogeneity of networks . It has an important role in data integration, owing to its capacity to factorize any relation matrix between heterogeneous data types . However, it has never been used for multi-label classification. The new proposed NMTF classifier has been compared with baseline classifiers such as KNN and RF classifiers. Their performance were computed considering multi-label evaluation metrics.
 Masseroli M, Pinoli P, Venco F, Kaitoua A, Jalili V, Palluzzi F, Muller H, Ceri S. GenoMetric Query Language: a novel approach to large-scale genomic data management. Bioinformatics, 2015, 31(12): 1881-1888.
 Gligorijevic V, Malod-Dognin N, Przulj N. Patient-specific data fusion for cancer stratification and personalized treatment. Pacific Symposium on Biocomputing.