Current students


Section: Computer Science and Engineering

Major Research topic:
Data-driven models for anomaly detection and clustering

Anomaly Detection (AD) in datastreams is​ a fundamental problem in Data Science, which raises both theoretical and practical challenges when monitoring datastreams that are high-dimensional and non-stationary. My research focuses on designing new data-driven AD methods that are able to handle high-dimensional data and, at the same time, adapt to changes in the process generating normal (i.e. anomaly-free) data. Particular interest is devoted to unsupervised techniques, which are very practical to use in real-world scenarios, but that deserve further investigation from an algorithmic perspective.
My research in the AD domain is supported by a collaboration between Politecnico di Milano and Cleafy - a company providing systems for monitoring and assessing transactional risk factors. In this collaboration I am exploring and applying tree-based AD techniques, with a particular emphasis on the Isolation Forest approach, with the aim of detecting threats in online web sessions.
My research concerns also the investigation of new Clustering techniques, which are very related to the AD field. In fact anomaly-free data are often identified as those belonging to a certain group, or satisfying a particular model. Clustering is a widely addressed problem in pattern recognition and data mining in the data exploration phase, of which however there is no universal solution. Among all the possible Clustering techniques, I am interested in the Multi-Model Fitting (MMF) approach, where it is assumed that data belonging to the same cluster satisfy the equation of some parametric model.
In particular, I have been investigating an extended version of a MMF algorithm for the case where multiple families of parametric models need to be employed. I am currently investigating how to combine MMF and Ensemble Clustering approaches based on the construction of Random Trees through Local Sensitive Hashing, a technique grounded on the probabilistic approximation of distance functions. A particularly interesting problem that I will address is clustering of data-structures characterized by very variable density.
In this direction, I will investigate new probabilistic strategies to improve AD and Clustering performance.