Current students


Section: Computer Science and Engineering

Major Research topic:
On learning from massive, evolving, and imbalanced data streams

In the last years, a growing number of devices connected to the Internet (e.g., smartphones, wearables, computers, and Internet of Things sensors) and web applications (e.g., Facebook, Instagram, and Twitter) are producing continuous, unbounded, time annotated flows of data. While in the past, those flows mostly got stored as time series to analyse in batches, nowadays they are often processed in real-time as data streams. This poses new challenges. For instance, saving the entire stream in memory is infeasible because would require an infinite amount of space. Those challenges have been addressed in databases, distributed systems and information systems since the end of the millennium. More recently also the Machine Learning (ML) community started embracing it. Streaming Machine Learning (SML) is a new approach able to tackle those challenges. Every time a new instance arrives, a streaming learner inspects it, uses it to update the model incrementally, and immediately discards it. In this way, the streaming learner is able to predict at each moment. Moreover, SML includes techniques able to cope with the possible lack of stationarity in the process from which the data originates (ADWIN, for example). In SML terminology, those techniques detect when concept drift occurs and adapt the model accordingly. ¬†However, SML is still lacking some fundamental techniques for time-series analysis used in the traditional batch scenario. For instance, in ML is very common to detect and delete any trends or seasonality during the time series analysis. In SML, ADWIN doesn’t consider these factors to detect concept drift. Moreover, in ML it is well-known that a classification problem cannot always ignore the possibly unequal distribution between the classes, i.e., it may not neglect the minority instances and focus only on the most frequent ones. This problem becomes even more evident in the stream context because instances are observed individually and gradually, which reduces, even more, the observation of minority instances. And this may prevent or delay the discovery of any existing patterns in this class. The aim of this thesis is to investigate how to fill the gap between batch time-series analysis and SML. Our goals is to conceive, design and evaluate the streaming version of popular batch approaches. We have already performed some investigations in this direction. We created the streaming version of a popular rebalancing technique for classification. Our next step is to extend it for regression.