|Thesis abstract: |
A kind of task, that is getting more and more attention in the recent years, is the selection of the most relevant elements in a number of data collections, e.g., the opinion leaders given a set of topics, the best hotel offers, and the best cities to live in. At the moment the problem is addressed through ad-hoc solutions tailored on the target scenario. These solutions usually exploit data centralisation and local offline processing; even if they work when the amount of data is limited, they do not scale when we consider Big Data. Big Data are characterized by, the so-called, three Vs: volume (high amount of data), velocity (highly dynamic in data) and variety (data with structural and semantic heterogeneity).
In the last decade, the research activities in this domain have addressed different sub-problems: the finding of the most relevant elements can be expressed through top-k queries, i.e., queries that ask for the top k tuples from a dataset, given an order expressed through a scoring function; the dynamicity and the velocity are addressed by stream computation and online streaming algorithms to process data in real time; data variety and data access can be addressed exploiting ontologies to obtain an holistic view on heterogeneous data sets. How to combine these methods and techniques is an open research issue: RDF stream engines, top-k ontological query answering and top-k computation in data streams are examples of novel research trends that are gathering more and more attention in the recent years.
My research activity will address this problem: given a collection of data sets with both streaming and static data, a formal model describing them, a set of top-k queries where each scoring function describes the relevance as a combination of several criteria, optimize the computation of the the top k relevant items of each query. My activity starts from a deep state-of-the-art analysis on data management and description logic fields. Another initial activity is the identification of a set of use cases: they are important to show that this research addresses real problems and to consolidate its potential exploitation. Next, the research activity aims at defining: 1) techniques and approaches for top-k query answering over multiple streams and static data sets; and 2) optimizations for sets of top-k queries over those data sets. To achieve these goals, my activity will follow an iterative approach. First, it will target the problem of ontological query answering over dynamic and static data collections; then, I will target the top-k problem over data streams. In the last step, I will study how to put all the results together, combining the ontological query answering with the top-k one, in the context of data stream processing.