Current students


Section: Computer Science and Engineering

Major Research topic:
New techniques for big data integration: enriching data lakes potentialities

Data integration addresses the problem of reconciling data from different sources, with inconsistent schemata and formats, and possibly conflicting values.
Modern data integration is composed of four phases: the first is data extraction, tackling the problem of format heterogeneity (i.e. structured, semi-structured and unstructured data); the second phase is schema alignment, with the purpose of aligning different database schemata and understanding which attributes have the same semantics; the third phase is entity linkage, in which a given set of records is partitioned in such a way that each partition corresponds to a distinct real-world entity; finally, through data fusion, we resolve all possible conflicts that can arise when different sources provide different values for the same attribute of an entity.
My research focuses on the automation of a quality-aware data integration process, in the context of user-generated and IoT data, where we often find different data formats, semantic and representation ambiguity, and data inconsistency.
With the advent of big data, we must tackle much more incomplete, dirty, and outdated information than before, therefore new data cleaning techniques have to be developed to reduce noise and errors in the values provided by data sources. Part of my research is to define methods to evaluate the quality, authority, and trustworthiness of data sources.
An emerging trend is to use data lakes to collect a huge amount of data and documents, exploiting the usage of metadata and modern storage techniques. Since in current systems the support for data quality and automatic alignment and matching of sources schemata is rather weak, one objective of my research is to enhance the data lake model, introducing a robust integration component supporting data sharing.
An application domain that could exploit the advantages of quality-aware data integration is healthcare, where often many heterogeneous data sources require integrated easy consultation regardless of their local design and structure. This context is very interesting also for the problem of multi-truth data fusion since there could be more than one true value for a given data object (e.g. patients having more than one pathology). This is particularly challenging if neither the true values nor their cardinality is known a priori, and therefore we have no clue on how the values provided by the sources are interrelated.