|Thesis abstract: |
The thesis tackles the problem of Big Data Analytics by focusing on how to extract synopsis from it, that is find recurring patterns in the data. We will show how these patterns can be used to get the gist out of data, that is, to represent in a succinct way the data¿s most frequent properties. These properties can be used as a compact representation of data as well as a basis for making efficient decisions. Moreover, we will see how the use of aggregates is decisive in Big Data because it allows for better analysis of data itself.
The first aim of the thesis is to propose novel applications of data mining techniques to provide advanced database functionalities. In particular we focus on extracting frequent information from a dataset in order to use it for query answering, that is allowing users to query the frequent patterns rather than the data. We consider such patterns as intensional information because they represent a dataset in terms of a set of properties rather than in terms of the data (which is called extensional information). Our goal is to propose a methodology for the XML scenario that uses association rules to represent intensional knowledge and provides an automatic strategy for translating user queries over the original dataset to queries over the mined association rules. In fact, intensional knowledge provides (often hidden) information about the actual data contained in the database. Such information is particularly valuable when the original documents are not available or reachable anymore or when the user prefers to obtain a synthetic, possibly faster but partial, answer.
From the XML tree-based scenario we take a step further into analyzing a similar but more complex representation, that is, graph-based data. We present DatalogFS, an extension of Datalog that allows to introduce more flexibility into the querying process by using count-based aggregates. Our approach allows users to write queries in terms of programs in DatalogFS , which are considered synopsis of expanded Datalog programs. We provide a rewriting of DatalogFS programs into Datalog and a semantics that allows us to keep the simple and elegant least-fixpoint semantics of Datalog and all of its optimizations, such as the differential fixpoint and magic sets. We will see how to write DatalogFS programs that implement Apriori and PageRank, making our proposal helpful in the process of analyzing both relational and web-based data. Moreover, we will also focus on the application of DatalogFS programs for the analysis of data coming from social networks. For example, using the Markov Chains and Diffusion Models we will show how DatalogFS can be efficiently used to analyze the role of retweets in the Twitter network.