SALVI DAVIDE | Cycle: XXXVI |
Section: Telecommunications
Advisor: BESTAGINI PAOLO
Tutor: MONTI-GUARNIERI ANDREA VIRGILIO
Major Research topic:
A multimodal approach to face multimedia forensics challenges
Abstract:
Nowadays, creating and sharing multimedia content is an increasingly simple operation. Indeed, thanks to the improvement of smartphones and other consumer devices, we daily generate a large amount of material that is shared on the web and social media platforms. Moreover, as connection speed, bandwidth and storage space are not limiting factors anymore, shared content is becoming more and more complex, migrating from simple images to high-quality videos. While this progress opens the door to new exciting possibilities, it also facilitates disseminating unpleasant material by malicious users. For this reason, the ability to analyze multimedia material automatically to prevent dangerous consequences is becoming of paramount importance.
In this work, we propose to tackle this problem using a multimodal approach, simultaneously analyzing different content modalities (e.g., video, audio, text, etc.). By exploiting consistencies and differences between different modalities, we can increase the amount of helpful information to solve specific tasks. We do so by combining signal processing and deep learning techniques. The goal is to exploit the potential of both methods by overcoming their weaknesses. As an example, using the information provided by model-based approaches, we can reduce the amount of data needed for data-driven techniques, improving the results obtained.
The proposed multimodal approach can be exploited in the Multimedia Forensics field. Here we are interested in a few specific and subtle traces contained within an image, a video or an audio track, while most of the other content (e.g., colors in the scene, the timbre of a sound, etc.) is not relevant to the purpose. With a purely data-driven approach, the risk is to learn some content-based bias that is harmful to the final goal, leading to poor models and results. On the other hand, a hybrid approach, which is both data and model-driven, allows us to focus the model on the data of our interest and improve its results. Moreover, the joint use of multiple data modalities opens the door to a new class of detectors intrinsically more robust to adversarial and anti-forensic attacks. Finally, we can use this kind of approach in many different fields. For example, jointly analyzing audio and video may help improve event classification methods. Alternatively, mixing model-based and data-driven approaches may prove useful in applications related to facial analysis.
In this work, we propose to tackle this problem using a multimodal approach, simultaneously analyzing different content modalities (e.g., video, audio, text, etc.). By exploiting consistencies and differences between different modalities, we can increase the amount of helpful information to solve specific tasks. We do so by combining signal processing and deep learning techniques. The goal is to exploit the potential of both methods by overcoming their weaknesses. As an example, using the information provided by model-based approaches, we can reduce the amount of data needed for data-driven techniques, improving the results obtained.
The proposed multimodal approach can be exploited in the Multimedia Forensics field. Here we are interested in a few specific and subtle traces contained within an image, a video or an audio track, while most of the other content (e.g., colors in the scene, the timbre of a sound, etc.) is not relevant to the purpose. With a purely data-driven approach, the risk is to learn some content-based bias that is harmful to the final goal, leading to poor models and results. On the other hand, a hybrid approach, which is both data and model-driven, allows us to focus the model on the data of our interest and improve its results. Moreover, the joint use of multiple data modalities opens the door to a new class of detectors intrinsically more robust to adversarial and anti-forensic attacks. Finally, we can use this kind of approach in many different fields. For example, jointly analyzing audio and video may help improve event classification methods. Alternatively, mixing model-based and data-driven approaches may prove useful in applications related to facial analysis.
Cookies
We serve cookies. If you think that's ok, just click "Accept all". You can also choose what kind of cookies you want by clicking "Settings".
Read our cookie policy
Cookies
Choose what kind of cookies to accept. Your choice will be saved for one year.
Read our cookie policy
-
Necessary
These cookies are not optional. They are needed for the website to function. -
Statistics
In order for us to improve the website's functionality and structure, based on how the website is used. -
Experience
In order for our website to perform as well as possible during your visit. If you refuse these cookies, some functionality will disappear from the website. -
Marketing
By sharing your interests and behavior as you visit our site, you increase the chance of seeing personalized content and offers.