Data Engineering for Data Science (DEDS)

Project Overview:

Data is a key asset in modern society. Data Science, which focuses on deriving valuable insight and knowledge from raw data, is indispensable for any economic, governmental, and scientific activity. Data Engineering provides the data ecosystem (i.e., data management pipelines, tools and services) that makes Data Science possible. The European Joint Doctorate in “Data Engineering for Data Science” (DEDS) is designed to develop education, research, and innovation at the intersection of Data Science and Data Engineering. Its core objective is to provide holistic support for the end-to-end management of the full lifecycle of data, from capture to exploitation by data scientists.

DEDS  operates under the Horizon 2020 – Marie Skłodowska-Curie Innovative Training Networks (H2020-MSCA-ITN-2020) framework. It is jointly organised by Université Libre de Bruxelles (Belgium), Universitat Politècnica de Catalunya (Spain), Aalborg Universitet (Denmark), and the Athena Research and Innovation Centre (Greece). Partner organisations from research, industry and the public sector prominently contribute to the programme by training students and providing secondments in a wide range of domains including Energy, Finance, Health, Transport, and Customer Relationship and Support.

 DEDS is a 3-year doctoral programme based on a co-tutelle model. A complementary set of 15 joint, fully funded, doctoral projects focus on the main aspects of holistic management of the full data lifecycle. Each doctoral project is co-supervised by two beneficiaries and includes a secondment in a partner organisation, which grounds the research in practice and validate the proposed solutions. DEDS delivers innovative training comprising technical and transversal courses, four jointly organized summer and winter schools, as well as dissemination activities including open science events and a final conference. Upon graduation, a joint degree from the universities of the co-tutelle will be awarded.

 

Our involvement:

Data-driven systems play a crucial role in many applications and are indispensable for any scientific, economic, and governmental activity nowadays. Given the ubiquity of such systems, the importance of implementing secure learning algorithms and practices can hardly be overestimated. A notable example comes from the field of online banking, where fraud detection systems heavily rely on past data to detect frauds, employing machine-learning to limit the amount and size of undetected bank frauds, which cause losses amounting to billions of euros every year.

Furthermore, fraud detection systems are a great testing table for secure learning algorithms, given the challenging nature of the task, ranging from machine learning-related problems such as imbalanced classification, concept drift, and delayed feedback to practical challenges such as scalability and reaction time. Therefore, studying secure learning in fraud detection can benefit all other fields facing one or multiple common challenges and improve the security of data-driven systems altogether.

So far, many techniques have been developed and deployed for detecting frauds using machine learn- ing. However, an aspect of fraud detection that has not been sufficiently covered is secure learning and its scalable implementation in a realistic streaming setting. In such a setting, a model is evaluated for its accuracy and its robustness to attackers aiming to take advantage of the system vulnerabilities. Unfortunately, existing literature typically considers simplified and/or static environments and makes stringent and unrealistic assumptions about the data distribution and the attacker capabilities.

This doctoral research aims to integrate the existing literature on fraud detection with adversarial learning, testing the existing methods against classical adversarial machine learning attacks and creating original defensive techniques in a streaming environment. In particular we plan to test our solution in a realistic fraud detection environment, based on a simulator recently developed by the Machine Learning Group (MLG) of the ULB and possibly with real-world data provided by the secondment partner.

 

Funding

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 955895.

 

Duration

2021-2024

 

Involved MLG Researches and Supervisors: 

Daniele Lunghi  

Prof. Gianluca Bontempi