Le Machine Learning Group propose pour l'année 2017/2018 une dizaine de sujets pour les étudiants en master. Les domaines d'applications incluent le calcul à haute performance, la bioinformatique, les réseaux de capteurs, l'évolution artificielle, la médecine assistée par ordinateur, les protéines artificielles et la dynamique des réseaux.

NB:** Le nombre de sujets est limité. Les étudiants intéressés sont priés de se manifester au plus tôt.**

**1. ****Machine learning on big data (Gianluca Bontempi, Catharina Olsen,**** Yann-A****ël ****Le Borgn****e)**

**4. **Affective Computing** (Tom Lenaerts)**

**5. Simulators for data mining of geographical mobility data (Gianluca Bontempi, ****Giovanni Buroni****)**

**6. Evolutionary origins of perception, an AI perspective (Tom Lenaerts)**

**7. Evolution of cooperation in Bayesian games (Tom Lenaerts)**

**8. Visual assessment of machine learning techniques for fraud detection (Gianluca Bontempi)**

**9. Bioinformatics proposal: Identifying tasks in exome data (Tom Lenaerts)**

**10. Chat bots playing strategic games (Tom Lenaerts)**

**11. Machine learning in medical and biological contexts (Tom Lenaerts)**

The collection of gigantic datasets in several domains (e.g. social networks, finance, internet) and the need to extract useful information from them asks for the development of new and effective techniques to store and mine very large data structures. The Master thesis will focus on methods to scale up and make parallel machine learning algorithms in order to deal effectively with very large and distributed databases (e.g. Hadoop, Spark/ML Lib). The objective of the thesis is to design and setup a running distributed system (based on existing open-source solutions) to store and analyze huge datasets.

Required competences; machine learning, computational statistics, programming.

Useful links:

- Mahout and Apache
- Machine learning and Hadoop
- Machine learning and Hadoop guide
- R and Hadoop
- R and Hadoop tutorial
- R, Apache and Hadoop
- Spark

Multiscale or multiresolution analysis is a technique for the analysis and processing of data in a telescopic way. That means that the data is decomposed into a reperesentation that separates global, large scale features from small scale details, with a broad spectrum in between. In that sense, multiscale is related to a frequency (Fourier) analysis (with slowly and fast oscillating components), but, unlike a Fourier transform, a multiscale analysis keeps information on the location in the original time or space domain.

The most well known example of a multiscale analysis is a wavelet decomposition. Wavelets are particularly popular in image processing, for instance in the JPEG compression standard. This thesis investigates the use of an other algorithm for a multiresolution, known as a Laplacian pyramid levitrakamagra.net. This Laplacian pyramid is a slightly overcomplete transform, meaning that it maps n data onto 2n coefficients in the multiscale representation. It can be implemented as an overcomplete version of a lifting scheme, which is a fast implementation of the wavelet transform.

In this thesis, the Laplacian pyramid is equiped with a local polynomial smoothing technique, popular in statistics. The objective is to investigate the properties a Laplacian pyramid with local polynomial smoothing in applications of image processing (denoising, compression).

The selection of an optimal model from a broad spectrum of non-nested models can be driven by a criterium that balances a good prediction of the training set and complexity of the model, that is, the number of selected variables. Optimization over a number of variables, or even comparison of models with a given number of variables is a problem of combinatorial complexity, and thus not feasible in the context of high-dimensional data. Part of the problem can be well approximated by changing the number of selected variables in the criterium by the sum of absolute values of the estimators of these variables within the selected model. The counting measure is replaced by a sum of magnitudes, thus changing a combinatorial problem into convex, quadratic programming problem. This problem can be solved by a wide range of algorithms, including direct methods, such as least angle regression, or iterative methods, such as iterative thresholding or gradient projection. Moreover, for a fixed value of model complexity, the relaxed problem selects approximately the same model as the original combinatorial one. This is no longer the case when the model complexity is part of the optimization problem, but a correction for the divergence between the combinatorial and quadratic problem can be established. The thesis is about the application of the variable selection in sparse inverse problems, or in deblurring and denoising images, using gradient projection or iterative thresholding

As defined on wikipedia : "Affective computing is the study and development of systems and devices that can recognize, interpret, process, and simulate human affects. It is an interdisciplinary field spanning computer science, psychology, and cognitive science. While the origins of the field may be traced as far back as to early philosophical inquiries into emotion, the more modern branch of computer science originated with Rosalind Picard's 1995 paper on affective computing. A motivation for the research is the ability to simulate empathy. The machine should interpret the emotional state of humans and adapt its behavior to them, giving an appropriate response to those emotions.”

Students interested in this field can contact prof. Tom Lenaerts (tlenaert@ulb.ac.be) for a discussion on the topic they can address. Possibilities are

1) Machine learning applications in the area of affective computing (recognition, detection, classification of affects, etc)

2) Modelling of the evolutionary origins of human emotions using game theory and agent-based modelling

References:

1) http://www.oxfordhandbooks.com/view/10.1093/oxfordhb/9780199942237.001.0001/oxfordhb-9780199942237

2) http://acii2017.org

3) http://opencv.org

Mobility is an aspect of growing relevance in our daily lives. It acts as 'the economy's backbone' by supporting other sectors throughout the economic system. Many studies, e.g. IBM smart cities study in Brussels, have shown that Brussels is lagging behind compared to other capital cities . The objective of the MA thesis is to compare and assess existing open source traffic simulators.

The first part of the work will consist in reviewing existing and assessing open-source solutions (e.g. Matsim [1]). The second part will consist in using the simulator to simulate scenarios of the Brussels Region traffic. Note that the simulator is expected to allow the user to simulate what-if scenarios (what if a car accident takes place in the tunnel? what if roadworks start?) and generate a continuous stream of sensed measures about the traffic. The MA student will be also supervised by a researcher involved in the ICITY MobiAId project [2] and is expected to interact with researchers of Bruxelles Mobilité (offices located in Gare du Nord).

References:

[2] http://mlg.ulb.ac.be/node/810

The goal of this proposal is to delve deeper in the work of Hoffman and colleagues (see refs) and to expand their work to a system that uses actual cameras. The starting point is an article entitled "Natural selection and veridical perception". The question we wish to raise is how complex vision systems need to be on order to capture sufficient data to act in a real environment. The research contains two parts: 1) reproduce the work in one of the papers and analyse the outcome more deeply, 2) define and test a vision system that would allow one to explore the questions posed in their paper in a real environment.

Contact prof. Tom Lenaerts (tlenaert@ulb.ac.be) for more information.

References:

Classic studies in the origin of cooperation examine games with full information. Each player knows exactly what the actions are and what preferences each player has over those actions. Research has shown that mechanisms like kin selection, direct/indirect reciprocity or network reciprocity are required to induce cooperative behaviour. So far no studies have taken into account that players may not have all the information (games with imperfect information). Such situations can be represented by Bayesian games, which include signals and beliefs players have about the game that is being played. In this proposal we will aim to examine the evolutionary dynamics of this kind of games. First an investigation will be made in the state of the art and then simulations will be performed in order to understand the influence of this uncertainty on the evolved behaviour.

Contact prof. Tom Lenaerts (tlenaert@ulb.ac.be) for more information.

References:

1) http://www.ma.huji.ac.il/~zamir/documents/BayesianGames_ShmuelZamir.pdf

2) http://www.eecs.harvard.edu/cs286r/courses/fall08/files/lecture5.pdf

3) https://www.nature.com/articles/srep25813

4) https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3443135/

As businesses continue to evolve and migrate to the Internet and money is transacted electronically in an ever-growing cashless banking economy, accurate fraud detection remains a key concern for modern banking systems. MLG and Worldline have been closely collaborating on this topic for 5 years, by using machine learning techniques to improve fraud detection in online transactions. In this context, the goal of this master thesis will be to further improve the fraud detection system, and the interpretation of alerts, by (i) designing a methodological framework allowing to compare the performances of different machine learning techniques on the transaction data shared by Worldline, (ii) developing an interactive tool to facilitate the comparison of algorithms, and (iii) exploring data visualisation techniques that could help to interpret the fraud detection algorithms. Through this topic, the student will have the opportunity to gain skills in machine learning, fraud detection techniques, and interactive Web development. The student will also have opportunities to collaborate with the Worldline fraud detection team.

References:

[1] A. Dal Pozzolo. Adaptive Machine Learning for Credit Credit Card Fraud Detection. PhD Thesis, Université Libre de Bruxelles Belgium. 2015. http://www.ulb.ac.be/di/map/adalpozz/pdf/Dalpozzolo2015PhD.pdf

[2] F. Carcillo, Y. Le Borgne, O. Caelen, and G. Bontempi. An assessment of streaming active learning strategies for real-life credit card fraud detection. In Proceedings of the 4th IEEE International Conference on Data Science and Advanced Analytics 2017, October 2017.

In this research the objective is to examine the Pareto task Inference approach made by the Alon group and apply it to exome data. More information on the approach can be found on : https://www.weizmann.ac.il/mcb/UriAlon/download/ParTI

A first task will be to see whether this package can be translated to python. In a second part you will evaluate the relevance of the approach for understanding neurodevelopmental data.

References:

1) introduction to chat bots e.g. https://tutorialzine.com/2016/11/introduction-to-chatbots

2) telegram.org api

Different projects are available on the use and exploration of machine learning methods for medical an biological problems. One of the current focusses in the group is on diseases whose origin is determined by two or more variants in two or more genes. Below are some projects which we like a Master student to work on:

- Semi-supervised learning on digenic disease data: Explore the state-of-the-art semi-supervised learning, prepare a thorough synthesis of the relevant approaches. Evaluate then a number of these approaches on the problem of predicting the type of digenic disease as discussed in : Gazzo, A., Raimondi, D., Daneels, D., Moreau, Y., Smits, G., Van Dooren, S., & Lenaerts, T. (2017). Understanding mutational effects in digenic diseases. Nucleic acids research, 45(15), e140-e140.
- Feature evaluation and selection for a digenic pathogenicity predictor: The quality of a predictor depends on the features: are they relevant? Can they be used to generalise ? Moreover, we want to limit the number of features used to make a predictor, as too many features leads to overfitting. The goal of this project is to explore a collection of features that may be used to predict the pathogenicity of variants in genes and examine which are most relevant in an digenic context.
- Exploring approaches in making the decision process of black-box machiner learning approaches understandable. We want you to explore techniques like https://github.com/andosa/treeinterpreter and others and see how we can used them to improve our insight into why certain predictions are made by some of the predictors we have constructed.

Theme by Danetsoft and Danang Probo Sayekti inspired by Maksimer