You are here

Master Thesis

Sujets de mémoires/ Master thesis subject MLG 2017-2018

Le Machine Learning Group propose pour l'année 2017/2018 une dizaine de sujets pour les étudiants en master. Les domaines d'applications incluent le calcul à haute performance, la bioinformatique, les réseaux de capteurs, l'évolution artificielle, la médecine assistée par ordinateur, les protéines artificielles et la dynamique des réseaux. 

NB: Le nombre de sujets est limité. Les étudiants intéressés sont priés de se manifester au plus tôt.

This widget requires Flash Player 9 or better

Topics:

1. Machine learning on big streams of data (Gianluca Bontempi, Catharina Olsen)

2. Multiscale kernel smoothing for use in image processing (Maarten Jansen)

3. Fast variable selection without shrinkage (Maarten Jansen)

4. Affective Computing (Tom Lenaerts)

5. Simulators for data mining of geographical mobility data (Gianluca Bontempi, Giovanni Buroni)

6. Deep reinforcement learning (Tom Lenaerts, Axel Abels)

7. Visual assessment of machine learning techniques for fraud detection (Gianluca Bontempi)

8. Chat bots playing strategic games (Tom Lenaerts)

9. Developing an improved variant combination pathogenicity predictor (Tom Lenaerts, Sofia Papadimitriou)

10. Measuring learning in networks (Tom Lenaerts)

11. Evaluation of biomarker resilience to noise and missing data (Matthieu Defrance)

12. Analysis of non-coding RNA expression in immune celles using single cell sequencing (Matthieu Defrance)

 

 

 

 

1. Machine learning on big streams of data (Gianluca Bontempi, Catharina Olsen)

The collection of gigantic datasets in several domains (e.g. social networks, finance, internet of thing) and the need to extract useful information from them asks for the development of new and effective techniques to mine very large data sets (without storing them). The Master thesis will focus on methods to scale up and make parallel machine learning algorithms in order to deal effectively with fast and high dimensional streams of data by focusing in particular on time series forecasting . The objective of the thesis is to design and setup a running distributed system (based on Spark Stream) to predict the future behaviour of   massive streams of online data.


Useful links:

 

2. Multiscale kernel smoothing for use in image processing (Maarten Jansen)

Multiscale or multiresolution analysis is a technique for the analysis and processing of data in a telescopic way. That means that the data is decomposed into a reperesentation that separates global, large scale features from small scale details, with a broad spectrum in between. In that sense, multiscale is related to a frequency (Fourier) analysis (with slowly and fast oscillating components), but, unlike a Fourier transform, a multiscale analysis keeps information on the location in the original time or space domain.
The most well known example of a multiscale analysis is a wavelet decomposition.  Wavelets are particularly popular in image processing, for instance in the JPEG compression standard.  This thesis investigates the use of an other algorithm for a multiresolution, known as a Laplacian pyramid levitrakamagra.net. This Laplacian pyramid is a slightly overcomplete transform, meaning that it maps n data onto 2n coefficients in the multiscale representation. It can be implemented as an overcomplete version of a lifting scheme, which is a fast implementation of the wavelet transform.
In this thesis, the Laplacian pyramid is equiped with a local polynomial smoothing technique, popular in statistics. The objective is to investigate the properties a Laplacian pyramid with local polynomial smoothing in applications of image processing (denoising, compression).

3. Fast variable selection without shrinkage (Maarten Jansen)

The selection of an optimal model from a broad spectrum of non-nested models can be driven by a criterium that balances a good prediction of the training set and complexity of the model, that is, the number of selected variables.  Optimization over a number of variables, or even comparison of models with a given number of variables is a problem of combinatorial complexity, and thus not feasible in the context of high-dimensional data. Part of the problem can be well approximated by changing the number of selected variables in the criterium by the sum of absolute values of the estimators of these variables within the selected model. The counting measure is replaced by a sum of magnitudes, thus changing a combinatorial problem into convex, quadratic programming problem. This problem can be solved by a wide range of algorithms, including direct methods, such as least angle regression, or iterative methods, such as iterative thresholding or gradient projection.  Moreover, for a fixed value of model complexity, the relaxed problem selects approximately the same model as the original combinatorial one. This is no longer the case when the model complexity is part of the optimization problem, but a correction for the divergence between the combinatorial and quadratic problem can be established.  The thesis is about the application of the variable selection in sparse inverse problems, or in deblurring and denoising images, using gradient projection or iterative thresholding

 

4. Affective computing (Tom Lenaerts)

As defined on wikipedia : "Affective computing is the study and development of systems and devices that can recognize, interpret, process, and simulate human affects. It is an interdisciplinary field spanning computer science, psychology, and cognitive science. While the origins of the field may be traced as far back as to early philosophical inquiries into emotion, the more modern branch of computer science originated with Rosalind Picard's 1995 paper on affective computing. A motivation for the research is the ability to simulate empathy. The machine should interpret the emotional state of humans and adapt its behavior to them, giving an appropriate response to those emotions.”

Students interested in this field can contact prof. Tom Lenaerts (tlenaert@ulb.ac.be) for a discussion on the topic they can address. Possibilities are
1) Machine learning applications in the area of affective computing (recognition, detection, classification of affects, etc)
2) Modelling of the evolutionary origins of human emotions using game theory and agent-based modelling


References:

1) http://www.oxfordhandbooks.com/view/10.1093/oxfordhb/9780199942237.001.0001/oxfordhb-9780199942237
2) http://acii2017.org
3) http://opencv.org

 

5. Simulators for data mining of geographical mobility data (Gianluca Bontempi, Giovanni Buroni)

Mobility is an aspect of growing relevance in our daily lives. It acts as 'the economy's backbone' by supporting other sectors throughout the economic system. Many studies, e.g. IBM smart cities study in Brussels, have shown that Brussels is lagging behind compared to other capital cities . The objective of the MA thesis is to compare and assess existing open source traffic simulators.

The first part of the work will consist in reviewing existing and assessing open-source solutions (e.g. Matsim [1]). The second part  will consist in using the simulator to simulate scenarios of the Brussels Region traffic. Note that the simulator is expected to allow the user to simulate what-if scenarios (what if a car accident takes place in the tunnel? what if roadworks start?)  and generate a continuous stream of sensed measures about the traffic. The MA student will be also supervised  by a researcher involved in the ICITY MobiAId project [2] and is expected to interact with researchers of Bruxelles Mobilité (offices located in Gare du Nord).

References:

[1] http://www.matsim.org

[2] http://mlg.ulb.ac.be/node/810


6. Deep reinforcement learning (Tom Lenaerts, Axel Abels)

One of the foundations of the DQN [1] algorithm is the replay buffer
from which experiences are sampled to train the agent on. Prioritized
Experience Replay (PER) [2] significantly improves performance by
replacing the uniform sampling of standard DQN by prioritized sampling,
allowing the agent to focus its training on experiences from which it
can potentially learn most. In PER, an experience's priority is computed
based on the training error it had when it was last sampled (and
updated). Ideally, each experience's priority would always be
representative of its current error. However, because PER implements no
explicit mechanism to update the priority of experiences which have not
been sampled for a long time, their priority is unlikely to be accurate.
While the authors of PER suggest the use of a 'staleness bonus' to
ensure stale experiences are revisited, they do not evaluate it
experimentally.

For this work the student should; 1) Re-produce the results of DQN with
PER, 2) evaluate the effect of stale experiences on performance, and 3)
propose and evaluate solutions to the staleness problem.


References:

1] https://deepmind.com/research/dqn/

[2] https://arxiv.org/abs/1511.05952

[3] https://arxiv.org/abs/1710.02298

 

7. Visual assessment of machine learning techniques for fraud detection (Gianluca Bontempi)

As businesses continue to evolve and migrate to the Internet and money is transacted electronically in an ever-growing cashless banking economy, accurate fraud detection remains a key concern for modern banking systems. MLG and Worldline have been closely collaborating on this topic for 5 years, by using machine learning techniques to improve fraud detection in online transactions. In this context, the goal of this master thesis will be to further improve the fraud detection system, and the interpretation of alerts, by (i) designing a methodological framework allowing to compare the performances of different machine learning techniques on the transaction data shared by Worldline, (ii) developing an interactive tool to facilitate the comparison of algorithms, and (iii) exploring data visualisation techniques that could help to interpret the fraud detection algorithms. Through this topic, the student will have the opportunity to gain skills in machine learning, fraud detection techniques, and interactive Web development. The student will also have opportunities to collaborate with the Worldline fraud detection team.
 
References:
[1] A. Dal Pozzolo. Adaptive Machine Learning for Credit Credit Card Fraud Detection. PhD Thesis, Université Libre de Bruxelles Belgium. 2015. http://www.ulb.ac.be/di/map/adalpozz/pdf/Dalpozzolo2015PhD.pdf
[2] F. Carcillo, Y. Le Borgne, O. Caelen, and G. Bontempi. An assessment of streaming active learning strategies for real-life credit card fraud detection. In Proceedings of the 4th IEEE International Conference on Data Science and Advanced Analytics 2017, October 2017.

 

 

8. Chat bots playing strategic games (Tom Lenaerts)

Last years there has been a revolution in the development chat bots that try to act like humans or learn about how people behave in chat sessions.  This technology incorporates a number of AI methods like natural language processing and machine learning.  It becomes not difficult to imagine that at some point we expect these bots to also perform simple tasks for us.  These tasks can be interacting with another chat bot, which may be a situation where there is a conflict of interest.  In this thesis project, you will expand an existing chatbot system (telegram for instance) so that the bots play a strategic game in place of a user chatting with the bot.  The question will be how people will use this system and what outcomes one can expect.

References:

2) telegram.org  api
 

9. Developing an improved variant combination pathogenicity predictor (Tom Lenaerts, Sofia Papadimitriou)

 

Rapid advances in high-throughput sequencing technologies have contributed in uncovering the genetic basis of many human genetic diseases1,2, notably those considered to be monogenic (i.e. controlled by a single gene). Several computational tools have been developed that predict or prioritize candidate pathogenic variants for monogenic diseases based on genetic, molecular, evolutionary and structural information3–6 and their predictive ability has opened the path to promising preventive, diagnostic and therapeutic strategies7). However, the analysis of a growing number of rare human disorders highlights the difficulties in establishing their genotype-phenotype relationship, due to non-mendelian patterns of inheritance, incomplete penetrance, phenotypic variability or locus heterogeneity8,9. In these situations, we often need to consider more complex genetic patterns, where mutations in multiple genes cause or modulate the development of disease (multi-locus or oligogenic diseases)10,11. For the clinical variant prediction tools to remain valuable for diagnostic purposes, they need an update towards these more elaborate scenarios, as they show limitations on the detection of candidate genes and variants for such complex cases.

As data on bi-locus or digenic diseases (i.e. diseases caused by variants at two genes), continuously accumulate in the Digenic Diseases Database (DIDA)12, the Variant Combinations Pathogenicity Predictor (VarCoPP) was developed; the first clinical predictive method for the pathogenicity of bi-locus variant combinations (http://varcopp.ibsquare.be/). VarCoPP is an ensemble predictor, consisted of multiple Random Forest algorithms, each trained on DIDA and a different subset of the 1000 Genomes Project13 (1KGP), and utilises information at the variant, gene and gene pair level. The predictor performs well during cross-validation and also when applied on new disease-causing data (True Positives, TPs), reaching an accuracy of 88%. Based on a validation with neutral sets, 95% and 99% confidence zones have been created that include instances with 5% or 1% probability respectively, of being False Positive (FP) results. Furthermore, VarCoPP performs as a “white-box” predictor that is able to provide justifications for the predictions, by unravelling the importance of the used features and how they contribute to either the neutral or disease-causing class.

 

Goal of the project

Although VarCoPP is a pioneering tool that provides an important step towards variant pathogenicity for oligogenic diseases, its methodology and performance can better be improved. VarCoPP should be able to be more generalised and have an improved performance when tested on new TP data. Another issue is the high FP rate that comes from the fact that VarCoPP investigates combinations of variants, thus always searching at a search space that increases exponentially based on the list of provided variants. Furthermore, in order for the method to be even more relevant for oligogenic diseases, further biological synergistic features should be explored that can better capture the relationship between the variants and genes belonging in the same bi-locus combination.

This project aims to continue transforming VarCoPP by improving its performance and enriching its annotation. This will allow it to be more effectively used in medical practice and be incorporated in diagnostic pipelines in industry. In order to achieve this goal, two main subtasks need to be undertaken:

  1. The investigation of the potentials of other machine-learning techniques that could also be applied to improve its performance, as well as more efficient ways to handle the class imbalance problem between DIDA and the 1KGP. With this improvement, VarCoPP should be able to predict correctly more TPs and further limit the amount of FP predictions.
  2. The exploration and selection of biological features relevant for oligogenic diseases (i.e. gene mutation rates or involvement of the two genes of a pair in the same phenotype/pathway/complex, etc.) that can be used to include more synergistic effects into VarCoPP and at the same time improve its classification performance.

References

 

1.        Biesecker, L. G. Exome sequencing makes medical genomics a reality. Nat. Genet. 42, 13–14 (2010).

2.        Ng, S. B., Nickerson, D. A., Bamshad, M. J. & Shendure, J. Massively parallel sequencing and rare disease. Hum. Mol. Genet. 19, 119–124 (2010).

3.        Adzhubei, I. A. et al. A method and server for predicting damaging missense mutations. Nature Methods 7, 248–249 (2010).

4.        Kumar, P., Henikoff, S. & Ng, P. C. Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nat. Protoc. 4, 1073–1081 (2009).

5.        Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 (2014).

6.        Sifrim, A. et al. eXtasy: variant prioritization by genomic data fusion. Nat. Methods 10, 1083–1084 (2013).

7.        Ng, S. B. et al. Exome sequencing identifies the cause of a mendelian disorder. Nat. Genet. 42, 30–35 (2010).

8.        van Heyningen, V. & Yeyati, P. L. Mechanisms of non-Mendelian inheritance in genetic disease. Hum. Mol. Genet. 13, 225–233 (2004).

9.        Badano, J. L. & Katsanis, N. Beyond Mendel: an evolving view of human genetic disease transmission. Nat. Rev. Genet. 3, 779–789 (2002).

10.      Robinson, J. F. & Katsanis, N. in Vogel and Motulsky’s Human Genetics: Problems and Approaches (ed. M.R. Speicher et al.) 243–262 (Springer-Verlag Berlin Heiderlberg, 2010). doi:10.1007/978-3-540-37654-5

11.      Nussbaum, R., McInnes, R. & Williard, H. in Thompson and Thompson Genetics in Medicine 151–174 (Saunders, 2007).

12.      Gazzo, A. M. et al. DIDA: A curated and annotated digenic diseases database. Nucleic Acids Res. 44, D900–D907 (2015).

13.      Sudmant, P. H. et al. An integrated map of structural variation in 2,504 human genomes. Nature 526, 75–81 (2015).

 

 

10. Measuring learning in networks (Tom Lenaerts)

 
In many situations, both social and professional, people are interacting in networks.  When interacting they adapt their behaviour according to the situation they find themselves in.  They can either do this by analysing their knowledge about the game in a stimulus response kind of fashion (individual learning) or by learning from the behaviours of the other persons with whom they interact (social learning).  In most cases it is a mixture of both, which makes it difficult to discern whether one or the other method is used to change behaviour. In this project your goal will be to develop and/or analyse existing methods that aim to achieve that goal.
 
In the preparation part of the thesis, you will look up the literature that addresses this topic, focussing on studies in evolutionary game theory and learning in games.  At the same time you will be create a simple simulator to reproduce the results of [1].  These two components will prepare you for the actual thesis work in the second year wherein you will use this preparation to answer the main question, i.e. how to determine from interaction data the fraction of individual and social learning used bu the individuals in the game. 
 
Ref:
[1] Learning to coordinate in complex networks. S Van Segbroeck, S De Jong, A Nowé, FC Santos, T Lenaerts. Adaptive Behavior 18 (5), 416-427
[2] Voelkl, B. (2014). Social learning. an introduction to mechanisms, methods, and models.
[3] Santos, F. C., Pacheco, J. M., & Lenaerts, T. (2006). Evolutionary dynamics of social dilemmas in structured heterogeneous populations. Proceedings of the National Academy of Sciences of the United States of America, 103(9), 3490-3494.

 

11. Evaluation of biomarker resilience to noise and missing data (Matthieu Defrance)

Biomarkers (e.g. epigenetic, expression) can be used to monitor alterations that are occurring at the cellular level in a given organism. One challenging task is to identify a restricted set of markers (e.g. genes) that allow an accurate estimation of the monitored properties. The main objective of this project is to evaluate the influence of noise and missing measurements on the prediction accuracy. To that aim, next generation sequencing data (RNA-seq, RRBS) will be used to explore real case settings.

 

12. Analysis of non-coding RNA expression in immune cells using single cell sequencing (Matthieu Defrance)

Transcription level of coding genes has been widely used to characterize functional changes in immune cells. Non-coding transcripts such as long non-coding ARNs are known to play an important role in cell function, but large-scale analysis of their alternations in immune cells remain poorly known. The main objective of this project is to evaluate, using both single cell RNA-seq and bulk RNA-seq data the contribution of non-coding ARNs in the alteration of immune cells functions and to relate those alterations to changes occurring at the coding level.

Theme by Danetsoft and Danang Probo Sayekti inspired by Maksimer