You are here

Master Thesis

Sujets de mémoires/ Master thesis subject MLG 2012-2013

Le Machine Learning Group propose pour l'année 2012/2013 une dizaine de sujets pour les étudiants en master. Les domaines d'applications incluent le calcul à haute performance, la bioinformatique, les réseaux de capteurs, l'évolution artificielle, la médecine assistée par ordinateur, les protéines artificielles et la dynamique des réseaux. 

NB: Le nombre de sujets est limité. Les étudiants intéressés sont priés de se manifester au plus tôt.

This widget requires Flash Player 9 or better

Topics:

1. Machine learning on big data (Gianluca Bontempi, Yann-Aël Le Borgne)

2. Machine learning for motion analysis in the treatment of spasticity (Yann-Aël Le Borgne, Gianluca Bontempi)

3. Computational Biology: Investigations into protein structure and function (Tom Lenaerts, Elisa Cilia)

4. Computational Biology:Evolutionary Dynamics of advanced strategies  (Tom Lenaerts)

5. Computational Biology: Stochastic dynamics of chronic myeloid leukemia (Tom Lenaerts)

6 Computational Biology: The environment's impact on cooperativity between microorganisms (Tom Lenaerts)

7. Analyse de la sécurité d’implémentation physique de finalistes du concours SHA3 au moyen de l’apprentissage automatique (TFE encadré par QualSec et MLG(machine learning group))

8. Trading challenge (Gianluca Bontempi, Andrea Dal Pozzolo)

9. Machine Learning : Coalition-based Naive Bayesian Classification (Tom Lenaerts, Elisa Cilia)

10. Multiscale kernel smoothing for use in image processing (Maarten Jansen)

11. Fast variable selection without shrinkage (Maarten Jansen)

12. Online recognition of human activities (Manuel Pegalajar Cuéliar, Yann-Aël Le Borgne, Gianluca Bontempi)

 

1. Machine learning on big data (Gianluca Bontempi, Yann-Aël Le Borgne)

The collection of gigantic datasets in several domains (e.g. social networks, finance, internet) and the need to extract useful information from them asks for the development of new and effective techniques to store and mine very large data structures. The Master thesis will focus on methods to scale up and make parallel machine learning algorithms in order to deal effectively with very large and distributed databases (e.g. Hadoop). The objective of the thesis is to design and setup a running distributed system (based on existing open-source solutions) to store and analyze huge datasets.

Required competences; machine learning, computational statistics, programming.  

Useful links:

 

2. Machine learning for motion analysis in the treatment of spasticity (Yann-Aël Le Borgne, Gianluca Bontempi)

The treatment of muscle spasticity can nowadays take advantage of a large amount of clinical data. This is made possible by the development of new sensor technologies (e.g. camera, magnetic sensors, wearable inertial sensors like gyroscopes, accelerometers) and their integration in daily life monitoring systems. This opens the way to the development of data-driven approaches to modelling and detection of human movement with the purpose of obtaining better diagnosis of patients (e.g. affected by Parkinson's disease), improving the medication process and recognizing the movement patterns (e.g. in biometrics). The Master thesis will focus on automated approaches based on statistical machine learning and data mining approaches to emulate the innate human capability of recognizing, disambiguating, classifying the type of movements and to support clinicians in diagnosing and decision making.

The thesis will be carried out in the context of the ICT4REHAB project funded by the Brussels Region.

Required skills: statistical analysis, numerical computing, machine learning, passion for interdisciplinary research

3. Computational Biology : Investigations into protein structure and function (Tom LenaertsElisa Cilia)

One of the main research tracks in the group is linked to questions related to the structure and function of proteins. Machine learning methods can assist in answering these questions.This  is a short list of topics which we want to investigate:

1. Mining relevant features that drive protein function : Understanding how proteins behave is a complex task. Mutants of the same protein provide an invaluable source of data for the application of statistical learning techniques. Mutation data can be analyzed to find patterns of amino acids that are most relevant for the manifestation of a certain protein behavior. This thesis proposal aims at learning relevant features and rules to explain a specific protein behavior. The techniques used for attaining this goal will draw on game theory (used for feature selection) and/or inductive logic programming.

2. Preference handling as an approach to analyze and understande protein binding preferences: Proteins and especialy their domains have a finely tuned preference for particular peptides.  One of the main interests of bioinformaticians is to provide a description of these preferences so that potential binding partners can be searched with the database of known proteins.  This project proposes to investigate the relevance of preference handling methods to solve this problem. 

Interested? Contact Tom Lenaerts or Elisa Cilia for more details.

Required skills: machine learning, programming skills and passion for interdisciplinary research.
Course prerequisite: INFO-F-208 (Introduction à la Bioinformatique) or some equivalent course.
 

4. Computational Biology : Evolutionary Dynamics of Advanced Strategies (Tom Lenaerts and Matteo Gagliolo)

1. Multi agent learning is a field of increasing importance, both for its direct technological impact on our society, and for its importance in modeling real complex phenomena in economics and social sciences.  The social aspect of learning is often neglected  in the design and analysis of adaptive multi agent systems. We propose to study the synergy of social and individual learning in the light of recent advances in evolutionary anthropology (cultural evolution in particular), evolutionary game theory, and learning theory. 

 
To isolate the effect of this synergy, we propose the study of a simple one-person game (e.g., a bandit problem), in which multiple learning agents can learn both individually and socially (e.g., by observing the actions played, and rewards obtained, by other players). Based on current literature, the candidate should devise and implement different social learning mechanisms, in order to gain insights into the added value of social learing. Since real-world social interactions typically occur along  the edges of a network, a possible extension of this work would be to study the impact of network structure on social learning.

Interested? Contact Tom Lenaerts or Matteo Gagliolo for more details.

Required skills: Modeling, dynamical systems, programming skills and passion for interdisciplinary research
Course prerequisite: INFO-F-409 (Learning Dynamics) or some equivalent course.
 

5. Computational Biology: Stochastic dynamics of chronic myeloid leukemia  (Tom Lenaerts)

The aim of this project is to investigate, through a model of the hematopoietic system and CML, the emergence and dynamics of therapy resistant clones, and the relation between patient treatment response, survival and the diagnostic risk groups.  Patients diagnosed with early-phase CML may relapse during treatment due to the appearance of cancer cells resistant to first-line treatment compounds like Imatinib. Understanding therefore how treatment affects the dynamics of these resistant cells is important and resulting insights will aid medical practitioners in setting up treatment protocols for individual patients.   In addition, each patient responds differently to Imatinib. Using our model and available serial Q-RT-PCR patient data we can determine the severity of the disease and the quality of initial treatment response.  Together these will reclassify patients with respect to their survival chances. Additionally it will shed light on the correlation between the risk groups identified at diagnosis and treatment response, which is not clear yet.  

Interested? Contact Tom Lenaerts for more details.

Required skills: Programming skills and passion for interdisciplinary research

Course prerequisite: INFO-F-305 (Modélisation et Simulation) or some equivalent course.

 

6. Computational Biology: The environment's impact on cooperativity between microorganisms  (Tom Lenaerts)

This master thesis work is performed in collaboration with the Bioinformatics and (eco-) systems biology (Raes Lab) at the VUB (VIB).

Microorganisms interact in multiple ways, for instance by competing for resources or by cooperating in cross-feeding chains. A case of particular interest is the release of extracellular enzymes to degrade polymers such as cellulose or starch. The degradation products benefit all neighboring bacteria able to transport and exploit them, whereas only the enzyme-producing bacteria pay a cost. Since this system is highly vulnerable to cheaters, it is interesting to ask, on the one hand, whether evolutionary models can explain the emergence of this cooperative behavior and, on the other hand, why extracellular polymer-degrading enzymes exist at all (when degradation could be carried out alternatively within the cell, where cheaters cannot steal the products of costly enzymes). 
 
1. Models explaining the evolution of resource sharing between competing microorganisms have been proposed to explain, for instance, the coexistence between cooperators (those that produce the enzyme) and cheats (those that do not produce the enzyme but profit from those that do).   It has been argued that Evolutionary Game Theoretical (EGT) models cannot explain such coexistence.  Yet, not attention was given to EGT studies on common-resource games, which may provide the techniques and results to understand the experimental work in resource sharing between microorganisms.  In addition, models on cross-feeding which focus on the idea that the metabolic products of one type of microorganism can serve as a nutrient for other microorganisms, ignore again these common-resource games, that also fit nicely to the idea of nutrient production and sharing.  In this proposal the idea is to explore whether EGT common-resource models can in fact provide meaningful insights into both the problem of resource sharing and cross-feeding between microorganisms..   
 
To answer this first question the aim is to first explore the existing literature on both subjects to identify in detail what has been done and which answers have been provided.  Afterwards, EGT models will be constructed to explore the evolutionary relevance of certain strategies, as for instance the choice to produce enzymes to degrade some polymers that benefit also those microorganisms that do not produce the enzyme.  
 
2. One suggestion for the second question is that surrounding benefiting bacteria are closely related to the enzyme producers. An alternative explanation is that cells benefiting from the degradation products excrete compounds needed by the enzyme producers, thus forming a mutualistic relationship. However, exchanging compounds and maintaining high relatedness in neighboring bacteria is easier in static and solid environments than in well-mixed liquid environments.
So the question is whether "cooperative" (e.g. extracellular polymer-degrading enzymes for which intracellular alternatives exist) are more likely to occur in static, solid environments than in others. As a start, taxonomic groups occurring in both the soil and in mammalian gut could be compared to see whether there is a difference in the abundance of "cooperative" enzymes. 
 
To carry out this second task, one would first need to compile a list of extracellular versus intracellular polymer-degrading enzymes for selected taxonomic groups and next analyze soil and gut metagenomic data sets to assess the abundance of genes coding for these enzymes. This study could also be extended to check whether cooperativity is more likely to occur in taxonomic groups producing adhesion-enhancing proteins (so neighbors are more likely to be highly related) than in related groups that do not produce these proteins.
 

Interested? Contact Tom Lenaerts for more details.

Required skills: Simulation, Machine Learning Programming skills and passion for interdisciplinary research

Course prerequisite: INFO-F-305 (Modélisation et Simulation)and/or INFO-F-208 (Introduction à la Bioinformatique)  or some equivalent courses.

 

7. Analyse de la sécurité d’implémentation physique de finalistes du concours SHA3 au moyen de l’apprentissage automatique (TFE encadré par QualSec et MLG (machine learning group))



Le NIST hash function competition (SHA3) est un concours mondial (dont le terme est prévu au cours de cette année 2012) dont le but est de sélectionner une nouvelle fonction de hachage cryptographique en tant que standard pour les Etats-Unis. Une telle fonction de hachage permettra entre autre d’assurer l’intégrité et l’authenticité des informations transmises ou stockées et ce sur base de l’utilisation d’une clé secrète (i.e. Message Authentication Code, MAC). Un tel MAC doit résister aux attaques cryptanalytiques connues. Une de ces techniques efficaces d’attaques est celle des attaques d’implémentations physiques par canaux auxiliaires (side channel attacks). Celles-ci se focalisent sur le comportement des devices physiques (la consommation d’énergie ou le temps de calcul) pour vérifier si une information secrète peut en être déduite. Toutefois, ces techniques connues et déjà implémentées semblent pouvoir être améliorées en les combinant avec celles de l’apprentissage automatique. Le travail se focalisera sur cette combinaison novatrice dans le but de tester des finalistes du concours SHA3.

Contact : Olivier Markowitch et Gianluca Bontempi

8. Trading challenge (Gianluca Bontempi, Andrea Dal Pozzolo)

Directa Sim, an Italian online trading broker, is organizing a trading challenge with real money for European master students.

The goal of the project is to use machine learning/statistical techniques to build a trading model. The people interested in participating are supposed to form a group and choose a leader who will manage the trading operation.

For more information: http://www.universiadideltrading.it/index_fr.html or contact Andrea Dal Pozzolo at adalpozz@ulb.ac.beRequired skills: Statistical analysis, machine learning.

9. Machine Learning : Coalition-based Naive Bayesian Classification (Tom LenaertsElisa Cilia)

"Classification is a basic task in data analysis and pattern recognition that requires the construction of a classifier, that is, a function that assigns a class label to instances described by a set of attributes. The induction of classifiers from data sets of preclassified instances is a central problem in machine learning. Numerous approaches to this problem are based on various functional representations such as decision trees, decision lists, neural networks, decision graphs, and rules.

One of the most effective classifiers, in the sense that its predictive performance is competitive with state-of-the-art classifiers, is the so-called naive Bayesian classifier described, for example, by Duda and Hart (1973) and by Langley et al. (1992). This classifier learns from training data the conditional probability of each attribute Ai given the class label C. Classification is then done by applying Bayes rule to compute the probability of C given the particular instance of A1 , . . . , An , and then predicting the class with the highest posterior probability. This computation is rendered feasible by making a strong independence assumption: all the attributes Ai are conditionally independent given the value of the class C. [...]
 
The performance of naive Bayes is somewhat surprising, since the above assumption is clearly unrealistic. Consider, for example, a classifier for assessing the risk in loan applications: it seems counterintuitive to ignore the correlations between age, education level, and income. This example raises the following question: can we improve the performance of naive Bayesian classifiers by avoiding unwarranted (by the data) assumptions about independence? [...]" (extracted from Friedman, N., Geiger, D., & Goldszmidt, M. (1997). Bayesian Network Classifiers. Machine Learning, 29, 131–163.)
 
This question posed by Friedman, Geiger and Goldszmidt led to the development of extension of the Naive Bayesian Classifier that incorporates information on the dependencies between the variables, creating the so called Tree Augmented Naive Bayes (TAN) Classifier.  In this proposal, we have two aims.  First we want to examine the performance of the TAN classifiers in comparison to Naive Bayes, Chow-Liu trees and AODE (which is an another extension of TAN) on particular biological data.  In the second place we want to examine how Game Theory may assist in achieving the same goal.
 
Interested? Contact Tom Lenaerts
 
Required Skills : Machine Learning, programming skills and interest in interdisciplinary research.

Course prerequisite: INFO-F-422 (Statistical Methods of Machine Learning)  or some equivalent courses.

 

10. Multiscale kernel smoothing for use in image processing (Maarten Jansen)

Multiscale or multiresolution analysis is a technique for the analysis and processing of data in a telescopic way. That means that the data is decomposed into a reperesentation that separates global, large scale features from small scale details, with a broad spectrum in between. In that sense, multiscale is related to a frequency (Fourier) analysis (with slowly and fast oscillating components), but, unlike a Fourier transform, a multiscale analysis keeps information on the location in the original time or space domain.

The most well known example of a multiscale analysis is a wavelet decomposition.  Wavelets are particularly popular in image processing, for instance in the JPEG compression standard.  This thesis investigates the use of an other algorithm for a multiresolution, known as a Laplacian pyramid. This Laplacian pyramid is a slightly overcomplete transform, meaning that it maps n data onto 2n coefficients in the multiscale representation. It can be implemented as an overcomplete version of a lifting scheme, which is a fast implementation of the wavelet transform.

In this thesis, the Laplacian pyramid is equiped with a local polynomial smoothing technique, popular in statistics. The objective is to investigate the properties a Laplacian pyramid with local polynomial smoothing in applications of image processing (denoising, compression).

11. Fast variable selection without shrinkage (Maarten Jansen)

The selection of an optimal model from a broad spectrum of non-nested models can be driven by a criterium that balances a good prediction of the training set and complexity of the model, that is, the number of selected variables.  Optimization over a number of variables, or even comparison of models with a given number of variables is a problem of combinatorial complexity, and thus not feasible in the context of high-dimensional data. Part of the problem can be well approximated by changing the number of selected variables in the criterium by the sum of absolute values of the estimators of these variables within the selected model. The counting measure is replaced by a sum of magnitudes, thus changing a combinatorial problem into convex, quadratic programming problem. This problem can be solved by a wide range of algorithms, including direct methods, such as least angle regression, or iterative methods, such as iterative thresholding or gradient projection.  Moreover, for a fixed value of model complexity, the relaxed problem selects approximately the same model as the original combinatorial one. This is no longer the case when the model complexity is part of the optimization problem, but a correction for the divergence between the combinatorial and quadratic problem can be established.  The thesis is about the application of the variable selection in sparse inverse problems, or in deblurring and denoising images, using gradient projection or iterative thresholding.

12. Online recognition of human activities (Manuel Pegalajar Cuéliar, Yann-Aël Le Borgne, Gianluca Bontempi)

The study and modelling of human behaviour is a key aspect in the development of AmI systems, since its intrinsic goal is focused in user assistance. Nowadays, the advances in technology, and specially sensor networks and the miniaturization of electronic devices, allow us to monitor the user activity at any time and place with the aim of improving life quality. However, the real time recognition of these activities becomes a challenge when the sensor signals provide multivariate data. We need fast and efficient methods not only to learn a behaviour from the training data, but also to recognize the learned activities online. The aim of this thesis will be focused in the design and development of machine learning methods to solve this problem, applied to Human-Computer Interaction.

Required skills: Machine learning, statistical analysis, programming skills, passion for interdisciplinary research


 

Theme by Danetsoft and Danang Probo Sayekti inspired by Maksimer