Learning from imbalanced datasets is an important iussse in a lot of practical classification tasks where one class of interest (e.g. fraud, anomaly) occurs at a much lower rate than the other (e.g. normal behaviour). This MA thesis will focus on the adoption of Generative Deep Learning to deal with the imbalancedness issue in business data (fraud detection, churn detection).
The student should be expert in Python programming, interested in Deep Learning, interested statistical aspects of Machine Learning and registered at the Computational Intelligence module of the Master. The student is expected to interact with MLG researchers working on applied machine learning projects in collaboration with companies.
Forecasting time series is one of the most challenging tasks in data science. A well known benchmark for comparing forecasitng strategies is provided by the M competitions, which recently proposed up to 100K historical time series. The goal of the thesis will be to develop machine learning strategies for forecasting and take part to the forthcoming and highly challenging M5 competition. An important part of the preparatory work will be devoted to study the succesful approach of the M4 competition proposed by Uber.
The student should be expert in Python programming, interested in Data Science, statistical aspects of Machine Learning and registered at the Computational Intelligence module of the Master. The student is expected to take part to the M5 competition which is expected to begin in February 2020. Given the highly competititve nature of the task, only highly motivated and ready-to-start students will be taken into consideration.
This MA thesis will take place in the context of a collaboration between MLG and the Laboratory of Neurophysiology and Movement Biomechanics (LNMB). An electroencephalogram (EEG) uses multiple electrodes to measure the electrical activity of post-synaptic potentials of cortical neurons located at specific parts of the brain. LNMB is composed of several researchers who developed a solid expertise in EEG signal acquisition and analysis. Over the years they acquired a large amount of EEG data from different domains (NASA astronauts in the ISS, hockey players from the national Belgian hockey team, tennis players from the Justine Henin Academy, children and adults with hyperactivity disorder…) and for various applications (brain-computer interface, increase human performance, diagnostic tool…).
The objective of the MA thesis is to work with cutting-edge technology and use state-of-the-art signal processing and Machine Learning techniques on EEG data.
The work will focus on i) exploring different EEG datasets ii) extracting relevant features from the brain state that may not be directly visible with standard EEG analysis iii) deploying different classification models to reach or improve state-of-the-art results.
The student should be expert in Python programming, registered at the MA module on computational intelligence, have a passion for interdisciplinary research and be available to visit frequently the Erasme lab.
– MNE : Gramfort, M. Luessi, E. Larson, D. Engemann, D. Strohmeier, C. Brodbeck, L. Parkkonen, M. Hämäläinen, MNE software for processing MEG and EEG data, NeuroImage, Volume 86, 1 February 2014, Pages 446-460, ISSN 1053-8119
– EEG Lab : A Delorme & S Makeig (2004) EEGLAB: an open source toolbox for analysis of single-trial EEG dynamics (pdf, 0.7 MB) Journal of Neuroscience Methods 134:9-21
PS. For students that are interested, the LNMB also offers the possibility of an internship compatible with the TRAN-F-501 course available in the MA Computer Science cursus.
The selection of an optimal model from a broad spectrum of non-nested models can be driven by a criterium that balances a good prediction of the training set and complexity of the model, that is, the number of selected variables. Optimization over a number of variables, or even comparison of models with a given number of variables is a problem of combinatorial complexity, and thus not feasible in the context of high-dimensional data. Part of the problem can be well approximated by changing the number of selected variables in the criterium by the sum of absolute values of the estimators of these variables within the selected model. The counting measure is replaced by a sum of magnitudes, thus changing a combinatorial problem into convex, quadratic programming problem. This problem can be solved by a wide range of algorithms, including direct methods, such as least angle regression, or iterative methods, such as iterative thresholding or gradient projection. Moreover, for a fixed value of model complexity, the relaxed problem selects approximately the same model as the original combinatorial one. This is no longer the case when the model complexity is part of the optimization problem, but a correction for the divergence between the combinatorial and quadratic problem can be established. The thesis is about the application of the variable selection in sparse inverse problems, or in deblurring and denoising images, using gradient projection or iterative thresholding.
Feature selection is a crucial step in any machine learning pipeline. However, most feature selection methods do not attempt to uncover causal relationships between feature and target and focus instead on making best predictions. The MA thesis will focus on:
⋅ A review and comparative assessment of existing causal feature selection algorithms, including the methods developed at MLG
⋅ The design of a validation strategy of those techniques on real datasets (e.g. ChaLearn competition datasets, other datasets)
The student should be interested in statistical aspects of Machine Learning and registered at the Computational Intelligence module of the Master.
Biomarkers (e.g. epigenetic, expression) can be used to monitor alterations that are occurring at the cellular level in a given organism. One challenging task is to identify a restricted set of markers (e.g. genes) that allow an accurate estimation of the monitored properties. The main objective of this project is to evaluate the influence of noise and missing measurements on the prediction accuracy. To that aim, next generation sequencing data (RNA-seq, RRBS) will be used to explore real case settings.
Transcription level of coding genes has been widely used to characterize functional changes in immune cells. Non-coding transcripts such as long non-coding ARNs are known to play an important role in cell function, but large-scale analysis of their alternations in immune cells remain poorly known. The main objective of this project is to evaluate, using both single cell RNA-seq and bulk RNA-seq data the contribution of non-coding ARNs in the alteration of immune cells functions and to relate those alterations to changes occurring at the coding level.