2024
|
Nachtegael, Charlotte Active learning for biomedical relation extraction, the oligogenic use case PhD Thesis 2024. @phdthesis{nokey,
title = {Active learning for biomedical relation extraction, the oligogenic use case},
author = {Nachtegael, Charlotte},
url = {https://difusion.ulb.ac.be/vufind/Record/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/375304/Holdings},
year = {2024},
date = {2024-06-28},
abstract = {In a context where technological advancements have enabled increased availability of genetic data through high-throughput sequencing technologies, the complexity of genetic diseases has become increasingly apparent. Oligogenic diseases, characterised by a combination of genetic variants in two or more genes, have emerged as a crucial research area, challenging the traditional model of "one genotype, one phenotype". Thus, understanding the underlying mechanisms and genetic interactions of oligogenic diseases has become a major priority in biomedical research. This context underlines the importance of developing dedicated tools to study these complex diseases.Our first major contribution, OLIDA, is an innovative database designed to collect data on variant combinations responsible for these diseases, filling significant gaps in the current knowledge, focused up until now on the digenic diseases. This resource, accessible via a web platform, adheres to FAIR principles and represents a significant advancement over its predecessor, DIDA, in terms of data curation and quality assessment.Furthermore, to support the biocuration of oligogenic diseases, we used active learning to construct DUVEL, a biomedical corpus focused on digenic variant combinations. To achieve this, we first investigated how to optimise these methods across numerous biomedical relation extraction datasets and developed a web-based platform, ALAMBIC, for text annotation using active learning. Our results and the quality of the corpus obtained demonstrate the effectiveness of active learning methods in biomedical relation annotation tasks.By establishing a curation pipeline for oligogenic diseases, as well as a standards for integrating active learning methods into biocuration, our work represents a significant advancement in the field of biomedical natural language processing and the understanding of oligogenic diseases.
},
keywords = {},
pubstate = {published},
tppubtype = {phdthesis}
}
In a context where technological advancements have enabled increased availability of genetic data through high-throughput sequencing technologies, the complexity of genetic diseases has become increasingly apparent. Oligogenic diseases, characterised by a combination of genetic variants in two or more genes, have emerged as a crucial research area, challenging the traditional model of "one genotype, one phenotype". Thus, understanding the underlying mechanisms and genetic interactions of oligogenic diseases has become a major priority in biomedical research. This context underlines the importance of developing dedicated tools to study these complex diseases.Our first major contribution, OLIDA, is an innovative database designed to collect data on variant combinations responsible for these diseases, filling significant gaps in the current knowledge, focused up until now on the digenic diseases. This resource, accessible via a web platform, adheres to FAIR principles and represents a significant advancement over its predecessor, DIDA, in terms of data curation and quality assessment.Furthermore, to support the biocuration of oligogenic diseases, we used active learning to construct DUVEL, a biomedical corpus focused on digenic variant combinations. To achieve this, we first investigated how to optimise these methods across numerous biomedical relation extraction datasets and developed a web-based platform, ALAMBIC, for text annotation using active learning. Our results and the quality of the corpus obtained demonstrate the effectiveness of active learning methods in biomedical relation annotation tasks.By establishing a curation pipeline for oligogenic diseases, as well as a standards for integrating active learning methods into biocuration, our work represents a significant advancement in the field of biomedical natural language processing and the understanding of oligogenic diseases.
|
Versbraegen, Nassim Discovering multivariant pathogenic patterns among patients with rare diseases PhD Thesis 2024. @phdthesis{nokey,
title = {Discovering multivariant pathogenic patterns among patients with rare diseases},
author = {Versbraegen, Nassim},
url = {https://difusion.ulb.ac.be/vufind/Record/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/375378/Holdings},
year = {2024},
date = {2024-06-24},
abstract = {Increasing evidence points to the complex interplay of multiple genetic variants as a major contributing factor in many human diseases. Oligogenic diseases, in which a small set of genes collaborate to cause a pathology, present a compelling example of this phenomenon and necessitate a shift away from traditional single-gene inheritance models. Our work aimed to develop robust methods for pinpointing pathogenic combinations of genetic variants across patient cohorts, ultimately improving disease understanding and potentially guiding future diagnostic approaches.We began by developing a novel machine learning framework that integrates explainable AI (XAI) techniques and game-theoretic concepts. This framework allows us to classify and characterise different types of oligogenic effects, providing insights into the specific mechanisms by which multiple genes interact to drive disease. Next, we focused on refining existing computational methods used to predict the pathogenicity of variant combinations. Our emphasis was two-fold: improving computational efficiency for handling the expansive datasets associated with cohort analysis, and critically, reducing false-positive rates to ensure the reliability of our results. With these tools in hand, we developed a specialised cohort analysis approach tailored to investigating diseases with complex genetic origins. To demonstrate the capabilities of our methodology, we delved into a Marfan syndrome cohort. Marfan syndrome is a hereditary condition affecting the body's connective tissue. Our analysis successfully uncovered potential modifier mutations that appear to interact with the primary disease-causing variant, offering new clues about the intricate genetic landscape of this condition.
},
keywords = {},
pubstate = {published},
tppubtype = {phdthesis}
}
Increasing evidence points to the complex interplay of multiple genetic variants as a major contributing factor in many human diseases. Oligogenic diseases, in which a small set of genes collaborate to cause a pathology, present a compelling example of this phenomenon and necessitate a shift away from traditional single-gene inheritance models. Our work aimed to develop robust methods for pinpointing pathogenic combinations of genetic variants across patient cohorts, ultimately improving disease understanding and potentially guiding future diagnostic approaches.We began by developing a novel machine learning framework that integrates explainable AI (XAI) techniques and game-theoretic concepts. This framework allows us to classify and characterise different types of oligogenic effects, providing insights into the specific mechanisms by which multiple genes interact to drive disease. Next, we focused on refining existing computational methods used to predict the pathogenicity of variant combinations. Our emphasis was two-fold: improving computational efficiency for handling the expansive datasets associated with cohort analysis, and critically, reducing false-positive rates to ensure the reliability of our results. With these tools in hand, we developed a specialised cohort analysis approach tailored to investigating diseases with complex genetic origins. To demonstrate the capabilities of our methodology, we delved into a Marfan syndrome cohort. Marfan syndrome is a hereditary condition affecting the body's connective tissue. Our analysis successfully uncovered potential modifier mutations that appear to interact with the primary disease-causing variant, offering new clues about the intricate genetic landscape of this condition.
|
Abels, Axel Resolving Knowledge Limitations for Improved Collective Intelligence: A novel online machine learning approach PhD Thesis 2024. @phdthesis{nokey,
title = {Resolving Knowledge Limitations for Improved Collective Intelligence: A novel online machine learning approach},
author = {Abels, Axel},
url = {https://difusion.ulb.ac.be/vufind/Record/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/373334/Holdings},
year = {2024},
date = {2024-04-23},
urldate = {2024-04-23},
abstract = {One of the reasons human groups struggle to make the best decisions is that they are inherently biased in their beliefs. In essence, our perception of what is true is often distorted by individual and social biases, including stereotypes. When individuals deliberate about a decision, they tend to transmit these beliefs to others, thereby steering the entire group away from the best decision. For example, a senior doctor could spread a misinterpretation of symptoms to junior doctors, resulting in inappropriate treatments. The primary objective of this thesis is to mitigate the impact of such biases on group decision-making in domains such as medical diagnostics, policy-making, and crowdsourced fact-checking. We propose to achieve this by having humans interact through a collective decision-making platform in charge of handling the aggregation of group knowledge. The key hypothesis here is that by carefully managing the collectivization of knowledge through this platform, it will be substantially harder for humans to impose their biases on the final decision. The core of our work involves the development and analysis of algorithms for decision-making systems. These algorithms are designed to effectively aggregate diverse expertise while addressing biases. We thus focus on aggregation methods that use online learning to foster collective intelligence more effectively. In doing so, we take into account the nuances of individual expertise and the impact of biases, aiming to filter out noise and enhance the reliability of collective decisions. Our theoretical analysis of the proposed algorithms is complemented by rigorous testing in both simulated and online experimental environments to validate the system’s effectiveness. Our results demonstrate a significant improvement in performance and reduction in bias influence. These findings not only highlight the potential of technology-assisted decision-making but also underscore the value of addressing human biases in collaborative environments.
},
keywords = {},
pubstate = {published},
tppubtype = {phdthesis}
}
One of the reasons human groups struggle to make the best decisions is that they are inherently biased in their beliefs. In essence, our perception of what is true is often distorted by individual and social biases, including stereotypes. When individuals deliberate about a decision, they tend to transmit these beliefs to others, thereby steering the entire group away from the best decision. For example, a senior doctor could spread a misinterpretation of symptoms to junior doctors, resulting in inappropriate treatments. The primary objective of this thesis is to mitigate the impact of such biases on group decision-making in domains such as medical diagnostics, policy-making, and crowdsourced fact-checking. We propose to achieve this by having humans interact through a collective decision-making platform in charge of handling the aggregation of group knowledge. The key hypothesis here is that by carefully managing the collectivization of knowledge through this platform, it will be substantially harder for humans to impose their biases on the final decision. The core of our work involves the development and analysis of algorithms for decision-making systems. These algorithms are designed to effectively aggregate diverse expertise while addressing biases. We thus focus on aggregation methods that use online learning to foster collective intelligence more effectively. In doing so, we take into account the nuances of individual expertise and the impact of biases, aiming to filter out noise and enhance the reliability of collective decisions. Our theoretical analysis of the proposed algorithms is complemented by rigorous testing in both simulated and online experimental environments to validate the system’s effectiveness. Our results demonstrate a significant improvement in performance and reduction in bias influence. These findings not only highlight the potential of technology-assisted decision-making but also underscore the value of addressing human biases in collaborative environments.
|
Verhelst, Theo Causal and predictive modeling of customer churn - Lessons learned from empirical and theoretical research PhD Thesis 2024. @phdthesis{nokey,
title = {Causal and predictive modeling of customer churn - Lessons learned from empirical and theoretical research},
author = {Theo Verhelst},
url = {https://difusion.ulb.ac.be/vufind/Record/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/368384/Holdings},
year = {2024},
date = {2024-01-29},
urldate = {2024-01-29},
abstract = {Customer churn is an important concern for large companies, especially in the
telecommunications sector. Customer retention campaigns are often used to mitigate
churn, but targeting the right customers based on their historical profiles
presents an important challenge. Companies usually have recourse to two datadriven
approaches: churn prediction and uplift modeling. In churn prediction,
customers are selected on the basis of their propensity to churn in the near future.
In uplift modeling, only customers who react positively to the campaign
are considered. Uplift modeling is used in various other domains, such as marketing,
healthcare, and finance. Despite the theoretical appeal of uplift modeling, its
added value with respect to conventional machine learning approaches has rarely
been quantified in the literature.
This doctoral thesis is the result of a collaborative research project between
the Machine Learning Group (ULB) and Orange Belgium, funded by Innoviris.
This collaboration offers a unique research opportunity to assess the added value
of causal-oriented strategies to address customer churn in the telecommunication
sector. Following the introduction, we give the necessary background in probability
theory, causality theory, and machine learning, and we describe the state of
the art in uplift modeling and counterfactual identification. Then, we present the
contributions of this thesis:
• An empirical comparison of various predictive and causal models for selecting
customers in churn prevention campaigns. We perform several benchmarks
of different state-of-the-art approaches on real-world datasets and in
live campaigns with our industrial partner, we propose a new approach that
exploits domain knowledge to improve predictions, and we make available
the first public churn dataset for uplift modeling, whose unique characteristics
make it more challenging than the few other public uplift datasets.
• Counterfactual identification allows one to classify the different behaviors
of customers in response to a marketing incentive. This can be used to establish
profiles of customers sensitive to the campaign, and subsequently
improve marketing operations. We derive novel bounds and point estimators
on the probability of counterfactual statements based on uplift models.
• A comprehensive comparison of predictive and uplift modeling, starting
from firm theoretical foundations and highlighting the parameters that influence
the performance of both approaches. In particular, we provide a new
formulation of the measure of profit, a formal proof of the convergence of
the uplift curve to the measure of profit, and an illustration, through simulations,
of the conditions under which predictive approaches still outperform
uplift modeling.
Our theoretical and empirical assessments of uplift modeling suggest that it often
fails to deliver the anticipated advantages over predictive modeling, especially in
scenarios such as customer churn within the telecom sector, characterized by class
imbalance, limited separability, and cost-benefit considerations. These results are
broadly aligned with the practical experience of our industrial partner and with
the existing scientific literature. Our counterfactual probability estimators allow
us to characterize customers at a level inaccessible to conventional predictive modeling,
revealing new insights on the behavior and preferences of customers.},
keywords = {},
pubstate = {published},
tppubtype = {phdthesis}
}
Customer churn is an important concern for large companies, especially in the
telecommunications sector. Customer retention campaigns are often used to mitigate
churn, but targeting the right customers based on their historical profiles
presents an important challenge. Companies usually have recourse to two datadriven
approaches: churn prediction and uplift modeling. In churn prediction,
customers are selected on the basis of their propensity to churn in the near future.
In uplift modeling, only customers who react positively to the campaign
are considered. Uplift modeling is used in various other domains, such as marketing,
healthcare, and finance. Despite the theoretical appeal of uplift modeling, its
added value with respect to conventional machine learning approaches has rarely
been quantified in the literature.
This doctoral thesis is the result of a collaborative research project between
the Machine Learning Group (ULB) and Orange Belgium, funded by Innoviris.
This collaboration offers a unique research opportunity to assess the added value
of causal-oriented strategies to address customer churn in the telecommunication
sector. Following the introduction, we give the necessary background in probability
theory, causality theory, and machine learning, and we describe the state of
the art in uplift modeling and counterfactual identification. Then, we present the
contributions of this thesis:
• An empirical comparison of various predictive and causal models for selecting
customers in churn prevention campaigns. We perform several benchmarks
of different state-of-the-art approaches on real-world datasets and in
live campaigns with our industrial partner, we propose a new approach that
exploits domain knowledge to improve predictions, and we make available
the first public churn dataset for uplift modeling, whose unique characteristics
make it more challenging than the few other public uplift datasets.
• Counterfactual identification allows one to classify the different behaviors
of customers in response to a marketing incentive. This can be used to establish
profiles of customers sensitive to the campaign, and subsequently
improve marketing operations. We derive novel bounds and point estimators
on the probability of counterfactual statements based on uplift models.
• A comprehensive comparison of predictive and uplift modeling, starting
from firm theoretical foundations and highlighting the parameters that influence
the performance of both approaches. In particular, we provide a new
formulation of the measure of profit, a formal proof of the convergence of
the uplift curve to the measure of profit, and an illustration, through simulations,
of the conditions under which predictive approaches still outperform
uplift modeling.
Our theoretical and empirical assessments of uplift modeling suggest that it often
fails to deliver the anticipated advantages over predictive modeling, especially in
scenarios such as customer churn within the telecom sector, characterized by class
imbalance, limited separability, and cost-benefit considerations. These results are
broadly aligned with the practical experience of our industrial partner and with
the existing scientific literature. Our counterfactual probability estimators allow
us to characterize customers at a level inaccessible to conventional predictive modeling,
revealing new insights on the behavior and preferences of customers. |
2021
|
Buroni, Giovanni On-Board-Unit big data analytics: from data architecture to traffic forecasting PhD Thesis 2021, (Funder: Universite Libre de Bruxelles). @phdthesis{info:hdl:2013/334819,
title = {On-Board-Unit big data analytics: from data architecture to traffic forecasting},
author = {Giovanni Buroni},
url = {http://hdl.handle.net/2013/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/334819},
year = {2021},
date = {2021-01-01},
note = {Funder: Universite Libre de Bruxelles},
keywords = {},
pubstate = {published},
tppubtype = {phdthesis}
}
|
2020
|
P`ere, Nathaniel Mon Statistical biophysics of hematopoiesis and growing cell populations PhD Thesis 2020, (Funder: Universite Libre de Bruxelles). @phdthesis{info:hdl:2013/314684,
title = {Statistical biophysics of hematopoiesis and growing cell populations},
author = {Nathaniel Mon P`ere},
url = {http://hdl.handle.net/2013/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/314684},
year = {2020},
date = {2020-01-01},
note = {Funder: Universite Libre de Bruxelles},
keywords = {},
pubstate = {published},
tppubtype = {phdthesis}
}
|
Papadimitriou, Sofia Towards multivariant pathogenicity predictions: Using machine-learning to directly predict and explore disease-causing oligogenic variant combinations PhD Thesis 2020, (Funder: Universite Libre de Bruxelles). @phdthesis{info:hdl:2013/312576,
title = {Towards multivariant pathogenicity predictions: Using machine-learning to directly predict and explore disease-causing oligogenic variant combinations},
author = {Sofia Papadimitriou},
url = {http://hdl.handle.net/2013/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/312576},
year = {2020},
date = {2020-01-01},
note = {Funder: Universite Libre de Bruxelles},
keywords = {},
pubstate = {published},
tppubtype = {phdthesis}
}
|
2019
|
Orlando, Gabriele The role of dynamics in emergent protein properties PhD Thesis 2019, (Funder: Universite Libre de Bruxelles). @phdthesis{info:hdl:2013/287368,
title = {The role of dynamics in emergent protein properties},
author = {Gabriele Orlando},
url = {http://hdl.handle.net/2013/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/287368},
year = {2019},
date = {2019-01-01},
note = {Funder: Universite Libre de Bruxelles},
keywords = {},
pubstate = {published},
tppubtype = {phdthesis}
}
|
isabelle Davila, Caroll Weneya´a – "quien habla con los cerros”. Memoria, mántica y paisaje sagrado en la Sierra Norte de Oaxaca PhD Thesis 2019, (Funder: Universite Libre de Bruxelles). @phdthesis{info:hdl:2013/285389,
title = {Weneya´a – "quien habla con los cerros”. Memoria, mántica y paisaje sagrado en la Sierra Norte de Oaxaca},
author = {Caroll isabelle Davila},
url = {http://hdl.handle.net/2013/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/285389},
year = {2019},
date = {2019-01-01},
note = {Funder: Universite Libre de Bruxelles},
keywords = {},
pubstate = {published},
tppubtype = {phdthesis}
}
|
2018
|
Gazzo, Andrea Beyond monogenic diseases: a first collection and analysis of digenic diseases PhD Thesis 2018, (Funder: Universite Libre de Bruxelles). @phdthesis{info:hdl:2013/272617,
title = {Beyond monogenic diseases: a first collection and analysis of digenic diseases},
author = {Andrea Gazzo},
url = {http://hdl.handle.net/2013/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/272617},
year = {2018},
date = {2018-01-01},
note = {Funder: Universite Libre de Bruxelles},
keywords = {},
pubstate = {published},
tppubtype = {phdthesis}
}
|
Chen, Jixin Some Domain Decomposition and Convex Optimization Algorithms with Applications to Inverse Problems PhD Thesis 2018, (Funder: Universite Libre de Bruxelles). @phdthesis{info:hdl:2013/271782,
title = {Some Domain Decomposition and Convex Optimization Algorithms with Applications to Inverse Problems},
author = {Jixin Chen},
url = {http://hdl.handle.net/2013/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/271782},
year = {2018},
date = {2018-01-01},
note = {Funder: Universite Libre de Bruxelles},
keywords = {},
pubstate = {published},
tppubtype = {phdthesis}
}
|
Carcillo, Fabrizio Beyond Supervised Learning in Credit Card Fraud Detection: A Dive into Semi-supervised and Distributed Learning PhD Thesis 2018, (Funder: Universite Libre de Bruxelles). @phdthesis{info:hdl:2013/272119,
title = {Beyond Supervised Learning in Credit Card Fraud Detection: A Dive into Semi-supervised and Distributed Learning},
author = {Fabrizio Carcillo},
url = {https://dipot.ulb.ac.be/dspace/bitstream/2013/272119/5/ContratDiCarcillo.pdf},
year = {2018},
date = {2018-01-01},
abstract = {The expansion of the electronic commerce, as well as the increasing confidence of customers in electronic payments, makes of fraud detection a critical issue. The design of a prompt and accurate Fraud Detection System is a priority for many organizations in the business of credit cards. In this thesis we present a series of studies to increase the precision and the speed of fraud detection system. The thesis has three main contributions. The first concerns the integration of unsupervised techniques and supervised classifiers. We proposed several approaches to integrate outlier scores in the detection process and we found that the accuracy of a conventional classifier may be improved when information about the input distribution is used to augment the training set.The second contribution concerns the role of active learning in Fraud Detection. We have extensively compared several state-of-the-art techniques and found that Stochastic Semi-supervised Learning is a convenient approach to tackle the Selection Bias problem in the active learning process.The third contribution of the thesis is the design, implementation and assessment of SCARFF, an original framework for near real-time Streaming Fraud Detection. This framework integrates Big Data technology (notably tools like Kafka, Spark and Cassandra) with a machine learning approach to deal with imbalance, non-stationarity and feedback latency in a scalable manner. Experimental results on a massive dataset of real credit card transactions have showed that our framework is scalable, efficient and accurate over a big stream of transactions.},
note = {Funder: Universite Libre de Bruxelles},
keywords = {},
pubstate = {published},
tppubtype = {phdthesis}
}
The expansion of the electronic commerce, as well as the increasing confidence of customers in electronic payments, makes of fraud detection a critical issue. The design of a prompt and accurate Fraud Detection System is a priority for many organizations in the business of credit cards. In this thesis we present a series of studies to increase the precision and the speed of fraud detection system. The thesis has three main contributions. The first concerns the integration of unsupervised techniques and supervised classifiers. We proposed several approaches to integrate outlier scores in the detection process and we found that the accuracy of a conventional classifier may be improved when information about the input distribution is used to augment the training set.The second contribution concerns the role of active learning in Fraud Detection. We have extensively compared several state-of-the-art techniques and found that Stochastic Semi-supervised Learning is a convenient approach to tackle the Selection Bias problem in the active learning process.The third contribution of the thesis is the design, implementation and assessment of SCARFF, an original framework for near real-time Streaming Fraud Detection. This framework integrates Big Data technology (notably tools like Kafka, Spark and Cassandra) with a machine learning approach to deal with imbalance, non-stationarity and feedback latency in a scalable manner. Experimental results on a massive dataset of real credit card transactions have showed that our framework is scalable, efficient and accurate over a big stream of transactions. |
Bizet, Martin Bioinformatic inference of a prognostic epigenetic signature of immunity in breast cancers PhD Thesis 2018, (Funder: Universite Libre de Bruxelles). @phdthesis{info:hdl:2013/265092,
title = {Bioinformatic inference of a prognostic epigenetic signature of immunity in breast cancers},
author = {Martin Bizet},
url = {https://dipot.ulb.ac.be/dspace/bitstream/2013/265092/7/ContratDiBizet.pdf},
year = {2018},
date = {2018-01-01},
abstract = {L'altération des marques épigénétiques est de plus en plus reconnue comme une caractéristique fondamentale des cancers. Dans cette th`ese, nous avons utilisé des profils de méthylation de l'ADN en vue d'améliorer la classification des patients atteints du cancer du sein gr^ace `a une approche basée sur l'apprentissage automatique. L'objectif `a long terme est le développement d'outils cliniques de médecine personnalisée. Les données de méthylation de l'ADN furent acquises `a l'aide d'une puce `a ADN dédiée `a la méthylation, appelée Infinium. Cette technologie est récente comparée, par exemple, aux puces d'expression génique et son prétraitement n'est pas encore standardisé. La premi`ere partie de cette th`ese fut donc consacrée `a l'évaluation des méthodes de normalisation par comparaison des données normalisées avec d'autres technologies (pyroséquenc cage et RRBS) pour les deux technologies Infinium les plus récentes (450k et 850k). Nous avons également évalué la couverture de régions biologiquement relevantes (promoteurs et amplificateurs) par les deux technologies. Ensuite, nous avons utilisé les données Infinium (correctement prétraitées) pour développer un score, appelé MeTIL score, qui présente une valeur pronostique et prédictive dans les cancers du sein. Nous avons profité de la capacité de la méthylation de l'ADN `a refléter la composition cellulaire pour extraire une signature de méthylation (c'est-`a-dire un ensemble de positions de l'ADN o`u la méthylation varie) qui refl`ete la présence de lymphocytes dans l'échantillon tumoral. Apr`es une sélection de sites présentant une méthylation spécifique aux lymphocytes, nous avons développé une approche basée sur l'apprentissage automatique pour obtenir une signature d'une tailleoptimale réduite `a cinq sites permettant potentiellement une utilisation en clinique. Apr`es conversion de cette signature en un score, nous avons montré sa spécificité pour les lymphocytes `a l'aide de données externes et de simulations informatiques. Puis, nous avons montré la capacité du MeTIL score `a prédire la réponse `a la chimiothérapie ainsi que son pouvoir pronostique dans des cohortes indépendantes de cancer du sein et, m^eme, dans d'autres cancers.},
note = {Funder: Universite Libre de Bruxelles},
keywords = {},
pubstate = {published},
tppubtype = {phdthesis}
}
L'altération des marques épigénétiques est de plus en plus reconnue comme une caractéristique fondamentale des cancers. Dans cette th`ese, nous avons utilisé des profils de méthylation de l'ADN en vue d'améliorer la classification des patients atteints du cancer du sein gr^ace `a une approche basée sur l'apprentissage automatique. L'objectif `a long terme est le développement d'outils cliniques de médecine personnalisée. Les données de méthylation de l'ADN furent acquises `a l'aide d'une puce `a ADN dédiée `a la méthylation, appelée Infinium. Cette technologie est récente comparée, par exemple, aux puces d'expression génique et son prétraitement n'est pas encore standardisé. La premi`ere partie de cette th`ese fut donc consacrée `a l'évaluation des méthodes de normalisation par comparaison des données normalisées avec d'autres technologies (pyroséquenc cage et RRBS) pour les deux technologies Infinium les plus récentes (450k et 850k). Nous avons également évalué la couverture de régions biologiquement relevantes (promoteurs et amplificateurs) par les deux technologies. Ensuite, nous avons utilisé les données Infinium (correctement prétraitées) pour développer un score, appelé MeTIL score, qui présente une valeur pronostique et prédictive dans les cancers du sein. Nous avons profité de la capacité de la méthylation de l'ADN `a refléter la composition cellulaire pour extraire une signature de méthylation (c'est-`a-dire un ensemble de positions de l'ADN o`u la méthylation varie) qui refl`ete la présence de lymphocytes dans l'échantillon tumoral. Apr`es une sélection de sites présentant une méthylation spécifique aux lymphocytes, nous avons développé une approche basée sur l'apprentissage automatique pour obtenir une signature d'une tailleoptimale réduite `a cinq sites permettant potentiellement une utilisation en clinique. Apr`es conversion de cette signature en un score, nous avons montré sa spécificité pour les lymphocytes `a l'aide de données externes et de simulations informatiques. Puis, nous avons montré la capacité du MeTIL score `a prédire la réponse `a la chimiothérapie ainsi que son pouvoir pronostique dans des cohortes indépendantes de cancer du sein et, m^eme, dans d'autres cancers. |
Reggiani, Claudio Bioinformatic discovery of novel exons expressed in human brain and their association with neurodevelopmental disorders PhD Thesis 2018, (Funder: Universite Libre de Bruxelles). @phdthesis{info:hdl:2013/270994,
title = {Bioinformatic discovery of novel exons expressed in human brain and their association with neurodevelopmental disorders},
author = {Claudio Reggiani},
url = {https://dipot.ulb.ac.be/dspace/bitstream/2013/270994/5/ContratDiReggiani.pdf},
year = {2018},
date = {2018-01-01},
abstract = {An important quest in genomics since the publication of the first complete human genome in 2003 has been its functional annotation. DNA holds the instructions to the production of the components necessary for the life of cells and organisms. A complete functional catalog of genomic regions will help the understanding of the cell body and its dynamics, thus creating links between genotype and phenotypic traits. The need for annotations prompted the development of several bioinformatic methods. In the context of promoter and first exon predictors, the majority of models relies principally on structural and chemical properties of the DNA sequence. Some of them integrate information from epigenomic and transcriptomic data as secondary features. Current genomic research asserts that reference genome annotations are far from being fully annotated (human organism included).Physicians rely on reference genome annotations and functional databases to understand disorders with genetic basis, and missing annotations may lead to unresolved cases. Because of their complexity, neurodevelopmental disorders are under study to figure out all genetic regions that are involved. Besides functional validation on model organisms, the search for genotype-phenotype association is supported by statistical analysis, which is typically biased towards known functional regions.This thesis addresses the use of an in-silico integrative analysis to improve reference genome annotations and discover novel functional regions associated with neurodevelopemental disorders. The contributions outlined in this document have practical applications in clinical settings. The presented bioinformatic method is based on epigenomic and transcriptomic data, thus excluding features from DNA sequence. Such integrative approach applied on brain data allowed the discovery of two novel promoters and coding first exons in the human DLG2 gene, which were also found to be statistically associated with neurodevelopmental disorders and intellectual disability in particular. The application of the same methodology to the whole genome resulted in the discovery of other novel exons expressed in brain. Concerning the in-silico method itself, the research demanded a high number of functional and clinical datasets to properly support and validate our discoveries.This work describes a bioinformatic method for genome annotation, in the specific area of promoter and first exons. So far the method has been applied on brain data, and the extension to the whole body data would be a logical by-product. We will leverage distributed frameworks to tackle the even higher amount of data to analyse, a task that has already begun. Another interesting research direction that came up from this work is the temporal enrichment analysis of epigenomics data across different developmental stages, in which changes of epigenomic enrichment suggest time-specific and tissue-specific functional gene and gene isoforms regulation.},
note = {Funder: Universite Libre de Bruxelles},
keywords = {},
pubstate = {published},
tppubtype = {phdthesis}
}
An important quest in genomics since the publication of the first complete human genome in 2003 has been its functional annotation. DNA holds the instructions to the production of the components necessary for the life of cells and organisms. A complete functional catalog of genomic regions will help the understanding of the cell body and its dynamics, thus creating links between genotype and phenotypic traits. The need for annotations prompted the development of several bioinformatic methods. In the context of promoter and first exon predictors, the majority of models relies principally on structural and chemical properties of the DNA sequence. Some of them integrate information from epigenomic and transcriptomic data as secondary features. Current genomic research asserts that reference genome annotations are far from being fully annotated (human organism included).Physicians rely on reference genome annotations and functional databases to understand disorders with genetic basis, and missing annotations may lead to unresolved cases. Because of their complexity, neurodevelopmental disorders are under study to figure out all genetic regions that are involved. Besides functional validation on model organisms, the search for genotype-phenotype association is supported by statistical analysis, which is typically biased towards known functional regions.This thesis addresses the use of an in-silico integrative analysis to improve reference genome annotations and discover novel functional regions associated with neurodevelopemental disorders. The contributions outlined in this document have practical applications in clinical settings. The presented bioinformatic method is based on epigenomic and transcriptomic data, thus excluding features from DNA sequence. Such integrative approach applied on brain data allowed the discovery of two novel promoters and coding first exons in the human DLG2 gene, which were also found to be statistically associated with neurodevelopmental disorders and intellectual disability in particular. The application of the same methodology to the whole genome resulted in the discovery of other novel exons expressed in brain. Concerning the in-silico method itself, the research demanded a high number of functional and clinical datasets to properly support and validate our discoveries.This work describes a bioinformatic method for genome annotation, in the specific area of promoter and first exons. So far the method has been applied on brain data, and the extension to the whole body data would be a logical by-product. We will leverage distributed frameworks to tackle the even higher amount of data to analyse, a task that has already begun. Another interesting research direction that came up from this work is the temporal enrichment analysis of epigenomics data across different developmental stages, in which changes of epigenomic enrichment suggest time-specific and tissue-specific functional gene and gene isoforms regulation. |
2017
|
Amghar, Mohamed Multiscale local polynomial transforms in smoothing and density estimation PhD Thesis 2017, (Funder: Universite Libre de Bruxelles). @phdthesis{info:hdl:2013/262040,
title = {Multiscale local polynomial transforms in smoothing and density estimation},
author = {Mohamed Amghar},
url = {http://hdl.handle.net/2013/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/262040},
year = {2017},
date = {2017-01-01},
note = {Funder: Universite Libre de Bruxelles},
keywords = {},
pubstate = {published},
tppubtype = {phdthesis}
}
|
Raimondi, Daniele The effect of genome variation on human proteins: understanding variants and improving their deleteriousness prediction through extensive contextualisation PhD Thesis 2017, (Funder: Universite Libre de Bruxelles). @phdthesis{info:hdl:2013/251313,
title = {The effect of genome variation on human proteins: understanding variants and improving their deleteriousness prediction through extensive contextualisation},
author = {Daniele Raimondi},
url = {http://hdl.handle.net/2013/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/251313},
year = {2017},
date = {2017-01-01},
note = {Funder: Universite Libre de Bruxelles},
keywords = {},
pubstate = {published},
tppubtype = {phdthesis}
}
|
2016
|
Zisis, Ioannis The Effect of Group Formation on Behaviour: An Experimental and Evolutionary Analysis PhD Thesis 2016, (Funder: Universite Libre de Bruxelles). @phdthesis{info:hdl:2013/231974,
title = {The Effect of Group Formation on Behaviour: An Experimental and Evolutionary Analysis},
author = {Ioannis Zisis},
url = {http://hdl.handle.net/2013/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/231974},
year = {2016},
date = {2016-01-01},
note = {Funder: Universite Libre de Bruxelles},
keywords = {},
pubstate = {published},
tppubtype = {phdthesis}
}
|
2015
|
Lopes, Miguel Inference of gene networks from time series expression data and application to type 1 Diabetes PhD Thesis 2015, (Funder: Universite Libre de Bruxelles). @phdthesis{info:hdl:2013/216729b,
title = {Inference of gene networks from time series expression data and application to type 1 Diabetes},
author = {Miguel Lopes},
url = {http://hdl.handle.net/2013/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/216729},
year = {2015},
date = {2015-01-01},
note = {Funder: Universite Libre de Bruxelles},
keywords = {},
pubstate = {published},
tppubtype = {phdthesis}
}
|
Hajingabo, Leon Analyzing molecular network perturbations in human cancer: application to mutated genes and gene fusions involved in acute lymphoblastic leukemia PhD Thesis 2015, (Funder: Universite Libre de Bruxelles). @phdthesis{info:hdl:2013/209126b,
title = {Analyzing molecular network perturbations in human cancer: application to mutated genes and gene fusions involved in acute lymphoblastic leukemia},
author = {Leon Hajingabo},
url = {http://hdl.handle.net/2013/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/209126},
year = {2015},
date = {2015-01-01},
note = {Funder: Universite Libre de Bruxelles},
keywords = {},
pubstate = {published},
tppubtype = {phdthesis}
}
|
Pozzolo, Andrea Dal Adaptive Machine Learning for Credit Card Fraud Detection PhD Thesis 2015, (Funder: Universite Libre de Bruxelles). @phdthesis{info:hdl:2013/221654,
title = {Adaptive Machine Learning for Credit Card Fraud Detection},
author = {Andrea Dal Pozzolo},
url = {https://dipot.ulb.ac.be/dspace/bitstream/2013/221654/5/contratDalPozzolo.pdf},
year = {2015},
date = {2015-01-01},
abstract = {Billions of dollars of loss are caused every year by fraudulent credit card transactions. The design of efficient fraud detection algorithms is key for reducing these losses, and more and more algorithms rely on advanced machine learning techniques to assist fraud investigators. The design of fraud detection algorithms is however particularly challenging due to the non-stationary distribution of the data, the highly unbalanced classes distributions and the availability of few transactions labeled by fraud investigators. At the same time public data are scarcely available for confidentiality issues, leaving unanswered many questions about what is the best strategy. In this thesis we aim to provide some answers by focusing on crucial issues such as: i) why and how undersampling is useful in the presence of class imbalance (i.e. frauds are a small percentage of the transactions), ii) how to deal with unbalanced and evolving data streams (non-stationarity due to fraud evolution and change of spending behavior), iii) how to assess performances in a way which is relevant for detection and iv) how to use feedbacks provided by investigators on the fraud alerts generated. Finally, we design and assess a prototype of a Fraud Detection System able to meet real-world working conditions and that is able to integrate investigators' feedback to generate accurate alerts.},
note = {Funder: Universite Libre de Bruxelles},
keywords = {},
pubstate = {published},
tppubtype = {phdthesis}
}
Billions of dollars of loss are caused every year by fraudulent credit card transactions. The design of efficient fraud detection algorithms is key for reducing these losses, and more and more algorithms rely on advanced machine learning techniques to assist fraud investigators. The design of fraud detection algorithms is however particularly challenging due to the non-stationary distribution of the data, the highly unbalanced classes distributions and the availability of few transactions labeled by fraud investigators. At the same time public data are scarcely available for confidentiality issues, leaving unanswered many questions about what is the best strategy. In this thesis we aim to provide some answers by focusing on crucial issues such as: i) why and how undersampling is useful in the presence of class imbalance (i.e. frauds are a small percentage of the transactions), ii) how to deal with unbalanced and evolving data streams (non-stationarity due to fraud evolution and change of spending behavior), iii) how to assess performances in a way which is relevant for detection and iv) how to use feedbacks provided by investigators on the fraud alerts generated. Finally, we design and assess a prototype of a Fraud Detection System able to meet real-world working conditions and that is able to integrate investigators' feedback to generate accurate alerts. |
Lerman, Liran A machine learning approach for automatic and generic side-channel attacks PhD Thesis 2015, (Funder: Universite Libre de Bruxelles). @phdthesis{info:hdl:2013/209070,
title = {A machine learning approach for automatic and generic side-channel attacks},
author = {Liran Lerman},
url = {https://dipot.ulb.ac.be/dspace/bitstream/2013/209070/2/be487c5b-7b94-414c-bf2e-96847aa98284.txt},
year = {2015},
date = {2015-01-01},
abstract = {L'omniprésence de dispositifs interconnectés am`ene `a un intér^et massif pour la sécurité informatique fournie entre autres par le domaine de la cryptographie. Pendant des décennies, les spécialistes en cryptographie estimaient le niveau de sécurité d'un algorithme cryptographique indépendamment de son implantation dans un dispositif. Cependant, depuis la publication des attaques d'implantation en 1996, les attaques physiques sont devenues un domaine de recherche actif en considérant les propriétés physiques de dispositifs cryptographiques. Dans notre dissertation, nous nous concentrons sur les attaques profilées. Traditionnellement, les attaques profilées appliquent des méthodes paramétriques dans lesquelles une information a priori sur les propriétés physiques est supposée. Le domaine de l'apprentissage automatique produit des mod`eles automatiques et génériques ne nécessitant pas une information a priori sur le phénom`ene étudié.<p><p>Cette dissertation apporte un éclairage nouveau sur les capacités des méthodes d'apprentissage automatique. Nous démontrons d'abord que les attaques profilées paramétriques surpassent les méthodes d'apprentissage automatique lorsqu'il n'y a pas d'erreur d'estimation ni d'hypoth`ese. En revanche, les attaques fondées sur l'apprentissage automatique sont avantageuses dans des scénarios réalistes o`u le nombre de données lors de l'étape d'apprentissage est faible. Par la suite, nous proposons une nouvelle métrique formelle d'évaluation qui permet (1) de comparer des attaques paramétriques et non-paramétriques et (2) d'interpréter les résultats de chaque méthode. La nouvelle mesure fournit les causes d'un taux de réussite élevé ou faible d'une attaque et, par conséquent, donne des pistes pour améliorer l'évaluation d'une implantation. Enfin, nous présentons des résultats expérimentaux sur des appareils non protégés et protégés. La premi`ere étude montre que l'apprentissage automatique a un taux de réussite plus élevé qu'une méthode paramétrique lorsque seules quelques données sont disponibles. La deuxi`eme expérience démontre qu'un dispositif protégé est attaquable avec une approche appartenant `a l'apprentissage automatique. La stratégie basée sur l'apprentissage automatique nécessite le m^eme nombre de données lors de la phase d'apprentissage que lorsque celle-ci attaque un produit non protégé. Nous montrons également que des méthodes paramétriques surestiment ou sous-estiment le niveau de sécurité fourni par l'appareil alors que l'approche basée sur l'apprentissage automatique améliore cette estimation. <p><p>En résumé, notre th`ese est que les attaques basées sur l'apprentissage automatique sont avantageuses par rapport aux techniques classiques lorsque la quantité d'information a priori sur l'appareil cible et le nombre de données lors de la phase d'apprentissage sont faibles.},
note = {Funder: Universite Libre de Bruxelles},
keywords = {},
pubstate = {published},
tppubtype = {phdthesis}
}
L'omniprésence de dispositifs interconnectés am`ene `a un intér^et massif pour la sécurité informatique fournie entre autres par le domaine de la cryptographie. Pendant des décennies, les spécialistes en cryptographie estimaient le niveau de sécurité d'un algorithme cryptographique indépendamment de son implantation dans un dispositif. Cependant, depuis la publication des attaques d'implantation en 1996, les attaques physiques sont devenues un domaine de recherche actif en considérant les propriétés physiques de dispositifs cryptographiques. Dans notre dissertation, nous nous concentrons sur les attaques profilées. Traditionnellement, les attaques profilées appliquent des méthodes paramétriques dans lesquelles une information a priori sur les propriétés physiques est supposée. Le domaine de l'apprentissage automatique produit des mod`eles automatiques et génériques ne nécessitant pas une information a priori sur le phénom`ene étudié.<p><p>Cette dissertation apporte un éclairage nouveau sur les capacités des méthodes d'apprentissage automatique. Nous démontrons d'abord que les attaques profilées paramétriques surpassent les méthodes d'apprentissage automatique lorsqu'il n'y a pas d'erreur d'estimation ni d'hypoth`ese. En revanche, les attaques fondées sur l'apprentissage automatique sont avantageuses dans des scénarios réalistes o`u le nombre de données lors de l'étape d'apprentissage est faible. Par la suite, nous proposons une nouvelle métrique formelle d'évaluation qui permet (1) de comparer des attaques paramétriques et non-paramétriques et (2) d'interpréter les résultats de chaque méthode. La nouvelle mesure fournit les causes d'un taux de réussite élevé ou faible d'une attaque et, par conséquent, donne des pistes pour améliorer l'évaluation d'une implantation. Enfin, nous présentons des résultats expérimentaux sur des appareils non protégés et protégés. La premi`ere étude montre que l'apprentissage automatique a un taux de réussite plus élevé qu'une méthode paramétrique lorsque seules quelques données sont disponibles. La deuxi`eme expérience démontre qu'un dispositif protégé est attaquable avec une approche appartenant `a l'apprentissage automatique. La stratégie basée sur l'apprentissage automatique nécessite le m^eme nombre de données lors de la phase d'apprentissage que lorsque celle-ci attaque un produit non protégé. Nous montrons également que des méthodes paramétriques surestiment ou sous-estiment le niveau de sécurité fourni par l'appareil alors que l'approche basée sur l'apprentissage automatique améliore cette estimation. <p><p>En résumé, notre th`ese est que les attaques basées sur l'apprentissage automatique sont avantageuses par rapport aux techniques classiques lorsque la quantité d'information a priori sur l'appareil cible et le nombre de données lors de la phase d'apprentissage sont faibles. |
2014
|
Kidzinski, Lukasz Inference for stationary functional time series: dimension reduction and regression PhD Thesis 2014, (Funder: Universite Libre de Bruxelles). @phdthesis{info:hdl:2013/209226,
title = {Inference for stationary functional time series: dimension reduction and regression},
author = {Lukasz Kidzinski},
url = {http://hdl.handle.net/2013/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/209226},
year = {2014},
date = {2014-01-01},
note = {Funder: Universite Libre de Bruxelles},
keywords = {},
pubstate = {published},
tppubtype = {phdthesis}
}
|
Taieb, Souhaib Ben Machine learning strategies for multi-step-ahead time series forecasting PhD Thesis 2014, (Funder: Universite Libre de Bruxelles). @phdthesis{info:hdl:2013/209234,
title = {Machine learning strategies for multi-step-ahead time series forecasting},
author = {Souhaib Ben Taieb},
url = {https://dipot.ulb.ac.be/dspace/bitstream/2013/209234/4/2c5e8bfe-3eab-4c2a-acb0-843504ddfcbd.txt},
year = {2014},
date = {2014-01-01},
abstract = {How much electricity is going to be consumed for the next 24 hours? What will be the temperature for the next three days? What will be the number of sales of a certain product for the next few months? Answering these questions often requires forecasting several future observations from a given sequence of historical observations, called a time series. <p><p>Historically, time series forecasting has been mainly studied in econometrics and statistics. In the last two decades, machine learning, a field that is concerned with the development of algorithms that can automatically learn from data, has become one of the most active areas of predictive modeling research. This success is largely due to the superior performance of machine learning prediction algorithms in many different applications as diverse as natural language processing, speech recognition and spam detection. However, there has been very little research at the intersection of time series forecasting and machine learning.<p><p>The goal of this dissertation is to narrow this gap by addressing the problem of multi-step-ahead time series forecasting from the perspective of machine learning. To that end, we propose a series of forecasting strategies based on machine learning algorithms.<p><p>Multi-step-ahead forecasts can be produced recursively by iterating a one-step-ahead model, or directly using a specific model for each horizon. As a first contribution, we conduct an in-depth study to compare recursive and direct forecasts generated with different learning algorithms for different data generating processes. More precisely, we decompose the multi-step mean squared forecast errors into the bias and variance components, and analyze their behavior over the forecast horizon for different time series lengths. The results and observations made in this study then guide us for the development of new forecasting strategies.<p><p>In particular, we find that choosing between recursive and direct forecasts is not an easy task since it involves a trade-off between bias and estimation variance that depends on many interacting factors, including the learning model, the underlying data generating process, the time series length and the forecast horizon. As a second contribution, we develop multi-stage forecasting strategies that do not treat the recursive and direct strategies as competitors, but seek to combine their best properties. More precisely, the multi-stage strategies generate recursive linear forecasts, and then adjust these forecasts by modeling the multi-step forecast residuals with direct nonlinear models at each horizon, called rectification models. We propose a first multi-stage strategy, that we called the rectify strategy, which estimates the rectification models using the nearest neighbors model. However, because recursive linear forecasts often need small adjustments with real-world time series, we also consider a second multi-stage strategy, called the boost strategy, that estimates the rectification models using gradient boosting algorithms that use so-called weak learners.<p><p>Generating multi-step forecasts using a different model at each horizon provides a large modeling flexibility. However, selecting these models independently can lead to irregularities in the forecasts that can contribute to increase the forecast variance. The problem is exacerbated with nonlinear machine learning models estimated from short time series. To address this issue, and as a third contribution, we introduce and analyze multi-horizon forecasting strategies that exploit the information contained in other horizons when learning the model for each horizon. In particular, to select the lag order and the hyperparameters of each model, multi-horizon strategies minimize forecast errors over multiple horizons rather than just the horizon of interest.<p><p>We compare all the proposed strategies with both the recursive and direct strategies. We first apply a bias and variance study, then we evaluate the different strategies using real-world time series from two past forecasting competitions. For the rectify strategy, in addition to avoiding the choice between recursive and direct forecasts, the results demonstrate that it has better, or at least has close performance to, the best of the recursive and direct forecasts in different settings. For the multi-horizon strategies, the results emphasize the decrease in variance compared to single-horizon strategies, especially with linear or weakly nonlinear data generating processes. Overall, we found that the accuracy of multi-step-ahead forecasts based on machine learning algorithms can be significantly improved if an appropriate forecasting strategy is used to select the model parameters and to generate the forecasts.<p><p>Lastly, as a fourth contribution, we have participated in the Load Forecasting track of the Global Energy Forecasting Competition 2012. The competition involved a hierarchical load forecasting problem where we were required to backcast and forecast hourly loads for a US utility with twenty geographical zones. Our team, TinTin, ranked fifth out of 105 participating teams, and we have been awarded an IEEE Power & Energy Society award.<p>},
note = {Funder: Universite Libre de Bruxelles},
keywords = {},
pubstate = {published},
tppubtype = {phdthesis}
}
How much electricity is going to be consumed for the next 24 hours? What will be the temperature for the next three days? What will be the number of sales of a certain product for the next few months? Answering these questions often requires forecasting several future observations from a given sequence of historical observations, called a time series. <p><p>Historically, time series forecasting has been mainly studied in econometrics and statistics. In the last two decades, machine learning, a field that is concerned with the development of algorithms that can automatically learn from data, has become one of the most active areas of predictive modeling research. This success is largely due to the superior performance of machine learning prediction algorithms in many different applications as diverse as natural language processing, speech recognition and spam detection. However, there has been very little research at the intersection of time series forecasting and machine learning.<p><p>The goal of this dissertation is to narrow this gap by addressing the problem of multi-step-ahead time series forecasting from the perspective of machine learning. To that end, we propose a series of forecasting strategies based on machine learning algorithms.<p><p>Multi-step-ahead forecasts can be produced recursively by iterating a one-step-ahead model, or directly using a specific model for each horizon. As a first contribution, we conduct an in-depth study to compare recursive and direct forecasts generated with different learning algorithms for different data generating processes. More precisely, we decompose the multi-step mean squared forecast errors into the bias and variance components, and analyze their behavior over the forecast horizon for different time series lengths. The results and observations made in this study then guide us for the development of new forecasting strategies.<p><p>In particular, we find that choosing between recursive and direct forecasts is not an easy task since it involves a trade-off between bias and estimation variance that depends on many interacting factors, including the learning model, the underlying data generating process, the time series length and the forecast horizon. As a second contribution, we develop multi-stage forecasting strategies that do not treat the recursive and direct strategies as competitors, but seek to combine their best properties. More precisely, the multi-stage strategies generate recursive linear forecasts, and then adjust these forecasts by modeling the multi-step forecast residuals with direct nonlinear models at each horizon, called rectification models. We propose a first multi-stage strategy, that we called the rectify strategy, which estimates the rectification models using the nearest neighbors model. However, because recursive linear forecasts often need small adjustments with real-world time series, we also consider a second multi-stage strategy, called the boost strategy, that estimates the rectification models using gradient boosting algorithms that use so-called weak learners.<p><p>Generating multi-step forecasts using a different model at each horizon provides a large modeling flexibility. However, selecting these models independently can lead to irregularities in the forecasts that can contribute to increase the forecast variance. The problem is exacerbated with nonlinear machine learning models estimated from short time series. To address this issue, and as a third contribution, we introduce and analyze multi-horizon forecasting strategies that exploit the information contained in other horizons when learning the model for each horizon. In particular, to select the lag order and the hyperparameters of each model, multi-horizon strategies minimize forecast errors over multiple horizons rather than just the horizon of interest.<p><p>We compare all the proposed strategies with both the recursive and direct strategies. We first apply a bias and variance study, then we evaluate the different strategies using real-world time series from two past forecasting competitions. For the rectify strategy, in addition to avoiding the choice between recursive and direct forecasts, the results demonstrate that it has better, or at least has close performance to, the best of the recursive and direct forecasts in different settings. For the multi-horizon strategies, the results emphasize the decrease in variance compared to single-horizon strategies, especially with linear or weakly nonlinear data generating processes. Overall, we found that the accuracy of multi-step-ahead forecasts based on machine learning algorithms can be significantly improved if an appropriate forecasting strategy is used to select the model parameters and to generate the forecasts.<p><p>Lastly, as a fourth contribution, we have participated in the Load Forecasting track of the Global Energy Forecasting Competition 2012. The competition involved a hierarchical load forecasting problem where we were required to backcast and forecast hourly loads for a US utility with twenty geographical zones. Our team, TinTin, ranked fifth out of 105 participating teams, and we have been awarded an IEEE Power & Energy Society award.<p> |
2013
|
Olsen, Catharina Causal inference and prior integration in bioinformatics using information theory PhD Thesis 2013, (Funder: Universite Libre de Bruxelles). @phdthesis{info:hdl:2013/209401b,
title = {Causal inference and prior integration in bioinformatics using information theory},
author = {Catharina Olsen},
url = {http://hdl.handle.net/2013/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/209401},
year = {2013},
date = {2013-01-01},
note = {Funder: Universite Libre de Bruxelles},
keywords = {},
pubstate = {published},
tppubtype = {phdthesis}
}
|
2011
|
Miranda, Abhilash Alexander Spectral factor model for time series learning PhD Thesis 2011, (Funder: Universite Libre de Bruxelles). @phdthesis{info:hdl:2013/209812b,
title = {Spectral factor model for time series learning},
author = {Abhilash Alexander Miranda},
url = {http://hdl.handle.net/2013/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/209812},
year = {2011},
date = {2011-01-01},
note = {Funder: Universite Libre de Bruxelles},
keywords = {},
pubstate = {published},
tppubtype = {phdthesis}
}
|
2009
|
Caelen, Olivier Sélection séquentielle en environnement aléatoire appliquée `a l'apprentissage supervisé PhD Thesis 2009, (Funder: Universite Libre de Bruxelles). @phdthesis{info:hdl:2013/210265b,
title = {Sélection séquentielle en environnement aléatoire appliquée `a l'apprentissage supervisé},
author = {Olivier Caelen},
url = {http://hdl.handle.net/2013/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/210265},
year = {2009},
date = {2009-01-01},
note = {Funder: Universite Libre de Bruxelles},
keywords = {},
pubstate = {published},
tppubtype = {phdthesis}
}
|
Borgne, Yann-A"el Le Learning in wireless sensor networks for energy-efficient environmental monitoring PhD Thesis 2009, (Funder: Universite Libre de Bruxelles). @phdthesis{info:hdl:2013/210334b,
title = {Learning in wireless sensor networks for energy-efficient environmental monitoring},
author = {Yann-A"el Le Borgne},
url = {http://hdl.handle.net/2013/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/210334},
year = {2009},
date = {2009-01-01},
note = {Funder: Universite Libre de Bruxelles},
keywords = {},
pubstate = {published},
tppubtype = {phdthesis}
}
|
Haibe-Kains, Benjamin Identification and assessment of gene signatures in human breast cancer PhD Thesis 2009, (Funder: Universite Libre de Bruxelles). @phdthesis{info:hdl:2013/210348b,
title = {Identification and assessment of gene signatures in human breast cancer},
author = {Benjamin Haibe-Kains},
url = {http://hdl.handle.net/2013/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/210348},
year = {2009},
date = {2009-01-01},
note = {Funder: Universite Libre de Bruxelles},
keywords = {},
pubstate = {published},
tppubtype = {phdthesis}
}
|
Kontos, Kevin Gaussian graphical model selection for gene regulatory network reverse engineering and function prediction PhD Thesis 2009, (Funder: Universite Libre de Bruxelles). @phdthesis{info:hdl:2013/210301,
title = {Gaussian graphical model selection for gene regulatory network reverse engineering and function prediction},
author = {Kevin Kontos},
url = {https://dipot.ulb.ac.be/dspace/bitstream/2013/210301/1/453ad8e7-667f-4c22-ab95-dc953d05b89d.txt},
year = {2009},
date = {2009-01-01},
abstract = {One of the most important and challenging ``knowledge extraction' tasks in bioinformatics is the reverse engineering of gene regulatory networks (GRNs) from DNA microarray gene expression data. Indeed, as a result of the development of high-throughput data-collection techniques, biology is experiencing a data flood phenomenon that pushes biologists toward a new view of biology--systems biology--that aims at system-level understanding of biological systems.<p><p>Unfortunately, even for small model organisms such as the yeast Saccharomyces cerevisiae, the number p of genes is much larger than the number n of expression data samples. The dimensionality issue induced by this ``small n, large p' data setting renders standard statistical learning methods inadequate. Restricting the complexity of the models enables to deal with this serious impediment. Indeed, by introducing (a priori undesirable) bias in the model selection procedure, one reduces the variance of the selected model thereby increasing its accuracy.<p><p>Gaussian graphical models (GGMs) have proven to be a very powerful formalism to infer GRNs from expression data. Standard GGM selection techniques can unfortunately not be used in the ``small n, large p' data setting. One way to overcome this issue is to resort to regularization. In particular, shrinkage estimators of the covariance matrix--required to infer GGMs--have proven to be very effective. Our first contribution consists in a new shrinkage estimator that improves upon existing ones through the use of a Monte Carlo (parametric bootstrap) procedure.<p><p>Another approach to GGM selection in the ``small n, large p' data setting consists in reverse engineering limited-order partial correlation graphs (q-partial correlation graphs) to approximate GGMs. Our second contribution consists in an inference algorithm, the q-nested procedure, that builds a sequence of nested q-partial correlation graphs to take advantage of the smaller order graphs' topology to infer higher order graphs. This allows us to significantly speed up the inference of such graphs and to avoid problems related to multiple testing. Consequently, we are able to consider higher order graphs, thereby increasing the accuracy of the inferred graphs.<p><p>Another important challenge in bioinformatics is the prediction of gene function. An example of such a prediction task is the identification of genes that are targets of the nitrogen catabolite repression (NCR) selection mechanism in the yeast Saccharomyces cerevisiae. The study of model organisms such as Saccharomyces cerevisiae is indispensable for the understanding of more complex organisms. Our third contribution consists in extending the standard two-class classification approach by enriching the set of variables and comparing several feature selection techniques and classification algorithms.<p><p>Finally, our fourth contribution formulates the prediction of NCR target genes as a network inference task. We use GGM selection to infer multivariate dependencies between genes, and, starting from a set of genes known to be sensitive to NCR, we classify the remaining genes. We hence avoid problems related to the choice of a negative training set and take advantage of the robustness of GGM selection techniques in the ``small n, large p' data setting.},
note = {Funder: Universite Libre de Bruxelles},
keywords = {},
pubstate = {published},
tppubtype = {phdthesis}
}
One of the most important and challenging ``knowledge extraction' tasks in bioinformatics is the reverse engineering of gene regulatory networks (GRNs) from DNA microarray gene expression data. Indeed, as a result of the development of high-throughput data-collection techniques, biology is experiencing a data flood phenomenon that pushes biologists toward a new view of biology--systems biology--that aims at system-level understanding of biological systems.<p><p>Unfortunately, even for small model organisms such as the yeast Saccharomyces cerevisiae, the number p of genes is much larger than the number n of expression data samples. The dimensionality issue induced by this ``small n, large p' data setting renders standard statistical learning methods inadequate. Restricting the complexity of the models enables to deal with this serious impediment. Indeed, by introducing (a priori undesirable) bias in the model selection procedure, one reduces the variance of the selected model thereby increasing its accuracy.<p><p>Gaussian graphical models (GGMs) have proven to be a very powerful formalism to infer GRNs from expression data. Standard GGM selection techniques can unfortunately not be used in the ``small n, large p' data setting. One way to overcome this issue is to resort to regularization. In particular, shrinkage estimators of the covariance matrix--required to infer GGMs--have proven to be very effective. Our first contribution consists in a new shrinkage estimator that improves upon existing ones through the use of a Monte Carlo (parametric bootstrap) procedure.<p><p>Another approach to GGM selection in the ``small n, large p' data setting consists in reverse engineering limited-order partial correlation graphs (q-partial correlation graphs) to approximate GGMs. Our second contribution consists in an inference algorithm, the q-nested procedure, that builds a sequence of nested q-partial correlation graphs to take advantage of the smaller order graphs' topology to infer higher order graphs. This allows us to significantly speed up the inference of such graphs and to avoid problems related to multiple testing. Consequently, we are able to consider higher order graphs, thereby increasing the accuracy of the inferred graphs.<p><p>Another important challenge in bioinformatics is the prediction of gene function. An example of such a prediction task is the identification of genes that are targets of the nitrogen catabolite repression (NCR) selection mechanism in the yeast Saccharomyces cerevisiae. The study of model organisms such as Saccharomyces cerevisiae is indispensable for the understanding of more complex organisms. Our third contribution consists in extending the standard two-class classification approach by enriching the set of variables and comparing several feature selection techniques and classification algorithms.<p><p>Finally, our fourth contribution formulates the prediction of NCR target genes as a network inference task. We use GGM selection to infer multivariate dependencies between genes, and, starting from a set of genes known to be sensitive to NCR, we classify the remaining genes. We hence avoid problems related to the choice of a negative training set and take advantage of the robustness of GGM selection techniques in the ``small n, large p' data setting. |
2000
|
Bontempi, Gianluca Local learning techniques for modeling, prediction and control PhD Thesis 2000, (Funder: Universite Libre de Bruxelles). @phdthesis{info:hdl:2013/211823b,
title = {Local learning techniques for modeling, prediction and control},
author = {Gianluca Bontempi},
url = {http://hdl.handle.net/2013/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/211823},
year = {2000},
date = {2000-01-01},
note = {Funder: Universite Libre de Bruxelles},
keywords = {},
pubstate = {published},
tppubtype = {phdthesis}
}
|