ADM - Analyse de Données et Modélisation probabiliste

M. Sc. course - Data analysis and probabilistic modeling

Making data speak: Advanced probabilistic data analysis and modeling

Data, whatever they are, are of very limited value without the possibility to extract valuable information to better synthesize, understand, predict. Statistical methods for data analysis and probabilistic models for statistical machine learning are commonly used to do so. This course aims at acquiring the basic techniques for data analysis (exploratory statistics) and probabilistic modeling (inferential statistics) and to study their application to different types of data (symbolic data, language, numerical data, signals, images, etc.). The lectures naturally articulate around the two major steps of any modeling process: understand your data then design an adequate model.

Keywords: data analysis, factor analysis, variance analysis, clustering, hypothesis testing, decision theory, estimation theory, Gaussian mixture models, EM algorithm, Markov chains, Markov fields, hidden Markov chains, Viterbi algorithm, Bayesian networks, token passing algorithm

Lectures, with the 2020-2021 dates

Thu. Sep. 17 14h45. A gentle reminder of the basics of probability: Kolmogorov, random variables, moments, classical laws
Fri. Sep. 18 11h. Exploratory statistics: visualization, summaries, correlation, factor analysis, PCA/LDA
Thu. Sep. 24 14h45. Cluster analysis: k-means, agglomerative/divisive clustering, spectral clustering and other weird things
Thu. Oct. 1 14h45. Fundamentals of statistical machine learning and estimation theory: cost function, decision theory, empirical estimation, estimation theory, practical estimation techniques
Fri. Oct. 2 11h. Mixture models: mixture models, hidden variables, estimation-maximization (EM) algorithm

see also handnotes on the EM for a two Gaussian mixture model

Thu. Oct. 8 14h45. Observable and hidden Markov models: Markov property, Markov chain, hidden Markov chain, Viterbi algorithm
Fri. Oct. 9 11h. Hidden Markov model (continued): Baum-Welsh algorithm, practical examples
Thu. Oct. 15 11h. Entropy and conditional random fields: maximum entropy principle, maxent model, logistic regression, log-linear sequence models, parameter estimation
Thu. Oct. 22 14h45. Graphical models and Bayesian network: directed/undirected graphical models, Bayesian networks, inference and reasoning, moralization, variable elimination, junction tree algorithm
Fri. Oct. 23 11h. Hypothesis testing: typology, likelihood ratio test, classical mean value tests, comparison and statistical significance, variance analysis
Fri. Nov. 13 14h45. Final exam - the exact form of the exam isn't fully defined yet because of the need to account for potential "cas contacts" who could not make it ti the exam room.

Evaluation / exams

Homework. Write a short comment on either one of the articles below. Maximum length is 1000 words, ca. 1.5-2 pages single column 11 point font (English or French, as you wish). Your report shall identify the techniques seen in the classroom, explain why they are appropriate in the context of this paper and what efforts authors have made to cast their work into a probabilistic framework, explain how they were adapted and/or extended, discuss the limits you foresee (whether mentioned in the paper or not). Deadline for mailing comment: before Nov. 2, 2020, 08:00 CET

Douglas Reynolds, Thomas Quatieri and Robert Dun. Speaker verification using adapted Gaussian mixture models. Digital Signal Processing, 10:19-41, 2000
Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research, 9:2579-2605, 2008
Tao Xiang and Shaogang Gong. Spectral clustering with eigen vector selection. Pattern recognition letter, 41:1012-1029, 2008

Final exam. Standard 2h written exam. You can check the text of past exams below.

2019 in French
2018 in French, in English

2017 in French, in English