In the past I have worked on tree- and graph-structured patterns.
I then have become more interested in generic pattern mining algorithms, where a single algorithm can handle a broad range of pattern definitions.
Efficiency is a key concern for pattern miners, in this regard I am interested on condensed representations and exploiting the parallelism of multicore processors.
Today's critical challenge for the pattern mining domain is to output few meaningful patterns. In this regard, I am interested in combining pattern mining approaches with optimization techniques.
Anomaly Detection in Streams with Extreme Value Theory
2017
Alban Siffer, Pierre-Alain Fouque, Alexandre Termier, Christine Largouët
Conference Paper KDD 2017, pp. 1067-1075.
Abstract: Anomaly detection in time series has attracted considerable attention due to its importance in many real-world applications including intrusion detection, energy management and finance. Most approaches for detecting outliers rely on either manually set thresholds or assumptions on the distribution of data according to Chandola, Banerjee and Kumar.
Here, we propose a new approach to detect outliers in streaming univariate time series based on Extreme Value Theory that does not require to hand-set thresholds and makes no assumption on the distribution: the main parameter is only the risk, controlling the number of false positives. Our approach can be used for outlier detection, but more generally for automatically setting thresholds, making it useful in wide number of situations. We also experiment our algorithms on various real-world datasets which confirm its soundness and efficiency
Purchase Signatures of Retail Customers
2017
Clément Gautrais, René Quiniou, Peggy Cellier, Thomas Guyet, Alexandre Termier
Conference Paper PAKDD'17, pp. 110-121.
Abstract: In the retail context, there is an increasing need for understanding individual customer behavior in order to personalize marketing actions. We propose the novel concept of customer signature, that identifies a set of important products that the customer refills regularly. Both the set of products and the refilling time periods give new insights on the customer behavior. Our approach is inspired by methods from the domain of sequence segmentation, thus benefiting from efficient exact and approximate algorithms. Experiments on a real massive retail dataset show the interest of the signatures for understanding individual customers.
TopPI: An efficient algorithm for item-centric mining
2017
Vincent Leroy, Martin Kirchgessner, Alexandre Termier, Sihem Amer-Yahia
Journal Paper Information Systems, pp. 104-118.
Abstract: In this paper, we introduce item-centric mining, a new semantics for mining long-tailed datasets. Our algorithm, TopPI, finds for each item its top-k most frequent closed itemsets. While most mining algorithms focus on the globally most frequent itemsets, TopPI guarantees that each item is represented in the results, regardless of its frequency in the database.
TopPI allows users to efficiently explore Web data, answering questions such as “what are the k most common sets of songs downloaded together with the ones of my favorite artist?”. When processing retail data consisting of 55 million supermarket receipts, TopPI finds the itemset “milk, puff pastry” that appears 10,315 times, but also “frangipane, puff pastry” and “nori seaweed, wasabi, sushi rice” that occur only 1120 and 163 times, respectively. Our experiments with analysts from the marketing department of our retail partner demonstrate that item-centric mining discover valuable itemsets. We also show that TopPI can serve as a building-block to approximate complex itemset ranking measures such as the p-value.
Thanks to efficient enumeration and pruning strategies, TopPI avoids the search space explosion induced by mining low support itemsets. We show how TopPI can be parallelized on multi-cores and distributed on Hadoop clusters. Our experiments on datasets with different characteristics show the superiority of TopPI when compared to standard top-k solutions, and to Parallel FP-Growth, its closest competitor.
Efficient local search for L1 and L2 binary matrix factorization
2016
Seyed Hamid Mirisaee, Éric Gaussier, Alexandre Termier
Journal Paper Intelligent Data Analysis, pp. 783-807.
Abstract: Rank K Binary Matrix Factorization (BMF) approximates a binary matrix by the product of two binary matrices of lower rank, K. Several researchers have addressed this problem, focusing on either approximations of rank 1 or higher, using either the L_1 or L_2-norms for measuring the quality of the approximation. The rank 1 problem (for which the L_1 and L_2-norms are equivalent) has been shown to be related to the Integer Linear Programming (ILP) problem. We first show here that the alternating strategy with the L_2-norm, at the core of several methods used to solve BMF, can be reformulated as an Unconstrained Binary Quadratic Programming (UBQP) problem. This reformulation allows us to use local search procedures designed for UBQP in order to improve the solutions of BMF. We then introduce a new local search dedicated to the BMF problem. We show in particular that this solution is in average faster than the previously proposed ones. We then assess its behavior on several collections and methods and show that it significantly improves methods targeting the L_2-norms on all the datasets considered; for the L_1-norm, the improvement is also significant for real, structured datasets and for the BMF problem without the binary reconstruction constraint.
TopPI: An Efficient Algorithm for Item-Centric Mining
2016
Martin Kirchgessner, Vincent Leroy, Alexandre Termier, Sihem Amer-Yahia, Marie-Christine Rousset
Conference Paper DaWaK 2016, pp. 19-33.
Abstract: We introduce TopPI, a new semantics and algorithm designed
to mine long-tailed datasets. For each item, and regardless of
its frequency, TopPI finds the k most frequent closed itemsets that item
belongs to. For example, in our retail dataset, TopPI finds the itemset
“nori seaweed, wasabi, sushi rice, soy sauce” that occurrs in only 133 store
receipts out of 290 million. It also finds the itemset “milk, puff pastry”,
that appears 152,991 times. Thanks to a dynamic threshold adjustment
and an adequate pruning strategy, TopPI efficiently traverses the relevant
parts of the search space and can be parallelized on multi-cores. Our
experiments on datasets with different characteristics show the high performance
of TopPI and its superiority when compared to state-of-the-art
mining algorithms. We show experimentally on real datasets that TopPI
allows the analyst to explore and discover valuable itemsets.
Identifying Genetic Variant Combinations Using Skypatterns
2016
Hoang-Son Pham, Dominique Lavenier, Alexandre Termier
Workshop BioKDD 2016 (DEXA Workshops), pp. 44-48.
Abstract: Identifying variant combination association with
disease is a bioinformatics challenge. This problem can be solved
by discriminative pattern mining that use statistical function to
evaluate the significance of individual biological patterns. There is
a wide range of such measures. However, selecting an appropriate
measure as well as a suitable threshold in some specific practical
situations is a difficult task. In this article, we propose to use
the skypattern technique which allows combinations of measures
to be used to evaluate the importance of variant combinations
without having to select a given measure and a fixed threshold.
Experiments on several real variant datasets demonstrate that
the skypattern method effectively identifies the risk variant
combinations related to diseases.
Understanding Customer Attrition at an Individual Level: a New Model in Grocery Retail Context
2016
Clément Gautrais, Peggy Cellier, Thomas Guyet, René Quiniou, Alexandre Termier
Conference Poster EDBT 2016 (poster), pp. 686-687.
Abstract: This paper presents a new model to detect and explain customer
defection in a grocery retail context. This new model
analyzes the evolution of each customer basket content. It
therefore provides actionable knowledge for the retailer at an
individual scale. In addition, this model is able to identify
customers that are likely to defect in the future months.
Steady Patterns
2016
Willy Ugarte, Alexandre Termier, Miguel Santana
Workshop DSBDA 2016 (ICDM Workshops), pp. 692-699.
Abstract: Skypatterns are an elegant answer to the pattern
explosion issue, when a set of measures can be provided. Skypatterns
for all possible measure combinations can be explored
thanks to recent work on the skypattern cube. However, this
leads to too many skypatterns, where it is difficult to quickly
identify which ones are more important. First, we introduce a
new notion of pattern steadiness which measures the conservation
of the skypattern property across the skypattern cube, allowing
to see which are the “most universal” skypatterns. Then, we
extended this notion to partitions of the dataset, and show in
our experiments that this both allows to discover especially
stable skypatterns, and identify interesting differences between
the partitions.
Towards Visualizing Hidden Structures
2016
Remy Dautriche, Alexandre Termier, Renaud Blanch, Miguel Santana
Workshop ICDM 2016 (PhD forum), pp. 1183-1190.
Abstract: There is an increasing need to quickly understand the contents log data. A wide range of patterns can be computed and provide valuable information: for example existence of repeated sequences of events or periodic behaviors. However patternminingtechniquesoftenproducemanypatternsthathave to be examined one by one, which is time consuming for experts. On the other hand, visualization techniques are easier to understand, but cannot provide the in-depth understanding provided by pattern mining approaches. Our contribution is to propose a novel visualanalytics methodthat allows toimmediately visualize hidden structures such as repeated sets/sequences and periodicity, allowing to quickly gain a deep understanding of the log.
TraceViz: a visualization framework for interactive analysis of execution traces
2016
Rémy Dautriche, Renaud Blanch, Alexandre Termier, Miguel Santana
National Conference IHM 2016, pp. 115-125.
Abstract: Les plateformes matérielles pour systèmes embarqués deviennent
plus puissantes à chaque nouvelle génération
grâce à l’intégration de système sur une puce (Systemon-Chip
ou SoC). Développer des applications pour la
lecture de contenu multimedia sur systèmes embarqués devient
une tâche de plus en plus complexe. Les applications
modernes sont massivement parallèles et doivent décoder
un flux multimédia en temps réel pour éviter l’apparition
d’artéfacts audio et video. Le débogage de ce type de problème
ne peut pas être fait avec les outils traditionnels qui
interromptent le décodage et perturbent la synchronisation
des différents fils d’exécution. Une solution consiste
à enregistrer tous les évènements apparus durant le dé-
codage dans une trace et à procéder à l’analyse a posteriori.
Il existe de multiples outils de visualisation pour
analyser de telles traces d’exécution. Cependant, leurs limites
sont atteintes lorsque de grosses quantités de données
générées telles que celles au cours de l’exécution
d’applications modernes doivent être analysées. Les outils
existants fournissent tantôt une vue trop haut niveau pour
être réellement utile, tantôt une vue trop détaillée rendant
l’exploration des données fastidieuse. Nous proposons une
nouvelle plateforme de visualisation interactive pour ré-
soudre ces problèmes. Notre contribution consiste en deux
volet : (a) nous présentons un nouveau système de stockage
pour trace d’exécution suffisamment rapide pour permettre
l’exploration interactive de traces volumineuses et (b) un
nouvel outil de visualisation pour explorer interactivement
les traces à différents niveaux de détails.
Interactive User Group Analysis
2015
Behrooz Omidvar Tehrani, Sihem Amer-Yahia, Alexandre Termier
Conference Paper CIKM 2015, pp. 403-412.
Abstract: User data is becoming increasingly available in multiple domains ranging from phone usage traces to data on the social Web. The analysis of user data is appealing to scientists who work on population studies, recommendations, and large-scale data analytics. We argue for the need for an interactive analysis to understand the multiple facets of user data and address different analytics scenarios. Since user data is often sparse and noisy, we propose to produce labeled groups that describe users with common properties and develop IUGA, an interactive framework based on group discovery primitives to explore the user space. At each step of IUGA, an analyst visualizes group members and may take an action on the group (add/remove members) and choose an operation (exploit/explore) to discover more groups and hence more users. Each discovery operation results in k most relevant and diverse groups. We formulate group exploitation and exploration as optimization problems and devise greedy algorithms to enable efficient group discovery. Finally, we design a principled validation methodology and run extensive experiments that validate the effectiveness of IUGA on large datasets for different user space analysis scenarios.
Data mining approach to temporal debugging of embedded streaming applications
2015
Oleg Iegorov, Vincent Leroy, Alexandre Termier, Jean-François Méhaut, Miguel Santana
Conference Paper EMSoft 2015, pp. 167-176.
Abstract: One of the greatest challenges in the embedded systems area is to empower software developers with tools that speed up the debugging of QoS properties in applications. Typical streaming applications, such as multimedia (audio/video) decoding, fulfill the QoS properties by respecting the realtime deadlines. A perfectly functional application, when missing these deadlines, may lead to cracks in the sound or perceptible artifacts in the image. We start from the premise that most of the streaming applications that run on embedded systems can be expressed under a dataflow model of computation, where the application is represented as a directed graph of the data flowing through computational units called actors. It has been shown that in order to meet real-time constraints the actors should be scheduled in a periodic manner. We exploit this property to propose SATM – a novel approach based on data mining techniques that automatically analyzes execution traces of streaming applications, and discovers significant breaks in the periodicity of actors, as well as potential causes of these breaks. We show on a real use case that our debugging approach can uncover important defects and pinpoint their location to the application developer.
Improved Local Search for Binary Matrix Factorization
2015
Seyed Hamid Mirisaee, Éric Gaussier, Alexandre Termier
Conference Paper AAAI 2015, pp. 1198-1204.
Abstract: Rank K Binary Matrix Factorization (BMF) approximates a binary matrix by the product of two binary matrices of lower rank, K, using either L1 or L2 norm. In this paper, we first show that the BMF with L2 norm can be reformulated as an Unconstrained Binary Quadratic Programming (UBQP) problem. We then review several local search strategies that can be used to improve the BMF solutions obtained by previously proposed methods, before introducing a new local search dedicated to the BMF problem. We show in particular that the proposed solution is in general faster than the previously proposed ones. We then assess its behavior on several collections and methods and show that it significantly improves methods targeting the L2 norms on all the datasets considered; for the L1 norm, the improvement is also significant for real, structured datasets and for the BMF problem without the binary reconstruction constraint.
Selecting representative instances from datasets
2015
Seyed Hamid Mirisaee, Ahlame Douzal, Alexandre Termier
Conference Paper DSAA 2015, pp. 1-10.
Abstract: We propose in this paper a new, alternative approach for the problem of finding a set of representative objects in large datasets. To do so, we first formulate the general Instance Selection Problem (ISP) and then study three variants of that in order to select instances from dierent regions of the data.These variants aim at finding the objects located in three very different locations of the data: the inner frontier, the central area and the outer frontier. Solutions to these problems have been discussed and their complexities have been studied.To illustrate the effectiveness of the proposed techniques, we first use a small, synthetic dataset for visualization purpose. We then study them on the Reuters dataset and show that the integration of instances selected by the ISP techniques is able to provide a good representation of the data and can be considered as a complementary approach for the state-of-the-art methods. Finally, we examine the quality of the selected objects by applying a topic-based analysis in order to show how well the selected documents cover the topics in the Reuters dataset.
Reducing trace size in multimedia applications endurance tests
2015
Serge Vladimir Emteu Tchagou, Alexandre Termier, Jean-François Méhaut, Brice Videau,Miguel Santana, René Quiniou
Conference paper DATE2015, pp. 984-985.
Nominated for best paper award, track A
Abstract: Proper testing of applications over embedded systems such as set-top boxes requires endurance tests, i.e. running applications for extended periods of times, typically several days. In order to understand bugs or poor performances, execution traces have to be analyzed, however current trace analysis methods are not designed to handle several days of execution traces due to the huge quantity of data generated. Our proposal, designed for regular applications such as multimedia decoding/encoding, is to monitor execution by analyzing trace on the fly in order to record trace only in time periods where a suspicious activity is detected. Our experiments show a significant reduction in the trace size compared to recording the whole trace.
ParaMiner: a generic pattern mining algorithm for multi-core architectures
2014
Benjamin Négrevergne, Alexandre Termier, Marie-Christine Rousset, Jean-François Méhaut
Journal paper Data Mining and Knowledge Discovery, 28(3), pp. 593-633, 2014.
Abstract: In this paper, we present Para Miner which is a generic and parallel algorithm for closed pattern mining. Para Miner is built on the principles of pattern enumeration in strongly accessible set systems. Its efficiency is due to a novel dataset reduction technique (that we call EL-reduction), combined with novel technique for performing dataset reduction in a parallel execution on a multi-core architecture. We illustrate Para Miner’s genericity by using this algorithm to solve three different pattern mining problems: the frequent itemset mining problem, the mining frequent connected relational graphs problem and the mining gradual itemsets problem. In this paper, we prove the soundness and the completeness of Para Miner. Furthermore, our experiments show that despite being a generic algorithm, Para Miner can compete with specialized state of the art algorithms designed for the pattern mining problems mentioned above. Besides, for the particular problem of gradual itemset mining, Para Miner outperforms the state of the art algorithm by two orders of magnitude.
Sofware
PGLCM: efficient parallel mining of closed frequent gradual itemsets
2015
Trong Dinh Thac Do, Alexandre Termier, Anne Laurent, Benjamin Negrevergne, Behrooz Omidvar-Tehrani, Sihem Amer-Yahia
Journal paper Knowledge and Information Systems, 43(3), pp 497-527, 2015.
Abstract: Numerical data (e.g., DNA micro-array data, sensor data) pose a challenging problem to existing frequent pattern mining methods which hardly handle them. In this framework, gradual patterns have been recently proposed to extract covariations of attributes, such as: “When X increases, Y decreases”. There exist some algorithms for mining frequent gradual patterns, but they cannot scale to real-world databases. We present in this paper GLCM, the first algorithm for mining closed frequent gradual patterns, which proposes strong complexity guarantees: the mining time is linear with the number of closed frequent gradual itemsets. Our experimental study shows that GLCM is two orders of magnitude faster than the state of the art, with a constant low memory usage. We also present PGLCM, a parallelization of GLCM capable of exploiting multicore processors, with good scale-up properties on complex datasets. These algorithms are the first algorithms capable of mining large real world datasets to discover gradual patterns.
Sofware
Itemset approximation using Constrained Binary Matrix Factorization
2014
Seyed Hamid Mirisaee, Éric Gaussier, Alexandre Termier
Conference Paper DSAA 2014, pp. 39-45
Abstract: We address in this paper the problem of efficiently finding a few number of representative frequent itemsets in transaction matrices. To do so, we propose to rely on matrix decomposition techniques, and more precisely on Constrained Binary Matrix Factorization (CBMF) which decomposes a given binary matrix into the product of two lower dimensional binary matrices, called factors. We first show, under binary constraints, that one can interpret the first factor as a transaction matrix operating on packets of items, whereas the second factor indicates which item belongs to which packet. We then formally prove that one can directly mine the CBMF factors in order to find (approximate) itemsets of a given size and support in the original transaction matrix. Then through a detailed experimental study, we show that the frequent itemsets produced by our method represent a significant portion of the set of all frequent itemsets according to existing metrics, while being up to several orders of magnitude less numerous.
Scalability bottlenecks discovery in MPSoC platforms using data mining on simulation traces
2014
Sofiane Lagraa, Alexandre Termier, Frédéric Pétrot
Conference Paper DATE 2014, pp. 1-6
Best paper award, track E
Abstract: Nowadays, a challenge faced by many developers is the profiling of parallel applications so that they can scale over more and more cores. This is especially critical for embedded systems powered by Multi-Processor System-on-Chip (MPSoC), where ever demanding applications have to run smoothly on numerous cores, each with modest power budget. The reasons for the lack of scalability of parallel applications are numerous, and it can be time consuming for a developer to pinpoint the correct one. In this paper, we propose a fully automatic method which detects the instructions of the code which lead to a lack of scalability. The method is based on data mining techniques exploiting low level execution traces produced by MPSoC simulators. Our experiments show the accuracy of the proposed technique on five different kinds of applications, and how the information reported can be exploited by application developers.
Benchmarking of triple stores scalability for MPSoC trace analysis
2014
Fopa Leon Constantin, Fabrice Jouanot, Alexandre Termier, Maurice Tchuente and Oleg Iegorov
Workshop VLDB workshop on benchmarking RDF systems (Bersys), 2014
Abstract: A Multi Processor System-on-Chip (MPSoC) is a complex embedded system used in consumer electronic devices, such as smartphones, tablets and set-top boxes. In order to cope with the complexity of MPSoC architectures, software developers rely on post-mortem trace analysis for application debugging or optimization. The traces are explored to localize expected and unexpected programs behaviors. However, the low semantic value of low-level trace events make the trace
exploration difficult. We propose to perform trace exploration through an ontology which adds semantics to events and provides a declarative language for querying data. Because traces can be huge, such an ontology contains a large number of instances stored as RDF triples. Because analysts need fast results on classical computer, an efficient system for query answering is preferred. Therefore, saturating, loading and querying those triples pose a scalability challenge to state-of-the-art knowledge base repositories (KBR). In this paper, we have conducted a benchmark of 7 KBRs: Jena, Sesame-native, Sesame-memory, tdb, sdb, rdf-3x and vertical-mdb, to test their scalability in a non-distributed environment close to analyst environment. We used these KBRs to analyze real traces through VIDECOM, an ontology we designed for trace analysis of applications on MPSoC. Results show that vertical-mdb has a loading rate 3 times faster than the others. It is the only KBR able to saturate the biggest trace of our dataset without exceeding system memory and to run complex queries on it in an acceptable time. Other approaches failed, due to memory limitation or inefficient join implementation.
Companion website with datasets and code
Data mining MPSoC simulation traces to identify concurrent memory access patterns
2013
Sofiane Lagraa, Alexandre Termier, Frédéric Pétrot
Conference Paper DATE 2013, pp. 755-760
Abstract: Due to a growing need for flexibility, massively parallel Multiprocessor SoC (MPSoC) architectures are currently being developed. This leads to the need for parallel software, but poses the problem of the efficient deployment of the software on these architectures. To address this problem, the execution of the parallel program with software traces enabled on the platform and the visualization of these traces to detect irregular timing behavior is the rule. This is error prone as it relies on software logs and human analysis, and requires an existing platform. To overcome these issues and automate the process, we propose the conjoint use of a virtual platform logging at hardware level the memory accesses and of a data-mining approach to automatically report unexpected instructions timings, and the context of occurrence of these instructions. We demonstrate the approach on a multiprocessor platform running a video decoding application.
Efficiently rewriting large multimedia application execution traces with few event sequences
2013
Christiane Kamdem Kengne, Leon Constantin Fopa, Alexandre Termier, Noha Ibrahim, Marie-Christine Rousset, Takashi Washio, Miguel Santana
Conference Paper KDD 2013, pp. 1348-1356
Abstract: The analysis of multimedia application traces can reveal important information to enhance program execution comprehension. However typical size of traces can be in gigabytes, which hinders their effective exploitation by application developers. In this paper, we study the problem of finding a set of sequences of events that allows a reduced-size rewriting of the original trace. These sequences of events, that we call blocks, can simplify the exploration of large execution traces by allowing application developers to see an abstraction instead of low-level events.
The problem of computing such set of blocks is NP-hard and naive approaches lead to prohibitive running times that prevent analysing real world traces. We propose a novel algorithm that directly mines the set of blocks. Our experiments show that our algorithm can analyse real traces of up to two hours of video. We also show experimentally the quality of the set of blocks proposed, and the interest of the rewriting to understand actual trace data.
Towards a Framework for Semantic Exploration of Frequent Patterns
2013
Behrooz Omidvar Tehrani, Sihem Amer-Yahia, Alexandre Termier, Aurélie Bertaux, Éric Gaussier, Marie-Christine Rousset
Workshop Workshop on Information Management for Mobile Applications (IMMoA) 2013, pp. 7-14
Abstract: Mining frequent patterns is an essential task in discovering
hidden correlations in datasets. Although frequent patterns
unveil valuable information, there are some challenges which
limits their usability. First, the number of possible patterns
is often very large which hinders their effective exploration.
Second, patterns with many items are hard to read and the
analyst may be unable to understand their meaning. In addition,
the only available information about patterns is their
support, a very coarse piece of information. In this paper,
we are particularly interested in mining datasets that reflect
usage patterns of users moving in space and time and for
whom demographics attributes are available (age, occupation,
etc). Such characteristics are typical of data collected
from smart phones, whose analysis has critical business applications
nowadays. We propose pattern exploration primitives,
abstraction and refinement, that use hand-crafted taxonomies
on time, space and user demographics. We show on
two real datasets, Nokia and MovieLens, how the use of
such taxonomies reduces the size of the pattern space and
how demographics enable their semantic exploration. This
work opens new perspectives in the semantic exploration of
frequent patterns that reflect the behavior of different user
communities.
Debugging embedded multimedia application traces through periodic pattern mining
2012
Patricia López Cueva, Aurélie Bertaux, Alexandre Termier, Jean-François Méhaut, Miguel Santana
Conference Paper EMSOFT'2012, pp. 13-22
Abstract: Increasing complexity in both the software and the underlying hardware, and ever tighter time-to-market pressures are some of the key challenges faced when designing multimedia embedded systems. Optimizing the debugging phase can help to reduce development time significantly. A powerful approach used extensively during this phase is the analysis of execution traces. However, huge trace volumes make manual trace analysis unmanageable. In such situations, Data Mining can help by automatically discovering interesting patterns in large amounts of data. In this paper, we are interested in discovering periodic behaviors in multimedia applications. Therefore, we propose a new pattern mining approach for automatically discovering all periodic patterns occurring in a multimedia application execution trace.
Furthermore, gaps in the periodicity are of special interest since they can correspond to cracks or drop-outs in the stream. Existing periodic pattern definitions are too restrictive regarding the size of the gaps in the periodicity. So, in this paper, we specify a new definition of frequent periodic patterns that removes this limitation. Moreover, in order to simplify the analysis of the set of frequent periodic patterns we propose two complementary approaches: (a) a lossless representation that reduces the size of the set and facilitates its analysis, and (b) a tool to identify pairs of "competitors" where a pattern breaks the periodicity of another pattern. Several experiments were carried out on embedded video and audio decoding application traces, demonstrating that using these new patterns it is possible to identify abnormal behaviors.
Enhancing the Analysis of Large Multimedia Applications Execution Traces with FrameMiner
2012
Christiane Kamdem Kengne, Leon Constantin Fopa, Noha Ibrahim, Alexandre Termier, Marie-Christine Rousset, Takashi Washio
Workshop ICDM workshop on Practical Theories for Exploratory Data Mining (PTDM), 2012, pp. 595-602
Abstract: The analysis of multimedia application traces can reveal important information to enhance program comprehension. However traces can be very large, which hinders their effective exploitation. In this paper, we study the problem of finding a k-golden set of blocks that best characterize data. Sequential pattern mining can help to automatically discover the blocks, and we called k-golden set, a set of k blocks that maximally covers the trace. These kind of blocks can simplify the exploration of large traces by allowing programmers to see an abstraction instead of low-level events. We propose an approach for mining golden blocks and finding coverage of frames. The experiments carried out on video and audio application decoding show very promising results.
Automatic congestion detection in MPSoC programs using data mining on simulation traces
2012
Sofiane Lagraa, Alexandre Termier, Frédéric Pétrot:
Conference paper RSP 2012, pp. 64-70
Abstract: The efficient deployment of parallel software, specifically legacy one, on Multiprocessor systems on chip (MPSoC) is a challenging task. In this paper, we introduce the use of a data-mining approach on traces of a functionally correct program to automatically identify recurring congestion points and their sources. Each memory transaction, i.e. instruction fetch, data load and data store, occurring in the system is logged, thanks to the use of a virtual platform of the system. The resulting trace is analyzed to discover memory access patterns that are occurring frequently and that feature high latencies. These patterns are sorted by order of decreasing occurrence and estimated congestion level, allowing the easy identification of the sources of inefficiency. We have simulated a MPSoC with 16 processors running multiple applications, and have been able to automatically detect congestion on resources and their sources in the parallel program using this technique by analyzing gigabytes of traces.
Discovery of Probabilistic Mappings between Taxonomies: Principles and Experiments
2011
Rémi Tournaire, Jean-Marc Petit, Marie-Christine Rousset, Alexandre Termier
Journal paper Journal of Data Semantics, 15, pp. 66-101, 2011.
Abstract: In this paper, we investigate a principled approach for defining and discovering probabilistic mappings between two taxonomies. First, we compare two ways of modeling probabilistic mappings which are compatible with the logical constraints declared in each taxonomy. Then we describe a generate and test algorithm which minimizes the number of calls to the probability estimator for determining those mappings whose probability exceeds a certain threshold. Finally, we provide an experimental analysis of this approach.
PGLCM: Efficient Parallel Mining of Closed Frequent Gradual Itemsets
2010
Trong Dinh Thac Do, Anne Laurent, Alexandre Termier
Conference paper ICDM 2010, pp. 138-147
Abstract: Numerical data (e.g., DNA micro-array data, sensor data) pose a challenging problem to existing frequent pattern mining methods which hardly handle them. In this framework, gradual patterns have been recently proposed to extract covariations of attributes, such as: "When X increases, Y decreases". There exist some algorithms for mining frequent gradual patterns, but they cannot scale to real-world databases. We present in this paper GLCM, the first algorithm for mining closed frequent gradual patterns, which proposes strong complexity guarantees: the mining time is linear with the number of closed frequent gradual item sets. Our experimental study shows that GLCM is two orders of magnitude faster than the state of the art, with a constant low memory usage. We also present PGLCM, a parallelization of GLCM capable of exploiting multicore processors, with good scale-up properties on complex datasets. These algorithms are the first algorithms capable of mining large real world datasets to discover gradual patterns.
Software
PGP-mc: Towards a Multicore Parallel Approach for Mining Gradual Patterns
2010
Anne Laurent, Benjamin Négrevergne, Nicolas Sicard, Alexandre Termier
Conference paper DASFAA 2010, pp. 78-84
Abstract: Gradual patterns highlight complex order correlations of the form “The more/less X, the more/less Y”. Only recently algorithms have appeared to mine efficiently gradual rules. However, due to the complexity of mining gradual rules, these algorithms cannot yet scale on huge real world datasets. In this paper, we propose to exploit parallelism in order to enhance the performances of the fastest existing one (GRITE). Through a detailed experimental study, we show that our parallel algorithm scales very well with the number of cores available.
Discovering closed frequent itemsets on multicore: Parallelizing computations and optimizing memory accesses
2010
Benjamin Négrevergne, Alexandre Termier, Jean-François Méhaut, Takeaki Uno
Conference paper HPCS 2010, pp. 521-528
Abstract: The problem of closed frequent itemset discovery is a fundamental problem of data mining, having applications in numerous domains. It is thus very important to have efficient parallel algorithms to solve this problem, capable of efficiently harnessing the power of multicore processors that exists in our computers (notebooks as well as desktops). In this paper we present PLCMQS, a parallel algorithm based on the LCM algorithm, recognized as the most efficient algorithm for sequential discovery of closed frequent itemsets. We also present a simple yet powerfull parallelism interface based on the concept of Tuple Space, which allows an efficient dynamic sharing of the work. Thanks to a detailed experimental study, we show that PLCMQS is efficient on both on sparse and dense databases.
Combining Logic and Probabilities for Discovering Mappings between Taxonomies
2010
Rémi Tournaire, Jean-Marc Petit, Marie-Christine Rousset, Alexandre Termier
Conference paper KSEM'2010, pp. 530-542
Abstract: In this paper, we investigate a principled approach for defining and discovering probabilistic mappings between two taxonomies. First, we compare two ways of modeling probabilistic mappings which are compatible with the logical constraints declared in each taxonomy. Then we describe a generate and test algorithm which minimizes the number of calls to the probability estimator for determining those mappings whose probability exceeds a certain threshold. Finally, we provide an experimental analysis of this approach.
Efficient Parallel Mining of Gradual Patterns on Multicore Processors
2010
Anne Laurent, Benjamin Négrevergne, Nicolas Sicard, Alexandre Termier
National conference EGC (best of volume) 2010, pp. 137-151
Abstract: Mining gradual patterns plays a crucial role in many real world applications where huge volumes of complex numerical data must be handled, e.g., biological databases, survey databases, data streams or sensor readings. Gradual patterns highlight complex order correlations of the form “The more/less X, the more/less Y”. Only recently algorithms have appeared to mine efficiently gradual rules. However, due to the complexity of mining gradual rules, these algorithms cannot yet scale on huge real world datasets. In this paper, we thus propose to exploit parallelism in order to enhance the performances of the fastest existing one (GRITE) on multicore processors. Through a detailed experimental study, we show that our parallel algorithm scales very well with the number of cores available.
PGP-mc : extraction parallèle efficace de motifs graduels
2010
Anne Laurent, Benjamin Négrevergne, Nicolas Sicard, Alexandre Termier
National conference EGC 2010, pp. 453-464
(in french)
Abstract: Initialement utilisés pour les systèmes de commande, les règles et motifs
graduels (de la forme “plus une personne est âgée, plus son salaire est élevé”)
trouvent de très nombreuses applications, par exemple dans les domaines
de la biologie, des données en flots (e.g. issues de réseaux de capteurs), etc. Très
récemment, des algorithmes ont été proposés pour extraire automatiquement
de tels motifs. Cependant, même si certains d’entre eux ont permis des gains
de performance importants, les algorithmes restent coûteux et ne permettent
pas de traiter efficacement les bases de données réelles souvent très volumineuses
(en nombre de lignes et/ou nombre d’attributs). Nous proposons donc
dans cet article une méthode originale de recherche de ces motifs utilisant le
multi-threading pour exploiter au mieux les multiples coeurs présents dans la
plupart des ordinateurs et serveurs actuels. L’efficacité de cette approche est validée
par une étude expérimentale.
Découverte d'itemsets fréquents fermés sur architecture multicoeurs
2010
Benjamin Négrevergne, Jean-François Méhaut, Alexandre Termier, Takeaki Uno
National conference EGC 2010, pp. 465-470
(in french)
Abstract: Dans ce papier nous proposons PLCM, un algorithme parallèle de
découverte d’itemsets fréquents fermés basé sur l’algorithme LCM, reconnu
comme l’algorithme séquentiel le plus efficace pour cette tâche. Nous présentons
aussi une interface de parallélisme à la fois simple et puissante basée sur la
notion de Tuple Space, qui permet d’avoir une bonne répartition dynamique du
travail.
Grâce à une étude expérimentale détaillée, nous montrons que PLCM est le seul
algorithme qui soit suffisamment générique pour calculer efficacement des itemsets
fréquents fermés à la fois sur des bases creuses et sur des bases denses,
améliorant ainsi l’état de l’art.
DryadeParent, An Efficient and Robust Closed Attribute Tree Mining Algorithm
2008
Alexandre Termier, Marie-Christine Rousset, Michèle Sebag, Kouzou Ohara, Takashi Washio, Hiroshi Motoda
Journal paper IEEE Transactions on Knowledge and Data Engineering, 20(3), pp. 300-320, 2008.
Abstract: In this paper, we present a new tree mining algorithm, DryadeParent, based on the hooking principle first introduced in DRYADE. In the experiments, we demonstrate that the branching factor and depth of the frequent patterns to find are key factors of complexity for tree mining algorithms, even if often overlooked in previous work. We show that DryadeParent outperforms the current fastest algorithm, CMTreeMiner, by orders of magnitude on data sets where the frequent tree patterns have a high branching factor.
DIGDAG, a First Algorithm to Mine Closed Frequent Embedded Sub-DAGs
2007
Alexandre Termier, Yoshinori Tamada, Kazuyuki Numata, Seiya Imoto, Takashi Washio, Tomoyuki Higuchi
Workshop Mining and Learning with Graphs (MLG) 2007
Abstract: Although tree and graph mining have attracted a lot of
attention, there are nearly no algorithms devoted to DAGmining,
whereas many applications are in dire need of such
algorithms. We present in this paper DIGDAG, the first algorithm
capable of mining closed frequent embedded subDAGs.
This algorithm combines efficient closed frequent
itemset algorithms with novel techniques in order to scale
up to complex input data.
Efficient Mining of High Branching Factor Attribute Trees
2005
Alexandre Termier, Marie-Christine Rousset, Michèle Sebag, Kouzou Ohara, Takashi Washio, Hiroshi Motoda
Conference paper ICDM 2005, pp. 785-788
Abstract: In this paper, we present a new tree mining algorithm, DryadeParent, based on the hooking principle first introduced in Dryade (Termier et al, 2004). In the experiments, we demonstrate that the branching factor and depth of the frequent patterns to find are key factor of complexity for tree mining algorithms. We show that DryadeParent outperforms the current fastest algorithm, CMTreeMiner, by orders of magnitude on datasets where the frequent patterns have a high branching factor.
DRYADE: A New Approach for Discovering Closed Frequent Trees in Heterogeneous Tree Databases
2004
Alexandre Termier, Marie-Christine Rousset, Michèle Sebag
Conference paper ICDM 2004, pp. 543-546
Abstract: In this paper we present a novel algorithm for discovering tree patterns in a tree database. This algorithm uses a relaxed tree inclusion definition, making the problem more complex (checking tree inclusion is NP-complete), but allowing to mine highly heterogeneous databases. To obtain good performances, our DRYADE algorithm, discovers only closed frequent tree patterns.
Highlighting Latent Structure in Documents
2004
Helka Folch, Benoit Habert, Michèle Jardino, Nathalie Pernelle, Marie-Christine Rousset, Alexandre Termier
Conference paper LREC 2004, pp. 1331-1334
Abstract: Extensible Markup Language (XML) is playing an increasingly important role in the exchange of a wide variety of data on the Web
and elsewhere. It is a simple, very flexible text format, used to annotate data by means of markup. XML documents can be checked for
syntactic well-formedness and semantic coherence through DTD and schema validation which makes their processing easier. In
particular, data with nested structure can be easily represented with embedded tags. This structured representation should be used in
information retrieval models which take structure into account. As such, it is meta-data and therefore a contribution to the Semantic
Web. However, nowadays, there exists huge quantities of raw texts and the issue is how to find an easy way to provide these texts with
sensible XML structure.
Here we present an automatic method to extract tree structure from raw texts. This work has been supported by the Paris XI
University (BQR2002 project, Paris-XI University).
TreeFinder: a First Step towards XML Data Mining
2002
Alexandre Termier, Marie-Christine Rousset, Michèle Sebag
Conference paper ICDM 2002, pp. 450-457
Abstract: In this paper we consider the problem of searching frequent trees from a collection of tree-structured data modeling XML data. The TreeFinder algorithm aims at finding trees, such that their exact or perturbed copies are frequent in a collection of labelled trees. To cope with complexity issues, TreeFinder is correct but not complete: it finds a subset of actually frequent trees. The default of completeness is experimentally investigated on artificial medium size datasets; it is shown that TreeFinder reaches completeness or falls short for a range of experimental settings.
Raising the Dead: Extending Evolutionary Algorithms with a Case-Based Memory
2001
Jeroen Eggermont, Tom Lenaerts, Sanna Poyhonen, Alexandre Termier
Conference paper EuroGP 2001, pp. 280-290
Abstract: In dynamically changing environments, the performance of a standard evolutionary algorithm deteriorates. This is due to the fact that the population, which is considered to contain the history of the evolutionary process, does not contain enough information to allow the algorithm to react adequately to changes in the fitness landscape. Therefore, we added a simple, global case-based memory to the process to keep track of interesting historical events. Through the introduction of this memory and a storing and replacement scheme we were able to improve the reaction capabilities of an evolutionary algorithm with a periodically changing fitness function.
Combining Statistics and Semantics for Word and Document Clustering
2001
Alexandre Termier, Michèle Sebag, Marie-Christine Rousset
Workshop IJCAI workshop on Ontology Learning 2001
Abstract: A new approach for constructing pseudo-keywords,
referred to as Sense Units, is proposed. Sense
Units are obtained by a word clustering process,
where the underlying similarity reflects both statistical
and semantic properties, respectively detected
through Latent Semantic Analysis and WordNet.
Sense Units are used to recode documents and are
evaluated from the performance increase they permit
in classification tasks.
Experimental results show that accounting for semantic
information in fact decreases the performances
compared to LSI standalone.
The main weakenesses of the current hybrid
scheme are discussed and several tracks for improvement
are sketched.
I teach at the Computer Science department of University Rennes 1 (
Last updated: 2015/04/01. In-lab picture