Negative Sequential Patterns

This web page is a compagnon page for the article "NegPSpan: efficient extraction of negative sequential patterns with embedding constraints" published in the Data Mining Journal (DMKD). The web site contains:

source code
Extended version of the article including proposition proofs
synthetic datasets
evaluate our implementation on line

Negative Sequential Patterns

Mining frequent sequential patterns consists in extracting recurrence behaviors, modeled as patterns, in a large dataset of sequences. Such pattern informs about was occurs in a sequence, ie what does happen. With frequent negative sequential patterns (NSP), behaviors are described by both events that happens and some of them that are absent. Few approaches have been proposed to mine such kind of behaviors but does not necessarily clearly state that extracted NSP sets differ. This article defines some syntaxes and semantics of negative patterns to position the proposed approach NegPSpan wrt to the state of the art approach eNSP. The NegPSpan approach extracts NSP using a PrefixSpan depth-first scheme and enabling maxgap constraints while eNSP can not (due to its semantics). Our experiments on synthetic and real datasets show that NegPSpan extracts meaningful NSP and it processes larger datasets thanks to significantly lower memory requirements and better computing times than eNSP.

NegPSpan is a complete algorithm and it uses an anti-monotonicity property of the support measure. It is usually stated that such properties does not hold for negative sequential patterns, but we identifies some semantics and partial orders for which it is possible to have such property that enables to reuse decades of research on frequent pattern mining, including the notion of closed-patterns that could be redefined for negative sequentials patterns.

This investigation of the semantics of negative sequential patterns raised interesting questions about the meaning of the negation in a sequential pattern and how it is intuitivelly understood. We are interested in understanding your understanding of the semantic of negative sequential patterns. You can fill the survey (5 questions, no need to be an expert in pattern mining):

Survey

The articles we pusblished on negative sequential patterns are the following (with video presentations of the articles):

T. Guyet and R. Quiniou. "NegPSpan: efficient extraction of negative sequential pattern with embedding constraints". Data Min. Knowl. Discov., Vol. 34(2) : p. 563–609. 2020. A preliminary version of this article can be downloaded on ArXiV
T. Guyet and P. Besnard. "Semantics of negative sequential patterns". European Conference on Artificial Intelligence : p. 1–11. Santiago de Compostela, Spain. 2020.
T. Guyet and P. Besnard. "Declarative mining of negative sequential patterns". Proceedings of DPSW@ECAI : p. 1–11. Santiago de Compostela, Spain. 2020.
- More details about pattern mining with Answer Set Programming (ASP) can be found here.
- ASP code are available here

Source code

The archive below contains source code in C++ of algorithms that are presented and evaluated in the article: 1) implementation of our algorithm NegPSpan et 2) implementation of eNSP of Dong et al. This source code requires a recent C++ compiler (compatible with C++11 norms). All details about compilation and execution of the code is given in the README file (we provide a Makefile).

sources.zip

Synthetic datasets

Our experiments use synthetic datasets for which, we can control the main parameters. In the following, we provide the synthetiser itself but also the datasets that have been generated to draw curves of the article. All datasets are in the IBM format (one line provide the description of an itemset detailing, in that order, the sequence id, the itemset timestamp, the size of the itemset and the list of items (integers) in the itemset.

sequences generator: GitLab open-access project (developped in Python3 language, it generates different kind of datasets, see details in the readme of the project)
dataset 1 (used to evaluate evaluating number of patterns and computation time wrt maxgap constraints) : dataset1.zip
dataset 2 (used to evaluate evaluating computation time/memory consumption wrt wrt minimal frequency threshold) : dataset2.zip
article examples (may be used to check examples of the article): Article_Examples.zip
dataset Instacart (simplified, IBM format): sequences_instacart.seq

The script to reproduce the experiments of the DMKD article on synthetic data (and the Figures) are here benchmark.zip.

Real-life datasets

Classical real-life datasets for sequential pattern mining tasks can be found here. We simply changed their format to fit our software requirements.

Due to their intrusive nature, care pathway data we used for real experiments can not be provided.

Test our code on-line (experimental)

This section is an experimental feature using the AllGo platform (based on Docker and REST technologies) to provide an access to a functional executable online, without any installation requirement. This platform is maintained by SED/Inria/IRISA. Note that we do not track the executions (eg their IP origin and so on) neither collect datasets you are evaluating with this tool.

This is an experimental tool: you may have trouble if several people use it at the same time. I f s/he is interested on evaluating our algorithm, your are strongly invited to download the source code on your own computer.

The purpose of this section is to enable readers to run example of our papers and even to test their own new example. The focus is not computation times. We are interested in enabling readers to better understand the semantic of negative patterns (depending on semantic choices and algorithms).

Usage details are given at the bottom of the page.

Dataset representation

Results

Usage instructions

To make run the algorithm, the main steps are the following:

(optional) select an example dataset. This step is optional but illustrate the format of the data and enable to make quick experiments. The proposed datasets are those that have been used to illustrate the semantic differences in the article. Three datasets are avaiblable:
- embeddings: use minimal frequency X, and ...
- multipleinstances: use minimal frequency X, and ..
- noninclusion: use minimal frequency X, and ..
create/complete/modify your dataset (in the IBM Format). You can modify the dataset as you want and thus test our algorithm NegPSpan (or our implementation of eNSP) with some particular cases.
set up the algorithm defining its parameters
- minimal frequency threshold
- to avoid combinatorial explosion (for fair use of the server), the maximum length of extracted patterns is 4
run the algorithm: the default waiting time is set up to 10 seconds (be patient and DO NOT CLICK TWICE). This means that the webpage will be automatically updated after 10s. If results are available, they are printed. Otherwise, you will have to click on a button to check whether it is available or not. This waiting time is just here to avoid you click to much on the button to request results!
patterns are outputed in a (expected interpretable) textual format