{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# this notebook requires the specific packages. Run this cell to automatically install them.\n", "!pip install pandas" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# TD / TP 2" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## First Exercice: Numerical interval patterns\n", "\n", "Itemset mining handles datasets with only categorical variables (boolean dataset). However, some dataset can contain numerical variables. The objective of this first part is to define a new domain of patterns for numerical attributes.\n", "\n", "Let's first illustrate the type of dataset we consider in this exercice. The table below illustrates a dataset that has three numerical attributes $m_1$, $m_2$ and $m_3$. The dataset contains 6 examples. There is no missing values. Without loss of generality we assume that all attributes are real valued.\n", "\n", "
\n", "\n", "| Id | $m_1$ | $m_2$ | $m_3$ | \n", "|----|---|---|---|\n", "| 1 | 0.7 | 0.5 | 10.5 |\n", "| 2 | 0.2 | 3.5 | 20.3 |\n", "| 3 | 0.4 | 0.6 | 14.7 |\n", "| 4 | 0.2 | 2.5 | 15.89 |\n", "| 5 | 0.6 | 6.5 | 23.5 |\n", "| 6 | 0.5 | 7.4 | 10.4 |\n", "\n", "
\n", "\n", "An *interval pattern* is a multidimensional closed interval (same number of dimension as the number of attributes). An interval pattern matches an example of the dataset is each value of the example belongs to the closed interval (at the corresponding dimension). \n", "For instance, $p=([0.1,0.3],[1.0,4.0],[10.0,30.0])$ is an *interval pattern* that matches the examples 2 and 5.\n", "\n", "A *minimal interval pattern* is an interval pattern whose intervals are the smallest to cover a given set of transactions. For instance, $p=([0.1,0.3],[1.0,4.0],[10.0,30.0])$ covers the transactions $\\{2,4\\}$ but is not minimal. $p=([0.2,0.2],[2.5,3.5],[15.89,20.3])$ is a minimal interval pattern that covers the same transactions. \n", "\n", "**Q1.** List all the minimal interval patterns that matches at least $4$ examples of the dataset.\n", "
\n", "\n", "Given a set of attributes $M = \\{m_i\\}_{i\\in[|M|]}$, a dataset is a collection of examples $(\\bm{x}^i)_i$ where $\\bm{x}^i=(x^i_1,\\dots, x^i_{|M|}) \\in \\mathbb{R}^{|M|}$ is a tuple. An interval pattern of dimension $M$ is denoted $p=([l_i,u_i])_{i\\in[|M|]}$ where $l_i, u_i \\in \\bar{\\mathbb{R}}$.\n", "\n", "**Q2.** Define the relation $\\sqsubseteq$ of containment of a pattern $p$ in an example $\\bm{x}$.\n", "\n", "**Q3.** Define the notion of support for the domain of interval patterns of dimension $M$\n", "\n", "**Q4.** Proposed an definition of a partial order $\\triangleleft$ on the set of the interval patterns of dimension $M$ such that the support measure is anti-monotonic w.r.t. this order.\n", "\n", "\n", "
\n", "We now define a closure operator as follows:\n", "\n", "$cl(p) = \\left( \\left[ min(\\{ x_i \\mid \\bm{x}\\in \\mathcal{D} \\wedge p\\sqsubseteq\\bm{x}\\}), max(\\{ x_i \\mid \\bm{x}\\in \\mathcal{D} \\wedge p\\sqsubseteq\\bm{x}\\}) \\right] \\right)_{i\\in[|M|]}$\n", "\n", "**Q5.** What is $cl(p)$ when $p=([0.1,0.3],[1.0,4.0],[10.0,30.0])$? Comment?\n", "\n", " \n", "By construction, $cl(p)$ is the *smaller* interval patterns that matches the same examples as $p$. The pattern such that $p=cl(p)$ are called *closed patterns*.\n", "\n", "**Q6.** Illustrate a Hasse diagram of the closed interval patterns that match some examples in the dataset above (do at most two levels starting from the degenerated pattern $([-\\infty,+\\infty],[-\\infty,+\\infty],[-\\infty,+\\infty])$ ). \n", "Why do not draw the Hasse diagram of the whole interval patterns?\n", "\n", "\n", "**Q7.** We want to implement a Depth-First Search algorithm that extracts frequent closed interval patterns from a dataset (frequent = support above a threshold $\\sigma$). Propose the function that evaluates the support and extends a pattern $p$ to explore further the search space.\n", "" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "#### Implementation of frequent interval pattern extraction\n", "We now implement the algorithm. \n", "For sake of simplicity, you can implement a version of the algorithm specific to a dataset with 3 dimensions ... but it is not so complicated to generalize it." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dataset=[ (0.7, 0.5, 10.5), \n", " (0.2, 3.5, 20.3),\n", " (0.4, 0.6, 14.7),\n", " (0.2, 2.5, 15.89),\n", " (0.6, 6.5, 23.5),\n", " (0.5, 7.4, 10.4) ]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def matches(x, p):\n", " \"\"\"\n", " x: example in dimension 1\n", " p: interval pattern\n", " \"\"\"\n", " \n", " return True" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#example of interval pattern in dimension 3\n", "# -> this is just a proposal for interval-pattern representation: if can change it, if you want\n", "p=([0.1,0.3],[1.0,4.0],[10.0,30.0])\n", "\n", "print(matches(dataset[2], p))\n", "print(matches(dataset[1], p))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def support(dataset,p):\n", " \"\"\"Compute the support of pattern p in the dataset\"\"\"\n", " s=0\n", " \n", " return s\n", "\n", "# test\n", "support(dataset,p)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#TODO ... depth first search algorithm" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "**Q8.** Assuming you implement a naive approach of the algorithm, what is the limit/default of the proposed algorithm?\n", "\n", "" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "#### Improved version\n", "\n", "In this improved version, we propose to tranverse the search space without redundancy. \n", "The problem we had in the former version is that there is not a unique path for reach a pattern of the search space.\n", "\n", "The solution we propose to avoid exploring twice the same interval pattern is is known as *lectic order*, that is to understand as an order on the possible actions (or \"generating the rightmost pattern\"). It is based on the definition of total order on the extentions alternatives. In this case, we assume that there is an order on the attributes $m_1<_a m_2<_a m_3$, and that the left change is *lower* than the right change.\n", "Then, \n", "\n", "In practice, this means that:\n", "1. after applying a change to the right boundary of an interval, then one cannot apply a change to the left boundary\n", "2. after applying a change to attribute $m$, one cannot apply a change to attribute $m'<_a m$.\n", "\n", "This rules induce a new partial order on the set of interval patterns. We denote $\\blacktriangleleft$ this order (note that it is a bit technical to formalize).\n", "\n", "\n", "**Q9.** Considering interval patterns in dimension 1 and the set of interval patterns $\\mathcal{P}=\\{[4,4], [4,5], [5,4], [4,6], [5,5]\\}$. Draw the Hasse diagram of $(\\mathcal{P},\\triangleleft)$ to represent the search paths used by the previous algorithm. Then, draw the Hasse diagram of $(\\mathcal{P},\\blacktriangleleft)$ and comment.\n", "\n", "**Q10.** Are all patterns accessible with it? (we do not ask for a formal proof)\n", "\n", "" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "**Q11.** (**) Propose an implementation with the new search space strategy and compare the number of output patterns \n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "This exercice has been inspired by the work of M. Kaytoue et al:\n", "\n", "> Mehdi Kaytoue, Sergei O. Kuznetsov, and Amedeo Napoli. Revisiting numerical\n", "> pattern mining with formal concept analysis. In Proceedings of International Join\n", "> Conference on Artificial Intelligence (IJCAI), pages 1342–1347, 2011." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Second Exercice: SetChronicle mining\n", "\n", "This section introduces a new type of patterns we named *SetChronicles* that is dedicated to temporal sequences. The first objective of this algorithm is to design a new frequent pattern mining algorithm to extract SetChronicles from a collection of temporal sequences. The second objective is to apply this algorithm on the \n", "\n", "Let's start with some definitions. Let us first define a finite alphabet $(\\Sigma,<)$ equipped with a total order $<$.\n", "\n", "A **temporal sequence** of length $n$ over $\\Sigma$ is a finite sequence $\\rho = \\langle\n", "(\\sigma_1,\\tau_1), \\dots, (\\sigma_{n},\\tau_{n})\\rangle$ in $(\\Sigma\\times \\mathbb{R}_{\\geq 0})^{n}$ where for all \$1\\leq i Christophe Dousson and Thang Vu Duong, Discovering Chronicles with Numerical \n", "> Time Constraints from Alarm Logs for Monitoring Dynamic Systems, In Proceedings \n", "> of International Join Conference on Artificial Intelligence (IJCAI), \n", "> pages 620-626, 1999." ] }, { "cell_type": "markdown", "metadata": {}, "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" }, "orig_nbformat": 4, "vscode": { "interpreter": { "hash": "e7370f93d1d0cde622a1f8e1c04877d8463912d04d973331ad4851f04de6915a" } } }, "nbformat": 4, "nbformat_minor": 2 }