{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# this notebook requires the specific packages. Run this cell to automatically install them.\n",
    "!pip install pandas"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# TD / TP 2"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## First Exercice: Numerical interval patterns\n",
    "\n",
    "Itemset mining handles datasets with only categorical variables (boolean dataset). However, some dataset can contain numerical variables. The objective of this first part is to define a new domain of patterns for numerical attributes.\n",
    "\n",
    "Let's first illustrate the type of dataset we consider in this exercice. The table below illustrates a dataset that has three numerical attributes $m_1$, $m_2$ and $m_3$. The dataset contains 6 examples. There is no missing values. Without loss of generality we assume that all attributes are real valued.\n",
    "\n",
    "<center>\n",
    "\n",
    "| Id  | $m_1$  |  $m_2$  | $m_3$  | \n",
    "|----|---|---|---|\n",
    "| 1  |  0.7 | 0.5 | 10.5 |\n",
    "| 2  |   0.2 | 3.5 | 20.3 |\n",
    "| 3 |   0.4 | 0.6 | 14.7 |\n",
    "| 4 |   0.2 | 2.5 | 15.89 |\n",
    "| 5 |    0.6 | 6.5 | 23.5 |\n",
    "| 6 |   0.5 | 7.4 | 10.4 |\n",
    "\n",
    "</center>\n",
    "\n",
    "An *interval pattern* is a multidimensional closed interval (same number of dimension as the number of attributes). An interval pattern matches an example of the dataset is each value of the example belongs to the closed interval (at the corresponding dimension). \n",
    "For instance, $p=([0.1,0.3],[1.0,4.0],[10.0,30.0])$ is an *interval pattern* that matches the examples 2 and 5.\n",
    "\n",
    "A *minimal interval pattern* is an interval pattern whose intervals are the smallest to cover a given set of transactions. For instance, $p=([0.1,0.3],[1.0,4.0],[10.0,30.0])$ covers the transactions $\\{2,4\\}$ but is not minimal. $p=([0.2,0.2],[2.5,3.5],[15.89,20.3])$ is a minimal interval pattern that covers the same transactions. \n",
    "\n",
    "**Q1.** List all the minimal interval patterns that matches at least $4$ examples of the dataset.\n",
    "</br>\n",
    "\n",
    "Given a set of attributes $M = \\{m_i\\}_{i\\in[|M|]}$, a dataset is a collection of examples $(\\bm{x}^i)_i$ where $\\bm{x}^i=(x^i_1,\\dots, x^i_{|M|}) \\in \\mathbb{R}^{|M|}$ is a tuple. An interval pattern of dimension $M$ is denoted $p=([l_i,u_i])_{i\\in[|M|]}$ where $l_i, u_i \\in \\bar{\\mathbb{R}}$.\n",
    "\n",
    "**Q2.** Define the relation $\\sqsubseteq$ of containment of a pattern $p$ in an example $\\bm{x}$.\n",
    "\n",
    "**Q3.** Define the notion of support for the domain of interval patterns of dimension $M$\n",
    "\n",
    "**Q4.** Proposed an definition of a partial order $\\triangleleft$ on the set of the interval patterns of dimension $M$ such that the support measure is anti-monotonic w.r.t. this order.\n",
    "<!-- I assume that the student will define the relation based on the subset relation dimension by dimension-->\n",
    "\n",
    "</br>\n",
    "We now define a closure operator as follows:\n",
    "\n",
    "$cl(p) = \\left( \\left[ min(\\{ x_i \\mid \\bm{x}\\in \\mathcal{D} \\wedge p\\sqsubseteq\\bm{x}\\}), max(\\{ x_i \\mid \\bm{x}\\in \\mathcal{D} \\wedge p\\sqsubseteq\\bm{x}\\}) \\right] \\right)_{i\\in[|M|]} $\n",
    "\n",
    "**Q5.** What is $cl(p)$ when $p=([0.1,0.3],[1.0,4.0],[10.0,30.0])$? Comment?\n",
    "\n",
    "<!-- $cl(p)=([0.2,0.2],[2.5,3.5],[15.89,20.3])$ It is made of values in the database --> \n",
    "By construction, $cl(p)$ is the *smaller* interval patterns that matches the same examples as $p$. The pattern such that $p=cl(p)$ are called *closed patterns*.\n",
    "\n",
    "**Q6.** Illustrate a Hasse diagram of the closed interval patterns that match some examples in the dataset above (do at most two levels starting from the degenerated pattern $([-\\infty,+\\infty],[-\\infty,+\\infty],[-\\infty,+\\infty])$ ). \n",
    "Why do not draw the Hasse diagram of the whole interval patterns?\n",
    "<!-- if we use the whole set of interval patterns, there is an infinite numbers of parents because attributes are real-valued.-->\n",
    "\n",
    "**Q7.** We want to implement a Depth-First Search algorithm that extracts frequent closed interval patterns from a dataset (frequent = support above a threshold $\\sigma$). Propose the function that evaluates the support and extends a pattern $p$ to explore further the search space.\n",
    "<!-- The trick is not to work directly with patterns, but the compute a sorted list of modalities and then to have intervals that are are pair of pointers on the elements of the modalities-->"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Implementation of frequent interval pattern extraction\n",
    "We now implement the algorithm. \n",
    "For sake of simplicity, you can implement a version of the algorithm specific to a dataset with 3 dimensions ... but it is not so complicated to generalize it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset=[   (0.7, 0.5, 10.5), \n",
    "            (0.2, 3.5, 20.3),\n",
    "            (0.4, 0.6, 14.7),\n",
    "            (0.2, 2.5, 15.89),\n",
    "            (0.6, 6.5, 23.5),\n",
    "            (0.5, 7.4, 10.4) ]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def matches(x, p):\n",
    "    \"\"\"\n",
    "    x: example in dimension 1\n",
    "    p: interval pattern\n",
    "    \"\"\"\n",
    "    \n",
    "    return True"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#example of interval pattern in dimension 3\n",
    "#   -> this is just a proposal for interval-pattern representation: if can change it, if you want\n",
    "p=([0.1,0.3],[1.0,4.0],[10.0,30.0])\n",
    "\n",
    "print(matches(dataset[2], p))\n",
    "print(matches(dataset[1], p))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def support(dataset,p):\n",
    "    \"\"\"Compute the support of pattern `p` in the `dataset`\"\"\"\n",
    "    s=0\n",
    "    \n",
    "    return s\n",
    "\n",
    "# test\n",
    "support(dataset,p)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#TODO ... depth first search algorithm"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Q8.** Assuming you implement a naive approach of the algorithm, what is the limit/default of the proposed algorithm?\n",
    "\n",
    "<!-- The algorithm has a lot of redundancies-->"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Improved version\n",
    "\n",
    "In this improved version, we propose to tranverse the search space without redundancy. \n",
    "The problem we had in the former version is that there is not a unique path for reach a pattern of the search space.\n",
    "\n",
    "The solution we propose to avoid exploring twice the same interval pattern is is known as *lectic order*, that is to understand as an order on the possible actions (or \"generating the rightmost pattern\"). It is based on the definition of total order on the extentions alternatives. In this case, we assume that there is an order on the attributes $m_1<_a m_2<_a m_3$, and that the left change is *lower* than the right change.\n",
    "Then, \n",
    "\n",
    "In practice, this means that:\n",
    "1. after applying a change to the right boundary of an interval, then one cannot apply a change to the left boundary\n",
    "2. after applying a change to attribute $m$, one cannot apply a change to attribute $m'<_a m$.\n",
    "\n",
    "This rules induce a new partial order on the set of interval patterns. We denote $\\blacktriangleleft$ this order (note that it is a bit technical to formalize).\n",
    "\n",
    "\n",
    "**Q9.** Considering interval patterns in dimension 1 and the set of interval patterns $\\mathcal{P}=\\{[4,4], [4,5], [5,4], [4,6], [5,5]\\}$. Draw the Hasse diagram of $(\\mathcal{P},\\triangleleft)$ to represent the search paths used by the previous algorithm. Then, draw the Hasse diagram of $(\\mathcal{P},\\blacktriangleleft)$ and comment.\n",
    "\n",
    "**Q10.** Are all patterns accessible with it? (we do not ask for a formal proof)\n",
    "\n",
    "<!-- Yes -->"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Q11.** (**) Propose an implementation with the new search space strategy and compare the number of output patterns \n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This exercice has been inspired by the work of M. Kaytoue et al:\n",
    "\n",
    "> Mehdi Kaytoue, Sergei O. Kuznetsov, and Amedeo Napoli. Revisiting numerical\n",
    "> pattern mining with formal concept analysis. In Proceedings of International Join\n",
    "> Conference on Artificial Intelligence (IJCAI), pages 1342–1347, 2011."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Second Exercice: SetChronicle mining\n",
    "\n",
    "This section introduces a new type of patterns we named *SetChronicles* that is dedicated to temporal sequences. The first objective of this algorithm is to design a new frequent pattern mining algorithm to extract SetChronicles from a collection of temporal sequences. The second objective is to apply this algorithm on the \n",
    "\n",
    "Let's start with some definitions. Let us first define a finite alphabet $(\\Sigma,<)$ equipped with a total order $<$.\n",
    "\n",
    "A **temporal sequence** of length $n$ over $\\Sigma$ is a finite sequence $\\rho = \\langle\n",
    "(\\sigma_1,\\tau_1), \\dots, (\\sigma_{n},\\tau_{n})\\rangle$ in $(\\Sigma\\times \\mathbb{R}_{\\geq 0})^{n}$ where for all $1\\leq i<n$, it holds $\\tau_i\\leq \\tau_{i+1}$.\n",
    "\n",
    "\n",
    "A **set-chronicle** is a pair $(\\mathcal{E}, \\mathcal{T})$ where\n",
    "* $\\mathcal{E}$ is a set over~$\\Sigma$\n",
    "* $\\mathcal{T}$ is a set of **temporal constraints**, expressions of the form $c{[t^-,t^+]}c'$ such that \n",
    "    * $c, c' \\in \\mathcal{E}$\n",
    "    * $t^-, t^+ \\in \\mathbb{R}\\cup\\{-\\infty,+\\infty\\}$\n",
    "    * c<c'\n",
    "\n",
    "The size of a set-chronicle $(\\mathcal E, \\mathcal T)$ is the size~$m$ of its set $\\mathcal E$.\n",
    "\n",
    "Let $\\bm{s}=\\langle (\\sigma_1,\\tau_1), (\\sigma_2,\\tau_2), \\ldots, (\\sigma_n,\\tau_n) \\rangle$ be a timed sequence of length $n$, and\n",
    "$\\mathcal{C}=(\\mathcal{E}=\\{ c_1,\\ldots,c_m \\}, \\mathcal{T})$ be a SetChronicle of size $m$.\n",
    "\n",
    "$\\mathcal{C}$ is said to **occur** in $\\bm{s}$ if, and only if, there exists an injective function $\\varepsilon\\colon [m] \\to [n]$\n",
    "such that:\n",
    "* for all $1\\leq i\\leq m$, $\\sigma_{\\varepsilon(i)} = c_i$,\n",
    "* for all $1\\leq i, j\\leq m$, $\\tau_{\\varepsilon(j)}-\\tau_{\\varepsilon(i)}\\in [t^-,t^+]$ whenever $c_i[t^-,t^+]c_j \\in \\mathcal{T}$.\n",
    "\n",
    "Then, $\\tilde{\\bm{s}}=\\{(\\sigma_{\\varepsilon(1)},\\tau_{\\varepsilon(1)}),\\ldots,(\\sigma_{\\varepsilon(m)},\\tau_{\\varepsilon(m)})\\}$ is an *occurrence* of $\\mathcal{C}$ in $\\bm{s}$.\n",
    "\n",
    "Chronicle $\\mathcal{C}$ is said to **match the sequence** $\\bm{s}$, denoted $\\mathcal{C} \\in \\bm{s}$, if, and only if, there is at least one occurrence of $\\mathcal{C}$ in $\\bm{s}$.\n",
    "\n",
    "\n",
    "**Q1.** What are the occurrences of the chronicle $\\mathcal{C}=(\\{A,B,C\\},T)$ where $\\mathcal{T}=\\{A[-3.5,2]B, A[-2,2.3]C, B[-1,5]C\\}$ in the sequences $\\bm{s}_1$ and $\\bm{s}_2$ defined as follows (over $\\Sigma=\\{A,B,C\\}$):\n",
    "* $\\bm{s}_1=\\langle (A, 1.8), (A, 3.5), (B, 3.9), (B, 4.1), (C, 4.2), (C, 5.7)\\rangle$\n",
    "* $\\bm{s}_1=\\langle (B, 0.2), (B, 0.9), (C, 2.5), (B, 3.2), (A, 3.7), (A, 4.7)\\rangle$\n",
    "\n",
    "\n",
    "**Q2.** Define: \n",
    "* the notion of support of a SetChronicle in a dataset of sequence, and \n",
    "* the notion of SetChronicle extension (in the light of the previous exercise, we suggest to investigate the notion of minimal-SetChronicle)\n",
    "\n",
    "**NOTE: handling the possible multiple occurrences of an itemset in a sequence may become technical: a solution that considers only the \"first occurrence\" of an itemset in a sequence is already a good implementation!**\n",
    "\n",
    "**Q3.** Propose an algorithm to extract frequent minimal-SetChronicles from a collection of sequences\n",
    "\n",
    "\n",
    "**Q4.** (*) Implement your algorithm in Python: we suggest the following steps:\n",
    "1. choose a representation of a SetChronicle and of temporal sequences (you can have a look at the question **Q6** that already propose a representation of temporal sequence, but feel free to modify it)\n",
    "2. implement and test a function to test whether a SetChronicle matches a sequence or not.\n",
    "3. implement a DFS search algorithm based on the usual strategy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# example of tests dataset\n",
    "dataset = [[('a',1), ('c',2), ('d',4)],\n",
    "           [('b',4), ('c',5), ('e',7)],\n",
    "           [('a',3), ('b',1.5), ('c',3), ('e',3.5)],\n",
    "           [('b',34), ('e',35)],\n",
    "           [('c',4), ('b',5), ('a',7)],\n",
    "           [('a',2), ('b',4), ('c',5), ('a',3)],\n",
    "           [('b',2), ('a',4), ('c',5), ('a',6)],\n",
    "           [('a',1), ('b',5), ('b',5.5), ('a',6)]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "\n",
    "**Q5.** Modify your code to limit the size of the set of event type to a user-defined number $m$. This should prevent from computing explosion in the next question.\n",
    "\n",
    "**Q6.** Apply your algorithm on the mining of frequent SetChronicles from the retail database (we recommend to test with very low $m$ values, $3$ for instance). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# load the itemsets from the preprocessed dataset\n",
    "fo = open(\"sequences_instacart.seq\", \"r\")\n",
    "dataset = []\n",
    "id=0\n",
    "for line in fo.readlines():\n",
    "    elems=line[:-1].split(' ')\n",
    "    if id!=elems[0]: #create a new sequence\n",
    "        id=elems[0]\n",
    "        dataset.append([])\n",
    "    date=elems[1]\n",
    "    dataset[-1] += [ (elems[i],int(date)) for i in range(2,len(elems))]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# example of sequence\n",
    "dataset[0]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "**Q7.** In the case we would like to implement the extration of frequent SetChronicles from a **single long sequence**, what would be the important definition to revise? You can make a proposal that would enable the design efficient correct algorithm.\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This exercice has been inspired by the work of C. Dousson and T.V. Duong:\n",
    "\n",
    "> Christophe Dousson and Thang Vu Duong, Discovering Chronicles with Numerical \n",
    "> Time Constraints from Alarm Logs for Monitoring Dynamic Systems, In Proceedings \n",
    "> of International Join Conference on Artificial Intelligence (IJCAI), \n",
    "> pages 620-626, 1999."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.6"
  },
  "orig_nbformat": 4,
  "vscode": {
   "interpreter": {
    "hash": "e7370f93d1d0cde622a1f8e1c04877d8463912d04d973331ad4851f04de6915a"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}