{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%pip install mlxtend\n", "%pip install pandas\n", "%pip install matplotlib" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# TD / TP 1 -- Frequent itemsets mining\n", "\n", "### Table of Contents\n", "* [Implementation of basic algorithms](#fim)\n", " * [Breadth first search strategy](#bfs)\n", " * [Depth first search strategy](#dfs)\n", "* [Comparison of algorithms on real data](#cmp)\n", " * [Data preparation](#cmp_prep)\n", " * [Graphic](#cmp_graphics)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Let us consider the transaction database depicted in the following Table.\n", "\n", "| Id | Motif | \n", "|----------|----------|\n", "| 1 | {a, c, d} | \n", "| 2 | {b, c, e} | \n", "| 3 | {a, b, c, e} | \n", "| 4 | {b, e} | \n", "| 5 | {a, b, c, e} | \n", "| 6 | {a, b, c, e} | \n", "\n", "1. General questions independent of the minimum frequency threshold:\n", " 1. What is the maximal number of frequent itemsets that can be extracted from this dataset?\n", " 2. Draw the Hasse diagram of itemsets (subset relation).\n", " 3. What is the maximal number of scans over the database with APriori algorithm ? \n", "2. Assuming $\\sigma = 2$\n", " 1. what are the frequent itemsets?\n", " 2. what are the closed itemsets?\n", " 3. what are the maximal itemsets?\n", "2. Extract the frequent itemsets with $\\sigma = 2$ with a breadth-first enumeration (generate and test strategy).\n", "3. Compute the frequent itemsets (minsup = 2) by using a depth first strategy." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Implementation of a frequent itemset mining algorithm \n", "\n", "#### Breadth first search \n", "\n", "Fulfill the following functions to implement the APriori algorithm using a breadth-first search. \n", "A _transaction_ of the database is represented by a set, but we suggest to represent an _itemset pattern_ by a list of items (not a set). Using set is more convenient to add and remove items from them." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# dataset example: list of sets\n", "dataset = [set(['a', 'c', 'd']),\n", " set(['b', 'c', 'e']),\n", " set(['a', 'b', 'c', 'e']),\n", " set(['b', 'e']),\n", " set(['a', 'b', 'c', 'e']),\n", " set(['a', 'b', 'c', 'e'])]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def support(dataset,I):\n", " \"\"\"Compute the support of the itemset I in the dataset D\n", " returns an integer\n", "\n", " dataset: list of sets of items\n", " I: list of items (representing an itemset pattern)\n", "\n", " hints: use the function \"issubset\" between sets (convert list as set if necessary)\n", " \"\"\"\n", " c=0\n", " #TODO\n", " return c\n", "\n", "# test of the function\n", "support(dataset, ['a','b'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def candidates(F):\n", " \"\"\"Generate candidate itemsets of size k+1 from frequent itemsets of size k\n", " F is a list of frequent itemsets (represented as a list) of size k.\n", "\n", " \"\"\"\n", " C=[]\n", " #TODO\n", " return C\n", "\n", "# test of the function\n", "candidates( [ ['a', 'b', 'c'], \n", " ['a', 'b', 'd'], \n", " ['a', 'b', 'e'],\n", " ['a', 'c', 'e'],\n", " ['a', 'd', 'e'], \n", " ['a', 'e', 'f'] ] )" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def frequent1itemsets(dataset,sigma):\n", " \"\"\"\n", " dataset: list of sets\n", " sigma: minimal frequency threshold\n", "\n", " find the 1-itemsets\n", " \"\"\"\n", " I=[]\n", " #TODO\n", " return I\n", "\n", "frequent1itemsets(dataset,2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def APriori(dataset, sigma):\n", " \"\"\"\n", " dataset: list of sets\n", " sigma: minimal frequency threshold\n", "\n", " Level-wise (breadth first search) algorithm for mining frequent itemsets\n", " \"\"\"\n", " F=[] #frequent patterns\n", " #TODO\n", " return F\n", "\n", "#test\n", "APriori(dataset,3)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "#### Depth-first search \n", "\n", "Implement the APriori stragety (generate and test) using a depth first search :\n", "\n", "1. identify the 1-itemsets\n", "2. create a recursive function that test a pattern $p$, and extent it by an item (if necessary)\n", "\n", "You will take care that your function do not explore twice the same patterns." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def test_and_generate(p, dataset, sigma, F1):\n", " \"\"\"\n", " p: an itemset pattern\n", " dataset: dataset\n", " ...\n", "\n", " Recursive function\n", " \"\"\"\n", " \n", " #TODO\n", " " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def APriori_rec(dataset, sigma):\n", " \"\"\"\n", " dataset: list of sets\n", " sigma: minimal frequency threshold\n", "\n", " Depth first search algorithm for mining frequent itemses. \n", " This function call the `test_and_generate` recursive function\n", " \"\"\"\n", " F=[] #frequent patterns\n", " #TODO\n", " return F\n", "\n", "#test\n", "APriori_rec(dataset,3)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Comparison of algorithms on a real dataset \n", "\n", "We use the Instacart dataset from which we extracted the data of the $\\approx 10000$ first customers. \n", "\n", "Dataset characteristics: \n", "* number of sequences: 9536\n", "* number of itemsets: 61959\n", "* number of items: 33635\n", "\n", "We use the library `mlxtend` that implements two frequent itemsets mining algorithms. I advice to not use your own implementation on this large dataset. In do not think that it will be as efficient." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# load and unzip the data (for linux only)\n", "!wget https://team.inria.fr/aistrosight/files/2023/02/instacard_data.zip\n", "!unzip -o -qq instacard_data.zip" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "#### Data preparation and running examples " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from mlxtend.preprocessing import TransactionEncoder\n", "from mlxtend.frequent_patterns import apriori, fpgrowth\n", "from time import time\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# get the name of the product\n", "products = pd.read_csv('./products.csv')\n", "# load the itemsets from the preprocessed dataset\n", "fo = open(\"sequences_instacart.seq\", \"r\")\n", "itemsets = []\n", "for line in fo.readlines():\n", " elems=line[:-1].split(' ')\n", " itemset = set([elems[i] for i in range(2,len(elems))]) #itemset of products\n", " itemsets.append(itemset)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# transformation of the dataset for mlxtend algorithms\n", "te = TransactionEncoder()\n", "te_ary = te.fit(itemsets).transform(itemsets)\n", "df_itemsets = pd.DataFrame(te_ary, columns=te.columns_)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "t0=time()\n", "patterns=apriori(df_itemsets, min_support=0.1, low_memory=True)\n", "d=time()-t0" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "t0=time()\n", "patterns=fpgrowth(df_itemsets, min_support=0.1)\n", "d=time()-t0\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "The file `products.csv` contains the name of the product. \n", "\n", "1. Write a small and simple function to transform an itemsets with figures into itemsets with names\n", "2. Which item is the most frequent?\n", "3. Among the patterns of length 3, is there some surprizing ones?\n", "4. How these patterns could be used in a retail company?" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "#### Comparisons of the algorithms \n", "\n", "1. Play with the threshold to evaluate which lower value of the threshold is reasonable in time (let's say: it takes less than 30s). Which problem is encountered with APriori? (then, look at the documentation to resolve the problem)\n", "2. Evaluate the computing time of the two implementations of the frequent itemsets mining for frequency threshold from 0.1 to the lowest possible value for each implementation.\n", "3. Draw the curve of time wrt the threshold with the collected figures (you can use the `matplotlib` library to draw this curve)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" }, "orig_nbformat": 4, "vscode": { "interpreter": { "hash": "e7370f93d1d0cde622a1f8e1c04877d8463912d04d973331ad4851f04de6915a" } } }, "nbformat": 4, "nbformat_minor": 2 }