examples/h-and-m-fash-rec-kaggle-com.../h&m-fash-rec-kale.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "# Kaggle Featured Prediction Competition: H&M Personalized Fashion Recommendations"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "In this [competition](https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations), product recommendations have to be done based on previous purchases. There's a whole range of data available including customer meta data, product meta data, and meta data that spans from simple data, such as garment type and customer age, to text data from product descriptions, to image data from garment images.\n",
    "\n",
    "In this notebook we will be working with implicit's ALS library for our recommender systems. Please do check out the [docs](https://benfred.github.io/implicit/index.html) for more information."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Install necessary packages\n",
    "\n",
    "We can install the necessary package by either running `pip install --user <package_name>` or include everything in a `requirements.txt` file and run `pip install --user -r requirements.txt`. We have put the dependencies in a `requirements.txt` file so we will use the former method.\n",
    "\n",
    "Restart the kernel after installation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "tags": [
     "skip"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Requirement already satisfied: numpy in /usr/local/lib/python3.6/dist-packages (from -r requirements.txt (line 1)) (1.19.5)\n",
      "Requirement already satisfied: pandas in /usr/local/lib/python3.6/dist-packages (from -r requirements.txt (line 2)) (1.1.5)\n",
      "Collecting implicit\n",
      "  Downloading implicit-0.5.2-cp36-cp36m-manylinux2014_x86_64.whl (18.6 MB)\n",
      "     |████████████████████████████████| 18.6 MB 8.6 MB/s            \n",
      "\u001b[?25hCollecting sklearn\n",
      "  Downloading sklearn-0.0.tar.gz (1.1 kB)\n",
      "  Preparing metadata (setup.py) ... \u001b[?25ldone\n",
      "\u001b[?25hCollecting kaggle\n",
      "  Downloading kaggle-1.5.12.tar.gz (58 kB)\n",
      "     |████████████████████████████████| 58 kB 8.3 MB/s             \n",
      "\u001b[?25h  Preparing metadata (setup.py) ... \u001b[?25ldone\n",
      "\u001b[?25hRequirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas->-r requirements.txt (line 2)) (2021.3)\n",
      "Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.6/dist-packages (from pandas->-r requirements.txt (line 2)) (2.8.2)\n",
      "Collecting tqdm>=4.27\n",
      "  Downloading tqdm-4.64.0-py2.py3-none-any.whl (78 kB)\n",
      "     |████████████████████████████████| 78 kB 7.2 MB/s             \n",
      "\u001b[?25hRequirement already satisfied: scipy>=0.16 in /usr/local/lib/python3.6/dist-packages (from implicit->-r requirements.txt (line 3)) (1.5.4)\n",
      "Requirement already satisfied: scikit-learn in /usr/local/lib/python3.6/dist-packages (from sklearn->-r requirements.txt (line 4)) (0.23.2)\n",
      "Requirement already satisfied: six>=1.10 in /usr/local/lib/python3.6/dist-packages (from kaggle->-r requirements.txt (line 5)) (1.16.0)\n",
      "Requirement already satisfied: certifi in /usr/local/lib/python3.6/dist-packages (from kaggle->-r requirements.txt (line 5)) (2021.10.8)\n",
      "Requirement already satisfied: requests in /usr/local/lib/python3.6/dist-packages (from kaggle->-r requirements.txt (line 5)) (2.27.1)\n",
      "Collecting python-slugify\n",
      "  Downloading python_slugify-6.1.2-py2.py3-none-any.whl (9.4 kB)\n",
      "Requirement already satisfied: urllib3 in /usr/local/lib/python3.6/dist-packages (from kaggle->-r requirements.txt (line 5)) (1.26.8)\n",
      "Requirement already satisfied: importlib-resources in /usr/local/lib/python3.6/dist-packages (from tqdm>=4.27->implicit->-r requirements.txt (line 3)) (5.4.0)\n",
      "Collecting text-unidecode>=1.3\n",
      "  Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)\n",
      "     |████████████████████████████████| 78 kB 10.0 MB/s            \n",
      "\u001b[?25hRequirement already satisfied: charset-normalizer~=2.0.0 in /usr/local/lib/python3.6/dist-packages (from requests->kaggle->-r requirements.txt (line 5)) (2.0.10)\n",
      "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests->kaggle->-r requirements.txt (line 5)) (3.3)\n",
      "Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.6/dist-packages (from scikit-learn->sklearn->-r requirements.txt (line 4)) (3.0.0)\n",
      "Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.6/dist-packages (from scikit-learn->sklearn->-r requirements.txt (line 4)) (1.1.0)\n",
      "Requirement already satisfied: zipp>=3.1.0 in /usr/local/lib/python3.6/dist-packages (from importlib-resources->tqdm>=4.27->implicit->-r requirements.txt (line 3)) (3.6.0)\n",
      "Building wheels for collected packages: sklearn, kaggle\n",
      "  Building wheel for sklearn (setup.py) ... \u001b[?25ldone\n",
      "\u001b[?25h  Created wheel for sklearn: filename=sklearn-0.0-py2.py3-none-any.whl size=2381 sha256=b097d9ed6edd1c879d42180dffc20aa2659b2628f9711ae4204ce4b1cb636dc2\n",
      "  Stored in directory: /home/jovyan/.cache/pip/wheels/23/9d/42/5ec745cbbb17517000a53cecc49d6a865450d1f5cb16dc8a9c\n",
      "  Building wheel for kaggle (setup.py) ... \u001b[?25ldone\n",
      "\u001b[?25h  Created wheel for kaggle: filename=kaggle-1.5.12-py3-none-any.whl size=73994 sha256=6e228faf71a3c29b864d64cfe8fe5b59a1257cfc4afb24f7c4712943489a5710\n",
      "  Stored in directory: /home/jovyan/.cache/pip/wheels/77/47/e4/44a4ba1b7dfd53faaa35f59f1175e123b213ff401a8a56876b\n",
      "Successfully built sklearn kaggle\n",
      "Installing collected packages: text-unidecode, tqdm, python-slugify, sklearn, kaggle, implicit\n",
      "Successfully installed implicit-0.5.2 kaggle-1.5.12 python-slugify-6.1.2 sklearn-0.0 text-unidecode-1.3 tqdm-4.64.0\n"
     ]
    }
   ],
   "source": [
    "!pip install --user -r requirements.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Download Data from Kaggle"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "Download relevant data from kaggle by running the below code cell. Follow the initial steps information mentioned in Github README.md to get the Kaggle username and key for authentication of Kaggle Public API. There's no need of secret to be created for the following step. The credentials will be present in the kaggle.json file. This cell needs to be run before starting Kale pipeline from  Kale deployment panel. Please ensure that you run the cell only once so you don't create nested directories. Restart the kernel before running the code cell again. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "tags": [
     "skip"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "customers.csv.zip: Skipping, found more recently modified local copy (use --force to force download)\n",
      "transactions_train.csv.zip: Skipping, found more recently modified local copy (use --force to force download)\n",
      "articles.csv.zip: Skipping, found more recently modified local copy (use --force to force download)\n",
      "sample_submission.csv.zip: Skipping, found more recently modified local copy (use --force to force download)\n",
      "Unzipping the files ...\n",
      "\n",
      "\n",
      "Checking the files are extracted properly ...\n",
      "\n",
      "\n",
      "sample_submission.csv 258MB 258MB\n",
      "customers.csv 198MB 198MB\n",
      "transactions_train.csv 3GB 3GB\n",
      "articles.csv 34MB 34MB\n",
      "All files are downloaded and unzipped inside the data directory. Please move on to next step\n"
     ]
    }
   ],
   "source": [
    "import os\n",
    "\n",
    "\n",
    "# Get the Kaggle Username and password from the kaggle.json file\n",
    "# and paste it in place of KAGGLE_USERNAME AND KAGGLE_KEY on right hand side\n",
    "\n",
    "os.environ['KAGGLE_USERNAME'] = \"KAGGLE_USERNAME\"\n",
    "os.environ['KAGGLE_KEY'] = \"KAGGLE_KEY\"\n",
    "\n",
    "path = \"data/\"\n",
    "\n",
    "os.chdir(os.getcwd())\n",
    "os.system(\"mkdir \" + path)\n",
    "os.chdir(path)\n",
    "\n",
    "import kaggle\n",
    "from kaggle.api.kaggle_api_extended import KaggleApi\n",
    "api = KaggleApi()\n",
    "api.authenticate()\n",
    "\n",
    "# Getting the files list from Kaggle using Kaggle api\n",
    "file_list = api.competition_list_files('h-and-m-personalized-fashion-recommendations')\n",
    "\n",
    "# Download the required files individually. You can also choose to download the entire dataset if you want to work with image data as well. The files will be in downloaded   \n",
    "api.competition_download_file('h-and-m-personalized-fashion-recommendations','customers.csv')\n",
    "api.competition_download_file('h-and-m-personalized-fashion-recommendations','transactions_train.csv')\n",
    "api.competition_download_file('h-and-m-personalized-fashion-recommendations','articles.csv')\n",
    "api.competition_download_file('h-and-m-personalized-fashion-recommendations','sample_submission.csv')   \n",
    "\n",
    "print(\"Unzipping the files ...\")\n",
    "\n",
    "# Get the path of the directory where the files are downloaded\n",
    "path_dir = os.getcwd()\n",
    "\n",
    "from zipfile import ZipFile \n",
    "\n",
    "# Extracting all files from individual zip files\n",
    "zipfile1 = ZipFile(path_dir + '/customers.csv.zip', 'r')\n",
    "zipfile1.extract(\"customers.csv\")\n",
    "zipfile1.close()\n",
    "    \n",
    "zipfile2 = ZipFile(path_dir + '/transactions_train.csv.zip', 'r')\n",
    "zipfile2.extract(\"transactions_train.csv\")\n",
    "zipfile2.close()\n",
    "    \n",
    "zipfile3 = ZipFile(path_dir + '/articles.csv.zip', 'r')\n",
    "zipfile3.extract(\"articles.csv\")\n",
    "zipfile3.close()\n",
    "    \n",
    "zipfile4 = ZipFile(path_dir + '/sample_submission.csv.zip', 'r')\n",
    "zipfile4.extract(\"sample_submission.csv\")\n",
    "zipfile4.close()\n",
    "\n",
    "print(\"Checking the files are extracted properly ...\")\n",
    "\n",
    "for file in os.listdir(path_dir):\n",
    "     filename = os.fsdecode(file)\n",
    "     if filename.endswith(\".csv\"):\n",
    "        file_size = os.path.getsize(path_dir + \"/\" + filename)\n",
    "        if file_size< 1e9:\n",
    "            file_size = str(round(file_size/(1024*1024))) + \"MB\"\n",
    "        else:\n",
    "            file_size = str(round(file_size/(1024*1024*1024))) + \"GB\"\n",
    "        for file in file_list:\n",
    "            if file.name == filename and file.size == file_size:\n",
    "                print(file.name,file.size, file_size)\n",
    "\n",
    "print(\"All files are downloaded and unzipped inside the data directory. Please move on to next step\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "imports"
    ]
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import implicit \n",
    "import scipy.sparse as sparse"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Load Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "block:load_and_preprocess_data"
    ]
   },
   "outputs": [],
   "source": [
    "path = \"data/\"\n",
    "train_data_filepath = path + \"transactions_train.csv\"\n",
    "article_metadata_filepath = path + \"articles.csv\"\n",
    "customer_metadata_filepath = path + \"customers.csv\"\n",
    "test_data_filepath = path + \"sample_submission.csv\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "train_data = pd.read_csv(train_data_filepath)\n",
    "test_data = pd.read_csv(test_data_filepath)\n",
    "customer_data = pd.read_csv(customer_metadata_filepath)\n",
    "article_data = pd.read_csv(article_metadata_filepath)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Exploring the dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "skip"
    ]
   },
   "outputs": [],
   "source": [
    "train_data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "skip"
    ]
   },
   "outputs": [],
   "source": [
    "train_data.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "skip"
    ]
   },
   "outputs": [],
   "source": [
    "train_data.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "skip"
    ]
   },
   "outputs": [],
   "source": [
    "test_data.tail()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "skip"
    ]
   },
   "outputs": [],
   "source": [
    "customer_data.tail()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "skip"
    ]
   },
   "outputs": [],
   "source": [
    "article_data.tail()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# We will be dropping t_dat, sales_channel and price as this won't be part of the recommendation system we will be building \n",
    "train_data.drop(['t_dat','sales_channel_id','price'], axis= 1, inplace = True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "skip"
    ]
   },
   "outputs": [],
   "source": [
    "train_data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Preprocess Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "block:"
    ]
   },
   "outputs": [],
   "source": [
    "# create a new purchase count column that would gives us count of every article bought by the customers\n",
    "X = train_data.groupby(['customer_id', 'article_id'])['article_id'].count().reset_index(name = \"purchase_count\") \n",
    "\n",
    "# Getting unique number of customers and articles using the customer and article metadata data files\n",
    "unique_customers = customer_data['customer_id'].unique()\n",
    "unique_articles = article_data['article_id'].unique()\n",
    "\n",
    "# length of the customers and articles\n",
    "n_customers = len(unique_customers)\n",
    "n_articles = len(unique_articles)\n",
    "\n",
    "# Create a mapping for customer_id to convert it from an object column to an int column for the sparse matrix creation\n",
    "customer_id_dict = {unique_customers[i]:i  for i in range(len(unique_customers))}\n",
    "reverse_customer_id_dict = {i:unique_customers[i] for i in range(len(unique_customers))} \n",
    "numeric_cus_id = []\n",
    "for i in range(len(X['customer_id'])):\n",
    "    numeric_cus_id.append(customer_id_dict.get(X['customer_id'][i]))\n",
    "X['customer_id'] = numeric_cus_id\n",
    "\n",
    "# Create a mapping for article_id so that the sparse matrix creation doesn't get large enough due to long int values of article_ids\n",
    "article_id_dict = {unique_articles[i]:i  for i in range(len(unique_articles))}\n",
    "reverse_article_id_dict = {i:unique_articles[i] for i in range(len(unique_articles))}\n",
    "numeric_art_id = []\n",
    "for i in range(len(X['article_id'])):\n",
    "    numeric_art_id.append(article_id_dict.get(X['article_id'][i]))\n",
    "X['article_id'] = numeric_art_id"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "skip"
    ]
   },
   "outputs": [],
   "source": [
    "X.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Sparse Matrix Creation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "block:sparse_matrix_creation",
     "prev:load_and_preprocess_data"
    ]
   },
   "outputs": [],
   "source": [
    "# Constructing sparse matrices for alternating least squares algorithm    \n",
    "sparse_user_item_coo = sparse.coo_matrix((X.purchase_count, (X.customer_id, X.article_id)), shape = (n_customers, n_articles))\n",
    "sparse_user_item_csr = sparse.csr_matrix((X['purchase_count'], (X['customer_id'], X['article_id'])), shape = (n_customers, n_articles))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "skip"
    ]
   },
   "outputs": [],
   "source": [
    "sparse_user_item_csr"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Model Training"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "block:train_model",
     "prev:sparse_matrix_creation"
    ]
   },
   "outputs": [],
   "source": [
    "# parameters for the model\n",
    "als_params = dict(\n",
    "    factors = 200,         # number of latent factors - try between 50 to 1000\n",
    "    regularization = 0.01, # regularization factor - try between 0.001 to 0.2\n",
    "    iterations = 5,        # iterations            - try between 2 to 100\n",
    ")\n",
    "\n",
    "# initialize a model\n",
    "model = implicit.als.AlternatingLeastSquares(**als_params)\n",
    "\n",
    "# train the model on a sparse matrix of user/item/confidence weights    \n",
    "model.fit(sparse_user_item_csr)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "skip"
    ]
   },
   "outputs": [],
   "source": [
    "model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Predictions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "block:predictions",
     "prev:sparse_matrix_creation",
     "prev:train_model",
     "prev:load_and_preprocess_data"
    ]
   },
   "outputs": [],
   "source": [
    "predictions=[]\n",
    "count = 0\n",
    "for cust_id in test_data.customer_id:\n",
    "    cust_id = customer_id_dict.get(cust_id)\n",
    "    if(cust_id!=None):    \n",
    "        recommendations = model.recommend(cust_id, sparse_user_item_csr[cust_id],10)\n",
    "        result=[]\n",
    "        for i in range(len(recommendations[0])):\n",
    "            val = reverse_article_id_dict.get(recommendations[0][i])\n",
    "            result.append(val)  \n",
    "        predictions.append(result)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "test_data['prediction'] = predictions\n",
    "test_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Final Submission"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "skip"
    ]
   },
   "outputs": [],
   "source": [
    "test_data.to_csv('data/submission.csv', index=False)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "kubeflow_notebook": {
   "autosnapshot": true,
   "experiment": {
    "id": "new",
    "name": "hm-fash-recomm"
   },
   "experiment_name": "hm-fash-recomm",
   "katib_metadata": {
    "algorithm": {
     "algorithmName": "grid"
    },
    "maxFailedTrialCount": 3,
    "maxTrialCount": 12,
    "objective": {
     "objectiveMetricName": "",
     "type": "minimize"
    },
    "parallelTrialCount": 3,
    "parameters": []
   },
   "katib_run": false,
   "pipeline_description": "",
   "pipeline_name": "predict-hm-purchases-kale-1",
   "snapshot_volumes": true,
   "steps_defaults": [
    "label:access-ml-pipeline:true",
    "label:access-rok:true"
   ],
   "volume_access_mode": "rwm",
   "volumes": [
    {
     "annotations": [],
     "mount_point": "/home/jovyan",
     "name": "hm-test2-workspace-6vjtz",
     "size": 50,
     "size_type": "Gi",
     "snapshot": false,
     "type": "clone"
    }
   ]
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}