examples/store-sales-forecasting-kag.../store-sales-kale.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "# Kaggle Getting Started Prediction Competition: Store Sales - Time Series Forecasting"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "In this [competition](https://www.kaggle.com/competitions/store-sales-time-series-forecasting), we will use time-series forecasting to forecast store sales on data from Corporación Favorita, a large Ecuadorian-based grocery retailer. The notebook is a buildup of hands-on-exercises presented in Kaggle Learn course of [Time Series Course](https://www.kaggle.com/learn/time-series) where you will learn to leverage periodic trends for forecasting as well as combine different models such as linear regression and XGBoost to perfect your forecasting. For the purpose of this tutorial we are looking at periodic trend for forecasting."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Install necessary packages\n",
    "\n",
    "We can install the necessary package by either running `pip install --user <package_name>` or include everything in a `requirements.txt` file and run `pip install --user -r requirements.txt`. We have put the dependencies in a `requirements.txt` file so we will use the former method.\n",
    "\n",
    "Restart the kernel after installation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "skip"
    ]
   },
   "outputs": [],
   "source": [
    "!pip install --user -r requirements.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Download Data from Kaggle\n",
    "\n",
    "Download relevant data from kaggle by running the below code cell. Follow the initial steps information mentioned in Github README.md to get the Kaggle username and key for authentication of Kaggle Public API. There's no need of secret to be created for the following step. The credentials will be present in the kaggle.json file. This cell needs to be run before starting Kale pipeline from  Kale deployment panel. Please ensure that you run the cell only once so you don't create nested directories. Restart the kernel before running the code cell again. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "skip"
    ]
   },
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "# Get the Kaggle Username and password from the kaggle.json file\n",
    "# and paste it in place of KAGGLE_USERNAME AND KAGGLE_KEY on right hand side\n",
    "\n",
    "os.environ['KAGGLE_USERNAME'] = \"KAGGLE_USERNAME\"\n",
    "os.environ['KAGGLE_KEY'] = \"KAGGLE_KEY\"\n",
    "\n",
    "path = \"data/\"\n",
    "\n",
    "print(\"Current directory: %s\" % os.getcwd())\n",
    "os.chdir(os.getcwd())\n",
    "os.system(\"mkdir \" + path)\n",
    "os.chdir(path)\n",
    "\n",
    "import kaggle\n",
    "from kaggle.api.kaggle_api_extended import KaggleApi\n",
    "api = KaggleApi()\n",
    "api.authenticate()\n",
    "\n",
    "# Getting the files list from Kaggle using Kaggle api\n",
    "file_list = api.competition_list_files('store-sales-time-series-forecasting')\n",
    "# print(file_list)\n",
    "\n",
    "# Download the entire dataset   \n",
    "api.competition_download_files('store-sales-time-series-forecasting')\n",
    "\n",
    "print(\"Unzipping the files ...\")\n",
    "\n",
    "# Get the path of the directory where the files are downloaded\n",
    "path_dir = os.getcwd()\n",
    "print(\"Data file path: %s\" % path_dir)\n",
    "\n",
    "from zipfile import ZipFile \n",
    "\n",
    "# # Extracting all files from competition zip file\n",
    "zipfile = ZipFile(path_dir + '/store-sales-time-series-forecasting.zip', 'r')\n",
    "zipfile.extractall()\n",
    "zipfile.close()\n",
    "    \n",
    "print(\"Checking the files are extracted properly ...\")\n",
    "print(\"For the current dataset, we are only looking at bigger files as those are ones that take longer time for extraction\")\n",
    "\n",
    "for file in os.listdir(path_dir):\n",
    "     filename = os.fsdecode(file)\n",
    "     if filename.endswith(\".csv\"):\n",
    "        file_size = os.path.getsize(path_dir + \"/\" + filename)\n",
    "        if file_size< 1e9:\n",
    "            file_size = str(round(file_size/(1024*1024))) + \"MB\"\n",
    "        else:\n",
    "            file_size = str(round(file_size/(1024*1024*1024))) + \"GB\"\n",
    "        for file in file_list:\n",
    "            if file.name == filename and file.size == file_size:\n",
    "                print(file.name,file.size, file_size)\n",
    "\n",
    "print(\"All files are downloaded and unzipped inside the data directory. Please move on to next step\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "imports"
    ]
   },
   "outputs": [],
   "source": [
    "import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\n",
    "import os\n",
    "from sklearn.linear_model import LinearRegression\n",
    "from statsmodels.tsa.deterministic import CalendarFourier, DeterministicProcess\n",
    "from sklearn.preprocessing import OneHotEncoder\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.metrics import mean_absolute_error"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Load the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "block:load_data"
    ]
   },
   "outputs": [],
   "source": [
    "path = 'data'\n",
    "# path = os.getcwd()\n",
    "\n",
    "train_data_filepath = path + \"/train.csv\"\n",
    "test_data_filepath = path + \"/test.csv\"\n",
    "holidays_filepath = path + \"/holidays_events.csv\"\n",
    "\n",
    "# Read the csv files into dataframes\n",
    "# Training data\n",
    "train_sales = pd.read_csv(train_data_filepath,\n",
    "    usecols=['store_nbr', 'family', 'date', 'sales'],\n",
    "    dtype={\n",
    "        'store_nbr': 'category',\n",
    "        'family': 'category',\n",
    "        'sales': 'float32',\n",
    "    },\n",
    "    parse_dates=['date'],\n",
    "    infer_datetime_format=True,\n",
    ")\n",
    "train_sales['date'] = train_sales.date.dt.to_period('D')\n",
    "train_sales = train_sales.set_index(['store_nbr', 'family', 'date']).sort_index()\n",
    "\n",
    "# Holiday features dataset\n",
    "holidays_events = pd.read_csv(\n",
    "    holidays_filepath,\n",
    "    dtype={\n",
    "        'type': 'category',\n",
    "        'locale': 'category',\n",
    "        'locale_name': 'category',\n",
    "        'description': 'category',\n",
    "        'transferred': 'bool',\n",
    "    },\n",
    "    parse_dates=['date'],\n",
    "    infer_datetime_format=True,\n",
    ")\n",
    "holidays_events = holidays_events.set_index('date').to_period('D')\n",
    "\n",
    "\n",
    "# Test data id required for submission of forecast sales\n",
    "df_test = pd.read_csv(\n",
    "    test_data_filepath,\n",
    "    dtype={\n",
    "        'store_nbr': 'category',\n",
    "        'family': 'category',\n",
    "        'onpromotion': 'uint32',\n",
    "    },\n",
    "    parse_dates=['date'],\n",
    "    infer_datetime_format=True,\n",
    ")\n",
    "df_test['date'] = df_test.date.dt.to_period('D')\n",
    "df_test = df_test.set_index(['store_nbr', 'family', 'date']).sort_index()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "skip"
    ]
   },
   "outputs": [],
   "source": [
    "train_sales.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "skip"
    ]
   },
   "outputs": [],
   "source": [
    "holidays_events.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "skip"
    ]
   },
   "outputs": [],
   "source": [
    "df_test.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Create features\n",
    "\n",
    "1. indicators for weekly seasons\n",
    "2. Fourier features of order 4 for monthly seasons\n",
    "3. Creating holiday features provided in the Store Sales Dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "block:create_features",
     "prev:load_data"
    ]
   },
   "outputs": [],
   "source": [
    "# National and regional holidays of Ecuador in the training set\n",
    "# Holiday features\n",
    "holidays = (\n",
    "    holidays_events\n",
    "    .query(\"locale in ['National', 'Regional']\")\n",
    "    .loc['2017':'2017-08-15', ['description']]\n",
    "    .assign(description=lambda x: x.description.cat.remove_unused_categories())\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "skip"
    ]
   },
   "outputs": [],
   "source": [
    "print(holidays)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Create training data features\n",
    "\n",
    "y = train_sales.unstack(['store_nbr', 'family']).loc[\"2017\"]\n",
    "\n",
    "# Using CalendarFourier to create fourier features \n",
    "fourier = CalendarFourier(freq='M', order=4)\n",
    "\n",
    "# Using DeterministicProcess to create indicators for both \n",
    "# weekly and monthly seasons\n",
    "dp = DeterministicProcess(\n",
    "    index=y.index,\n",
    "    constant=True,\n",
    "    order=1,\n",
    "    seasonal=True,               # weekly seasonality (indicators)\n",
    "    additional_terms=[fourier],  # annual seasonality (fourier)\n",
    "    drop=True,\n",
    ")\n",
    "\n",
    "# `in_sample` creates features for the dates given in the `index` argument\n",
    "X = dp.in_sample()\n",
    "\n",
    "ohe = OneHotEncoder(sparse=False)\n",
    "\n",
    "X_holidays = pd.DataFrame(\n",
    "    ohe.fit_transform(holidays),\n",
    "    index=holidays.index,\n",
    "    columns=holidays.description.unique(),\n",
    ")\n",
    "\n",
    "X_holidays = pd.get_dummies(holidays)\n",
    "\n",
    "# Join holiday features to training data\n",
    "X_2= X.join(X_holidays, on='date').fillna(0.0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Train and Evaluate the Model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "block:train_and_evaluate_model",
     "prev:create_features"
    ]
   },
   "outputs": [],
   "source": [
    "# Split the data to train and valid datasets\n",
    "X_train, X_valid, y_train, y_valid = train_test_split(X_2, y, test_size=0.1, shuffle=False)\n",
    "\n",
    "# Train the model\n",
    "model = LinearRegression(fit_intercept=False)\n",
    "model.fit(X_train, y_train)\n",
    "\n",
    "# Get the training and valid data predictions\n",
    "y_train_pred = pd.DataFrame(model.predict(X_train), index=X_train.index, columns=y.columns)\n",
    "y_valid_pred = pd.DataFrame(model.predict(X_valid), index=X_valid.index, columns=y.columns)\n",
    "# Evaluate the model using mean_squared_log_error\n",
    "print(mean_absolute_error(y_valid, y_valid_pred))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Forecast sales"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "block:forecast_sales",
     "prev:create_features",
     "prev:train_and_evaluate_model"
    ]
   },
   "outputs": [],
   "source": [
    "# Create features for test set\n",
    "# \"out of sample\" refers to times outside of the observation period of the training data.\n",
    "# We are forecasting for next 16 days from the end of the training data date\n",
    "test = dp.out_of_sample(steps=16)\n",
    "test.index.name = 'date'\n",
    "X_test = test.join(X_holidays, on='date').fillna(0.0)\n",
    "y_forecast = pd.DataFrame(model.predict(X_test), index=X_test.index, columns=y.columns)\n",
    "y_forecast"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Submission"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "skip"
    ]
   },
   "outputs": [],
   "source": [
    "y_submit = y_forecast.stack(['store_nbr', 'family'])\n",
    "y_submit = y_submit.join(df_test.id).reindex(columns=['id', 'sales'])\n",
    "y_submit.to_csv('submission.csv', index=False)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "kubeflow_notebook": {
   "autosnapshot": true,
   "experiment": {
    "id": "new",
    "name": "store-sales-forecast"
   },
   "experiment_name": "store-sales-forecast",
   "katib_metadata": {
    "algorithm": {
     "algorithmName": "grid"
    },
    "maxFailedTrialCount": 3,
    "maxTrialCount": 12,
    "objective": {
     "objectiveMetricName": "",
     "type": "minimize"
    },
    "parallelTrialCount": 3,
    "parameters": []
   },
   "katib_run": false,
   "pipeline_description": "",
   "pipeline_name": "store-sales-kale",
   "snapshot_volumes": true,
   "steps_defaults": [
    "label:access-ml-pipeline:true",
    "label:access-rok:true"
   ],
   "volume_access_mode": "rwm",
   "volumes": [
    {
     "annotations": [],
     "mount_point": "/home/jovyan",
     "name": "store-sales-workspace-tx8b7",
     "size": 20,
     "size_type": "Gi",
     "snapshot": false,
     "type": "clone"
    }
   ]
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}