examples/G-research-crypto-forecasti.../g-research-crypto-forecast-...

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "# 🪙 G-Research Crypto Original Notebook\n",
    "![](./images/vector-blockchain-poster.jpg)\n",
    "\n",
    "---\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "In this [Kaggle competition](https://www.kaggle.com/competitions/g-research-crypto-forecasting/overview), you'll use your machine learning expertise to forecast short term returns in 14 popular cryptocurrencies. The dataset provided contains information on historic trades for several cryptoassets, such as Bitcoin and Ethereum. \n",
    "\n",
    "> G-Research is a leading quantitative research and technology company. By using the latest scientific techniques, they produce world-beating predictive research and build advanced technology to analyse the world's data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Install necessary packages\n",
    "\n",
    "We can install the necessary package by either running pip install --user <package_name> or include everything in a requirements.txt file and run pip install --user -r requirements.txt. We have put the dependencies in a requirements.txt file so we will use the former method.\n",
    "\n",
    "NOTE: After installing python packages, restart notebook kernel before proceeding."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "tags": [
     "skip"
    ]
   },
   "outputs": [],
   "source": [
    "!pip install -r requirements.txt --user --quiet"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Imports\n",
    "\n",
    "In this section we import the packages we need for this example. Make it a habit to gather your imports in a single place. It will make your life easier if you are going to transform this notebook into a Kubeflow pipeline using Kale."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "tags": [
     "imports"
    ]
   },
   "outputs": [],
   "source": [
    "import os, random, subprocess\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import time, datetime, zipfile\n",
    "import joblib, talib\n",
    "from tqdm import tqdm\n",
    "import lightgbm as lgb\n",
    "\n",
    "import warnings\n",
    "warnings.filterwarnings(\"ignore\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Project hyper-parameters\n",
    "\n",
    "In this cell, we define the different hyper-parameters. Defining them in one place makes it easier to experiment with their values and also facilitates the execution of HP Tuning experiments using Kale and Katib."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "tags": [
     "pipeline-parameters"
    ]
   },
   "outputs": [],
   "source": [
    "# Hyper-parameters\n",
    "LR = 0.01\n",
    "N_EST = 1200"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "Set random seed for reproducibility"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "tags": [
     "skip"
    ]
   },
   "outputs": [],
   "source": [
    "def fix_all_seeds(seed):\n",
    "    np.random.seed(seed)\n",
    "    random.seed(seed)\n",
    "    os.environ['PYTHONHASHSEED'] = str(seed)\n",
    "\n",
    "fix_all_seeds(2022)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Download data\n",
    "\n",
    "In this section, we download the data from kaggle using the Kaggle API credentials"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "tags": [
     "block:download_data"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "CompletedProcess(args=['kaggle', 'competitions', 'download', '-c', 'g-research-crypto-forecasting'], returncode=0)"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# setup kaggle environment for data download\n",
    "dataset = \"g-research-crypto-forecasting\"\n",
    "\n",
    "# setup kaggle environment for data download\n",
    "with open('/secret/kaggle-secret/password', 'r') as file:\n",
    "    kaggle_key = file.read().rstrip()\n",
    "with open('/secret/kaggle-secret/username', 'r') as file:\n",
    "    kaggle_user = file.read().rstrip()\n",
    "\n",
    "os.environ['KAGGLE_USERNAME'], os.environ['KAGGLE_KEY'] = kaggle_user, kaggle_key\n",
    "\n",
    "# download kaggle's g-research-crypto-forecast data\n",
    "subprocess.run([\"kaggle\",\"competitions\", \"download\", \"-c\", dataset])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "tags": [
     "block:"
    ]
   },
   "outputs": [],
   "source": [
    "# path to download to\n",
    "data_path = 'data'\n",
    "\n",
    "# extract g-research-crypto-forecasting.zip to load_data_path\n",
    "with zipfile.ZipFile(f\"{dataset}.zip\",\"r\") as zip_ref:\n",
    "    zip_ref.extractall(data_path, members=['train.csv', 'asset_details.csv'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Load the dataset\n",
    "\n",
    "First, let us load and analyze the data.\n",
    "\n",
    "The data is in csv format, thus, we use the handy read_csv pandas method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "tags": [
     "block:load_data",
     "prev:download_data"
    ]
   },
   "outputs": [],
   "source": [
    "TRAIN_CSV = f'{data_path}/train.csv'\n",
    "ASSET_DETAILS_CSV = f'{data_path}/asset_details.csv'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "df_train = pd.read_csv(TRAIN_CSV)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(24236806, 10)"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_train.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Asset_ID</th>\n",
       "      <th>Weight</th>\n",
       "      <th>Asset_Name</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>4.304065</td>\n",
       "      <td>Binance Coin</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>6.779922</td>\n",
       "      <td>Bitcoin</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2</td>\n",
       "      <td>2.397895</td>\n",
       "      <td>Bitcoin Cash</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>3</td>\n",
       "      <td>4.406719</td>\n",
       "      <td>Cardano</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>4</td>\n",
       "      <td>3.555348</td>\n",
       "      <td>Dogecoin</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>5</td>\n",
       "      <td>1.386294</td>\n",
       "      <td>EOS.IO</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>6</td>\n",
       "      <td>5.894403</td>\n",
       "      <td>Ethereum</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>7</td>\n",
       "      <td>2.079442</td>\n",
       "      <td>Ethereum Classic</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>8</td>\n",
       "      <td>1.098612</td>\n",
       "      <td>IOTA</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>9</td>\n",
       "      <td>2.397895</td>\n",
       "      <td>Litecoin</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>10</td>\n",
       "      <td>1.098612</td>\n",
       "      <td>Maker</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>11</td>\n",
       "      <td>1.609438</td>\n",
       "      <td>Monero</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>12</td>\n",
       "      <td>2.079442</td>\n",
       "      <td>Stellar</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>13</td>\n",
       "      <td>1.791759</td>\n",
       "      <td>TRON</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    Asset_ID    Weight        Asset_Name\n",
       "1          0  4.304065      Binance Coin\n",
       "2          1  6.779922           Bitcoin\n",
       "0          2  2.397895      Bitcoin Cash\n",
       "10         3  4.406719           Cardano\n",
       "13         4  3.555348          Dogecoin\n",
       "3          5  1.386294            EOS.IO\n",
       "5          6  5.894403          Ethereum\n",
       "4          7  2.079442  Ethereum Classic\n",
       "11         8  1.098612              IOTA\n",
       "6          9  2.397895          Litecoin\n",
       "12        10  1.098612             Maker\n",
       "7         11  1.609438            Monero\n",
       "9         12  2.079442           Stellar\n",
       "8         13  1.791759              TRON"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_asset_details = pd.read_csv(ASSET_DETAILS_CSV).sort_values(\"Asset_ID\")\n",
    "df_asset_details"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "df_train['datetime'] = pd.to_datetime(df_train['timestamp'], unit='s')\n",
    "df_train = df_train[df_train['datetime'] >= '2020-01-01 00:00:00'].copy()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(12228898, 11)"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_train.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Timestamp('2021-09-21 00:00:00')"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_train['datetime'].max()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "timestamp         0\n",
       "Asset_ID          0\n",
       "Count             0\n",
       "Open              0\n",
       "High              0\n",
       "Low               0\n",
       "Close             0\n",
       "Volume            0\n",
       "VWAP              9\n",
       "Target       262453\n",
       "datetime          0\n",
       "dtype: int64"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_train.isna().sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Define Helper Functions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "tags": [
     "functions"
    ]
   },
   "outputs": [],
   "source": [
    "# define the evaluation metric\n",
    "def weighted_correlation(a, train_data):\n",
    "    \n",
    "    weights = train_data.add_w.values.flatten()\n",
    "    b = train_data.get_label()\n",
    "    \n",
    "    \n",
    "    w = np.ravel(weights)\n",
    "    a = np.ravel(a)\n",
    "    b = np.ravel(b)\n",
    "\n",
    "    sum_w = np.sum(w)\n",
    "    mean_a = np.sum(a * w) / sum_w\n",
    "    mean_b = np.sum(b * w) / sum_w\n",
    "    var_a = np.sum(w * np.square(a - mean_a)) / sum_w\n",
    "    var_b = np.sum(w * np.square(b - mean_b)) / sum_w\n",
    "\n",
    "    cov = np.sum((a * b * w)) / np.sum(w) - mean_a * mean_b\n",
    "    corr = cov / np.sqrt(var_a * var_b)\n",
    "\n",
    "    return 'eval_wcorr', corr, True"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "tags": [
     "functions"
    ]
   },
   "outputs": [],
   "source": [
    "def RSI(df, n):\n",
    "    return talib.RSI(df['Close'], n)\n",
    "\n",
    "def ATR(df, n):\n",
    "    return talib.ATR(df[\"High\"], df.Low, df.Close, n)\n",
    "\n",
    "#Create a function to calculate the Double Exponential Moving Average (DEMA)\n",
    "def DEMA(data, time_period):\n",
    "    #Calculate the Exponential Moving Average for some time_period (in days)\n",
    "    EMA = data['Close'].ewm(span=time_period, adjust=False).mean()\n",
    "    #Calculate the DEMA\n",
    "    DEMA = 2*EMA - EMA.ewm(span=time_period, adjust=False).mean()\n",
    "    return DEMA\n",
    "\n",
    "def upper_shadow(df):\n",
    "    return df['High'] - np.maximum(df['Close'], df['Open'])\n",
    "\n",
    "def lower_shadow(df):\n",
    "    return np.minimum(df['Close'], df['Open']) - df['Low']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Feature Engineering"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "tags": [
     "block:feature_engineering",
     "prev:load_data"
    ]
   },
   "outputs": [],
   "source": [
    "def get_features(df, \n",
    "                 asset_id, \n",
    "                 train=True):\n",
    "    '''\n",
    "    This function takes a dataframe with all asset data and return the lagged features for a single asset.\n",
    "    \n",
    "    df - Full dataframe with all assets included\n",
    "    asset_id - integer from 0-13 inclusive to represent a cryptocurrency asset\n",
    "    train - True - you are training your model\n",
    "          - False - you are submitting your model via api\n",
    "    '''\n",
    "    # filter based on asset id\n",
    "    df = df[df['Asset_ID']==asset_id]\n",
    "    \n",
    "    # sort based on time stamp\n",
    "    df = df.sort_values('timestamp')\n",
    "    \n",
    "    if train == True:\n",
    "        df_feat = df.copy()\n",
    "        \n",
    "        # define a train_flg column to split your data into train and validation\n",
    "        totimestamp = lambda s: np.int32(time.mktime(datetime.datetime.strptime(s, \"%d/%m/%Y\").timetuple()))\n",
    "        valid_window = [totimestamp(\"01/05/2021\")]\n",
    "        \n",
    "        df_feat['train_flg'] = np.where(df_feat['timestamp']>=valid_window[0], 0,1)\n",
    "        df_feat = df_feat[['timestamp','Asset_ID', 'High', 'Low', 'Open', 'Close', 'Volume','Target','train_flg']].copy()\n",
    "    else:\n",
    "        df = df.sort_values('row_id')\n",
    "        df_feat = df[['Asset_ID', 'High', 'Low', 'Open', 'Close', 'Volume','row_id']].copy()\n",
    "        \n",
    "    for i in tqdm([30, 120, 240]):\n",
    "        # creating technical indicators\n",
    "        df_feat[f'RSI_{i}'] = RSI(df_feat, i)\n",
    "        df_feat[f'ATR_{i}'] = ATR(df_feat, i)\n",
    "        df_feat[f'DEMA_{i}'] = DEMA(df_feat, i)\n",
    "\n",
    "    for i in tqdm([30, 120, 240]):\n",
    "        # creating lag features\n",
    "        df_feat[f'sma_{i}'] = df_feat['Close'].rolling(i).mean()/df_feat['Close'] -1\n",
    "        df_feat[f'return_{i}'] = df_feat['Close']/df_feat['Close'].shift(i) -1\n",
    "    \n",
    "    # new featu# creating technical indicators featureses\n",
    "    df_feat['HL'] = np.log(df_feat['High'] - df_feat['Low'])\n",
    "    df_feat['OC'] = np.log(df_feat['Close'] - df_feat['Open'])\n",
    "    \n",
    "    df_feat['lower_shadow'] = np.log(lower_shadow(df)) \n",
    "    df_feat['upper_shadow'] = np.log(upper_shadow(df))\n",
    "    \n",
    "    # replace inf with nan\n",
    "    df_feat.replace([np.inf, -np.inf], np.nan, inplace=True)\n",
    "    \n",
    "    # datetime features\n",
    "    df_feat['Date'] = pd.to_datetime(df_feat['timestamp'], unit='s')\n",
    "    df_feat['Day'] = df_feat['Date'].dt.weekday.astype(np.int32)\n",
    "    df_feat[\"dayofyear\"] = df_feat['Date'].dt.dayofyear\n",
    "    df_feat[\"weekofyear\"] = df_feat['Date'].dt.weekofyear\n",
    "    df_feat[\"season\"] = ((df_feat['Date'].dt.month)%12 + 3)//3\n",
    "\n",
    "    df_feat = df_feat.drop(['Open','Close','High','Low', 'Volume', 'Date'], axis=1)\n",
    "    \n",
    "    # fill nan values with 0\n",
    "    df_feat = df_feat.fillna(0)\n",
    "    return df_feat"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 3/3 [00:00<00:00,  7.64it/s]\n",
      "100%|██████████| 3/3 [00:00<00:00, 12.66it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 3/3 [00:00<00:00,  7.61it/s]\n",
      "100%|██████████| 3/3 [00:00<00:00, 16.32it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 3/3 [00:00<00:00,  9.89it/s]\n",
      "100%|██████████| 3/3 [00:00<00:00, 22.05it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 3/3 [00:00<00:00,  8.52it/s]\n",
      "100%|██████████| 3/3 [00:00<00:00, 17.92it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "4\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 3/3 [00:00<00:00, 10.35it/s]\n",
      "100%|██████████| 3/3 [00:00<00:00, 22.56it/s]\n"
     ]
    }
   ],
   "source": [
    "# create your feature dataframe for each asset and concatenate\n",
    "feature_df = pd.DataFrame()\n",
    "for i in range(14):\n",
    "    print(i)\n",
    "    feature_df = pd.concat([feature_df,get_features(df_train,i,train=True)])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Merge Assets Features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "block:merge_assets_features",
     "prev:load_data",
     "prev:feature_engineering"
    ]
   },
   "outputs": [],
   "source": [
    "# assign weight column feature dataframe\n",
    "feature_df = pd.merge(feature_df, df_asset_details[['Asset_ID','Weight']], how='left', on=['Asset_ID'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "feature_df.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Modelling"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "block:modelling",
     "prev:merge_assets_features"
    ]
   },
   "outputs": [],
   "source": [
    "# define features for LGBM\n",
    "features = ['Asset_ID', 'RSI_30', 'ATR_30',\n",
    "       'DEMA_30', 'RSI_120', 'ATR_120', 'DEMA_120', 'RSI_240', 'ATR_240',\n",
    "       'DEMA_240', 'sma_30', 'return_30', 'sma_120', 'return_120', 'sma_240',\n",
    "       'return_240', 'HL', 'OC', 'lower_shadow', 'upper_shadow', 'Day',\n",
    "       'dayofyear', 'weekofyear', 'season']\n",
    "categoricals = ['Asset_ID']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# define train and validation weights and datasets\n",
    "weights_train = feature_df.query('train_flg == 1')[['Weight']]\n",
    "weights_test = feature_df.query('train_flg == 0')[['Weight']]\n",
    "\n",
    "train_dataset = lgb.Dataset(feature_df.query('train_flg == 1')[features], \n",
    "                            feature_df.query('train_flg == 1')['Target'].values, \n",
    "                            feature_name = features,\n",
    "                           categorical_feature= categoricals)\n",
    "val_dataset = lgb.Dataset(feature_df.query('train_flg == 0')[features], \n",
    "                          feature_df.query('train_flg == 0')['Target'].values, \n",
    "                          feature_name = features,\n",
    "                         categorical_feature= categoricals)\n",
    "\n",
    "train_dataset.add_w = weights_train\n",
    "val_dataset.add_w = weights_test\n",
    "\n",
    "evals_result = {}\n",
    "params = {'n_estimators': int(N_EST),\n",
    "        'objective': 'regression',\n",
    "        'metric': 'rmse',\n",
    "        'boosting_type': 'gbdt',\n",
    "        'max_depth': -1, \n",
    "        'learning_rate': float(LR),\n",
    "        'seed': 2022,\n",
    "        'verbose': -1,\n",
    "        }\n",
    "\n",
    "# train LGBM2\n",
    "model = lgb.train(params = params,\n",
    "                  train_set = train_dataset, \n",
    "                  valid_sets = [val_dataset],\n",
    "                  early_stopping_rounds=60,\n",
    "                  verbose_eval = 30,\n",
    "                  feval=weighted_correlation,\n",
    "                  evals_result = evals_result \n",
    "                 )\n",
    "\n",
    "joblib.dump(model, 'lgb.jl')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "fea_imp = pd.DataFrame({'imp':model.feature_importance(), 'col': features})\n",
    "fea_imp = fea_imp.sort_values(['imp', 'col'], ascending=[True, False]).iloc[-30:]\n",
    "_ = fea_imp.plot(kind='barh', x='col', y='imp', figsize=(20, 10))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Evaluation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "block:evaluation_result",
     "prev:modelling"
    ]
   },
   "outputs": [],
   "source": [
    "model = joblib.load('lgb.jl')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "root_mean_squared_error = model.best_score.get('valid_0').get('rmse')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "weighted_correlation = model.best_score.get('valid_0').get('eval_wcorr')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "pipeline-metrics"
    ]
   },
   "outputs": [],
   "source": [
    "print(root_mean_squared_error)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "pipeline-metrics"
    ]
   },
   "outputs": [],
   "source": [
    "print(weighted_correlation)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "kubeflow_notebook": {
   "autosnapshot": true,
   "experiment": {
    "id": "2efb8e27-3b2e-439b-a53c-b1f9d7b94cfc",
    "name": "g-research-crypto-forecasting"
   },
   "experiment_name": "g-research-crypto-forecasting",
   "katib_metadata": {
    "algorithm": {
     "algorithmName": "grid"
    },
    "maxFailedTrialCount": 3,
    "maxTrialCount": 12,
    "objective": {
     "objectiveMetricName": "",
     "type": "minimize"
    },
    "parallelTrialCount": 3,
    "parameters": []
   },
   "katib_run": false,
   "pipeline_description": "forecasting short term returns in 14 popular cryptocurrencies.",
   "pipeline_name": "g-research-crypto-forecasting-pipeline",
   "snapshot_volumes": true,
   "steps_defaults": [
    "label:access-ml-pipeline:true",
    "label:kaggle-secret:true",
    "label:access-rok:true"
   ],
   "volume_access_mode": "rwm",
   "volumes": [
    {
     "annotations": [],
     "mount_point": "/home/jovyan",
     "name": "test-workspace-6lhtr",
     "size": 15,
     "size_type": "Gi",
     "snapshot": false,
     "type": "clone"
    }
   ]
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}