new PR for XGBoost due to problems with history rewrite

This commit is contained in:
Puneith Kaul 2018-08-21 18:49:50 -07:00
parent e6b6730650
commit ecc1aab0e4
9 changed files with 468 additions and 0 deletions

37
xgboost/Dockerfile Normal file
View File

@ -0,0 +1,37 @@
# Copyright 2018 Google Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Use an official Ubuntu 18.04 as parent image
FROM ubuntu:18.04
# Install python 2.7
FROM python:2.7
RUN apt-get update
RUN apt-get install -y git make g++
# Build XGBoost
RUN git clone --recursive https://github.com/dmlc/xgboost && \
cd xgboost && \
make -j4 && \
cd python-package; python setup.py install
# Download code and install dependencies
RUN cd .. && \
git clone https://github.com/kubeflow/examples.git && \
cd examples && \
cd xgboost && \
pip install -r seldon_serve/requirements.txt
ENTRYPOINT ["python", "examples/xgboost/housing.py"]

211
xgboost/README.md Normal file
View File

@ -0,0 +1,211 @@
# Ames housing value prediction using XGBoost on Kubeflow
In this example we will demonstrate how to use Kubeflow with XGBoost using the [Kaggle Ames Housing Prices prediction](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/). We will do a detailed
walk-through of how to implement, train and serve the model. You will be able to run the exact same workload on-prem and/or on any cloud provider. We will be using [Google Kubernetes Engine](https://cloud.google.com/kubernetes-engine/) to show how the end-to-end workflow runs on [Google Cloud Platform](https://cloud.google.com/).
# Pre-requisites
As a part of running this setup on Google Cloud Platform, make sure you have enabled the [Google
Kubernetes Engine](https://cloud.google.com/kubernetes-engine/). In addition to that you will need to install
[Docker](https://docs.docker.com/install/) and [gcloud](https://cloud.google.com/sdk/downloads). Note that this setup can run on-prem and on any cloud provider, but here we will demonstrate GCP cloud option. Finally, follow the [instructions](https://www.kubeflow.org/docs/started/getting-started-gke/) to create a GKE cluster.
# Steps
* [Kubeflow Setup](#kubeflow-setup)
* [Data Preparation](#data-preparation)
* [Dockerfile](#dockerfile)
* [Model Training on GKE](#model-training-on-gke)
* [Model Export](#model-export)
* [Model Serving Locally](#model-serving-locally)
* [Deploying Model to Kubernetes Cluster](#model-serving-on-gke)
## Kubeflow Setup
In this part you will setup Kubeflow on an existing Kubernetes cluster. Checkout the Kubeflow [getting started guide](https://www.kubeflow.org/docs/started/getting-started/).
## Data Preparation
You can download the dataset from the [Kaggle competition](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data). In order to make it convenient we have uploaded the dataset on GCS
```
gs://kubeflow-examples-data/ames_dataset/
```
## Dockerfile
We have attached a Dockerfile with this repo which you can use to create a
docker image. We have also uploaded the image to gcr.io, which you can use to
directly download the image.
```
IMAGE_NAME=ames-housing
VERSION=v1
```
Use `gcloud` command to get the GCP project
```
PROJECT_ID=`gcloud config get-value project`
```
Let's create a docker image from our Dockerfile
```
docker build -t gcr.io/$PROJECT_ID/${IMAGE_NAME}:${VERSION} .
```
Once the above command is successful you should be able to see the docker
images on your local machine by running `docker images`. Next we will upload the image to
[Google Container Registry](https://cloud.google.com/container-registry/)
```
gcloud auth configure-docker
docker push gcr.io/${PROJECT_ID}/${IMAGE_NAME}:${VERSION}
```
A public copy is available at `gcr.io/kubeflow-examples/ames-housing:v1`.
## Model training on GKE
In this section we will run the above docker container on a [Google Kubernetes Engine](https://cloud.google.com/kubernetes-engine/). There are two steps to perform the training
* Create a GKE cluster
* Create a Persistent Volume
* Follow the instructions [here](https://kubernetes.io/docs/tasks/configure-pod-container/configure-persistent-volume-storage/). You will need to run the following `kubectl create` commands in order to get the `claim` attached to the `pod`.
```
kubectl create -f py-volume.yaml
kubectl create -f py-claim.yaml
```
* Run docker container on GKE
* Use the `kubectl` command to run the image on GKE
```
kubectl create -f py-pod.yaml
```
Once the above command finishes you will have an XGBoost model available at Persistent Volume `/mnt/xgboost/housing.dat`
## Model Export
The model is exported to the location `/tmp/ames/housing.dat`. We will use [Seldon Core](https://github.com/SeldonIO/seldon-core/) to serve the model asset. In order to make the model servable we have created `xgboost/seldon_serve` with the following assets
* `HousingServe.py`
* `housing.dat`
* `requirements.txt`
## Model Serving Locally
We are going to use [seldon-core](https://github.com/SeldonIO/seldon-core/) to serve the model. [HousingServe.py](seldon_serve/HousingServe.py) contains the code to serve the model. Run the following command to create a microservice
```
docker run -v $(pwd):/seldon_serve seldonio/core-python-wrapper:0.7 /seldon_serve HousingServe 0.1 gcr.io --base-image=python:3.6 --image-name=${PROJECT_ID}/housingserve
```
Let's build the seldon-core microservice image. You can find seldon core model wrapping details [here](https://github.com/SeldonIO/seldon-core/blob/master/docs/wrappers/python.md).
```
cd build
./build_image.sh
```
You should see the docker image locally `gcr.io/cloudmlplat/housingserve` which can be run locally to serve the model. Before running the image locally push it to `gcr.io`
```
gcloud auth configure-docker
docker push gcr.io/${PROJECT_ID}/housingserve:0.1
```
Let's run the docker image now
```
docker run -p 5000:5000 gcr.io/cloudmlplat/housingserve:0.1
```
Now you are ready to send requests on `localhost:5000`
```
curl -H "Content-Type: application/x-www-form-urlencoded" -d 'json={"data":{"tensor":{"shape":[1,37],"values":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37]}}}' http://localhost:5000/predict
```
```
{
"data": {
"names": [
"t:0",
"t:1"
],
"tensor": {
"shape": [
1,
2
],
"values": [
97522.359375,
97522.359375
]
}
}
}
```
## Model serving on GKE
One of the amazing features of Kubernetes is that you can run it anywhere i.e., local, on-prem and cloud. We will show you how to run your code on Google Kubernetes Engine. First off, start a GKE cluster.
Deploy Seldon core to your GKE cluster by following the instructions in the Deploy Seldon Core section [here](https://github.com/kubeflow/examples/blob/fb2fb26f710f7c03996f08d81607f5ebf7d5af09/github_issue_summarization/serving_the_model.md#deploy-seldon-core). Once everything is successful you can verify it using `kubectl get pods -n${NAMESPACE}`.
```
NAME READY STATUS RESTARTS AGE
ambassador-849fb9c8c5-5kx6l 2/2 Running 0 16m
ambassador-849fb9c8c5-pww4j 2/2 Running 0 16m
ambassador-849fb9c8c5-zn6gl 2/2 Running 0 16m
redis-75c969d887-fjqt8 1/1 Running 0 30s
seldon-cluster-manager-6c78b7d6c9-6qhtg 1/1 Running 0 30s
spartakus-volunteer-66cc8ccd5b-9f8tw 1/1 Running 0 16m
tf-hub-0 1/1 Running 0 16m
tf-job-dashboard-7b57c549c8-bfpp8 1/1 Running 0 16m
tf-job-operator-594d8c7ddd-lqn8r 1/1 Running 0 16m
```
Deploy the XGBoost model
```
ks generate seldon-serve-simple xgboost-ames \
--name=xgboost-ames \
--image=gcr.io/cloudmlplat/housingserve:0.1 \
--namespace=${NAMESPACE} \
--replicas=1
ks apply ${KF_ENV} -c xgboost-ames
```
## Sample request and response
Seldon Core uses ambassador to route its requests. To send requests to the model, you can port-forward the ambassador container locally:
```
kubectl port-forward $(kubectl get pods -n ${NAMESPACE} -l service=ambassador -o jsonpath='{.items[0].metadata.name}') -n ${NAMESPACE} 8080:80
```
Now you are ready to send requests on `localhost:8080`
```
curl -H "Content-Type: application/x-www-form-urlencoded" -d 'json={"data":{"tensor":{"shape":[1,37],"values":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37]}}}' http://localhost:8080/predict
```
```
{
"data": {
"names": [
"t:0",
"t:1"
],
"tensor": {
"shape": [
1,
2
],
"values": [
97522.359375,
97522.359375
]
}
}
}
```

121
xgboost/housing.py Normal file
View File

@ -0,0 +1,121 @@
# Copyright 2018 Google Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import joblib
import numpy as np
import pandas as pd
from sklearn.preprocessing import Imputer
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error
def read_input(file_name, test_size=0.25):
"""Read input data and split it into train and test."""
data = pd.read_csv(file_name[0])
data.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = data.SalePrice
X = data.drop(['SalePrice'], axis=1).select_dtypes(exclude=['object'])
train_X, test_X, train_y, test_y = train_test_split(X.values,
y.values,
test_size=test_size)
imputer = Imputer()
train_X = imputer.fit_transform(train_X)
test_X = imputer.transform(test_X)
return (train_X, train_y), (test_X, test_y)
def train_model(train_X,
train_y,
test_X,
test_y,
n_estimators,
learning_rate):
"""Train the model using XGBRegressor."""
model = XGBRegressor(n_estimators=n_estimators,
learning_rate=learning_rate)
model.fit(train_X,
train_y,
early_stopping_rounds=40,
eval_set=[(test_X, test_y)])
print("Best RMSE on eval: {:.2f} with {} rounds".format(
model.best_score,
model.best_iteration+1))
return model
def eval_model(model, test_X, test_y):
"""Evaluate the model performance."""
predictions = model.predict(test_X)
print
print("MAE on test: {:.2f}".format(mean_absolute_error(predictions, test_y)))
def save_model(model, model_file):
"""Save XGBoost model for serving."""
joblib.dump(model, model_file)
print("Model export success {}".format(model_file))
def main(args):
(train_X, train_y), (test_X, test_y) = read_input(args.train_input)
model = train_model(train_X,
train_y,
test_X,
test_y,
args.n_estimators,
args.learning_rate)
eval_model(model, test_X, test_y)
save_model(model, args.model_file)
if __name__=='__main__':
parser = argparse.ArgumentParser()
parser.add_argument(
'--train-input',
help="Input training file",
nargs='+',
required=True
)
parser.add_argument(
'--n-estimators',
help='Number of trees in the model',
type=int,
default=1000
)
parser.add_argument(
'--learning-rate',
help='Learning rate for the model',
default=0.1
)
parser.add_argument(
'--model-file',
help='Model file location for XGBoost',
required=True
)
parser.add_argument(
'--test-size',
help='Fraction of training data to be reserved for test',
default=0.25
)
parser.add_argument(
'--early-stopping-rounds',
help='XGBoost argument for stopping early',
default=50
)
args = parser.parse_args()
main(args)

11
xgboost/py-claim.yaml Normal file
View File

@ -0,0 +1,11 @@
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: claim
spec:
storageClassName: manual
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 3Gi

30
xgboost/py-pod.yaml Normal file
View File

@ -0,0 +1,30 @@
apiVersion: v1
kind: Pod
metadata:
name: xgboost
labels:
team: platform
spec:
containers:
- name: xgboost
image: gcr.io/cloudmlplat/ames-housing:v1
volumeMounts:
- mountPath: "/mnt/xgboost"
name: datadir
args:
- --train-input
- /mnt/xgboost/ames_dataset/train.csv
- --model-file
- /mnt/xgboost/housing.dat
- --learning-rate
- "0.1"
- --n-estimators
- "30000"
- --early-stopping-rounds
- "50"
volumes:
- name: datadir
persistentVolumeClaim:
claimName: claim

14
xgboost/py-volume.yaml Normal file
View File

@ -0,0 +1,14 @@
kind: PersistentVolume
apiVersion: v1
metadata:
name: vol
labels:
type: local
spec:
storageClassName: manual
capacity:
storage: 10Gi
accessModes:
- ReadWriteOnce
hostPath:
path: "/mnt/stateful_partition/data2/"

View File

@ -0,0 +1,37 @@
# Copyright 2018 Google Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import joblib
import numpy as np
class HousingServe(object):
def __init__(self, model_file='housing.dat'):
"""Load the housing model using joblib."""
self.model = joblib.load(model_file)
def predict(self, X, feature_names):
"""Predict using the model for given ndarray."""
prediction = self.model.predict(data=X)
# added this temporarily to keep seldon happy
# TODO: Fix https://github.com/SeldonIO/seldon-core/blob/master/wrappers/python/model_microservice.py#L55
return [[prediction.item(0), prediction.item(0)]]
def sample_test(self):
"""Generate a random sample feature."""
return np.ndarray([1, 37])
if __name__=='__main__':
serve = HousingServe()
print(serve.predict(serve.sample_test(), None))

Binary file not shown.

View File

@ -0,0 +1,7 @@
joblib==0.12.0
numpy==1.14.4
pandas==0.23.0
scikit-learn==0.19.1
scipy==1.1.0
sklearn==0.0
xgboost==0.72.1