* Trigger unittests on postsubmit and periodic runs.
* Rename the unittests workflow because its running unittests not E2E tests.
Fix#510
* Shorten the name otherwise step names become two long.
* [mnist] Add support for S3 in TensorBoard component; Update docs.
* [mnist] reverted autonumbering in README
* [mnist] add expected fail for predict_test, until it'ss fixed
* Add e2e test for xgboost housing example
* fix typo
add ks apply
add [
modify example to trigger tests
add prediction test
add xgboost ks param
rename the job name without _
use - instead of _
libson params
rm redudent component
rename component in prow config
add ames-hoursing-env
use - for all names
use _ for params names
use xgboost_ames_accross
rename component name
shorten the name
change deploy-test command
change to xgboost-
namespace
init ks app
fix type
add confest.py
change path
change deploy command
change dep
change the query URL for seldon
add ks_app with seldon lib
update ks_app
use ks init only
rerun
change to kf-v0-4-n00 cluster
add ks_app
use ks-13
remove --namespace
use kubeflow as namespace
delete seldon deployment
simplify ks_app
retry on 503
fix typo
query 1285
move deletion after prediction
wait 10s
always retry till 10 mins
move check to retry
fix pylint
move clean-up to the delete template
* set up xgboost component
* check in ks component& run it directly
* change comments
* add comment on why use 'ks delete'
* add two modules to pylint whitelist
* ignore tf_operator/py
* disable pylint per line
* reorder import
* Create an E2E test for TFServing using the rest API
* We use the pytest framework because
1. it has really good support for using command line arguments
2. can emit junit xml file to report results to prow.
Related to #270: Create a generic test runner
* Address comments.
* Fix lint.
* Add retries to the prediction.
* Add some comments.
* Fix model path.
* * Fix the workflow labels
* Set the K8s service name correctly on the test.
* Fix the workflow.
* Fix lint.
* Update model inference wrapping to use S2I and update docs
* Add s2i reference in docs
* Fix typo highlighted in review
* Add pyLint annotation to allow protected-access on keras make predict function method
* Refactor Python module:
- Replace MPI by GLOO as backend to avoid having to recompily Pytorch
- Replace DistributedDataParallel() class with official version when using GPUs
- Remove unnecessary method to disable logs in workers
- Refactor run()
* Simplify Dockerfile by using Pytorch 0.4 official image with Cuda and remove mpirun call
* Add the web-ui for the mnist example
Copy the mnist web app from
https://github.com/googlecodelabs/kubeflow-introduction
* Update the web app
* Change "server-name" argument to "model-name" because this is what
is.
* Update the prediction client code; The prediction code was copied
from https://github.com/googlecodelabs/kubeflow-introduction and
that model used slightly different values for the input names
and outputs.
* Add a test for the mnist_client code; currently it needs to be run
manually.
* Fix the label selector for the mnist service so that it matches the
TFServing deployment.
* Delete the old copy of mnist_client.py; we will go with the copy in ewb-ui from https://github.com/googlecodelabs/kubeflow-introduction
* Delete model-deploy.yaml, model-train.yaml, and tf-user.yaml.
The K8s resources for training and deploying the model are now in ks_app.
* Fix tensorboard; tensorboard only partially works behind Ambassador. It seems like some requests don't work behind a reverse proxy.
* Fix lint.
* Add the TFServing component
* Create TFServing components.
* The model.py code doesn't appear to be exporting a model in saved model
format; it was a missing a call to export.
* I'm not sure how this ever worked.
* It also looks like there is a bug in the code in that its using the cnn input fn even if the model is the linear one. I'm going to leave that as is for now.
* Create a namespace for each test run; delete the namespace on teardown
* We need to copy the GCP service account key to the new namespace.
* Add a shell script to do that.
* Update training to use Kubeflow 0.4 and add testing.
* To support testing we need to create a ksonnet template to train
the model so we can easily subsitute in different parameters during
training.
* We create a ksonnet component for just training; we don't use Argo.
This makes the example much simpler.
* To support S3 we add a generic ksonnet parameter to take environment
variables as a comma separated list of variables. This should make it
easy for users to set the environment variables needed to talk to S3.
This is compatible with the existing Argo workflow which supports S3.
* By default the training job runs non-distributed; this is because to
run distributed the user needs a shared filesystem (e.g. S3/GCS/NFS).
* Update the mnist workflow to correctly build the images.
* We didn't update the workflow in the previous example to actually
build the correct images.
* Update the workflow to run the tfjob_test
* Related to #460 E2E test for mnist.
* Add a parameter to specify a secret that can be used to mount
a secret such as the GCP service account key.
* Update the README with instructions for GCS and S3.
* Remove the instructions about Argo; the Argo workflow is outdated.
Using Argo adds complexity to the example and the thinking is to remove
that to provide a simpler example and to mirror the pytorch example.
* Add a TOC to the README
* Update prerequisite instructions.
* Delete instructions for installing Kubeflow; just link to the
getting started guide.
* Argo CLI should no longer be needed.
* GitHub token shouldn't be needed; I think that was only needed
for ksonnet to pull the registry.
* * Fix instructions; access keys shouldn't be stored as ksonnet parameters
as these will get checked into source control.
* Add job_mnist_DDP_CPU for v1beta1
* Add job_mnist_DDP_GPU for v1beta1
* Update 02_distributed_training.md to use v1beta1
* Remove pytorch v1alpha2 config
* Add missing CPU training config
* Fix link to next section, training the model
* Added links to next and previous sections in training the model README
* Fix link to previous section, training the model
* Remove TODO list
* This is the first step in adding E2E tests for the mnist example.
* Add a Makefile and .jsonnet file to build the Docker images using GCB
* Define an Argo workflow to trigger the image builds on pre & post submit.
Related to: #460
* Create a test for submitting the TFJob for the GitHub issue summarization example.
* This test needs to be run manually right now. In a follow on PR we will
integrate it into CI.
* We use the image built from Dockerfile.estimator because that is the image
we are running train_test.py in.
* Note: The current version of the code now requires Python3 (I think this
is due to an earlier PR which refactored the code into a shared
implementation for using TF estimator and not TF estimator).
* Create a TFJob component for TFJob v1beta1; this is the version
in KF 0.4.
TFJob component
* Upgrade to v1beta to work with 0.4
* Update command line arguments to match the versions in the current code
* input & output are now single parameters rather then separate parameters
for bucket and name
* change default input to a CSV file because the current version of the
code doesn't handle unzipping it.
* Use ks_util from kubeflow/testing
* Address comments.
* Setup continuous building of Docker images and testing for GH Issue Summarization Example.
* This is the first step in setting up a continuously running CI test.
* Add support for building the Docker images using GCB; we will use GCB
to trigger the builds from our CI system.
* Make the Makefile top level (at root of GIS example) so that we can
easily access all the different resources.
* Add a .gitignore file to avoid checking in the build directory used by
the Makefile.
* Define an Argo workflow to use as the E2E test.
Related to #92: E2E test & CI for github issue summarization
* Trigger the test on pre & post submit
* Dockerfile.estimator don't install the data_download.sh script
* It doesn't look like we are currently using data_download.sh in the
DockerImage
* It looks like it only gets used vias the ksonnet job which mounts the
script via a config map
* Copying data_download.sh to the Docker image is currently weird
given the organization of the Dockerfile and context.
* Copy the test_data to the Docker images so that we can run the test
inside the images.
* Invoke the python unittest for training from our CI system.
* In a follow on PR we will update the test to emit a JUnit XML file to
report results to prow.
* Fix image build.
* Update tfjob components to v1beta1
Remove old version of tensor2tensor component
* Combine UI into a single jsonnet file
* Upgrade GH issue summarization to kf v0.4.0-rc.2
Use latest ksonnet v0.13.1
Use latest seldon v1alpha2
Remove ksonnet app with full kubeflow platform & replace with components specific to this example.
Remove outdated scripts
Add cluster creation links to Click-to-deploy & kfctl
Add warning not to use the Training with an Estimator guide
Replace commandline with bash for better syntax highlighting
Replace messy port-forwarding commands with svc/ambassador
Add modelUrl param to ui component
Modify teardown instructions to remove the deployment
Fix grammatical mistakes
* Rearrange tfjob instructions
* An Argo workflow to use as the E2E test for code_search example.
* The workflow builds the Docker images and then runs the python test
to train and export a model
* Move common utilities into util.libsonnet.
* Add the workflow to the set of triggered workflows.
* Update the test environment used by the test ksonnet app; we've since
changed the location of the app.
Related to #295
* Refactor the jsonnet file defining the GCB build workflow
* Use an external variable to conditionally pull and use a previous
Docker image as a cache
* Reduce code duplication by building a shared template for all the different
workflows.
* BUILD_ID needs to be defined in the default parameters otherwise we get an error when adding a new environment.
* Define suitable defaults.
* create pv for pets-pv
For a lot of user k8s clusters, dynamic volume provisioning isn't
enabled. So the newcomer may be blocked since pets-pv will keep
Pending. We can guide them to create a nfs PV as an option.
* tell user how to check if a default storage class is defined
* add link about how to create PV