rm stale agents example (#487)

This commit is contained in:
Christopher Beitel 2019-01-23 16:27:50 -08:00 committed by Kubernetes Prow Robot
parent 2b0eec34c3
commit 89e960202a
27 changed files with 0 additions and 77705 deletions

View File

@ -1,26 +0,0 @@
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
FROM tensorflow/tensorflow:1.4.1
# Needed for rendering and uploading renders
RUN apt-get update
RUN apt-get install -y libav-tools ffmpeg git
ADD requirements.txt /app/
RUN pip install -r /app/requirements.txt
ADD trainer /app/trainer/
WORKDIR /app/
ENTRYPOINT ["python", "-m", "trainer.task"]

View File

@ -1,66 +0,0 @@
# [WIP] Reinforcement Learning with [tensorflow/agents](https://github.com/tensorflow/agents)
Here we provide a demonstration of training a reinforcement learning agent to perform a robotic grasping task using Kubeflow running on Google Kubernetes Engine. In this demonstration you will learn how to paramaeterize a training job, submit it to run on your cluster, monitor the job including launching a tensorboard instance, and finally producing renders of the agent performing the robotic grasping task.
For clarity and fun you can check out what the product of this tutorial will look like by clicking through the render screenshot below to a short video of a trained agent performing a simulated robotic block grasping task:
[![](doc/render_preview.png)](https://youtu.be/0X0w5XOtcHw)
### Setup
##### Training locally
In order to run the example localy we'll need to install the necessary requirements in a local conda environment, which can be done as follows:
```bash
$conda create -y -n dev python=2.7
$source activate dev
$pip install -r requirements.txt
```
The trainer can be run as follows (in this case to display information on the available parameters):
```bash
$python -m trainer.task --help
usage: task.py [-h] [--run_mode RUN_MODE] [--logdir LOGDIR] [--hparam_set_id HPARAM_SET_ID]
[--run_base_tag RUN_BASE_TAG] [--env_processes [ENV_PROCESSES]] [--noenv_processes]
[--num_gpus NUM_GPUS] [--save_checkpoint_secs SAVE_CHECKPOINT_SECS]
[--log_device_placement [LOG_DEVICE_PLACEMENT]] [--nolog_device_placement]
[--debug [DEBUG]] [--nodebug] [--render_secs RENDER_SECS]
[--render_out_dir RENDER_OUT_DIR] [--algorithm ALGORITHM] [--num_agents NUM_AGENTS]
[--eval_episodes EVAL_EPISODES] [--env ENV] [--max_length MAX_LENGTH] [--steps STEPS]
[--network NETWORK] [--init_mean_factor INIT_MEAN_FACTOR] [--init_std INIT_STD]
[--learning_rate LEARNING_RATE] [--optimizer OPTIMIZER] [--update_epochs UPDATE_EPOCHS]
[--update_every UPDATE_EVERY] [--discount DISCOUNT] [--kl_target KL_TARGET]
[--kl_cutoff_factor KL_CUTOFF_FACTOR] [--kl_cutoff_coef KL_CUTOFF_COEF]
[--kl_init_penalty KL_INIT_PENALTY]
...
```
##### GCP and Kubeflow configuration
This tutorial assumes you have deployed a Kubernetes cluster on your provider of choice and have completed the steps described in the [Kubeflow User Guide](https://github.com/kubeflow/kubeflow/blob/master/user_guide.md) to deploy the core, argo, and nfs components.
##### Launching base image on JupyterHub
This example is intended to be run inside of the `gcr.io/kubeflow/tensorflow-notebook-cpu` container running on JupyterHub which is in turn running on Kubeflow. You may provide the name of this container via the spawner options dialog.
For general troubleshooting of the spawning of notebook containers on JupyterHub or anything else related to your Kubeflow deployment please refer to the [Kubeflow User Guide](https://github.com/kubeflow/kubeflow/blob/master/user_guide.md).
There are two steps to perform from within the JupyterHub environment before the demonstration notebook can be used as intended.
First, we need to obtain the kubeflow example code as follows:
```bash
$cd /home/jovyan
$git clone https://github.com/kubeflow/examples kubeflow-examples
```
We will also need to authenticate our notebook environment to make calls to the underlying Kubernetes cluster. For example, if this is running on Google Container Engine the command would be as follows:
```bash
$gcloud container clusters --project={PROJECT} --zone={ZONE} get-credentials {CLUSTER}
```
Well it looks like our initial setup is finished 🎉🎉 and it's time to start playing around with that shiny new demonstration notebook!! You'll find it in doc/demo.ipynb.

View File

@ -1,39 +0,0 @@
apiVersion: 0.1.0
gitVersion:
commitSha: 422d521c05aa905df949868143b26445f5e4eda5
refSpec: master
kind: ksonnet.io/registry
libraries:
apache:
path: apache
version: master
efk:
path: efk
version: master
mariadb:
path: mariadb
version: master
memcached:
path: memcached
version: master
mongodb:
path: mongodb
version: master
mysql:
path: mysql
version: master
nginx:
path: nginx
version: master
node:
path: node
version: master
postgres:
path: postgres
version: master
redis:
path: redis
version: master
tomcat:
path: tomcat
version: master

View File

@ -1,18 +0,0 @@
apiVersion: 0.1.0
gitVersion:
commitSha: 8b48d28127cb719410a9c40be214d3c76c2b4cb7
refSpec: master
kind: ksonnet.io/registry
libraries:
argo:
path: argo
version: master
core:
path: core
version: master
tf-job:
path: tf-job
version: master
tf-serving:
path: tf-serving
version: master

View File

@ -1,24 +0,0 @@
apiVersion: 0.1.0
kind: ksonnet.io/app
libraries:
tf-job:
gitVersion:
commitSha: 8b48d28127cb719410a9c40be214d3c76c2b4cb7
refSpec: master
name: tf-job
registry: kubeflow
name: app
registries:
incubator:
gitVersion:
commitSha: 422d521c05aa905df949868143b26445f5e4eda5
refSpec: master
protocol: github
uri: github.com/ksonnet/parts/tree/master/incubator
kubeflow:
gitVersion:
commitSha: 8b48d28127cb719410a9c40be214d3c76c2b4cb7
refSpec: master
protocol: github
uri: github.com/kubeflow/kubeflow/tree/master/kubeflow
version: 0.0.1

View File

@ -1,56 +0,0 @@
{
global: {
},
components: {
"train": {
algorithm: "agents.ppo.PPOAlgorithm",
discount: 0.995,
dump_dependency_versions: "True",
env: "KukaBulletEnv-v0",
eval_episodes: 25,
generate_data: "True",
hparam_set_id: "pybullet_kuka_ff",
image: "gcr.io/kubeflow-rl/agents:0405-1658-39bf",
image_gpu: "null",
init_mean_factor: 0.1,
job_tag: "0206-1409-6174",
kl_cutoff_coef: 1000,
kl_cutoff_factor: 2,
kl_init_penalty: 1,
kl_target: 0.01,
learning_rate: 0.0001,
log_dir: "/mnt/nfs-1/train_dirs/studies/replicated-kuka-demo/kuka-0405-1707-545d",
max_length: 1000,
name: "kuka-0405-1707-545d",
namespace: "kubeflow",
network: "agents.scripts.networks.feed_forward_gaussian",
nfs_claim_name: "nfs-1",
num_agents: 30,
num_cpu: 30,
num_gpus: 0,
num_masters: 1,
num_ps: 1,
num_replicas: 1,
num_workers: 1,
optimizer: "tensorflow.train.AdamOptimizer",
render_secs: 600,
run_base_tag: "0e90193e",
run_mode: "train",
save_checkpoint_secs: 600,
save_checkpoints_secs: 600,
steps: 15000000,
sync_replicas: "False",
update_epochs: 25,
update_every: 60,
},
"render": {
image: "gcr.io/kubeflow-rl/agents:0319-1806-6614",
log_dir: "/mnt/nfs-1/train_dirs/kubeflow-rl/studies/replicated-kuka-demo-1/kuka-0319-1735-222e",
name: "render-0319-2043-47e6",
namespace: "kubeflow",
nfs_claim_name: "nfs-1",
num_cpu: 4,
num_gpus: 0,
},
},
}

View File

@ -1,63 +0,0 @@
local params = std.extVar("__ksonnet/params").components["render"];
local k = import 'k.libsonnet';
local deployment = k.extensions.v1beta1.deployment;
local container = deployment.mixin.spec.template.spec.containersType;
local podTemplate = k.extensions.v1beta1.podTemplate;
local tfJob = import 'kubeflow/tf-job/tf-job.libsonnet';
local name = params.name;
local namespace = params.namespace;
local num_gpus = params.num_gpus;
local log_dir = params.log_dir;
local imageGpu = "";
local image = params.image;
local numCpu = params.num_cpu;
local args = [
"--run_mode=render",
"--logdir=" + log_dir,
"--num_agents=1"
];
local workerSpec = if num_gpus > 0 then
tfJob.parts.tfJobReplica("MASTER", 1, args, imageGpu, num_gpus)
else
tfJob.parts.tfJobReplica("MASTER", 1, args, image);
local nfsClaimName = params.nfs_claim_name;
local replicas = std.map(function(s)
s + {
template+: {
spec+: {
containers: [
s.template.spec.containers[0] + {
resources: {
limits: {
cpu: numCpu
},
requests: {
cpu: numCpu
}
},
volumeMounts:[{
name: "nfs",
mountPath: "/mnt/" + nfsClaimName
}]
},
],
volumes: [{
name: "nfs",
persistentVolumeClaim: {
claimName: nfsClaimName
}
}]
},
},
},
std.prune([workerSpec]));
local job = tfJob.parts.tfJob(name, namespace, replicas);
std.prune(k.core.v1.list.new([job]))

View File

@ -1,110 +0,0 @@
local params = std.extVar("__ksonnet/params").components["train"];
local k = import 'k.libsonnet';
local deployment = k.extensions.v1beta1.deployment;
local container = deployment.mixin.spec.template.spec.containersType;
local podTemplate = k.extensions.v1beta1.podTemplate;
local tfJob = import 'kubeflow/tf-job/tf-job.libsonnet';
local name = params.name;
local namespace = params.namespace;
local num_gpus = params.num_gpus;
local hparam_set_id = params.hparam_set_id;
local jobTag = params.job_tag;
local image = params.image;
local imageGpu = params.image_gpu;
local numCpu = params.num_cpu;
local dumpDependencyVersions = params.dump_dependency_versions;
local log_dir = params.log_dir;
local hparamSetID = params.hparam_set_id;
local runBaseTag = params.run_base_tag;
local syncReplicas = params.sync_replicas;
local algorithm = params.algorithm;
local numAgents = params.num_agents;
local evalEpisodes = params.eval_episodes;
local env = params.env;
local maxLength = params.max_length;
local steps = params.steps;
local network = params.network;
local initMeanFactor = params.init_mean_factor;
local learningRate = params.learning_rate;
local optimizer = params.optimizer;
local updateEpochs = params.update_epochs;
local updateEvery = params.update_every;
local discount = params.discount;
local klTarget = params.kl_target;
local klCutoffFactor = params.kl_cutoff_factor;
local klCutoffCoef = params.kl_cutoff_coef;
local klInitPenalty = params.kl_init_penalty;
local renderSecs = params.render_secs;
local args = [
"--run_mode=train",
"--logdir=" + log_dir,
"--hparam_set_id=" + hparamSetID,
"--run_base_tag=" + runBaseTag,
"--sync_replicas=" + syncReplicas,
"--num_gpus=" + num_gpus,
"--algorithm=" + algorithm,
"--num_agents=" + numAgents,
"--eval_episodes=" + evalEpisodes,
"--env=" + env,
"--max_length=" + maxLength,
"--steps=" + steps,
"--network=" + network,
"--init_mean_factor=" + initMeanFactor,
"--learning_rate=" + learningRate,
"--optimizer=" + optimizer,
"--update_epochs=" + updateEpochs,
"--update_every=" + updateEvery,
"--discount=" + discount,
"--kl_target=" + klTarget,
"--kl_cutoff_factor=" + klCutoffFactor,
"--kl_cutoff_coef=" + klCutoffCoef,
"--kl_init_penalty=" + klInitPenalty,
"--dump_dependency_versions=" + dumpDependencyVersions,
"--render_secs=" + renderSecs,
];
local workerSpec = if num_gpus > 0 then
tfJob.parts.tfJobReplica("MASTER", 1, args, imageGpu, num_gpus)
else
tfJob.parts.tfJobReplica("MASTER", 1, args, image);
local nfsClaimName = params.nfs_claim_name;
local replicas = std.map(function(s)
s + {
template+: {
spec+: {
containers: [
s.template.spec.containers[0] + {
resources: {
limits: {
cpu: numCpu
},
requests: {
cpu: numCpu
}
},
volumeMounts:[{
name: "nfs",
mountPath: "/mnt/" + nfsClaimName
}]
},
],
volumes: [{
name: "nfs",
persistentVolumeClaim: {
claimName: nfsClaimName
}
}]
},
},
},
std.prune([workerSpec]));
local job = tfJob.parts.tfJob(name, namespace, replicas);
std.prune(k.core.v1.list.new([job]))

View File

@ -1,4 +0,0 @@
local components = std.extVar("__ksonnet/components");
components + {
// Insert user-specified overrides here.
}

View File

@ -1,80 +0,0 @@
local k8s = import "k8s.libsonnet";
local apps = k8s.apps;
local core = k8s.core;
local extensions = k8s.extensions;
local hidden = {
mapContainers(f):: {
local podContainers = super.spec.template.spec.containers,
spec+: {
template+: {
spec+: {
// IMPORTANT: This overwrites the 'containers' field
// for this deployment.
containers: std.map(f, podContainers),
},
},
},
},
mapContainersWithName(names, f) ::
local nameSet =
if std.type(names) == "array"
then std.set(names)
else std.set([names]);
local inNameSet(name) = std.length(std.setInter(nameSet, std.set([name]))) > 0;
self.mapContainers(
function(c)
if std.objectHas(c, "name") && inNameSet(c.name)
then f(c)
else c
),
};
k8s + {
apps:: apps + {
v1beta1:: apps.v1beta1 + {
local v1beta1 = apps.v1beta1,
daemonSet:: v1beta1.daemonSet + {
mapContainers(f):: hidden.mapContainers(f),
mapContainersWithName(names, f):: hidden.mapContainersWithName(names, f),
},
deployment:: v1beta1.deployment + {
mapContainers(f):: hidden.mapContainers(f),
mapContainersWithName(names, f):: hidden.mapContainersWithName(names, f),
},
},
},
core:: core + {
v1:: core.v1 + {
list:: {
new(items)::
{apiVersion: "v1"} +
{kind: "List"} +
self.items(items),
items(items):: if std.type(items) == "array" then {items+: items} else {items+: [items]},
},
},
},
extensions:: extensions + {
v1beta1:: extensions.v1beta1 + {
local v1beta1 = extensions.v1beta1,
daemonSet:: v1beta1.daemonSet + {
mapContainers(f):: hidden.mapContainers(f),
mapContainersWithName(names, f):: hidden.mapContainersWithName(names, f),
},
deployment:: v1beta1.deployment + {
mapContainers(f):: hidden.mapContainers(f),
mapContainersWithName(names, f):: hidden.mapContainersWithName(names, f),
},
},
},
}

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@ -1,76 +0,0 @@
# tf-job
> Prototypes for running TensorFlow jobs.
* [Quickstart](#quickstart)
* [Using Prototypes](#using-prototypes)
* [io.ksonnet.pkg.tf-job](#io.ksonnet.pkg.tf-job)
* [io.ksonnet.pkg.tf-cnn](#io.ksonnet.pkg.tf-cnn)
## Quickstart
*The following commands use the `io.ksonnet.pkg.tf-job` prototype to generate Kubernetes YAML for tf-job, and then deploys it to your Kubernetes cluster.*
First, create a cluster and install the ksonnet CLI (see root-level [README.md](rootReadme)).
If you haven't yet created a [ksonnet application](linkToSomewhere), do so using `ks init <app-name>`.
Finally, in the ksonnet application directory, run the following:
```shell
# Expand prototype as a Jsonnet file, place in a file in the
# `components/` directory. (YAML and JSON are also available.)
$ ks prototype use io.ksonnet.pkg.tf-job tf-job \
--namespace default \
--name tf-job
# Apply to server.
$ ks apply -f tf-job.jsonnet
```
## Using the library
The library files for tf-job define a set of relevant *parts* (_e.g._, deployments, services, secrets, and so on) that can be combined to configure tf-job for a wide variety of scenarios. For example, a database like Redis may need a secret to hold the user password, or it may have no password if it's acting as a cache.
This library provides a set of pre-fabricated "flavors" (or "distributions") of tf-job, each of which is configured for a different use case. These are captured as ksonnet *prototypes*, which allow users to interactively customize these distributions for their specific needs.
These prototypes, as well as how to use them, are enumerated below.
### io.ksonnet.pkg.tf-job
A TensorFlow job (could be training or evaluation).
#### Example
```shell
# Expand prototype as a Jsonnet file, place in a file in the
# `components/` directory. (YAML and JSON are also available.)
$ ks prototype use io.ksonnet.pkg.tf-job tf-job \
--name YOUR_NAME_HERE
```
#### Parameters
The available options to pass prototype are:
* `--name=<name>`: Name to give to each of the components [string]
### io.ksonnet.pkg.tf-cnn
A TensorFlow CNN Benchmarking job
#### Example
```shell
# Expand prototype as a Jsonnet file, place in a file in the
# `components/` directory. (YAML and JSON are also available.)
$ ks prototype use io.ksonnet.pkg.tf-cnn tf-job \
--name YOUR_NAME_HERE
```
#### Parameters
The available options to pass prototype are:
* `--name=<name>`: Name for the job. [string]
[rootReadme]: https://github.com/ksonnet/mixins

View File

@ -1,35 +0,0 @@
{
"name": "tf-job",
"apiVersion": "0.0.1",
"kind": "ksonnet.io/parts",
"description": "Prototypes for running TensorFlow jobs.\n",
"author": "kubeflow team <kubeflow-team@google.com>",
"contributors": [
{
"name": "Jeremy Lewi",
"email": "jlewi@google.com"
}
],
"repository": {
"type": "git",
"url": "https://github.com/kubeflow/kubeflow"
},
"bugs": {
"url": "https://github.com/kubeflow/kubeflow/issues"
},
"keywords": [
"kubeflow",
"tensorflow",
"database"
],
"quickStart": {
"prototype": "io.ksonnet.pkg.tf-job",
"componentName": "tf-job",
"flags": {
"name": "tf-job",
"namespace": "default"
},
"comment": "Run TensorFlow Job"
},
"license": "Apache 2.0"
}

View File

@ -1,104 +0,0 @@
// @apiVersion 0.1
// @name io.ksonnet.pkg.tf-cnn
// @description A TensorFlow CNN Benchmarking job
// @shortDescription Run the TensorFlow CNN benchmarking job.
// @param name string Name for the job.
// @optionalParam namespace string default Namespace
// @optionalParam batch_size number 32 The batch size
// @optionalParam model string resnet50 Which model to use
// @optionalParam num_gpus number 0 The number of GPUs to attach to workers.
// @optionalParam image string gcr.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3 The docker image to use for the job.
// @optionalParam image_gpu string gcr.io/kubeflow/tf-benchmarks-gpu:v20171202-bdab599-dirty-284af3 The docker image to use when using GPUs.
// @optionalParam num_ps number 1 The number of ps to use
// @optionalParam num_workers number 1 The number of workers to use
// We need at least 1 parameter server.
// TODO(jlewi): Should we move this into an examples package?
// TODO(https://github.com/ksonnet/ksonnet/issues/222): We have to add namespace as an explicit parameter
// because ksonnet doesn't support inheriting it from the environment yet.
local k = import "k.libsonnet";
local deployment = k.extensions.v1beta1.deployment;
local container = deployment.mixin.spec.template.spec.containersType;
local podTemplate = k.extensions.v1beta1.podTemplate;
local tfJob = import "kubeflow/tf-job/tf-job.libsonnet";
local name = import "param://name";
local namespace = import "param://namespace";
local numGpus = import "param://num_gpus";
local batchSize = import "param://batch_size";
local model = import "param://model";
local args = [
"python",
"tf_cnn_benchmarks.py",
"--batch_size=" + batchSize,
"--model=" + model,
"--variable_update=parameter_server",
"--flush_stdout=true",
] +
if numGpus == 0 then
// We need to set num_gpus=1 even if not using GPUs because otherwise the devie list
// is empty because of this code
// https://github.com/tensorflow/benchmarks/blob/master/scripts/tf_cnn_benchmarks/benchmark_cnn.py#L775
// We won't actually use GPUs because based on other flags no ops will be assigned to GPus.
[
"--num_gpus=1",
"--local_parameter_device=cpu",
"--device=cpu",
"--data_format=NHWC",
]
else
[
"--num_gpus=" + numGpus,
]
;
local image = import "param://image";
local imageGpu = import "param://image_gpu";
local numPs = import "param://num_ps";
local numWorkers = import "param://num_workers";
local numGpus = import "param://num_gpus";
local workerSpec = if numGpus > 0 then
tfJob.parts.tfJobReplica("WORKER", numWorkers, args, imageGpu, numGpus)
else
tfJob.parts.tfJobReplica("WORKER", numWorkers, args, image);
// TODO(jlewi): Look at how the redis prototype modifies a container by
// using mapContainersWithName. Can we do something similar?
// https://github.com/ksonnet/parts/blob/9d78d6bb445d530d5b927656d2293d4f12654608/incubator/redis/redis.libsonnet
local replicas = std.map(function(s)
s {
template+: {
spec+: {
// TODO(jlewi): Does this overwrite containers?
containers: [
s.template.spec.containers[0] {
workingDir: "/opt/tf-benchmarks/scripts/tf_cnn_benchmarks",
},
],
},
},
},
std.prune([workerSpec, tfJob.parts.tfJobReplica("PS", numPs, args, image)]));
local job =
if numWorkers < 1 then
error "num_workers must be >= 1"
else
if numPs < 1 then
error "num_ps must be >= 1"
else
tfJob.parts.tfJob(name, namespace, replicas) + {
spec+: {
tfImage: image,
terminationPolicy: { chief: { replicaName: "WORKER", replicaIndex: 0 } },
},
};
std.prune(k.core.v1.list.new([job]))

View File

@ -1,51 +0,0 @@
// @apiVersion 0.1
// @name io.ksonnet.pkg.tf-job
// @description A TensorFlow job (could be training or evaluation).
// @shortDescription A TensorFlow job.
// @param name string Name to give to each of the components
// @optionalParam namespace string default Namespace
// @optionalParam args string null Comma separated list of arguments to pass to the job
// @optionalParam image string null The docker image to use for the job.
// @optionalParam image_gpu string null The docker image to use when using GPUs.
// @optionalParam num_masters number 1 The number of masters to use
// @optionalParam num_ps number 0 The number of ps to use
// @optionalParam num_workers number 0 The number of workers to use
// @optionalParam num_gpus number 0 The number of GPUs to attach to workers.
// TODO(https://github.com/ksonnet/ksonnet/issues/235): ks param set args won't work if the arg starts with "--".
// TODO(https://github.com/ksonnet/ksonnet/issues/222): We have to add namespace as an explicit parameter
// because ksonnet doesn't support inheriting it from the environment yet.
local k = import "k.libsonnet";
local tfJob = import "kubeflow/tf-job/tf-job.libsonnet";
local name = import "param://name";
local namespace = import "param://namespace";
local argsParam = import "param://args";
local args =
if argsParam == "null" then
[]
else
std.split(argsParam, ",");
local image = import "param://image";
local imageGpu = import "param://image_gpu";
local numMasters = import "param://num_masters";
local numPs = import "param://num_ps";
local numWorkers = import "param://num_workers";
local numGpus = import "param://num_gpus";
local workerSpec = if numGpus > 0 then
tfJob.parts.tfJobReplica("WORKER", numWorkers, args, imageGpu, numGpus)
else
tfJob.parts.tfJobReplica("WORKER", numWorkers, args, image);
std.prune(k.core.v1.list.new([
tfJob.parts.tfJob(name, namespace, [
tfJob.parts.tfJobReplica("MASTER", numMasters, args, image),
workerSpec,
tfJob.parts.tfJobReplica("PS", numPs, args, image),
]),
]))

View File

@ -1,49 +0,0 @@
local k = import "k.libsonnet";
{
parts:: {
tfJobReplica(replicaType, number, args, image, numGpus=0)::
local baseContainer = {
image: image,
name: "tensorflow",
};
local containerArgs = if std.length(args) > 0 then
{
args: args,
}
else {};
local resources = if numGpus > 0 then {
resources: {
limits: {
"nvidia.com/gpu": numGpus,
},
},
} else {};
if number > 0 then
{
replicas: number,
template: {
spec: {
containers: [
baseContainer + containerArgs + resources,
],
restartPolicy: "OnFailure",
},
},
tfReplicaType: replicaType,
}
else {},
tfJob(name, namespace, replicas):: {
apiVersion: "kubeflow.org/v1alpha1",
kind: "TFJob",
metadata: {
name: name,
namespace: namespace,
},
spec: {
replicaSpecs: replicas,
},
},
},
}

Binary file not shown.

Before

Width:  |  Height:  |  Size: 620 KiB

File diff suppressed because one or more lines are too long

Binary file not shown.

Before

Width:  |  Height:  |  Size: 81 KiB

Binary file not shown.

Binary file not shown.

Before

Width:  |  Height:  |  Size: 1.3 MiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 89 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 472 KiB

View File

@ -1,5 +0,0 @@
-e git://github.com/tensorflow/agents.git@459c4f88ece996eac3489e6e97a6ee0b30bdd6b3#egg=agents
pybullet==1.7.5
gym==0.9.4
tensorflow==1.4.1
google-cloud-storage==1.7.0

View File

@ -1,19 +0,0 @@
# Copyright 2017 The TensorFlow Agents Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Proximal Policy Optimization algorithm."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

View File

@ -1,320 +0,0 @@
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Provides an entrypoint for the training and rendering tasks.
Usage: python -m trainer.task [options]
"""
from __future__ import absolute_import, division, print_function
import datetime
import logging
import os
import pprint
import uuid
import shutil
from google.cloud import storage
import tensorflow as tf
#pylint: disable=unused-import
import pybullet_envs
import agents
flags = tf.app.flags
flags.DEFINE_string("run_mode", "train",
"Run mode, one of [train, render, train_and_render].")
flags.DEFINE_string("logdir", '/tmp/test',
"The base directory in which to write logs and "
"checkpoints.")
flags.DEFINE_string("hparam_set_id", "pybullet_kuka_ff",
"The name of the config object to be used to parameterize "
"the run.")
flags.DEFINE_string("run_base_tag",
datetime.datetime.now().strftime('%Y%m%dT%H%M%S'),
"Base tag to prepend to logs dir folder name. Defaults "
"to timestamp.")
flags.DEFINE_boolean("env_processes", True,
"Step environments in separate processes to circumvent "
"the GIL.")
flags.DEFINE_integer("num_gpus", 0,
"Total number of gpus for each machine."
"If you don't use GPU, please set it to '0'")
flags.DEFINE_integer("save_checkpoint_secs", 600,
"Number of seconds between checkpoint save.")
flags.DEFINE_boolean("log_device_placement", False,
"Whether to output logs listing the devices on which "
"variables are placed.")
flags.DEFINE_boolean("debug", True,
"Run in debug mode.")
# Render
flags.DEFINE_integer("render_secs", 600,
"Number of seconds between triggering render jobs.")
flags.DEFINE_string("render_out_dir", None,
"The path to which to copy generated renders.")
# Algorithm
flags.DEFINE_string("algorithm", "agents.ppo.PPOAlgorithm",
"The name of the algorithm to use.")
flags.DEFINE_integer("num_agents", 30,
"The number of agents to use.")
flags.DEFINE_integer("eval_episodes", 25,
"The number of eval episodes to use.")
flags.DEFINE_string("env", "AntBulletEnv-v0",
"The gym / bullet simulation environment to use.")
flags.DEFINE_integer("max_length", 1000,
"The maximum length of an episode.")
flags.DEFINE_integer("steps", 10000000,
"The number of steps.")
# Network
flags.DEFINE_string("network", "agents.scripts.networks.feed_forward_gaussian",
"The registered network name to use for policy and value.")
flags.DEFINE_float("init_mean_factor", 0.1,
"")
flags.DEFINE_float("init_std", 0.35,
"")
# Optimization
flags.DEFINE_float("learning_rate", 1e-4,
"The learning rate of the optimizer.")
flags.DEFINE_string("optimizer", "tensorflow.train.AdamOptimizer",
"The import path of the optimizer to use.")
flags.DEFINE_integer("update_epochs", 25,
"The number of update epochs.")
flags.DEFINE_integer("update_every", 60,
"The update frequency.")
# Losses
flags.DEFINE_float("discount", 0.995,
"The discount.")
flags.DEFINE_float("kl_target", 1e-2,
"the KL target.")
flags.DEFINE_integer("kl_cutoff_factor", 2,
"The KL cutoff factor.")
flags.DEFINE_integer("kl_cutoff_coef", 1000,
"The KL cutoff coefficient.")
flags.DEFINE_integer("kl_init_penalty", 1,
"The initial KL penalty?.")
FLAGS = flags.FLAGS
hparams_base = {
# General
"algorithm": agents.ppo.PPOAlgorithm,
"num_agents": 30,
"eval_episodes": 30,
"use_gpu": False,
# Environment
"env": 'KukaBulletEnv-v0',
"normalize_ranges": True,
"max_length": 1000,
# Network
"network": agents.scripts.networks.feed_forward_gaussian,
"weight_summaries": dict(
all=r'.*', policy=r'.*/policy/.*', value=r'.*/value/.*'),
"policy_layers": (200, 100),
"value_layers": (200, 100),
"init_output_factor": 0.1,
"init_logstd": -1,
"init_std": 0.35,
# Optimization
"update_every": 60,
"update_epochs": 25,
"optimizer": tf.train.AdamOptimizer,
"learning_rate": 1e-4,
"steps": 3e7, # 30M
# Losses
"discount": 0.995,
"kl_target": 1e-2,
"kl_cutoff_factor": 2,
"kl_cutoff_coef": 1000,
"kl_init_penalty": 1,
}
def _object_import_from_string(name):
"""Import and return an object from a string import path.
Args:
name (str): A string import path
(e.g. "tf.train.AdamOptimizer")
Returns:
obj: The imported Python object
"""
components = name.split('.')
mod = __import__(components[0])
for comp in components[1:]:
mod = getattr(mod, comp)
return mod
def _realize_import_attrs(d, hparam_filter):
"""Import objects from string paths in dict if in `hparam_filter`.
Notes:
The following call with an optimizer object referenced as a str:
_realize_import_attrs(
{"optimizer":"tf.train.AdamOptimizer"},
["optimizer"])
returns {"optimizer": tf.train.AdamOptimizer}
This is part of an experiment on how to make all hyperparameters
configurable, including python objects, towards more flexible
tuning.
"""
for k, v in d.items():
if k in hparam_filter:
imported = _object_import_from_string(v)
# TODO: Provide an appropriately informative error if the import fails
# except ImportError as e:
# msg = ("Failed to realize import path %s." % v)
# raise e
d[k] = imported
return d
def _get_agents_configuration(log_dir=None):
"""Load hyperparameter config.
Args:
log_dir (str): The directory in which to search for a
tensorflow/agents config file.
Returns:
dict: A dictionary storing the hyperparameter config.
for this run.
"""
try:
# Try to resume training.
hparams = agents.scripts.utility.load_config(log_dir)
except IOError:
hparams = hparams_base
# --------
# Experimental
for k, v in FLAGS.__dict__['__flags'].items():
hparams[k] = v
hparams = _realize_import_attrs(
hparams, ["network", "algorithm", "optimizer"])
# --------
hparams = agents.tools.AttrDict(hparams)
hparams = agents.scripts.utility.save_config(hparams, log_dir)
pprint.pprint(hparams)
return hparams
def gcs_upload(local_dir, gcs_out_dir):
"""Upload the contents of a local directory to a specific GCS path.
Args:
local_dir (str): The local directory containing files to upload.
gcs_out_dir (str): The target Google Cloud Storage directory path.
Raises:
ValueError: If `gcs_out_dir` does not start with "gs://".
"""
# Get a list of all files in the local_dir
local_files = [f for f in os.listdir(
local_dir) if os.path.isfile(os.path.join(local_dir, f))]
tf.logging.info("Preparing local files for upload:\n %s" % local_files)
# Initialize the GCS API client
storage_client = storage.Client()
# Raise an error if the target directory cannot be a GCS path
if not gcs_out_dir.startswith("gs://"):
raise ValueError(
"gcs_upload expected gcs_out_dir argument to start with gs://, saw %s" % gcs_out_dir)
# TODO: Detect and handle case where a GCS path has been provdied
# corresponding to a bucket that does not exist or for which the user does
# not have permissions.
# Obtain the bucket path from the total path
bucket_path = gcs_out_dir.split('/')[2]
bucket = storage_client.get_bucket(bucket_path)
# Construct a target upload path that excludes the initial gs://bucket-name
blob_base_path = '/'.join(gcs_out_dir.split('/')[3:])
# For each local file *name* in the list of local file names
for local_filename in local_files:
# Construct the target and local *paths*
blob_path = os.path.join(blob_base_path, local_filename)
blob = bucket.blob(blob_path)
local_file_path = os.path.join(local_dir, local_filename)
# Perform the upload operation
blob.upload_from_filename(local_file_path)
def main(_):
"""Configures run and initiates either training or rendering."""
tf.logging.set_verbosity(tf.logging.INFO)
if FLAGS.debug:
tf.logging.set_verbosity(tf.logging.DEBUG)
log_dir = FLAGS.logdir
agents_config = _get_agents_configuration(log_dir)
if FLAGS.run_mode == 'train':
for score in agents.scripts.train.train(agents_config, env_processes=True):
logging.info('Score %s.', score)
if FLAGS.run_mode == 'render':
now = datetime.datetime.now()
subdir = now.strftime("%m%d-%H%M") + "-" + uuid.uuid4().hex[0:4]
render_tmp_dir = "/tmp/agents-render/"
os.system('mkdir -p %s' % render_tmp_dir)
agents.scripts.visualize.visualize(
logdir=FLAGS.logdir, outdir=render_tmp_dir, num_agents=1, num_episodes=1,
checkpoint=None, env_processes=True)
render_out_dir = FLAGS.render_out_dir
# Unless a render out dir is specified explicitly upload to a unique subdir
# of the log dir with the parent render/
if render_out_dir is None:
render_out_dir = os.path.join(FLAGS.logdir, "render", subdir)
if render_out_dir.startswith("gs://"):
gcs_upload(render_tmp_dir, render_out_dir)
else:
shutil.copytree(render_tmp_dir, render_out_dir)
return True
if __name__ == '__main__':
tf.app.run()