mirror of https://github.com/kubeflow/examples.git
				
				
				
			rm stale agents example (#487)
This commit is contained in:
		
							parent
							
								
									2b0eec34c3
								
							
						
					
					
						commit
						89e960202a
					
				|  | @ -1,26 +0,0 @@ | |||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||||
| # you may not use this file except in compliance with the License. | ||||
| # You may obtain a copy of the License at | ||||
| # | ||||
| #      http://www.apache.org/licenses/LICENSE-2.0 | ||||
| # | ||||
| # Unless required by applicable law or agreed to in writing, software | ||||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||||
| # See the License for the specific language governing permissions and | ||||
| # limitations under the License. | ||||
| 
 | ||||
| FROM tensorflow/tensorflow:1.4.1 | ||||
| 
 | ||||
| # Needed for rendering and uploading renders | ||||
| RUN apt-get update | ||||
| RUN apt-get install -y libav-tools ffmpeg git | ||||
| 
 | ||||
| ADD requirements.txt /app/ | ||||
| RUN pip install -r /app/requirements.txt | ||||
| 
 | ||||
| ADD trainer /app/trainer/ | ||||
| 
 | ||||
| WORKDIR /app/ | ||||
| 
 | ||||
| ENTRYPOINT ["python", "-m", "trainer.task"] | ||||
|  | @ -1,66 +0,0 @@ | |||
| # [WIP] Reinforcement Learning with [tensorflow/agents](https://github.com/tensorflow/agents) | ||||
| 
 | ||||
| Here we provide a demonstration of training a reinforcement learning agent to perform a robotic grasping task using Kubeflow running on Google Kubernetes Engine. In this demonstration you will learn how to paramaeterize a training job, submit it to run on your cluster, monitor the job including launching a tensorboard instance, and finally producing renders of the agent performing the robotic grasping task. | ||||
| 
 | ||||
| For clarity and fun you can check out what the product of this tutorial will look like by clicking through the render screenshot below to a short video of a trained agent performing a simulated robotic block grasping task: | ||||
| 
 | ||||
| [](https://youtu.be/0X0w5XOtcHw) | ||||
| 
 | ||||
| ### Setup | ||||
| 
 | ||||
| ##### Training locally | ||||
| 
 | ||||
| In order to run the example localy we'll need to install the necessary requirements in a local conda environment, which can be done as follows: | ||||
| 
 | ||||
| ```bash | ||||
| $conda create -y -n dev python=2.7 | ||||
| $source activate dev | ||||
| $pip install -r requirements.txt | ||||
| ``` | ||||
| 
 | ||||
| The trainer can be run as follows (in this case to display information on the available parameters): | ||||
| 
 | ||||
| ```bash | ||||
| $python -m trainer.task --help | ||||
| usage: task.py [-h] [--run_mode RUN_MODE] [--logdir LOGDIR] [--hparam_set_id HPARAM_SET_ID] | ||||
|                [--run_base_tag RUN_BASE_TAG] [--env_processes [ENV_PROCESSES]] [--noenv_processes] | ||||
|                [--num_gpus NUM_GPUS] [--save_checkpoint_secs SAVE_CHECKPOINT_SECS] | ||||
|                [--log_device_placement [LOG_DEVICE_PLACEMENT]] [--nolog_device_placement] | ||||
|                [--debug [DEBUG]] [--nodebug] [--render_secs RENDER_SECS] | ||||
|                [--render_out_dir RENDER_OUT_DIR] [--algorithm ALGORITHM] [--num_agents NUM_AGENTS] | ||||
|                [--eval_episodes EVAL_EPISODES] [--env ENV] [--max_length MAX_LENGTH] [--steps STEPS] | ||||
|                [--network NETWORK] [--init_mean_factor INIT_MEAN_FACTOR] [--init_std INIT_STD] | ||||
|                [--learning_rate LEARNING_RATE] [--optimizer OPTIMIZER] [--update_epochs UPDATE_EPOCHS] | ||||
|                [--update_every UPDATE_EVERY] [--discount DISCOUNT] [--kl_target KL_TARGET] | ||||
|                [--kl_cutoff_factor KL_CUTOFF_FACTOR] [--kl_cutoff_coef KL_CUTOFF_COEF] | ||||
|                [--kl_init_penalty KL_INIT_PENALTY] | ||||
| ... | ||||
| ``` | ||||
| 
 | ||||
| 
 | ||||
| ##### GCP and Kubeflow configuration | ||||
| 
 | ||||
| This tutorial assumes you have deployed a Kubernetes cluster on your provider of choice and have completed the steps described in the [Kubeflow User Guide](https://github.com/kubeflow/kubeflow/blob/master/user_guide.md) to deploy the core, argo, and nfs components. | ||||
| 
 | ||||
| ##### Launching base image on JupyterHub | ||||
| 
 | ||||
| This example is intended to be run inside of the `gcr.io/kubeflow/tensorflow-notebook-cpu` container running on JupyterHub which is in turn running on Kubeflow. You may provide the name of this container via the spawner options dialog. | ||||
| 
 | ||||
| For general troubleshooting of the spawning of notebook containers on JupyterHub or anything else related to your Kubeflow deployment please refer to the [Kubeflow User Guide](https://github.com/kubeflow/kubeflow/blob/master/user_guide.md). | ||||
| 
 | ||||
| There are two steps to perform from within the JupyterHub environment before the demonstration notebook can be used as intended. | ||||
| 
 | ||||
| First, we need to obtain the kubeflow example code as follows: | ||||
| 
 | ||||
| ```bash | ||||
| $cd /home/jovyan | ||||
| $git clone https://github.com/kubeflow/examples kubeflow-examples | ||||
| ``` | ||||
| 
 | ||||
| We will also need to authenticate our notebook environment to make calls to the underlying Kubernetes cluster. For example, if this is running on Google Container Engine the command would be as follows: | ||||
| 
 | ||||
| ```bash | ||||
| $gcloud container clusters --project={PROJECT} --zone={ZONE} get-credentials {CLUSTER} | ||||
| ``` | ||||
| 
 | ||||
| Well it looks like our initial setup is finished 🎉🎉 and it's time to start playing around with that shiny new demonstration notebook!! You'll find it in doc/demo.ipynb. | ||||
|  | @ -1,39 +0,0 @@ | |||
| apiVersion: 0.1.0 | ||||
| gitVersion: | ||||
|   commitSha: 422d521c05aa905df949868143b26445f5e4eda5 | ||||
|   refSpec: master | ||||
| kind: ksonnet.io/registry | ||||
| libraries: | ||||
|   apache: | ||||
|     path: apache | ||||
|     version: master | ||||
|   efk: | ||||
|     path: efk | ||||
|     version: master | ||||
|   mariadb: | ||||
|     path: mariadb | ||||
|     version: master | ||||
|   memcached: | ||||
|     path: memcached | ||||
|     version: master | ||||
|   mongodb: | ||||
|     path: mongodb | ||||
|     version: master | ||||
|   mysql: | ||||
|     path: mysql | ||||
|     version: master | ||||
|   nginx: | ||||
|     path: nginx | ||||
|     version: master | ||||
|   node: | ||||
|     path: node | ||||
|     version: master | ||||
|   postgres: | ||||
|     path: postgres | ||||
|     version: master | ||||
|   redis: | ||||
|     path: redis | ||||
|     version: master | ||||
|   tomcat: | ||||
|     path: tomcat | ||||
|     version: master | ||||
|  | @ -1,18 +0,0 @@ | |||
| apiVersion: 0.1.0 | ||||
| gitVersion: | ||||
|   commitSha: 8b48d28127cb719410a9c40be214d3c76c2b4cb7 | ||||
|   refSpec: master | ||||
| kind: ksonnet.io/registry | ||||
| libraries: | ||||
|   argo: | ||||
|     path: argo | ||||
|     version: master | ||||
|   core: | ||||
|     path: core | ||||
|     version: master | ||||
|   tf-job: | ||||
|     path: tf-job | ||||
|     version: master | ||||
|   tf-serving: | ||||
|     path: tf-serving | ||||
|     version: master | ||||
|  | @ -1,24 +0,0 @@ | |||
| apiVersion: 0.1.0 | ||||
| kind: ksonnet.io/app | ||||
| libraries: | ||||
|   tf-job: | ||||
|     gitVersion: | ||||
|       commitSha: 8b48d28127cb719410a9c40be214d3c76c2b4cb7 | ||||
|       refSpec: master | ||||
|     name: tf-job | ||||
|     registry: kubeflow | ||||
| name: app | ||||
| registries: | ||||
|   incubator: | ||||
|     gitVersion: | ||||
|       commitSha: 422d521c05aa905df949868143b26445f5e4eda5 | ||||
|       refSpec: master | ||||
|     protocol: github | ||||
|     uri: github.com/ksonnet/parts/tree/master/incubator | ||||
|   kubeflow: | ||||
|     gitVersion: | ||||
|       commitSha: 8b48d28127cb719410a9c40be214d3c76c2b4cb7 | ||||
|       refSpec: master | ||||
|     protocol: github | ||||
|     uri: github.com/kubeflow/kubeflow/tree/master/kubeflow | ||||
| version: 0.0.1 | ||||
|  | @ -1,56 +0,0 @@ | |||
| { | ||||
|   global: { | ||||
|   }, | ||||
|   components: { | ||||
|     "train": { | ||||
|       algorithm: "agents.ppo.PPOAlgorithm", | ||||
|       discount: 0.995, | ||||
|       dump_dependency_versions: "True", | ||||
|       env: "KukaBulletEnv-v0", | ||||
|       eval_episodes: 25, | ||||
|       generate_data: "True", | ||||
|       hparam_set_id: "pybullet_kuka_ff", | ||||
|       image: "gcr.io/kubeflow-rl/agents:0405-1658-39bf", | ||||
|       image_gpu: "null", | ||||
|       init_mean_factor: 0.1, | ||||
|       job_tag: "0206-1409-6174", | ||||
|       kl_cutoff_coef: 1000, | ||||
|       kl_cutoff_factor: 2, | ||||
|       kl_init_penalty: 1, | ||||
|       kl_target: 0.01, | ||||
|       learning_rate: 0.0001, | ||||
|       log_dir: "/mnt/nfs-1/train_dirs/studies/replicated-kuka-demo/kuka-0405-1707-545d", | ||||
|       max_length: 1000, | ||||
|       name: "kuka-0405-1707-545d", | ||||
|       namespace: "kubeflow", | ||||
|       network: "agents.scripts.networks.feed_forward_gaussian", | ||||
|       nfs_claim_name: "nfs-1", | ||||
|       num_agents: 30, | ||||
|       num_cpu: 30, | ||||
|       num_gpus: 0, | ||||
|       num_masters: 1, | ||||
|       num_ps: 1, | ||||
|       num_replicas: 1, | ||||
|       num_workers: 1, | ||||
|       optimizer: "tensorflow.train.AdamOptimizer", | ||||
|       render_secs: 600, | ||||
|       run_base_tag: "0e90193e", | ||||
|       run_mode: "train", | ||||
|       save_checkpoint_secs: 600, | ||||
|       save_checkpoints_secs: 600, | ||||
|       steps: 15000000, | ||||
|       sync_replicas: "False", | ||||
|       update_epochs: 25, | ||||
|       update_every: 60, | ||||
|     }, | ||||
|     "render": { | ||||
|       image: "gcr.io/kubeflow-rl/agents:0319-1806-6614", | ||||
|       log_dir: "/mnt/nfs-1/train_dirs/kubeflow-rl/studies/replicated-kuka-demo-1/kuka-0319-1735-222e", | ||||
|       name: "render-0319-2043-47e6", | ||||
|       namespace: "kubeflow", | ||||
|       nfs_claim_name: "nfs-1", | ||||
|       num_cpu: 4, | ||||
|       num_gpus: 0, | ||||
|     }, | ||||
|   }, | ||||
| } | ||||
|  | @ -1,63 +0,0 @@ | |||
| local params = std.extVar("__ksonnet/params").components["render"]; | ||||
| local k = import 'k.libsonnet'; | ||||
| local deployment = k.extensions.v1beta1.deployment; | ||||
| local container = deployment.mixin.spec.template.spec.containersType; | ||||
| local podTemplate = k.extensions.v1beta1.podTemplate; | ||||
| 
 | ||||
| local tfJob = import 'kubeflow/tf-job/tf-job.libsonnet'; | ||||
| 
 | ||||
| local name = params.name; | ||||
| local namespace = params.namespace; | ||||
| local num_gpus = params.num_gpus; | ||||
| local log_dir = params.log_dir; | ||||
| local imageGpu = ""; | ||||
| local image = params.image; | ||||
| local numCpu = params.num_cpu; | ||||
| 
 | ||||
| local args = [ | ||||
|   "--run_mode=render", | ||||
|   "--logdir=" + log_dir, | ||||
|   "--num_agents=1" | ||||
| ]; | ||||
| 
 | ||||
| local workerSpec = if num_gpus > 0 then | ||||
|   	tfJob.parts.tfJobReplica("MASTER", 1, args, imageGpu, num_gpus) | ||||
|   	else | ||||
|   	tfJob.parts.tfJobReplica("MASTER", 1, args, image); | ||||
| 
 | ||||
| local nfsClaimName = params.nfs_claim_name; | ||||
| 
 | ||||
| local replicas = std.map(function(s) | ||||
|   s + { | ||||
|     template+: { | ||||
|       spec+:  { | ||||
|         containers: [ | ||||
|           s.template.spec.containers[0] + { | ||||
|             resources: { | ||||
|               limits: { | ||||
|                 cpu: numCpu | ||||
|               }, | ||||
|               requests: { | ||||
|                 cpu: numCpu | ||||
|               } | ||||
|             }, | ||||
|             volumeMounts:[{ | ||||
|               name: "nfs", | ||||
|               mountPath: "/mnt/" + nfsClaimName | ||||
|             }] | ||||
|           }, | ||||
|         ], | ||||
|         volumes: [{ | ||||
|           name: "nfs", | ||||
|           persistentVolumeClaim: { | ||||
|             claimName: nfsClaimName | ||||
|           } | ||||
|         }] | ||||
|       }, | ||||
|     }, | ||||
|   }, | ||||
|   std.prune([workerSpec])); | ||||
| 
 | ||||
| local job = tfJob.parts.tfJob(name, namespace, replicas); | ||||
| 
 | ||||
| std.prune(k.core.v1.list.new([job])) | ||||
|  | @ -1,110 +0,0 @@ | |||
| local params = std.extVar("__ksonnet/params").components["train"]; | ||||
| local k = import 'k.libsonnet'; | ||||
| local deployment = k.extensions.v1beta1.deployment; | ||||
| local container = deployment.mixin.spec.template.spec.containersType; | ||||
| local podTemplate = k.extensions.v1beta1.podTemplate; | ||||
| 
 | ||||
| local tfJob = import 'kubeflow/tf-job/tf-job.libsonnet'; | ||||
| 
 | ||||
| local name = params.name; | ||||
| local namespace = params.namespace; | ||||
| local num_gpus = params.num_gpus; | ||||
| local hparam_set_id = params.hparam_set_id; | ||||
| local jobTag = params.job_tag; | ||||
| local image = params.image; | ||||
| local imageGpu = params.image_gpu; | ||||
| local numCpu = params.num_cpu; | ||||
| local dumpDependencyVersions = params.dump_dependency_versions; | ||||
| local log_dir = params.log_dir; | ||||
| local hparamSetID = params.hparam_set_id; | ||||
| local runBaseTag = params.run_base_tag; | ||||
| local syncReplicas = params.sync_replicas; | ||||
| local algorithm = params.algorithm; | ||||
| local numAgents = params.num_agents; | ||||
| local evalEpisodes = params.eval_episodes; | ||||
| local env = params.env; | ||||
| local maxLength = params.max_length; | ||||
| local steps = params.steps; | ||||
| local network = params.network; | ||||
| local initMeanFactor = params.init_mean_factor; | ||||
| local learningRate = params.learning_rate; | ||||
| local optimizer = params.optimizer; | ||||
| local updateEpochs = params.update_epochs; | ||||
| local updateEvery = params.update_every; | ||||
| local discount = params.discount; | ||||
| local klTarget = params.kl_target; | ||||
| local klCutoffFactor = params.kl_cutoff_factor; | ||||
| local klCutoffCoef = params.kl_cutoff_coef; | ||||
| local klInitPenalty = params.kl_init_penalty; | ||||
| 
 | ||||
| local renderSecs = params.render_secs; | ||||
| 
 | ||||
| local args = [ | ||||
|   "--run_mode=train", | ||||
|   "--logdir=" + log_dir, | ||||
|   "--hparam_set_id=" + hparamSetID, | ||||
|   "--run_base_tag=" + runBaseTag, | ||||
|   "--sync_replicas=" + syncReplicas, | ||||
|   "--num_gpus=" + num_gpus, | ||||
|   "--algorithm=" + algorithm, | ||||
|   "--num_agents=" + numAgents, | ||||
|   "--eval_episodes=" + evalEpisodes, | ||||
|   "--env=" + env, | ||||
|   "--max_length=" + maxLength, | ||||
|   "--steps=" + steps, | ||||
|   "--network=" + network, | ||||
|   "--init_mean_factor=" + initMeanFactor, | ||||
|   "--learning_rate=" + learningRate, | ||||
|   "--optimizer=" + optimizer, | ||||
|   "--update_epochs=" + updateEpochs, | ||||
|   "--update_every=" + updateEvery, | ||||
|   "--discount=" + discount, | ||||
|   "--kl_target=" + klTarget, | ||||
|   "--kl_cutoff_factor=" + klCutoffFactor, | ||||
|   "--kl_cutoff_coef=" + klCutoffCoef, | ||||
|   "--kl_init_penalty=" + klInitPenalty, | ||||
|   "--dump_dependency_versions=" + dumpDependencyVersions, | ||||
|   "--render_secs=" + renderSecs, | ||||
| ]; | ||||
| 
 | ||||
| local workerSpec = if num_gpus > 0 then | ||||
|   	tfJob.parts.tfJobReplica("MASTER", 1, args, imageGpu, num_gpus) | ||||
|   	else | ||||
|   	tfJob.parts.tfJobReplica("MASTER", 1, args, image); | ||||
| 
 | ||||
| local nfsClaimName = params.nfs_claim_name; | ||||
| 
 | ||||
| local replicas = std.map(function(s) | ||||
|   s + { | ||||
|     template+: { | ||||
|       spec+:  { | ||||
|         containers: [ | ||||
|           s.template.spec.containers[0] + { | ||||
|             resources: { | ||||
|               limits: { | ||||
|                 cpu: numCpu | ||||
|               }, | ||||
|               requests: { | ||||
|                 cpu: numCpu | ||||
|               } | ||||
|             }, | ||||
|             volumeMounts:[{ | ||||
|               name: "nfs", | ||||
|               mountPath: "/mnt/" + nfsClaimName | ||||
|             }] | ||||
|           }, | ||||
|         ], | ||||
|         volumes: [{ | ||||
|           name: "nfs", | ||||
|           persistentVolumeClaim: { | ||||
|             claimName: nfsClaimName | ||||
|           } | ||||
|         }] | ||||
|       }, | ||||
|     }, | ||||
|   }, | ||||
|   std.prune([workerSpec])); | ||||
| 
 | ||||
| local job = tfJob.parts.tfJob(name, namespace, replicas); | ||||
| 
 | ||||
| std.prune(k.core.v1.list.new([job])) | ||||
|  | @ -1,4 +0,0 @@ | |||
| local components = std.extVar("__ksonnet/components"); | ||||
| components + { | ||||
|   // Insert user-specified overrides here. | ||||
| } | ||||
|  | @ -1,80 +0,0 @@ | |||
| local k8s = import "k8s.libsonnet"; | ||||
| 
 | ||||
| local apps = k8s.apps; | ||||
| local core = k8s.core; | ||||
| local extensions = k8s.extensions; | ||||
| 
 | ||||
| local hidden = { | ||||
|   mapContainers(f):: { | ||||
|     local podContainers = super.spec.template.spec.containers, | ||||
|     spec+: { | ||||
|       template+: { | ||||
|         spec+: { | ||||
|           // IMPORTANT: This overwrites the 'containers' field | ||||
|           // for this deployment. | ||||
|           containers: std.map(f, podContainers), | ||||
|         }, | ||||
|       }, | ||||
|     }, | ||||
|   }, | ||||
| 
 | ||||
|   mapContainersWithName(names, f) :: | ||||
|     local nameSet = | ||||
|       if std.type(names) == "array" | ||||
|       then std.set(names) | ||||
|       else std.set([names]); | ||||
|     local inNameSet(name) = std.length(std.setInter(nameSet, std.set([name]))) > 0; | ||||
|     self.mapContainers( | ||||
|       function(c) | ||||
|         if std.objectHas(c, "name") && inNameSet(c.name) | ||||
|         then f(c) | ||||
|         else c | ||||
|     ), | ||||
| }; | ||||
| 
 | ||||
| k8s + { | ||||
|   apps:: apps + { | ||||
|     v1beta1:: apps.v1beta1 + { | ||||
|       local v1beta1 = apps.v1beta1, | ||||
| 
 | ||||
|       daemonSet:: v1beta1.daemonSet + { | ||||
|         mapContainers(f):: hidden.mapContainers(f), | ||||
|         mapContainersWithName(names, f):: hidden.mapContainersWithName(names, f), | ||||
|       }, | ||||
| 
 | ||||
|       deployment:: v1beta1.deployment + { | ||||
|         mapContainers(f):: hidden.mapContainers(f), | ||||
|         mapContainersWithName(names, f):: hidden.mapContainersWithName(names, f), | ||||
|       }, | ||||
|     }, | ||||
|   }, | ||||
| 
 | ||||
|   core:: core + { | ||||
|     v1:: core.v1 + { | ||||
|       list:: { | ||||
|         new(items):: | ||||
|           {apiVersion: "v1"} + | ||||
|           {kind: "List"} + | ||||
|           self.items(items), | ||||
| 
 | ||||
|         items(items):: if std.type(items) == "array" then {items+: items} else {items+: [items]}, | ||||
|       }, | ||||
|     }, | ||||
|   }, | ||||
| 
 | ||||
|   extensions:: extensions + { | ||||
|     v1beta1:: extensions.v1beta1 + { | ||||
|       local v1beta1 = extensions.v1beta1, | ||||
| 
 | ||||
|       daemonSet:: v1beta1.daemonSet + { | ||||
|         mapContainers(f):: hidden.mapContainers(f), | ||||
|         mapContainersWithName(names, f):: hidden.mapContainersWithName(names, f), | ||||
|       }, | ||||
| 
 | ||||
|       deployment:: v1beta1.deployment + { | ||||
|         mapContainers(f):: hidden.mapContainers(f), | ||||
|         mapContainersWithName(names, f):: hidden.mapContainersWithName(names, f), | ||||
|       }, | ||||
|     }, | ||||
|   }, | ||||
| } | ||||
										
											
												File diff suppressed because it is too large
												Load Diff
											
										
									
								
							
										
											
												File diff suppressed because it is too large
												Load Diff
											
										
									
								
							|  | @ -1,76 +0,0 @@ | |||
| # tf-job | ||||
| 
 | ||||
| > Prototypes for running TensorFlow jobs. | ||||
| 
 | ||||
| 
 | ||||
| * [Quickstart](#quickstart) | ||||
| * [Using Prototypes](#using-prototypes) | ||||
|   * [io.ksonnet.pkg.tf-job](#io.ksonnet.pkg.tf-job) | ||||
|   * [io.ksonnet.pkg.tf-cnn](#io.ksonnet.pkg.tf-cnn) | ||||
| 
 | ||||
| ## Quickstart | ||||
| 
 | ||||
| *The following commands use the `io.ksonnet.pkg.tf-job` prototype to generate Kubernetes YAML for tf-job, and then deploys it to your Kubernetes cluster.* | ||||
| 
 | ||||
| First, create a cluster and install the ksonnet CLI (see root-level [README.md](rootReadme)). | ||||
| 
 | ||||
| If you haven't yet created a [ksonnet application](linkToSomewhere), do so using `ks init <app-name>`. | ||||
| 
 | ||||
| Finally, in the ksonnet application directory, run the following: | ||||
| 
 | ||||
| ```shell | ||||
| # Expand prototype as a Jsonnet file, place in a file in the | ||||
| # `components/` directory. (YAML and JSON are also available.) | ||||
| $ ks prototype use io.ksonnet.pkg.tf-job tf-job \ | ||||
|   --namespace default \ | ||||
|   --name tf-job | ||||
| 
 | ||||
| # Apply to server. | ||||
| $ ks apply -f tf-job.jsonnet | ||||
| ``` | ||||
| 
 | ||||
| ## Using the library | ||||
| 
 | ||||
| The library files for tf-job define a set of relevant *parts* (_e.g._, deployments, services, secrets, and so on) that can be combined to configure tf-job for a wide variety of scenarios. For example, a database like Redis may need a secret to hold the user password, or it may have no password if it's acting as a cache. | ||||
| 
 | ||||
| This library provides a set of pre-fabricated "flavors" (or "distributions") of tf-job, each of which is configured for a different use case. These are captured as ksonnet *prototypes*, which allow users to interactively customize these distributions for their specific needs. | ||||
| 
 | ||||
| These prototypes, as well as how to use them, are enumerated below. | ||||
| 
 | ||||
| ### io.ksonnet.pkg.tf-job | ||||
| 
 | ||||
| A TensorFlow job (could be training or evaluation). | ||||
| #### Example | ||||
| 
 | ||||
| ```shell | ||||
| # Expand prototype as a Jsonnet file, place in a file in the | ||||
| # `components/` directory. (YAML and JSON are also available.) | ||||
| $ ks prototype use io.ksonnet.pkg.tf-job tf-job \ | ||||
|   --name YOUR_NAME_HERE | ||||
| ``` | ||||
| 
 | ||||
| #### Parameters | ||||
| 
 | ||||
| The available options to pass prototype are: | ||||
| 
 | ||||
| * `--name=<name>`: Name to give to each of the components [string] | ||||
| ### io.ksonnet.pkg.tf-cnn | ||||
| 
 | ||||
| A TensorFlow CNN Benchmarking job | ||||
| #### Example | ||||
| 
 | ||||
| ```shell | ||||
| # Expand prototype as a Jsonnet file, place in a file in the | ||||
| # `components/` directory. (YAML and JSON are also available.) | ||||
| $ ks prototype use io.ksonnet.pkg.tf-cnn tf-job \ | ||||
|   --name YOUR_NAME_HERE | ||||
| ``` | ||||
| 
 | ||||
| #### Parameters | ||||
| 
 | ||||
| The available options to pass prototype are: | ||||
| 
 | ||||
| * `--name=<name>`: Name for the job. [string] | ||||
| 
 | ||||
| 
 | ||||
| [rootReadme]: https://github.com/ksonnet/mixins | ||||
|  | @ -1,35 +0,0 @@ | |||
| { | ||||
|    "name": "tf-job", | ||||
|    "apiVersion": "0.0.1", | ||||
|    "kind": "ksonnet.io/parts", | ||||
|    "description": "Prototypes for running TensorFlow jobs.\n", | ||||
|    "author": "kubeflow team <kubeflow-team@google.com>", | ||||
|    "contributors": [ | ||||
|       { | ||||
|          "name": "Jeremy Lewi", | ||||
|          "email": "jlewi@google.com" | ||||
|       } | ||||
|    ], | ||||
|    "repository": { | ||||
|       "type": "git", | ||||
|       "url": "https://github.com/kubeflow/kubeflow" | ||||
|    }, | ||||
|    "bugs": { | ||||
|       "url": "https://github.com/kubeflow/kubeflow/issues" | ||||
|    }, | ||||
|    "keywords": [ | ||||
|       "kubeflow", | ||||
|       "tensorflow", | ||||
|       "database" | ||||
|    ], | ||||
|    "quickStart": { | ||||
|       "prototype": "io.ksonnet.pkg.tf-job", | ||||
|       "componentName": "tf-job", | ||||
|       "flags": { | ||||
|          "name": "tf-job", | ||||
|          "namespace": "default" | ||||
|       }, | ||||
|       "comment": "Run TensorFlow Job" | ||||
|    }, | ||||
|    "license": "Apache 2.0" | ||||
| } | ||||
|  | @ -1,104 +0,0 @@ | |||
| // @apiVersion 0.1 | ||||
| // @name io.ksonnet.pkg.tf-cnn | ||||
| // @description A TensorFlow CNN Benchmarking job | ||||
| // @shortDescription Run the TensorFlow CNN benchmarking job. | ||||
| // @param name string Name for the job. | ||||
| // @optionalParam namespace string default Namespace | ||||
| // @optionalParam batch_size number 32 The batch size | ||||
| // @optionalParam model string resnet50 Which model to use | ||||
| // @optionalParam num_gpus number 0 The number of GPUs to attach to workers. | ||||
| // @optionalParam image string gcr.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3 The docker image to use for the job. | ||||
| // @optionalParam image_gpu string gcr.io/kubeflow/tf-benchmarks-gpu:v20171202-bdab599-dirty-284af3 The docker image to use when using GPUs. | ||||
| // @optionalParam num_ps number 1 The number of ps to use | ||||
| // @optionalParam num_workers number 1 The number of workers to use | ||||
| 
 | ||||
| // We need at least 1 parameter server. | ||||
| 
 | ||||
| // TODO(jlewi): Should we move this into an examples package? | ||||
| 
 | ||||
| // TODO(https://github.com/ksonnet/ksonnet/issues/222): We have to add namespace as an explicit parameter | ||||
| // because ksonnet doesn't support inheriting it from the environment yet. | ||||
| 
 | ||||
| local k = import "k.libsonnet"; | ||||
| local deployment = k.extensions.v1beta1.deployment; | ||||
| local container = deployment.mixin.spec.template.spec.containersType; | ||||
| local podTemplate = k.extensions.v1beta1.podTemplate; | ||||
| 
 | ||||
| local tfJob = import "kubeflow/tf-job/tf-job.libsonnet"; | ||||
| 
 | ||||
| local name = import "param://name"; | ||||
| local namespace = import "param://namespace"; | ||||
| 
 | ||||
| local numGpus = import "param://num_gpus"; | ||||
| local batchSize = import "param://batch_size"; | ||||
| local model = import "param://model"; | ||||
| 
 | ||||
| local args = [ | ||||
|                "python", | ||||
|                "tf_cnn_benchmarks.py", | ||||
|                "--batch_size=" + batchSize, | ||||
|                "--model=" + model, | ||||
|                "--variable_update=parameter_server", | ||||
|                "--flush_stdout=true", | ||||
|              ] + | ||||
|              if numGpus == 0 then | ||||
|                // We need to set num_gpus=1 even if not using GPUs because otherwise the devie list | ||||
|                // is empty because of this code | ||||
|                // https://github.com/tensorflow/benchmarks/blob/master/scripts/tf_cnn_benchmarks/benchmark_cnn.py#L775 | ||||
|                // We won't actually use GPUs because based on other flags no ops will be assigned to GPus. | ||||
|                [ | ||||
|                  "--num_gpus=1", | ||||
|                  "--local_parameter_device=cpu", | ||||
|                  "--device=cpu", | ||||
|                  "--data_format=NHWC", | ||||
|                ] | ||||
|              else | ||||
|                [ | ||||
|                  "--num_gpus=" + numGpus, | ||||
|                ] | ||||
| ; | ||||
| 
 | ||||
| local image = import "param://image"; | ||||
| local imageGpu = import "param://image_gpu"; | ||||
| local numPs = import "param://num_ps"; | ||||
| local numWorkers = import "param://num_workers"; | ||||
| local numGpus = import "param://num_gpus"; | ||||
| 
 | ||||
| local workerSpec = if numGpus > 0 then | ||||
|   tfJob.parts.tfJobReplica("WORKER", numWorkers, args, imageGpu, numGpus) | ||||
| else | ||||
|   tfJob.parts.tfJobReplica("WORKER", numWorkers, args, image); | ||||
| 
 | ||||
| // TODO(jlewi): Look at how the redis prototype modifies a container by | ||||
| // using mapContainersWithName. Can we do something similar? | ||||
| // https://github.com/ksonnet/parts/blob/9d78d6bb445d530d5b927656d2293d4f12654608/incubator/redis/redis.libsonnet | ||||
| local replicas = std.map(function(s) | ||||
|                            s { | ||||
|                              template+: { | ||||
|                                spec+: { | ||||
|                                  // TODO(jlewi): Does this overwrite containers? | ||||
|                                  containers: [ | ||||
|                                    s.template.spec.containers[0] { | ||||
|                                      workingDir: "/opt/tf-benchmarks/scripts/tf_cnn_benchmarks", | ||||
|                                    }, | ||||
|                                  ], | ||||
|                                }, | ||||
|                              }, | ||||
|                            }, | ||||
|                          std.prune([workerSpec, tfJob.parts.tfJobReplica("PS", numPs, args, image)])); | ||||
| 
 | ||||
| local job = | ||||
|   if numWorkers < 1 then | ||||
|     error "num_workers must be >= 1" | ||||
|   else | ||||
|     if numPs < 1 then | ||||
|       error "num_ps must be >= 1" | ||||
|     else | ||||
|       tfJob.parts.tfJob(name, namespace, replicas) + { | ||||
|         spec+: { | ||||
|           tfImage: image, | ||||
|           terminationPolicy: { chief: { replicaName: "WORKER", replicaIndex: 0 } }, | ||||
|         }, | ||||
|       }; | ||||
| 
 | ||||
| std.prune(k.core.v1.list.new([job])) | ||||
|  | @ -1,51 +0,0 @@ | |||
| // @apiVersion 0.1 | ||||
| // @name io.ksonnet.pkg.tf-job | ||||
| // @description A TensorFlow job (could be training or evaluation). | ||||
| // @shortDescription A TensorFlow job. | ||||
| // @param name string Name to give to each of the components | ||||
| // @optionalParam namespace string default Namespace | ||||
| // @optionalParam args string null Comma separated list of arguments to pass to the job | ||||
| // @optionalParam image string null The docker image to use for the job. | ||||
| // @optionalParam image_gpu string null The docker image to use when using GPUs. | ||||
| // @optionalParam num_masters number 1 The number of masters to use | ||||
| // @optionalParam num_ps number 0 The number of ps to use | ||||
| // @optionalParam num_workers number 0 The number of workers to use | ||||
| // @optionalParam num_gpus number 0 The number of GPUs to attach to workers. | ||||
| 
 | ||||
| // TODO(https://github.com/ksonnet/ksonnet/issues/235): ks param set args won't work if the arg starts with "--". | ||||
| 
 | ||||
| // TODO(https://github.com/ksonnet/ksonnet/issues/222): We have to add namespace as an explicit parameter | ||||
| // because ksonnet doesn't support inheriting it from the environment yet. | ||||
| 
 | ||||
| local k = import "k.libsonnet"; | ||||
| local tfJob = import "kubeflow/tf-job/tf-job.libsonnet"; | ||||
| 
 | ||||
| local name = import "param://name"; | ||||
| local namespace = import "param://namespace"; | ||||
| 
 | ||||
| local argsParam = import "param://args"; | ||||
| local args = | ||||
|   if argsParam == "null" then | ||||
|     [] | ||||
|   else | ||||
|     std.split(argsParam, ","); | ||||
| 
 | ||||
| local image = import "param://image"; | ||||
| local imageGpu = import "param://image_gpu"; | ||||
| local numMasters = import "param://num_masters"; | ||||
| local numPs = import "param://num_ps"; | ||||
| local numWorkers = import "param://num_workers"; | ||||
| local numGpus = import "param://num_gpus"; | ||||
| 
 | ||||
| local workerSpec = if numGpus > 0 then | ||||
|   tfJob.parts.tfJobReplica("WORKER", numWorkers, args, imageGpu, numGpus) | ||||
| else | ||||
|   tfJob.parts.tfJobReplica("WORKER", numWorkers, args, image); | ||||
| 
 | ||||
| std.prune(k.core.v1.list.new([ | ||||
|   tfJob.parts.tfJob(name, namespace, [ | ||||
|     tfJob.parts.tfJobReplica("MASTER", numMasters, args, image), | ||||
|     workerSpec, | ||||
|     tfJob.parts.tfJobReplica("PS", numPs, args, image), | ||||
|   ]), | ||||
| ])) | ||||
|  | @ -1,49 +0,0 @@ | |||
| local k = import "k.libsonnet"; | ||||
| 
 | ||||
| { | ||||
|   parts:: { | ||||
|     tfJobReplica(replicaType, number, args, image, numGpus=0):: | ||||
|       local baseContainer = { | ||||
|         image: image, | ||||
|         name: "tensorflow", | ||||
|       }; | ||||
|       local containerArgs = if std.length(args) > 0 then | ||||
|         { | ||||
|           args: args, | ||||
|         } | ||||
|       else {}; | ||||
|       local resources = if numGpus > 0 then { | ||||
|         resources: { | ||||
|           limits: { | ||||
|             "nvidia.com/gpu": numGpus, | ||||
|           }, | ||||
|         }, | ||||
|       } else {}; | ||||
|       if number > 0 then | ||||
|         { | ||||
|           replicas: number, | ||||
|           template: { | ||||
|             spec: { | ||||
|               containers: [ | ||||
|                 baseContainer + containerArgs + resources, | ||||
|               ], | ||||
|               restartPolicy: "OnFailure", | ||||
|             }, | ||||
|           }, | ||||
|           tfReplicaType: replicaType, | ||||
|         } | ||||
|       else {}, | ||||
| 
 | ||||
|     tfJob(name, namespace, replicas):: { | ||||
|       apiVersion: "kubeflow.org/v1alpha1", | ||||
|       kind: "TFJob", | ||||
|       metadata: { | ||||
|         name: name, | ||||
|         namespace: namespace, | ||||
|       }, | ||||
|       spec: { | ||||
|         replicaSpecs: replicas, | ||||
|       }, | ||||
|     }, | ||||
|   }, | ||||
| } | ||||
										
											Binary file not shown.
										
									
								
							| Before Width: | Height: | Size: 620 KiB | 
										
											
												File diff suppressed because one or more lines are too long
											
										
									
								
							
										
											Binary file not shown.
										
									
								
							| Before Width: | Height: | Size: 81 KiB | 
										
											Binary file not shown.
										
									
								
							
										
											Binary file not shown.
										
									
								
							| Before Width: | Height: | Size: 1.3 MiB | 
										
											Binary file not shown.
										
									
								
							| Before Width: | Height: | Size: 89 KiB | 
										
											Binary file not shown.
										
									
								
							| Before Width: | Height: | Size: 472 KiB | 
|  | @ -1,5 +0,0 @@ | |||
| -e git://github.com/tensorflow/agents.git@459c4f88ece996eac3489e6e97a6ee0b30bdd6b3#egg=agents | ||||
| pybullet==1.7.5 | ||||
| gym==0.9.4 | ||||
| tensorflow==1.4.1 | ||||
| google-cloud-storage==1.7.0 | ||||
|  | @ -1,19 +0,0 @@ | |||
| # Copyright 2017 The TensorFlow Agents Authors. | ||||
| # | ||||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||||
| # you may not use this file except in compliance with the License. | ||||
| # You may obtain a copy of the License at | ||||
| # | ||||
| #      http://www.apache.org/licenses/LICENSE-2.0 | ||||
| # | ||||
| # Unless required by applicable law or agreed to in writing, software | ||||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||||
| # See the License for the specific language governing permissions and | ||||
| # limitations under the License. | ||||
| 
 | ||||
| """Proximal Policy Optimization algorithm.""" | ||||
| 
 | ||||
| from __future__ import absolute_import | ||||
| from __future__ import division | ||||
| from __future__ import print_function | ||||
|  | @ -1,320 +0,0 @@ | |||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||||
| # you may not use this file except in compliance with the License. | ||||
| # You may obtain a copy of the License at | ||||
| # | ||||
| #      http://www.apache.org/licenses/LICENSE-2.0 | ||||
| # | ||||
| # Unless required by applicable law or agreed to in writing, software | ||||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||||
| # See the License for the specific language governing permissions and | ||||
| # limitations under the License. | ||||
| 
 | ||||
| """Provides an entrypoint for the training and rendering tasks. | ||||
| 
 | ||||
| Usage: python -m trainer.task [options] | ||||
| 
 | ||||
| """ | ||||
| 
 | ||||
| from __future__ import absolute_import, division, print_function | ||||
| 
 | ||||
| import datetime | ||||
| import logging | ||||
| import os | ||||
| import pprint | ||||
| import uuid | ||||
| import shutil | ||||
| 
 | ||||
| from google.cloud import storage | ||||
| import tensorflow as tf | ||||
| 
 | ||||
| #pylint: disable=unused-import | ||||
| import pybullet_envs | ||||
| 
 | ||||
| import agents | ||||
| 
 | ||||
| 
 | ||||
| flags = tf.app.flags | ||||
| 
 | ||||
| flags.DEFINE_string("run_mode", "train", | ||||
|                     "Run mode, one of [train, render, train_and_render].") | ||||
| flags.DEFINE_string("logdir", '/tmp/test', | ||||
|                     "The base directory in which to write logs and " | ||||
|                     "checkpoints.") | ||||
| flags.DEFINE_string("hparam_set_id", "pybullet_kuka_ff", | ||||
|                     "The name of the config object to be used to parameterize " | ||||
|                     "the run.") | ||||
| flags.DEFINE_string("run_base_tag", | ||||
|                     datetime.datetime.now().strftime('%Y%m%dT%H%M%S'), | ||||
|                     "Base tag to prepend to logs dir folder name. Defaults " | ||||
|                     "to timestamp.") | ||||
| flags.DEFINE_boolean("env_processes", True, | ||||
|                      "Step environments in separate processes to circumvent " | ||||
|                      "the GIL.") | ||||
| flags.DEFINE_integer("num_gpus", 0, | ||||
|                      "Total number of gpus for each machine." | ||||
|                      "If you don't use GPU, please set it to '0'") | ||||
| flags.DEFINE_integer("save_checkpoint_secs", 600, | ||||
|                      "Number of seconds between checkpoint save.") | ||||
| flags.DEFINE_boolean("log_device_placement", False, | ||||
|                      "Whether to output logs listing the devices on which " | ||||
|                      "variables are placed.") | ||||
| flags.DEFINE_boolean("debug", True, | ||||
|                      "Run in debug mode.") | ||||
| 
 | ||||
| # Render | ||||
| flags.DEFINE_integer("render_secs", 600, | ||||
|                      "Number of seconds between triggering render jobs.") | ||||
| flags.DEFINE_string("render_out_dir", None, | ||||
|                     "The path to which to copy generated renders.") | ||||
| 
 | ||||
| # Algorithm | ||||
| flags.DEFINE_string("algorithm", "agents.ppo.PPOAlgorithm", | ||||
|                     "The name of the algorithm to use.") | ||||
| flags.DEFINE_integer("num_agents", 30, | ||||
|                      "The number of agents to use.") | ||||
| flags.DEFINE_integer("eval_episodes", 25, | ||||
|                      "The number of eval episodes to use.") | ||||
| flags.DEFINE_string("env", "AntBulletEnv-v0", | ||||
|                     "The gym / bullet simulation environment to use.") | ||||
| flags.DEFINE_integer("max_length", 1000, | ||||
|                      "The maximum length of an episode.") | ||||
| flags.DEFINE_integer("steps", 10000000, | ||||
|                      "The number of steps.") | ||||
| 
 | ||||
| # Network | ||||
| flags.DEFINE_string("network", "agents.scripts.networks.feed_forward_gaussian", | ||||
|                     "The registered network name to use for policy and value.") | ||||
| flags.DEFINE_float("init_mean_factor", 0.1, | ||||
|                    "") | ||||
| flags.DEFINE_float("init_std", 0.35, | ||||
|                    "") | ||||
| 
 | ||||
| # Optimization | ||||
| flags.DEFINE_float("learning_rate", 1e-4, | ||||
|                    "The learning rate of the optimizer.") | ||||
| flags.DEFINE_string("optimizer", "tensorflow.train.AdamOptimizer", | ||||
|                     "The import path of the optimizer to use.") | ||||
| flags.DEFINE_integer("update_epochs", 25, | ||||
|                      "The number of update epochs.") | ||||
| flags.DEFINE_integer("update_every", 60, | ||||
|                      "The update frequency.") | ||||
| 
 | ||||
| # Losses | ||||
| flags.DEFINE_float("discount", 0.995, | ||||
|                    "The discount.") | ||||
| flags.DEFINE_float("kl_target", 1e-2, | ||||
|                    "the KL target.") | ||||
| flags.DEFINE_integer("kl_cutoff_factor", 2, | ||||
|                      "The KL cutoff factor.") | ||||
| flags.DEFINE_integer("kl_cutoff_coef", 1000, | ||||
|                      "The KL cutoff coefficient.") | ||||
| flags.DEFINE_integer("kl_init_penalty", 1, | ||||
|                      "The initial KL penalty?.") | ||||
| 
 | ||||
| FLAGS = flags.FLAGS | ||||
| 
 | ||||
| 
 | ||||
| hparams_base = { | ||||
| 
 | ||||
|   # General | ||||
|   "algorithm": agents.ppo.PPOAlgorithm, | ||||
|   "num_agents": 30, | ||||
|   "eval_episodes": 30, | ||||
|   "use_gpu": False, | ||||
| 
 | ||||
|   # Environment | ||||
|   "env": 'KukaBulletEnv-v0', | ||||
|   "normalize_ranges": True, | ||||
|   "max_length": 1000, | ||||
| 
 | ||||
|   # Network | ||||
|   "network": agents.scripts.networks.feed_forward_gaussian, | ||||
|   "weight_summaries": dict( | ||||
|     all=r'.*', policy=r'.*/policy/.*', value=r'.*/value/.*'), | ||||
|   "policy_layers": (200, 100), | ||||
|   "value_layers": (200, 100), | ||||
|   "init_output_factor": 0.1, | ||||
|   "init_logstd": -1, | ||||
|   "init_std": 0.35, | ||||
| 
 | ||||
|   # Optimization | ||||
|   "update_every": 60, | ||||
|   "update_epochs": 25, | ||||
|   "optimizer": tf.train.AdamOptimizer, | ||||
|   "learning_rate": 1e-4, | ||||
|   "steps": 3e7,  # 30M | ||||
| 
 | ||||
|   # Losses | ||||
|   "discount": 0.995, | ||||
|   "kl_target": 1e-2, | ||||
|   "kl_cutoff_factor": 2, | ||||
|   "kl_cutoff_coef": 1000, | ||||
|   "kl_init_penalty": 1, | ||||
| } | ||||
| 
 | ||||
| 
 | ||||
| def _object_import_from_string(name): | ||||
|   """Import and return an object from a string import path. | ||||
| 
 | ||||
|   Args: | ||||
|     name (str): A string import path | ||||
|         (e.g. "tf.train.AdamOptimizer") | ||||
| 
 | ||||
|   Returns: | ||||
|     obj: The imported Python object | ||||
| 
 | ||||
|   """ | ||||
|   components = name.split('.') | ||||
|   mod = __import__(components[0]) | ||||
|   for comp in components[1:]: | ||||
|     mod = getattr(mod, comp) | ||||
|   return mod | ||||
| 
 | ||||
| 
 | ||||
| def _realize_import_attrs(d, hparam_filter): | ||||
|   """Import objects from string paths in dict if in `hparam_filter`. | ||||
| 
 | ||||
|   Notes: | ||||
|   The following call with an optimizer object referenced as a str: | ||||
|       _realize_import_attrs( | ||||
|           {"optimizer":"tf.train.AdamOptimizer"}, | ||||
|           ["optimizer"]) | ||||
|   returns {"optimizer": tf.train.AdamOptimizer} | ||||
| 
 | ||||
|   This is part of an experiment on how to make all hyperparameters | ||||
|   configurable, including python objects, towards more flexible | ||||
|   tuning. | ||||
| 
 | ||||
|   """ | ||||
|   for k, v in d.items(): | ||||
|     if k in hparam_filter: | ||||
|       imported = _object_import_from_string(v) | ||||
|       # TODO: Provide an appropriately informative error if the import fails | ||||
|       # except ImportError as e: | ||||
|       #   msg = ("Failed to realize import path %s." % v) | ||||
|       #   raise e | ||||
|       d[k] = imported | ||||
|   return d | ||||
| 
 | ||||
| 
 | ||||
| def _get_agents_configuration(log_dir=None): | ||||
|   """Load hyperparameter config. | ||||
| 
 | ||||
|   Args: | ||||
|     log_dir (str): The directory in which to search for a | ||||
|         tensorflow/agents config file. | ||||
| 
 | ||||
|   Returns: | ||||
|     dict: A dictionary storing the hyperparameter config. | ||||
|         for this run. | ||||
| 
 | ||||
|   """ | ||||
|   try: | ||||
|     # Try to resume training. | ||||
|     hparams = agents.scripts.utility.load_config(log_dir) | ||||
|   except IOError: | ||||
| 
 | ||||
|     hparams = hparams_base | ||||
| 
 | ||||
|     # -------- | ||||
|     # Experimental | ||||
|     for k, v in FLAGS.__dict__['__flags'].items(): | ||||
|       hparams[k] = v | ||||
|     hparams = _realize_import_attrs( | ||||
|         hparams, ["network", "algorithm", "optimizer"]) | ||||
|     # -------- | ||||
| 
 | ||||
|     hparams = agents.tools.AttrDict(hparams) | ||||
|     hparams = agents.scripts.utility.save_config(hparams, log_dir) | ||||
| 
 | ||||
|   pprint.pprint(hparams) | ||||
|   return hparams | ||||
| 
 | ||||
| 
 | ||||
| def gcs_upload(local_dir, gcs_out_dir): | ||||
|   """Upload the contents of a local directory to a specific GCS path. | ||||
| 
 | ||||
|   Args: | ||||
|     local_dir (str): The local directory containing files to upload. | ||||
|     gcs_out_dir (str): The target Google Cloud Storage directory path. | ||||
| 
 | ||||
|   Raises: | ||||
|     ValueError: If `gcs_out_dir` does not start with "gs://". | ||||
| 
 | ||||
|   """ | ||||
| 
 | ||||
|   # Get a list of all files in the local_dir | ||||
|   local_files = [f for f in os.listdir( | ||||
|       local_dir) if os.path.isfile(os.path.join(local_dir, f))] | ||||
|   tf.logging.info("Preparing local files for upload:\n %s" % local_files) | ||||
| 
 | ||||
|   # Initialize the GCS API client | ||||
|   storage_client = storage.Client() | ||||
| 
 | ||||
|   # Raise an error if the target directory cannot be a GCS path | ||||
|   if not gcs_out_dir.startswith("gs://"): | ||||
|     raise ValueError( | ||||
|         "gcs_upload expected gcs_out_dir argument to start with gs://, saw %s" % gcs_out_dir) | ||||
| 
 | ||||
|   # TODO: Detect and handle case where a GCS path has been provdied | ||||
|   # corresponding to a bucket that does not exist or for which the user does | ||||
|   # not have permissions. | ||||
| 
 | ||||
|   # Obtain the bucket path from the total path | ||||
|   bucket_path = gcs_out_dir.split('/')[2] | ||||
|   bucket = storage_client.get_bucket(bucket_path) | ||||
| 
 | ||||
|   # Construct a target upload path that excludes the initial gs://bucket-name | ||||
|   blob_base_path = '/'.join(gcs_out_dir.split('/')[3:]) | ||||
| 
 | ||||
|   # For each local file *name* in the list of local file names | ||||
|   for local_filename in local_files: | ||||
| 
 | ||||
|     # Construct the target and local *paths* | ||||
|     blob_path = os.path.join(blob_base_path, local_filename) | ||||
|     blob = bucket.blob(blob_path) | ||||
|     local_file_path = os.path.join(local_dir, local_filename) | ||||
| 
 | ||||
|     # Perform the upload operation | ||||
|     blob.upload_from_filename(local_file_path) | ||||
| 
 | ||||
| 
 | ||||
| def main(_): | ||||
|   """Configures run and initiates either training or rendering.""" | ||||
| 
 | ||||
|   tf.logging.set_verbosity(tf.logging.INFO) | ||||
| 
 | ||||
|   if FLAGS.debug: | ||||
|     tf.logging.set_verbosity(tf.logging.DEBUG) | ||||
| 
 | ||||
|   log_dir = FLAGS.logdir | ||||
| 
 | ||||
|   agents_config = _get_agents_configuration(log_dir) | ||||
| 
 | ||||
|   if FLAGS.run_mode == 'train': | ||||
|     for score in agents.scripts.train.train(agents_config, env_processes=True): | ||||
|       logging.info('Score %s.', score) | ||||
|   if FLAGS.run_mode == 'render': | ||||
|     now = datetime.datetime.now() | ||||
|     subdir = now.strftime("%m%d-%H%M") + "-" + uuid.uuid4().hex[0:4] | ||||
|     render_tmp_dir = "/tmp/agents-render/" | ||||
|     os.system('mkdir -p %s' % render_tmp_dir) | ||||
|     agents.scripts.visualize.visualize( | ||||
|         logdir=FLAGS.logdir, outdir=render_tmp_dir, num_agents=1, num_episodes=1, | ||||
|         checkpoint=None, env_processes=True) | ||||
|     render_out_dir = FLAGS.render_out_dir | ||||
|     # Unless a render out dir is specified explicitly upload to a unique subdir | ||||
|     # of the log dir with the parent render/ | ||||
|     if render_out_dir is None: | ||||
|       render_out_dir = os.path.join(FLAGS.logdir, "render", subdir) | ||||
|     if render_out_dir.startswith("gs://"): | ||||
|       gcs_upload(render_tmp_dir, render_out_dir) | ||||
|     else: | ||||
|       shutil.copytree(render_tmp_dir, render_out_dir) | ||||
| 
 | ||||
|   return True | ||||
| 
 | ||||
| 
 | ||||
| if __name__ == '__main__': | ||||
|   tf.app.run() | ||||
		Loading…
	
		Reference in New Issue