First pass at jobrunner docs for DTR

2017-02-21 12:08:38 -08:00 · 2017-02-21 12:08:38 -08:00 · 2fcee427e3
parent 9127e313ee
commit 2fcee427e3
3 changed files with 268 additions and 1 deletions
--- a/_data/toc.yaml
+++ b/_data/toc.yaml
@ -1336,6 +1336,8 @@ manuals:
            title: Chain multiple caches
        - path: /datacenter/dtr/2.2/guides/admin/configure/garbage-collection/
          title: Garbage collection
        - path: /datacenter/dtr/2.2/guides/admin/configure/jobrunner/
          title: Jobrunner
      - sectiontitle: Manage users
        section:
        - path: /datacenter/dtr/2.2/guides/admin/manage-users/
--- a/datacenter/dtr/2.2/guides/admin/configure/garbage-collection.md
+++ b/datacenter/dtr/2.2/guides/admin/configure/garbage-collection.md
@ -1,6 +1,6 @@
 ---
 description: Configure garbage collection in Docker Trusted Registry
-keyworkds: docker, registry, garbage collection, gc, space, disk space
+keywords: docker, registry, garbage collection, gc, space, disk space
 title: Docker Trusted Registry 2.2 Garbage Collection
 ---
--- a/datacenter/dtr/2.2/guides/admin/configure/jobrunner.md
+++ b/datacenter/dtr/2.2/guides/admin/configure/jobrunner.md
@ -0,0 +1,265 @@
 ---
 title: Jobrunner
 description: Learn about the inner-workings of the jobrunner container in the DTR workflow.
 keywords: docker, job, runner
 ---
 The jobrunner container is a DTR mechanism that:
 1. Consumes jobs from a cluster-wide jobs queue 
 2. Performs the work of the given action
 There is one jobrunner container per replica. 
 ## Jobrunner Workflow
 [//]: # (uncomment once diagrams are complete) The following diagram depicts the behavior of the jobrunner:
 [//]: # (Placeholder for jobrunner diagram. @sarahpark will work on the diagram)
 When a job is scheduled (see [Job Actions](#job-actions) below) it is put onto the 
 cluster-wide jobs queue with an initial status of `waiting`. When a jobrunner worker 
 is available to pick up the job, it will claim the it (i.e: the workerID will be set 
 to the replicaID of the jobrunner container that claimed the job) and set the job 
 status to `running`. The worker will carry out the job and then set the appropriate 
 status when complete (see [Job Statuses](#job-statuses) below).
 [//]: # (Placeholder for jobrunner scheduling. @sarahpark will work on the diagram)
 Each jobrunner has an internal queue of `waiting` jobs sorted by their `scheduledAt`
 time (from earliest to latest). When a worker is free to claim the next job, it claims
 it after a delay of up to 3 seconds. This delay is imposed so that each available worker
 has a chance to compete for the job. The worker that was successfully able to claim the
 job will set the job's `workerID` and all other workers will drop the job from their
 internal queue. If a job cannot be claimed due to capacity limits (see [Job Capacities](#job-capacities))
 then it is placed into a separate queue and will go through the claiming process
 when the worker has enough free capacity for it.
 Jobrunners monitor each other's `heartbeatExpiration` in the workers table. When a worker 
 see that another worker hasn't updated its expiration in a long time, it sets the 
 dead worker's status to `dead` and its jobs to `worker_dead`. If the dead worker is able to
 reconnect to the database and notices that it's jobs have been set to `worker_dead`,
 it sets those job statuses to `worker_resurrection` and cancels them.
 ### Job Actions
 The available job actions are:
 - `gc`
 : Garbage collection deletes layers associated with deleted images.
 - `sleep`
 : Sleep is used to test the correctness of the jobrunner. It sleeps for 60 seconds.
 - `false`
 : False is used to test the correctness of the jobrunner. It runs the `false` command and immediately fails.
 - `tagmigration`
 : Tag migration is used to sync tag and manifest information from the blobstore into the database.
 This information is used to for information in the API, UI, and also for GC.
 - `bloblinkmigration`
 : bloblinkmigration is a 2.1 to 2.1 upgrade process that adds references for blobs to repositories in the database.
 - `license_update`
 : License update checks for license expiration extensions if online license updates are enabled.
 - `nautilus_scan_check`
 : An image security scanning job. This job does not perform the actual scanning, rather it
 spawns `nautilus_scan_check_single` jobs (one for each layer in the image). Once
 all of the `nautilus_scan_check_single` jobs are complete, this job will terminate.
 - `nautilus_scan_check_single`
 : A security scanning job for a particular layer given by the `parameter: SHA256SUM`. This job
 breaks up the layer into components and checks each component for vulnerabilities
 (see [Security Scanning](../../user/manage-images/scan-images-for-vulnerabilities.md)).
 - `nautilus_update_db`
 : A job that is created to update DTR's vulnerability database. It uses an
 Internet connection to check for database updates through `https://dss-cve-updates.docker.com/` and
 updates the dtr-scanningstore container if there is a new update available (see [Set up vulnerability scans](set-up-vulnerability-scans.md)). 
 - `webhook`
 : A job that is used to dispatch a webhook payload to a single endpoint
 #### Job Capacities
 As mentioned above, each jobrunner container acts as one worker that can carry out these actions.
 The number of a particular action a worker can carry out is defined by it's capacity which can be 
 seen in the `GET /api/v0/workers` endpoint. For example the workers entry may look like this:
 ```json
 {
  "workers": [
    {
      "id": "000000000000",	
      "status": "running",
      "capacityMap": {
        "scan": 1,
        "scanCheck": 1
      },
      "heartbeatExpiration": "2017-02-18T00:51:02Z"
    }
  ]
 }
 ```
 This means that the worker with the replica ID `000000000000` has a capacity of 1 `scan` and 1
 `scanCheck`. A job may have a `capacityMap` field which dictates how much capacity a worker
 must have available for the job to be executed.
 For example, if we take the above worker's `capacityMap` and the following jobs:
 ```json
 {
  "jobs": [
    {
      "id": "0",
      "workerID": "",
      "status": "waiting",
      "capacityMap": {
        "scan": 1
      }
    },
    {
       "id": "1",
       "workerID": "",
       "status": "waiting",
       "capacityMap": {
         "scan": 1
       }
    },
    {
     "id": "2",
      "workerID": "",
      "status": "waiting",
      "capacityMap": {
        "scanCheck": 1
      }
    }
  ]
 }
 ```
 Our worker will be able to pick up job id `0` and `2` since it has the capacity for both,
 while id `1` will have to wait until the previous scan job is complete:
 ```json
 {
  "jobs": [
    {
      "id": "0",
      "workerID": "000000000000",
      "status": "running",
      "capacityMap": {
        "scan": 1
      }
    },
    {
       "id": "1",
       "workerID": "",
       "status": "waiting",
       "capacityMap": {
         "scan": 1
       }
    },
    {
     "id": "2",
      "workerID": "000000000000",
      "status": "running",
      "capacityMap": {
        "scanCheck": 1
      }
    }
  ]
 }
 ```
 ## Job Statuses
 Jobs can have the following statuses:
 - `waiting`
 : the job is unclaimed and waiting to be picked up by a worker
 - `running`
 : the worker defined by `workerID` is currently running the job
 - `done`
 : the job has successfully completed
 - `error`
 : the job has completed with errors
 - `cancel_request`
 : the worker monitors the job statuses in the database. If the status for a job changes
 to `cancel_request`, the worker will cancel the job
 - `cancel`
 : the job has been cancelled and not fully executed
 - `deleted`
 : the job and logs have been removed  
 - `worker_dead`
 : the worker for this job has been declared `dead` and the job will not continue
 - `worker_shutdown`
 : the worker that was running this job has been gracefully stopped 
 - `worker_resurrection`
 : the worker for this job has reconnected to the database and will cancel these jobs
 ## Troubleshooting a Job
 An entry for a job can look like:
 ```json
 {
 	"id": "1fcf4c0f-ff3b-471a-8839-5dcb631b2f7b",
 	"retryFromID": "1fcf4c0f-ff3b-471a-8839-5dcb631b2f7b",
 	"workerID": "000000000000",
 	"status": "done",
 	"scheduledAt": "2017-02-17T01:09:47.771Z",
 	"lastUpdated": "2017-02-17T01:10:14.117Z",
 	"action": "nautilus_scan_check_single",
 	"retriesLeft": 0,
 	"retriesTotal": 0,
 	"capacityMap": {
      	  "scan": 1
 	},
 	"parameters": {
      	  "SHA256SUM": "1bacd3c8ccb1f15609a10bd4a403831d0ec0b354438ddbf644c95c5d54f8eb13"
 	},
 	"deadline": "",
 	"stopTimeout": ""
 }
 ```
 The fields of interest here are:
 - `id`: the ID of the job itself
 - `workerID`: the ID of the jobrunner worker (synonymous with the DTR replica ID) that is running this job
 - `status`: the current state of the job (see [Job Statuses](#job-statuses))
 - `action`: what job the worker will actually perform (see [Job Actions](#job-actions))
 - `capacityMap`: the available "capacity" a worker needs for this job to run (see [Job Capacities](#job-capacities))
 You can view the logs of a particular job by hitting the `GET /api/v0/jobs/{jobID}/logs` endpoint
 with the job's `id` as `{jobID}`.
 ## Cron jobs
 Several of the jobs listed in [Job Actions](#job-actions) have been set to run on a
 recurring schedule. You can view these jobs with the `GET /api/v0/crons` endpoint which
 will return a list similar to this example:
 ```json
 {
  "crons": [
    {
      "id": "48875b1b-5006-48f5-9f3c-af9fbdd82255",
      "action": "license_update",
      "schedule": "57 54 3 * * *",
      "retries": 2,
      "capacityMap": null,
      "parameters": null,
      "deadline": "",
      "stopTimeout": "",
      "nextRun": "2017-02-22T03:54:57Z"
    },
    {
      "id": "b1c1e61e-1e74-4677-8e4a-2a7dacefffdc",
      "action": "nautilus_update_db",
      "schedule": "0 0 3 * * *",
      "retries": 0,
      "capacityMap": null,
      "parameters": null,
      "deadline": "",
      "stopTimeout": "",
      "nextRun": "2017-02-22T03:00:00Z"
    }
  ]
 }
 ```
 The `schedule` is simlar to the style of a typical Unix crontab format:
 `"second minute hour day month year"`. This determines the next time the `action` will
 take place.