Docker Model Runner
Go to file
Ignasi b8561e1bfe
Only run tests once (#89)
* Only run tests once

* Adds newline at the end of the file
2025-06-20 13:14:20 +02:00
.github/workflows Only run tests once (#89) 2025-06-20 13:14:20 +02:00
assets Add pull model tests (#40) 2025-05-29 10:56:17 +02:00
pkg Merge pull request #87 from docker/cloud-ttl-8hour 2025-06-18 07:49:17 -06:00
.dockerignore Dockerize (#22) 2025-04-29 18:03:12 +02:00
.gitignore Add bin subdirectory to updated llama.cpp path 2025-05-30 15:40:37 +03:00
Dockerfile Create home folder (#85) 2025-06-16 19:46:18 +02:00
LICENSE chore: add README.md and LICENSE in preparation for open sourcing 2025-04-14 10:44:26 -06:00
METRICS.md Adds metrics endpoint (#78) 2025-06-16 10:18:20 +02:00
Makefile Add metrics tracker for model usage 2025-06-03 16:49:22 +03:00
README.md Adds metrics endpoint (#78) 2025-06-16 10:18:20 +02:00
go.mod Bump model-distribution (#88) 2025-06-20 10:07:48 +02:00
go.sum Bump model-distribution (#88) 2025-06-20 10:07:48 +02:00
main.go Adds metrics endpoint (#78) 2025-06-16 10:18:20 +02:00
main_test.go Revert "Revert "Revert "Revert "configure backend args""" (#54)" (#55) 2025-05-30 16:51:34 +02:00

README.md

Model Runner

The backend library for the Docker Model Runner.

Overview

[!NOTE] This package is still under rapid development and its APIs should not be considered stable.

This package supports the Docker Model Runner in Docker Desktop (in conjunction with Model Distribution and the Model CLI). It includes a main.go that mimics its integration with Docker Desktop and allows the package to be run in a standalone mode.

Using the Makefile

This project includes a Makefile to simplify common development tasks. It requires Docker Desktop >= 4.41.0 The Makefile provides the following targets:

  • build - Build the Go application
  • run - Run the application locally
  • clean - Clean build artifacts
  • test - Run tests
  • docker-build - Build the Docker image
  • docker-run - Run the application in a Docker container with TCP port access and mounted model storage
  • help - Show available targets

Running in Docker

The application can be run in Docker with the following features enabled by default:

  • TCP port access (default port 8080)
  • Persistent model storage in a local models directory
# Run with default settings
make docker-run

# Customize port and model storage location
make docker-run PORT=3000 MODELS_PATH=/path/to/your/models

This will:

  • Create a models directory in your current working directory (or use the specified path)
  • Mount this directory into the container
  • Start the service on port 8080 (or the specified port)
  • All models downloaded will be stored in the host's models directory and will persist between container runs

llama.cpp integration

The Docker image includes the llama.cpp server binary from the docker/docker-model-backend-llamacpp image. You can specify the version of the image to use by setting the LLAMA_SERVER_VERSION variable. Additionally, you can configure the target OS, architecture, and acceleration type:

# Build with a specific llama.cpp server version
make docker-build LLAMA_SERVER_VERSION=v0.0.4

# Specify all parameters
make docker-build LLAMA_SERVER_VERSION=v0.0.4 LLAMA_SERVER_VARIANT=cpu

Default values:

  • LLAMA_SERVER_VERSION: latest
  • LLAMA_SERVER_VARIANT: cpu

The binary path in the image follows this pattern: /com.docker.llama-server.native.linux.${LLAMA_SERVER_VARIANT}.${TARGETARCH}

API Examples

The Model Runner exposes a REST API that can be accessed via TCP port. You can interact with it using curl commands.

Using the API

When running with docker-run, you can use regular HTTP requests:

# List all available models
curl http://localhost:8080/models

# Create a new model
curl http://localhost:8080/models/create -X POST -d '{"from": "ai/smollm2"}'

# Get information about a specific model
curl http://localhost:8080/models/ai/smollm2

# Chat with a model
curl http://localhost:8080/engines/llama.cpp/v1/chat/completions -X POST -d '{
  "model": "ai/smollm2",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello, how are you?"}
  ]
}'

# Delete a model
curl http://localhost:8080/models/ai/smollm2 -X DELETE

# Get metrics
curl http://localhost:8080/metrics

The response will contain the model's reply:

{
  "id": "chat-12345",
  "object": "chat.completion",
  "created": 1682456789,
  "model": "ai/smollm2",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "I'm doing well, thank you for asking! How can I assist you today?"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 24,
    "completion_tokens": 16,
    "total_tokens": 40
  }
}

Metrics

The Model Runner exposes the metrics endpoint of llama.cpp server at the /metrics endpoint. This allows you to monitor model performance, request statistics, and resource usage.

Accessing Metrics

# Get metrics in Prometheus format
curl http://localhost:8080/metrics

Configuration

  • Enable metrics (default): Metrics are enabled by default
  • Disable metrics: Set DISABLE_METRICS=1 environment variable
  • Monitoring integration: Add the endpoint to your Prometheus configuration

Check METRICS.md for more details.