Commit Graph

297 Commits

Author SHA1 Message Date
Emily Casey bb7abccf47
Update pkg/inference/backends/llamacpp/llamacpp.go
Co-authored-by: Jacob Howard <jacob.howard@docker.com>
2025-08-22 10:38:25 -06:00
Emily Casey 9f7f778e82 Fix remote memory estimation:
* pull in blob URL fix - https://github.com/docker/model-distribution/pull/123
* don't attempt to estimate sharded models

Signed-off-by: Emily Casey <emily.casey@docker.com>
2025-08-22 10:35:47 -06:00
Piotr Stankiewicz d8ed374455 inference: Use common system memory size getter in the loader
Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
2025-08-22 17:11:18 +02:00
Piotr Stankiewicz 03f7adc077 inference: Fix ignoring parse errors for unknown models
We ignore parse errors for models that gguf-parser-go can't parse yet,
for now. This regressed in the pre-pull memeory estimation PR.

Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
2025-08-22 17:06:23 +02:00
Piotr Stankiewicz 6d72f943f6 Make sure I don't commit vendor/ again
Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
2025-08-22 17:05:56 +02:00
Piotr Stankiewicz 77e0de486f Remove vendor/
Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
2025-08-22 17:05:56 +02:00
Piotr Stankiewicz 933edd2249 inference: Fix up review comments
Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
2025-08-22 10:15:03 +02:00
Piotr Stankiewicz 64c85dcd83 inference: Support disabling pre-pull memory checks
Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
2025-08-22 10:15:03 +02:00
Piotr Stankiewicz 15e31feb30 inference: Block pull if model requires too much memory to run
Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
2025-08-22 10:15:03 +02:00
Piotr Stankiewicz 880818f741 inference: Support memory estimation for remote models
Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
2025-08-22 10:15:03 +02:00
Piotr Stankiewicz 59da65a365 Bump docker/model-distribution
Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
2025-08-22 10:15:03 +02:00
Piotr Stankiewicz 1c13e4fc61 inference: Ignore parse errors when estimating model memory
We will run into cases where our model runner is ahead of
gguf-parser-go. In such cases we may want to load a model that will
cause the model parse to fail. So, for now, in such cases ignore model
parsing errors, and assume it takes no resources. In the future we
should come up with a cleaner way of dealing with this (e.g. ship a
model memory estimator along with the llama-server).

Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
2025-08-06 16:52:42 +02:00
Ignasi d61ffd5311
updated the RequestResponsePair struct to differentiate between successful responses and error responses (#128) 2025-08-06 16:29:58 +02:00
Jacob Howard 9e639fd253
Merge pull request #124 from aivantsov/patch-1
Fix the broken link to the Helm chart README.
2025-07-30 11:30:13 +03:00
Andrei Ivantsov 29a306b5af
Fix the broken link to the Helm chart README. 2025-07-30 10:14:38 +02:00
Jacob Howard 6b1cfee5a3
Merge pull request #123 from docker/nicks/chart
charts: add Kubernetes examples
2025-07-30 10:55:53 +03:00
Nick Santos b42f3a0cb5
charts: add Kubernetes examples
- a helm chart
- static Kubernetes configs for a few common setups

I put these under ./charts so we can expose
this as a Helm chart repo later if we want,
but for now we'll just tell people to install it
from source.

Signed-off-by: Nick Santos <nick.santos@docker.com>
2025-07-29 12:53:05 -04:00
Piotr Stankiewicz ecfa5e7e68 gpuinfo: Make CGO optional on darwin
Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
2025-07-24 14:24:32 +02:00
Dorin-Andrei Geman 7777c22890
Merge pull request #113 from docker/model-load
Adds model load endpoint to models API
2025-07-24 14:52:22 +03:00
Dorin Geman e2a0473732 Bump model-distribution to a11d745e58
Signed-off-by: Dorin Geman <dorin.geman@docker.com>
2025-07-24 14:48:53 +03:00
Dorin Geman e748a3c4de chore: group and sort imports
Signed-off-by: Dorin Geman <dorin.geman@docker.com>
2025-07-24 14:48:53 +03:00
Dorin-Andrei Geman 602f657781
Revert "models/load: ensure request body is closed"
Co-authored-by: Jacob Howard <jacob.howard@docker.com>
2025-07-24 14:39:11 +03:00
Piotr Stankiewicz 43b96fc9a8 gpuinfo: Make building without cgo possible on Linux
Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
2025-07-24 11:58:55 +02:00
Dorin Geman db19d8318f
models/load: ensure request body is closed
Signed-off-by: Dorin Geman <dorin.geman@docker.com>
2025-07-24 12:27:25 +03:00
Emily Casey 4215c129be add model/load endpoint
Signed-off-by: Emily Casey <emily.casey@docker.com>
2025-07-23 22:01:20 -06:00
Piotr fc9b2a7171 inference: Fix typo in log
Co-authored-by: Dorin-Andrei Geman <dorin.geman@docker.com>
2025-07-23 13:50:20 +02:00
Piotr Stankiewicz 47517fdefa inference: Fallback behaviour if reading RAM/VRAM size fails
Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
2025-07-23 13:50:20 +02:00
Piotr Stankiewicz 2810fc21bd inference: Always return 1 as VRAM size on win/arm64
Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
2025-07-23 13:50:20 +02:00
Piotr Stankiewicz 5f7d3a22a9 gpuinfo: Use go:build instead of obsolete +build
Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
2025-07-23 13:50:20 +02:00
Piotr Stankiewicz 263e4c7732 inference: Fix nv-gpu-info path and wrap errors
Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
2025-07-23 13:50:20 +02:00
Piotr Stankiewicz ecc3f8dde4 inference: Fix failing llama_config unit tests
Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
2025-07-23 13:50:20 +02:00
Piotr Stankiewicz 3548e5f3e6 inference: Keep track of RAM allocated by runners
Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
2025-07-23 13:50:20 +02:00
Piotr Stankiewicz cc9656e64c inference, gpuinfo: Limit allowed models to 1 on windows/arm64 for now
Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
2025-07-23 13:50:20 +02:00
Piotr Stankiewicz 00e3d60de5 gpuinfo: Release Metal device handle in VRAM size getter
Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
2025-07-23 13:50:20 +02:00
Piotr Stankiewicz ea3bb71830 Use nv-gpu-info on Windows to get VRAM size
Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
2025-07-23 13:50:20 +02:00
Piotr Stankiewicz 6e096b2caa Move VRAM size getters to a separate package
Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
2025-07-23 13:50:20 +02:00
Piotr Stankiewicz 96ecef4eed VRAM size getter for windows
Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
2025-07-23 13:50:20 +02:00
Piotr Stankiewicz c458b232a8 VRAM size getter for linux
Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
2025-07-23 13:50:20 +02:00
Piotr Stankiewicz a4dc5834d1 Implement basic memory estimation in scheduler
First pass implementation of memory estimation logic in model scheduler.
This change heavily relies on gguf-parser-go to calculate estimated peak
memory requirement for running inference with a given model. It adds
GetRequiredMemoryForModel() to the Backend interface to allow each
backend to deal with its config and calculate required memory usage
based on it.

Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
2025-07-23 13:50:20 +02:00
Jacob Howard 606aead0e5
Merge pull request #117 from docker/config-delete
Unload configs based on model ID and for both modes.
2025-07-23 14:05:29 +03:00
Jacob Howard 77f24abb8b
Unload configs based on model ID and for both modes.
Follow-up to #98.

Signed-off-by: Jacob Howard <jacob.howard@docker.com>
2025-07-23 13:57:32 +03:00
Dorin-Andrei Geman 6a695dc026
Merge pull request #116 from doringeman/lock
fix: switch to a RWMutex for synchronizing the router rebuild
2025-07-22 15:32:43 +03:00
Dorin Geman 5fa2bee652
fix: switch to a RWMutex for synchronizing the router rebuild
Global locks in ServeHTTP methods were preventing concurrent request processing. Individual handlers should manage their own synchronization.

Signed-off-by: Dorin Geman <dorin.geman@docker.com>
2025-07-22 15:29:30 +03:00
Dorin-Andrei Geman 0c1a6b7bec
Merge pull request #115 from doringeman/misc
models: avoid error response when request is canceled/timed out
2025-07-22 13:42:38 +03:00
Dorin Geman 0eff85e3c2
models: avoid error response when request is canceled/timed out
Prevents "superfluous response.WriteHeader" errors by checking if the request context was canceled or timed out before attempting to call http.Error().

Signed-off-by: Dorin Geman <dorin.geman@docker.com>
2025-07-22 09:45:49 +03:00
Jacob Howard 505613344c
Merge pull request #111 from docker/recorder-use-model-ref
recorder: use model ref rather than model ID
2025-07-18 18:28:59 +03:00
Jacob Howard 3b5b4b58e4
recorder: use model ref rather than model ID
The model ID is already recorded elsewhere and used as the record key.

Signed-off-by: Jacob Howard <jacob.howard@docker.com>
2025-07-18 18:22:25 +03:00
Jacob Howard c873021e07
Merge pull request #112 from docker/context-regulation
Use more aggressive server shutdown and resequence termination
2025-07-18 18:19:30 +03:00
Jacob Howard 843a94cf1b
Merge pull request #110 from docker/recorder-memory-leak
recorder: fix memory leak due to slice appends
2025-07-18 18:18:52 +03:00
Jacob Howard 3fef8f2980
Use more aggressive server shutdown and resequence termination
Using Shutdown with an already cancelled context will cause the method
to return almost immediately, with only idle connections being closed.
More problematically, it can't close active connections, which can
remain in flight indefinitely. At the moment, these active connections
(and their associated contexts) can cause loader.load() to block
loader.run() from exiting, especially if a backend is misbehaving, which
can cause shutdown to halt waiting on the request. Even if a backend
isn't misbehaving, an inference request can take many seconds. The
best solution would be to make loader.load() unblock if the context
passed to loader.run() is cancelled, but this is fairly complicated to
implemented. The easier solution for now is just to use a hard server
Close() to cancel inflight requests (and their contexts) and then wait
for scheduler shutdown. This is what we do in Docker Desktop.

Signed-off-by: Jacob Howard <jacob.howard@docker.com>
2025-07-18 13:17:47 +03:00