model-runner

Commit Graph

Author	SHA1	Message	Date
Emily Casey	bb7abccf47	Update pkg/inference/backends/llamacpp/llamacpp.go Co-authored-by: Jacob Howard <jacob.howard@docker.com>	2025-08-22 10:38:25 -06:00
Emily Casey	9f7f778e82	Fix remote memory estimation: * pull in blob URL fix - https://github.com/docker/model-distribution/pull/123 * don't attempt to estimate sharded models Signed-off-by: Emily Casey <emily.casey@docker.com>	2025-08-22 10:35:47 -06:00
Piotr Stankiewicz	d8ed374455	inference: Use common system memory size getter in the loader Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>	2025-08-22 17:11:18 +02:00
Piotr Stankiewicz	03f7adc077	inference: Fix ignoring parse errors for unknown models We ignore parse errors for models that gguf-parser-go can't parse yet, for now. This regressed in the pre-pull memeory estimation PR. Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>	2025-08-22 17:06:23 +02:00
Piotr Stankiewicz	6d72f943f6	Make sure I don't commit vendor/ again Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>	2025-08-22 17:05:56 +02:00
Piotr Stankiewicz	77e0de486f	Remove vendor/ Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>	2025-08-22 17:05:56 +02:00
Piotr Stankiewicz	933edd2249	inference: Fix up review comments Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>	2025-08-22 10:15:03 +02:00
Piotr Stankiewicz	64c85dcd83	inference: Support disabling pre-pull memory checks Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>	2025-08-22 10:15:03 +02:00
Piotr Stankiewicz	15e31feb30	inference: Block pull if model requires too much memory to run Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>	2025-08-22 10:15:03 +02:00
Piotr Stankiewicz	880818f741	inference: Support memory estimation for remote models Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>	2025-08-22 10:15:03 +02:00
Piotr Stankiewicz	59da65a365	Bump docker/model-distribution Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>	2025-08-22 10:15:03 +02:00
Piotr Stankiewicz	1c13e4fc61	inference: Ignore parse errors when estimating model memory We will run into cases where our model runner is ahead of gguf-parser-go. In such cases we may want to load a model that will cause the model parse to fail. So, for now, in such cases ignore model parsing errors, and assume it takes no resources. In the future we should come up with a cleaner way of dealing with this (e.g. ship a model memory estimator along with the llama-server). Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>	2025-08-06 16:52:42 +02:00
Ignasi	d61ffd5311	updated the RequestResponsePair struct to differentiate between successful responses and error responses (#128 )	2025-08-06 16:29:58 +02:00
Jacob Howard	9e639fd253	Merge pull request #124 from aivantsov/patch-1 Fix the broken link to the Helm chart README.	2025-07-30 11:30:13 +03:00
Andrei Ivantsov	29a306b5af	Fix the broken link to the Helm chart README.	2025-07-30 10:14:38 +02:00
Jacob Howard	6b1cfee5a3	Merge pull request #123 from docker/nicks/chart charts: add Kubernetes examples	2025-07-30 10:55:53 +03:00
Nick Santos	b42f3a0cb5	charts: add Kubernetes examples - a helm chart - static Kubernetes configs for a few common setups I put these under ./charts so we can expose this as a Helm chart repo later if we want, but for now we'll just tell people to install it from source. Signed-off-by: Nick Santos <nick.santos@docker.com>	2025-07-29 12:53:05 -04:00
Piotr Stankiewicz	ecfa5e7e68	gpuinfo: Make CGO optional on darwin Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>	2025-07-24 14:24:32 +02:00
Dorin-Andrei Geman	7777c22890	Merge pull request #113 from docker/model-load Adds model load endpoint to models API	2025-07-24 14:52:22 +03:00
Dorin Geman	e2a0473732	Bump model-distribution to `a11d745e58` Signed-off-by: Dorin Geman <dorin.geman@docker.com>	2025-07-24 14:48:53 +03:00
Dorin Geman	e748a3c4de	chore: group and sort imports Signed-off-by: Dorin Geman <dorin.geman@docker.com>	2025-07-24 14:48:53 +03:00
Dorin-Andrei Geman	602f657781	Revert "models/load: ensure request body is closed" Co-authored-by: Jacob Howard <jacob.howard@docker.com>	2025-07-24 14:39:11 +03:00
Piotr Stankiewicz	43b96fc9a8	gpuinfo: Make building without cgo possible on Linux Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>	2025-07-24 11:58:55 +02:00
Dorin Geman	db19d8318f	models/load: ensure request body is closed Signed-off-by: Dorin Geman <dorin.geman@docker.com>	2025-07-24 12:27:25 +03:00
Emily Casey	4215c129be	add model/load endpoint Signed-off-by: Emily Casey <emily.casey@docker.com>	2025-07-23 22:01:20 -06:00
Piotr	fc9b2a7171	inference: Fix typo in log Co-authored-by: Dorin-Andrei Geman <dorin.geman@docker.com>	2025-07-23 13:50:20 +02:00
Piotr Stankiewicz	47517fdefa	inference: Fallback behaviour if reading RAM/VRAM size fails Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>	2025-07-23 13:50:20 +02:00
Piotr Stankiewicz	2810fc21bd	inference: Always return 1 as VRAM size on win/arm64 Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>	2025-07-23 13:50:20 +02:00
Piotr Stankiewicz	5f7d3a22a9	gpuinfo: Use go:build instead of obsolete +build Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>	2025-07-23 13:50:20 +02:00
Piotr Stankiewicz	263e4c7732	inference: Fix nv-gpu-info path and wrap errors Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>	2025-07-23 13:50:20 +02:00
Piotr Stankiewicz	ecc3f8dde4	inference: Fix failing llama_config unit tests Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>	2025-07-23 13:50:20 +02:00
Piotr Stankiewicz	3548e5f3e6	inference: Keep track of RAM allocated by runners Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>	2025-07-23 13:50:20 +02:00
Piotr Stankiewicz	cc9656e64c	inference, gpuinfo: Limit allowed models to 1 on windows/arm64 for now Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>	2025-07-23 13:50:20 +02:00
Piotr Stankiewicz	00e3d60de5	gpuinfo: Release Metal device handle in VRAM size getter Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>	2025-07-23 13:50:20 +02:00
Piotr Stankiewicz	ea3bb71830	Use nv-gpu-info on Windows to get VRAM size Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>	2025-07-23 13:50:20 +02:00
Piotr Stankiewicz	6e096b2caa	Move VRAM size getters to a separate package Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>	2025-07-23 13:50:20 +02:00
Piotr Stankiewicz	96ecef4eed	VRAM size getter for windows Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>	2025-07-23 13:50:20 +02:00
Piotr Stankiewicz	c458b232a8	VRAM size getter for linux Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>	2025-07-23 13:50:20 +02:00
Piotr Stankiewicz	a4dc5834d1	Implement basic memory estimation in scheduler First pass implementation of memory estimation logic in model scheduler. This change heavily relies on gguf-parser-go to calculate estimated peak memory requirement for running inference with a given model. It adds GetRequiredMemoryForModel() to the Backend interface to allow each backend to deal with its config and calculate required memory usage based on it. Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>	2025-07-23 13:50:20 +02:00
Jacob Howard	606aead0e5	Merge pull request #117 from docker/config-delete Unload configs based on model ID and for both modes.	2025-07-23 14:05:29 +03:00
Jacob Howard	77f24abb8b	Unload configs based on model ID and for both modes. Follow-up to #98. Signed-off-by: Jacob Howard <jacob.howard@docker.com>	2025-07-23 13:57:32 +03:00
Dorin-Andrei Geman	6a695dc026	Merge pull request #116 from doringeman/lock fix: switch to a RWMutex for synchronizing the router rebuild	2025-07-22 15:32:43 +03:00
Dorin Geman	5fa2bee652	fix: switch to a RWMutex for synchronizing the router rebuild Global locks in ServeHTTP methods were preventing concurrent request processing. Individual handlers should manage their own synchronization. Signed-off-by: Dorin Geman <dorin.geman@docker.com>	2025-07-22 15:29:30 +03:00
Dorin-Andrei Geman	0c1a6b7bec	Merge pull request #115 from doringeman/misc models: avoid error response when request is canceled/timed out	2025-07-22 13:42:38 +03:00
Dorin Geman	0eff85e3c2	models: avoid error response when request is canceled/timed out Prevents "superfluous response.WriteHeader" errors by checking if the request context was canceled or timed out before attempting to call http.Error(). Signed-off-by: Dorin Geman <dorin.geman@docker.com>	2025-07-22 09:45:49 +03:00
Jacob Howard	505613344c	Merge pull request #111 from docker/recorder-use-model-ref recorder: use model ref rather than model ID	2025-07-18 18:28:59 +03:00
Jacob Howard	3b5b4b58e4	recorder: use model ref rather than model ID The model ID is already recorded elsewhere and used as the record key. Signed-off-by: Jacob Howard <jacob.howard@docker.com>	2025-07-18 18:22:25 +03:00
Jacob Howard	c873021e07	Merge pull request #112 from docker/context-regulation Use more aggressive server shutdown and resequence termination	2025-07-18 18:19:30 +03:00
Jacob Howard	843a94cf1b	Merge pull request #110 from docker/recorder-memory-leak recorder: fix memory leak due to slice appends	2025-07-18 18:18:52 +03:00
Jacob Howard	3fef8f2980	Use more aggressive server shutdown and resequence termination Using Shutdown with an already cancelled context will cause the method to return almost immediately, with only idle connections being closed. More problematically, it can't close active connections, which can remain in flight indefinitely. At the moment, these active connections (and their associated contexts) can cause loader.load() to block loader.run() from exiting, especially if a backend is misbehaving, which can cause shutdown to halt waiting on the request. Even if a backend isn't misbehaving, an inference request can take many seconds. The best solution would be to make loader.load() unblock if the context passed to loader.run() is cancelled, but this is fairly complicated to implemented. The easier solution for now is just to use a hard server Close() to cancel inflight requests (and their contexts) and then wait for scheduler shutdown. This is what we do in Docker Desktop. Signed-off-by: Jacob Howard <jacob.howard@docker.com>	2025-07-18 13:17:47 +03:00

1 2 3 4 5 ...

297 Commits All Branches Search

297 Commits

All Branches