model-runner

Commit Graph

Author	SHA1	Message	Date
Emily Casey	bb7abccf47	Update pkg/inference/backends/llamacpp/llamacpp.go Co-authored-by: Jacob Howard <jacob.howard@docker.com>	2025-08-22 10:38:25 -06:00
Emily Casey	9f7f778e82	Fix remote memory estimation: * pull in blob URL fix - https://github.com/docker/model-distribution/pull/123 * don't attempt to estimate sharded models Signed-off-by: Emily Casey <emily.casey@docker.com>	2025-08-22 10:35:47 -06:00
Piotr Stankiewicz	03f7adc077	inference: Fix ignoring parse errors for unknown models We ignore parse errors for models that gguf-parser-go can't parse yet, for now. This regressed in the pre-pull memeory estimation PR. Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>	2025-08-22 17:06:23 +02:00
Piotr Stankiewicz	880818f741	inference: Support memory estimation for remote models Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>	2025-08-22 10:15:03 +02:00
Piotr Stankiewicz	1c13e4fc61	inference: Ignore parse errors when estimating model memory We will run into cases where our model runner is ahead of gguf-parser-go. In such cases we may want to load a model that will cause the model parse to fail. So, for now, in such cases ignore model parsing errors, and assume it takes no resources. In the future we should come up with a cleaner way of dealing with this (e.g. ship a model memory estimator along with the llama-server). Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>	2025-08-06 16:52:42 +02:00
Piotr Stankiewicz	2810fc21bd	inference: Always return 1 as VRAM size on win/arm64 Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>	2025-07-23 13:50:20 +02:00
Piotr Stankiewicz	263e4c7732	inference: Fix nv-gpu-info path and wrap errors Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>	2025-07-23 13:50:20 +02:00
Piotr Stankiewicz	ecc3f8dde4	inference: Fix failing llama_config unit tests Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>	2025-07-23 13:50:20 +02:00
Piotr Stankiewicz	3548e5f3e6	inference: Keep track of RAM allocated by runners Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>	2025-07-23 13:50:20 +02:00
Piotr Stankiewicz	cc9656e64c	inference, gpuinfo: Limit allowed models to 1 on windows/arm64 for now Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>	2025-07-23 13:50:20 +02:00
Piotr Stankiewicz	a4dc5834d1	Implement basic memory estimation in scheduler First pass implementation of memory estimation logic in model scheduler. This change heavily relies on gguf-parser-go to calculate estimated peak memory requirement for running inference with a given model. It adds GetRequiredMemoryForModel() to the Backend interface to allow each backend to deal with its config and calculate required memory usage based on it. Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>	2025-07-23 13:50:20 +02:00
Ignasi	9806a43e79	Adds support for Multimodal projector file (#104 ) * Includes --proj if the model includes a Multimodal projector file * Fix test	2025-07-10 14:41:28 +02:00
Emily Casey	cbdbb83bf7	Incorporate review feedback Signed-off-by: Emily Casey <emily.casey@docker.com>	2025-06-27 10:55:23 -06:00
Emily Casey	aac988e62a	Apply suggestions from code review Signed-off-by: Emily Casey <emily.casey@docker.com> Co-authored-by: Jacob Howard <jacob.howard@docker.com>	2025-06-27 10:47:18 -06:00
Emily Casey	247d9e06a3	Respect context size from model config Signed-off-by: Emily Casey <emily.casey@docker.com>	2025-06-27 09:35:14 -06:00
Jacob Howard	cce6a71cc9	Allow configuration through argument list (in addition to string) Signed-off-by: Jacob Howard <jacob.howard@docker.com>	2025-06-26 15:19:04 -06:00
Piotr Stankiewicz	5d31399dca	Log Hub response in llama-server auto update failure path In case Hub responds with an unexpected tag (or other unexpected content) when querying for the desired llama-server tag during the auto update process, we return an ok error, but extra info would be nice for debugging. So add a log entry in case the Hub response does not contain the desired tag. Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>	2025-06-25 14:27:23 +02:00
Ignasi	9933b7dabf	Adds metrics endpoint (#78 ) * Adds metrics endpoint * Remove NewSchedulerMetricsHandler, not used * replace custom parser with official Prometheus libraries - Remove custom prometheus_metrics.go - Use expfmt.TextParser for parsing and expfmt.NewEncoder for output * acquire/release the loader's lock * I missed commiting this * remove unneeded dep * clean up	2025-06-16 10:18:20 +02:00
Piotr Stankiewicz	099b12231d	Support runner configuration (temporary solution) We need to allow users to configure the model runtime. Whether to control inference settings, or low-level llama.cpp specific settings. In the interest of unblocking users quickly, this patch adds a very simple mechanism to configure the runtime settings. A `_configure` endpoint is added per-engine, and acceps POST requests to set context-size and raw runtime CLI flags. Those settings will be applied to any run of a given model, until unload is called for that model or model-runner is terminated. This is a temporary solution and therefore subject to change once a design for specifying runtime settings is finalised. Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>	2025-06-13 10:36:29 +02:00
Piotr Stankiewicz	5047eede95	Allow specifying a target llama-server version Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>	2025-06-11 16:57:09 +02:00
Piotr Stankiewicz	36e1441454	Include llama-server output in server error The llama-server exit status won't be sufficiently informative in most cases. So gather llama-server's stderr in a tail buffer and include it in the error returned from llamacpp.Run. Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>	2025-06-03 08:48:59 +02:00
Ignasi	23896c491b	Revert "Revert "Revert "Revert "configure backend args""" (#54 )" (#55 ) This reverts commit `1898347e1a`.	2025-05-30 16:51:34 +02:00
Piotr Stankiewicz	d73a30c01a	Fix return on server update disabled If we don't return an error from `ensureLatestLlamaCpp` when update is disabled, we'll fall into the wrong code path, and may end up trying to execute a non-existent `llama-server`. So, make sure we return an error from `ensureLatestLlamaCpp` if auto-update is disabled. Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>	2025-05-29 18:35:29 +02:00
Piotr Stankiewicz	6dd6230e85	Allow disabling inference server auto update Being able to disable the inference server auto update process is useful for testing. Eventually we should also provide it as an option to DD users. So, add an option to disable the inference server auto update. Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>	2025-05-29 15:26:09 +02:00
Piotr Stankiewicz	0343b8cbea	Use int64 for size instead of float64 Representing byte sizes as float64's can be confusing and potentially inefficient. So, use an integer type for representing byte sizes. Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>	2025-05-29 09:09:14 +02:00
Ignasi	1898347e1a	Revert "Revert "Revert "configure backend args""" (#54 )	2025-05-23 18:40:34 +02:00
Ignasi	54f2b98f2e	Revert "Revert "configure backend args (#41 )" (#50 )" This reverts commit `25eefbf1e6`.	2025-05-23 17:01:56 +02:00
Ignasi	25eefbf1e6	Revert "configure backend args (#41 )" (#50 ) This reverts commit `e59c062759`.	2025-05-23 17:01:38 +02:00
Ignasi	e59c062759	configure backend args (#41 ) * configure backend args * Fail if disallowed args are override	2025-05-22 15:22:41 +02:00
Dorin Geman	b881521c88	Add /engines/df Signed-off-by: Dorin Geman <dorin.geman@docker.com>	2025-05-21 09:59:37 +03:00
Piotr Stankiewicz	84ed5fdb94	llamacpp: Use --host instead of DD_INF_UDS Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>	2025-05-12 07:45:26 +02:00
Dorin Geman	359af9c951	llama.cpp: linux: Set running status on errLlamaCppUpToDate Signed-off-by: Dorin Geman <dorin.geman@docker.com>	2025-05-08 19:46:36 +03:00
Jacob Howard	3c6008e869	llamacpp: fix socket removal error detection and formatting Signed-off-by: Jacob Howard <jacob.howard@docker.com>	2025-05-01 16:35:02 -06:00
Jacob Howard	0f1ffa2c19	llamacpp: disable CUDA check for Windows/ARM64 We don't have the com.docker.nv-gpu-info.exe tool there (yet). Signed-off-by: Jacob Howard <jacob.howard@docker.com>	2025-05-01 14:20:18 -06:00
Jacob Howard	1fef60e334	native: don't apply version restrictions on Adreno GPUs Signed-off-by: Jacob Howard <jacob.howard@docker.com>	2025-04-30 13:48:34 -06:00
Jacob Howard	3b5bf559db	native: add Adreno device constraints for OpenCL backend Signed-off-by: Jacob Howard <jacob.howard@docker.com>	2025-04-30 13:48:34 -06:00
Jacob Howard	3d8c73c355	[AIE-151] native: support dynamic detection of OpenCL Signed-off-by: Jacob Howard <jacob.howard@docker.com>	2025-04-30 13:48:31 -06:00
Ignasi	dbbb7afe9f	Dockerize (#22 ) * Adds Makefile for local development * Fix chat completions example request * Added delete example * Dockerize model-runner * WIP Run container with host access to socket * Dockerize model-runner * WIP Run container with host access to socket * Debugging * Run in Docker container with TCP port access * mounted model storage * - Remove duplication in .gitignore - Do not use alpine in builder image - NVIDIA seems to use Ubuntu in all of their CDI docs and produces Ubuntu tags for nvidia/cuda but not Debian. So use Ubuntu for our final image For more details: https://github.com/docker/model-runner/pull/22 * - Add MODELS_PATH environment variable to configure model storage location - Default to $HOME/.docker/models when MODELS_PATH is not set - Update Docker container to use /models as the default storage path - Update Makefile to pass MODELS_PATH to container - Update Dockerfile to create and set permissions for /models directory This change allows users to: - Override the model storage location via MODELS_PATH - Maintain backward compatibility with default $HOME/.docker/models path - Use a more idiomatic folder for /models * Removes unneeded logs	2025-04-29 18:03:12 +02:00
Piotr Stankiewicz	4239791795	Basic param tuning on windows/arm64 Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>	2025-04-25 16:26:35 +02:00
Piotr Stankiewicz	978875e99c	Enable basic windows/arm64 support Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>	2025-04-25 16:26:35 +02:00
Piotr Stankiewicz	87fd6f6466	Fix llama-server auto update Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>	2025-04-24 17:31:42 +02:00
Jacob Howard	ca5fbbd8e8	chore: run go fmt Signed-off-by: Jacob Howard <jacob.howard@docker.com>	2025-04-17 14:04:02 -06:00
Jacob Howard	ed476dcbb8	chore: code review suggestions and go mod tidy Signed-off-by: Jacob Howard <jacob.howard@docker.com>	2025-04-17 13:59:32 -06:00
Dorin Geman	40f7438308	Lock ShouldUseGPUVariant Signed-off-by: Dorin Geman <dorin.geman@docker.com>	2025-04-17 19:11:23 +02:00
Dorin Geman	e5d5ccf2dd	Add Status to Backend interface Signed-off-by: Dorin Geman <dorin.geman@docker.com>	2025-04-17 19:11:20 +02:00
Dorin Geman	5e4719501a	Reset on Install Signed-off-by: Dorin Geman <dorin.geman@docker.com>	2025-04-17 19:10:16 +02:00
Dorin Geman	75f963a112	Show the GPU-backed setting only if it is available Signed-off-by: Dorin Geman <dorin.geman@docker.com>	2025-04-17 19:10:11 +02:00
Dorin Geman	eb0dba0cc8	No need to use the updated llama.cpp if the bundled one is up to date Signed-off-by: Dorin Geman <dorin.geman@docker.com>	2025-04-17 19:10:05 +02:00
Dorin Geman	a3fb86a0bb	Force a re-installation if EnableInferenceGPUVariant has changed Signed-off-by: Dorin Geman <dorin.geman@docker.com>	2025-04-17 19:10:00 +02:00
Dorin Geman	5d56ba5ad3	Kill process on Windows Signed-off-by: Dorin Geman <dorin.geman@docker.com>	2025-04-17 19:09:52 +02:00

1 2

70 Commits