Commit Graph

70 Commits

Author SHA1 Message Date
Emily Casey bb7abccf47
Update pkg/inference/backends/llamacpp/llamacpp.go
Co-authored-by: Jacob Howard <jacob.howard@docker.com>
2025-08-22 10:38:25 -06:00
Emily Casey 9f7f778e82 Fix remote memory estimation:
* pull in blob URL fix - https://github.com/docker/model-distribution/pull/123
* don't attempt to estimate sharded models

Signed-off-by: Emily Casey <emily.casey@docker.com>
2025-08-22 10:35:47 -06:00
Piotr Stankiewicz 03f7adc077 inference: Fix ignoring parse errors for unknown models
We ignore parse errors for models that gguf-parser-go can't parse yet,
for now. This regressed in the pre-pull memeory estimation PR.

Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
2025-08-22 17:06:23 +02:00
Piotr Stankiewicz 880818f741 inference: Support memory estimation for remote models
Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
2025-08-22 10:15:03 +02:00
Piotr Stankiewicz 1c13e4fc61 inference: Ignore parse errors when estimating model memory
We will run into cases where our model runner is ahead of
gguf-parser-go. In such cases we may want to load a model that will
cause the model parse to fail. So, for now, in such cases ignore model
parsing errors, and assume it takes no resources. In the future we
should come up with a cleaner way of dealing with this (e.g. ship a
model memory estimator along with the llama-server).

Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
2025-08-06 16:52:42 +02:00
Piotr Stankiewicz 2810fc21bd inference: Always return 1 as VRAM size on win/arm64
Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
2025-07-23 13:50:20 +02:00
Piotr Stankiewicz 263e4c7732 inference: Fix nv-gpu-info path and wrap errors
Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
2025-07-23 13:50:20 +02:00
Piotr Stankiewicz ecc3f8dde4 inference: Fix failing llama_config unit tests
Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
2025-07-23 13:50:20 +02:00
Piotr Stankiewicz 3548e5f3e6 inference: Keep track of RAM allocated by runners
Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
2025-07-23 13:50:20 +02:00
Piotr Stankiewicz cc9656e64c inference, gpuinfo: Limit allowed models to 1 on windows/arm64 for now
Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
2025-07-23 13:50:20 +02:00
Piotr Stankiewicz a4dc5834d1 Implement basic memory estimation in scheduler
First pass implementation of memory estimation logic in model scheduler.
This change heavily relies on gguf-parser-go to calculate estimated peak
memory requirement for running inference with a given model. It adds
GetRequiredMemoryForModel() to the Backend interface to allow each
backend to deal with its config and calculate required memory usage
based on it.

Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
2025-07-23 13:50:20 +02:00
Ignasi 9806a43e79
Adds support for Multimodal projector file (#104)
* Includes --proj if the model includes a Multimodal projector file

* Fix test
2025-07-10 14:41:28 +02:00
Emily Casey cbdbb83bf7 Incorporate review feedback
Signed-off-by: Emily Casey <emily.casey@docker.com>
2025-06-27 10:55:23 -06:00
Emily Casey aac988e62a
Apply suggestions from code review
Signed-off-by: Emily Casey <emily.casey@docker.com>

Co-authored-by: Jacob Howard <jacob.howard@docker.com>
2025-06-27 10:47:18 -06:00
Emily Casey 247d9e06a3 Respect context size from model config
Signed-off-by: Emily Casey <emily.casey@docker.com>
2025-06-27 09:35:14 -06:00
Jacob Howard cce6a71cc9
Allow configuration through argument list (in addition to string)
Signed-off-by: Jacob Howard <jacob.howard@docker.com>
2025-06-26 15:19:04 -06:00
Piotr Stankiewicz 5d31399dca Log Hub response in llama-server auto update failure path
In case Hub responds with an unexpected tag (or other unexpected
content) when querying for the desired llama-server tag during the auto
update process, we return an ok error, but extra info would be nice for
debugging. So add a log entry in case the Hub response does not contain
the desired tag.

Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
2025-06-25 14:27:23 +02:00
Ignasi 9933b7dabf
Adds metrics endpoint (#78)
* Adds metrics endpoint

* Remove NewSchedulerMetricsHandler, not used

* replace custom parser with official Prometheus libraries

- Remove custom prometheus_metrics.go
- Use expfmt.TextParser for parsing and expfmt.NewEncoder for output

* acquire/release the loader's lock

* I missed commiting this

* remove unneeded dep

* clean up
2025-06-16 10:18:20 +02:00
Piotr Stankiewicz 099b12231d Support runner configuration (temporary solution)
We need to allow users to configure the model runtime. Whether to
control inference settings, or low-level llama.cpp specific settings.

In the interest of unblocking users quickly, this patch adds a very simple
mechanism to configure the runtime settings. A `_configure` endpoint is
added per-engine, and acceps POST requests to set context-size and raw
runtime CLI flags. Those settings will be applied to any run of a given
model, until unload is called for that model or model-runner is
terminated.

This is a temporary solution and therefore subject to change once a
design for specifying runtime settings is finalised.

Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
2025-06-13 10:36:29 +02:00
Piotr Stankiewicz 5047eede95 Allow specifying a target llama-server version
Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
2025-06-11 16:57:09 +02:00
Piotr Stankiewicz 36e1441454 Include llama-server output in server error
The llama-server exit status won't be sufficiently informative in most
cases. So gather llama-server's stderr in a tail buffer and include it
in the error returned from llamacpp.Run.

Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
2025-06-03 08:48:59 +02:00
Ignasi 23896c491b
Revert "Revert "Revert "Revert "configure backend args""" (#54)" (#55)
This reverts commit 1898347e1a.
2025-05-30 16:51:34 +02:00
Piotr Stankiewicz d73a30c01a Fix return on server update disabled
If we don't return an error from `ensureLatestLlamaCpp` when update is
disabled, we'll fall into the wrong code path, and may end up trying to
execute a non-existent `llama-server`. So, make sure we return an error
from `ensureLatestLlamaCpp` if auto-update is disabled.

Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
2025-05-29 18:35:29 +02:00
Piotr Stankiewicz 6dd6230e85 Allow disabling inference server auto update
Being able to disable the inference server auto update process is useful
for testing. Eventually we should also provide it as an option to DD
users. So, add an option to disable the inference server auto update.

Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
2025-05-29 15:26:09 +02:00
Piotr Stankiewicz 0343b8cbea Use int64 for size instead of float64
Representing byte sizes as float64's can be confusing and potentially
inefficient. So, use an integer type for representing byte sizes.

Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
2025-05-29 09:09:14 +02:00
Ignasi 1898347e1a
Revert "Revert "Revert "configure backend args""" (#54) 2025-05-23 18:40:34 +02:00
Ignasi 54f2b98f2e
Revert "Revert "configure backend args (#41)" (#50)"
This reverts commit 25eefbf1e6.
2025-05-23 17:01:56 +02:00
Ignasi 25eefbf1e6
Revert "configure backend args (#41)" (#50)
This reverts commit e59c062759.
2025-05-23 17:01:38 +02:00
Ignasi e59c062759
configure backend args (#41)
* configure backend args

* Fail if disallowed args are override
2025-05-22 15:22:41 +02:00
Dorin Geman b881521c88
Add /engines/df
Signed-off-by: Dorin Geman <dorin.geman@docker.com>
2025-05-21 09:59:37 +03:00
Piotr Stankiewicz 84ed5fdb94 llamacpp: Use --host instead of DD_INF_UDS
Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
2025-05-12 07:45:26 +02:00
Dorin Geman 359af9c951 llama.cpp: linux: Set running status on errLlamaCppUpToDate
Signed-off-by: Dorin Geman <dorin.geman@docker.com>
2025-05-08 19:46:36 +03:00
Jacob Howard 3c6008e869
llamacpp: fix socket removal error detection and formatting
Signed-off-by: Jacob Howard <jacob.howard@docker.com>
2025-05-01 16:35:02 -06:00
Jacob Howard 0f1ffa2c19
llamacpp: disable CUDA check for Windows/ARM64
We don't have the com.docker.nv-gpu-info.exe tool there (yet).

Signed-off-by: Jacob Howard <jacob.howard@docker.com>
2025-05-01 14:20:18 -06:00
Jacob Howard 1fef60e334
native: don't apply version restrictions on Adreno GPUs
Signed-off-by: Jacob Howard <jacob.howard@docker.com>
2025-04-30 13:48:34 -06:00
Jacob Howard 3b5bf559db
native: add Adreno device constraints for OpenCL backend
Signed-off-by: Jacob Howard <jacob.howard@docker.com>
2025-04-30 13:48:34 -06:00
Jacob Howard 3d8c73c355
[AIE-151] native: support dynamic detection of OpenCL
Signed-off-by: Jacob Howard <jacob.howard@docker.com>
2025-04-30 13:48:31 -06:00
Ignasi dbbb7afe9f
Dockerize (#22)
* Adds Makefile for local development

* Fix chat completions example request

* Added delete example

* Dockerize model-runner

* WIP Run container with host access to socket

* Dockerize model-runner

* WIP Run container with host access to socket

* Debugging

* Run in Docker container with TCP port access

* mounted model storage

* - Remove duplication in .gitignore
- Do not use alpine in builder image
- NVIDIA seems to use Ubuntu in all of their CDI docs and produces Ubuntu tags for nvidia/cuda but not Debian. So use Ubuntu for our final image
For more details: https://github.com/docker/model-runner/pull/22

* - Add MODELS_PATH environment variable to configure model storage location
- Default to $HOME/.docker/models when MODELS_PATH is not set
- Update Docker container to use /models as the default storage path
- Update Makefile to pass MODELS_PATH to container
- Update Dockerfile to create and set permissions for /models directory

This change allows users to:
- Override the model storage location via MODELS_PATH
- Maintain backward compatibility with default $HOME/.docker/models path
- Use a more idiomatic folder for /models

* Removes unneeded logs
2025-04-29 18:03:12 +02:00
Piotr Stankiewicz 4239791795 Basic param tuning on windows/arm64
Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
2025-04-25 16:26:35 +02:00
Piotr Stankiewicz 978875e99c Enable basic windows/arm64 support
Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
2025-04-25 16:26:35 +02:00
Piotr Stankiewicz 87fd6f6466 Fix llama-server auto update
Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
2025-04-24 17:31:42 +02:00
Jacob Howard ca5fbbd8e8
chore: run go fmt
Signed-off-by: Jacob Howard <jacob.howard@docker.com>
2025-04-17 14:04:02 -06:00
Jacob Howard ed476dcbb8
chore: code review suggestions and go mod tidy
Signed-off-by: Jacob Howard <jacob.howard@docker.com>
2025-04-17 13:59:32 -06:00
Dorin Geman 40f7438308 Lock ShouldUseGPUVariant
Signed-off-by: Dorin Geman <dorin.geman@docker.com>
2025-04-17 19:11:23 +02:00
Dorin Geman e5d5ccf2dd Add Status to Backend interface
Signed-off-by: Dorin Geman <dorin.geman@docker.com>
2025-04-17 19:11:20 +02:00
Dorin Geman 5e4719501a Reset on Install
Signed-off-by: Dorin Geman <dorin.geman@docker.com>
2025-04-17 19:10:16 +02:00
Dorin Geman 75f963a112 Show the GPU-backed setting only if it is available
Signed-off-by: Dorin Geman <dorin.geman@docker.com>
2025-04-17 19:10:11 +02:00
Dorin Geman eb0dba0cc8 No need to use the updated llama.cpp if the bundled one is up to date
Signed-off-by: Dorin Geman <dorin.geman@docker.com>
2025-04-17 19:10:05 +02:00
Dorin Geman a3fb86a0bb Force a re-installation if EnableInferenceGPUVariant has changed
Signed-off-by: Dorin Geman <dorin.geman@docker.com>
2025-04-17 19:10:00 +02:00
Dorin Geman 5d56ba5ad3 Kill process on Windows
Signed-off-by: Dorin Geman <dorin.geman@docker.com>
2025-04-17 19:09:52 +02:00