We ignore parse errors for models that gguf-parser-go can't parse yet,
for now. This regressed in the pre-pull memeory estimation PR.
Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
We will run into cases where our model runner is ahead of
gguf-parser-go. In such cases we may want to load a model that will
cause the model parse to fail. So, for now, in such cases ignore model
parsing errors, and assume it takes no resources. In the future we
should come up with a cleaner way of dealing with this (e.g. ship a
model memory estimator along with the llama-server).
Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
First pass implementation of memory estimation logic in model scheduler.
This change heavily relies on gguf-parser-go to calculate estimated peak
memory requirement for running inference with a given model. It adds
GetRequiredMemoryForModel() to the Backend interface to allow each
backend to deal with its config and calculate required memory usage
based on it.
Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
In case Hub responds with an unexpected tag (or other unexpected
content) when querying for the desired llama-server tag during the auto
update process, we return an ok error, but extra info would be nice for
debugging. So add a log entry in case the Hub response does not contain
the desired tag.
Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
* Adds metrics endpoint
* Remove NewSchedulerMetricsHandler, not used
* replace custom parser with official Prometheus libraries
- Remove custom prometheus_metrics.go
- Use expfmt.TextParser for parsing and expfmt.NewEncoder for output
* acquire/release the loader's lock
* I missed commiting this
* remove unneeded dep
* clean up
We need to allow users to configure the model runtime. Whether to
control inference settings, or low-level llama.cpp specific settings.
In the interest of unblocking users quickly, this patch adds a very simple
mechanism to configure the runtime settings. A `_configure` endpoint is
added per-engine, and acceps POST requests to set context-size and raw
runtime CLI flags. Those settings will be applied to any run of a given
model, until unload is called for that model or model-runner is
terminated.
This is a temporary solution and therefore subject to change once a
design for specifying runtime settings is finalised.
Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
The llama-server exit status won't be sufficiently informative in most
cases. So gather llama-server's stderr in a tail buffer and include it
in the error returned from llamacpp.Run.
Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
If we don't return an error from `ensureLatestLlamaCpp` when update is
disabled, we'll fall into the wrong code path, and may end up trying to
execute a non-existent `llama-server`. So, make sure we return an error
from `ensureLatestLlamaCpp` if auto-update is disabled.
Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
Being able to disable the inference server auto update process is useful
for testing. Eventually we should also provide it as an option to DD
users. So, add an option to disable the inference server auto update.
Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
Representing byte sizes as float64's can be confusing and potentially
inefficient. So, use an integer type for representing byte sizes.
Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
* Adds Makefile for local development
* Fix chat completions example request
* Added delete example
* Dockerize model-runner
* WIP Run container with host access to socket
* Dockerize model-runner
* WIP Run container with host access to socket
* Debugging
* Run in Docker container with TCP port access
* mounted model storage
* - Remove duplication in .gitignore
- Do not use alpine in builder image
- NVIDIA seems to use Ubuntu in all of their CDI docs and produces Ubuntu tags for nvidia/cuda but not Debian. So use Ubuntu for our final image
For more details: https://github.com/docker/model-runner/pull/22
* - Add MODELS_PATH environment variable to configure model storage location
- Default to $HOME/.docker/models when MODELS_PATH is not set
- Update Docker container to use /models as the default storage path
- Update Makefile to pass MODELS_PATH to container
- Update Dockerfile to create and set permissions for /models directory
This change allows users to:
- Override the model storage location via MODELS_PATH
- Maintain backward compatibility with default $HOME/.docker/models path
- Use a more idiomatic folder for /models
* Removes unneeded logs