First pass implementation of memory estimation logic in model scheduler.
This change heavily relies on gguf-parser-go to calculate estimated peak
memory requirement for running inference with a given model. It adds
GetRequiredMemoryForModel() to the Backend interface to allow each
backend to deal with its config and calculate required memory usage
based on it.
Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
We need to allow users to configure the model runtime. Whether to
control inference settings, or low-level llama.cpp specific settings.
In the interest of unblocking users quickly, this patch adds a very simple
mechanism to configure the runtime settings. A `_configure` endpoint is
added per-engine, and acceps POST requests to set context-size and raw
runtime CLI flags. Those settings will be applied to any run of a given
model, until unload is called for that model or model-runner is
terminated.
This is a temporary solution and therefore subject to change once a
design for specifying runtime settings is finalised.
Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
Representing byte sizes as float64's can be confusing and potentially
inefficient. So, use an integer type for representing byte sizes.
Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
This new design will allow for concurrent runner operation (eventually)
on systems that support it.
Signed-off-by: Jacob Howard <jacob.howard@docker.com>