Commit Graph

12 Commits

Author SHA1 Message Date
Piotr Stankiewicz 880818f741 inference: Support memory estimation for remote models
Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
2025-08-22 10:15:03 +02:00
Piotr Stankiewicz a4dc5834d1 Implement basic memory estimation in scheduler
First pass implementation of memory estimation logic in model scheduler.
This change heavily relies on gguf-parser-go to calculate estimated peak
memory requirement for running inference with a given model. It adds
GetRequiredMemoryForModel() to the Backend interface to allow each
backend to deal with its config and calculate required memory usage
based on it.

Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
2025-07-23 13:50:20 +02:00
Piotr Stankiewicz 099b12231d Support runner configuration (temporary solution)
We need to allow users to configure the model runtime. Whether to
control inference settings, or low-level llama.cpp specific settings.

In the interest of unblocking users quickly, this patch adds a very simple
mechanism to configure the runtime settings. A `_configure` endpoint is
added per-engine, and acceps POST requests to set context-size and raw
runtime CLI flags. Those settings will be applied to any run of a given
model, until unload is called for that model or model-runner is
terminated.

This is a temporary solution and therefore subject to change once a
design for specifying runtime settings is finalised.

Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
2025-06-13 10:36:29 +02:00
Piotr Stankiewicz 0343b8cbea Use int64 for size instead of float64
Representing byte sizes as float64's can be confusing and potentially
inefficient. So, use an integer type for representing byte sizes.

Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
2025-05-29 09:09:14 +02:00
Dorin Geman b881521c88
Add /engines/df
Signed-off-by: Dorin Geman <dorin.geman@docker.com>
2025-05-21 09:59:37 +03:00
Dorin Geman e5d5ccf2dd Add Status to Backend interface
Signed-off-by: Dorin Geman <dorin.geman@docker.com>
2025-04-17 19:11:20 +02:00
Jacob Howard 36ae1e3b30
inference: adjust for lack of logger and paths packages
Signed-off-by: Jacob Howard <jacob.howard@docker.com>
2025-03-28 18:01:42 -06:00
Jacob Howard f7cec84173
deps: remove errordef package references
Signed-off-by: Jacob Howard <jacob.howard@docker.com>
2025-03-28 17:53:13 -06:00
Jacob Howard ac5324bd3a
[AIE-52] inference: add separate completion/embedding backend modes
Signed-off-by: Jacob Howard <jacob.howard@docker.com>
2025-03-28 17:53:08 -06:00
Jacob Howard d6b1191a01
inference: refactor scheduler to a more modular design
This new design will allow for concurrent runner operation (eventually)
on systems that support it.

Signed-off-by: Jacob Howard <jacob.howard@docker.com>
2025-03-28 17:53:06 -06:00
Jacob Howard f8cdbc4d81
inference: refactor service and implement scheduling mechanism
Signed-off-by: Jacob Howard <jacob.howard@docker.com>
2025-03-28 17:53:05 -06:00
Jacob Howard 21e10c378a
inference: move to modular backend structure and implement stubs
Signed-off-by: Jacob Howard <jacob.howard@docker.com>
2025-03-28 17:53:00 -06:00