We ignore parse errors for models that gguf-parser-go can't parse yet,
for now. This regressed in the pre-pull memeory estimation PR.
Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
We will run into cases where our model runner is ahead of
gguf-parser-go. In such cases we may want to load a model that will
cause the model parse to fail. So, for now, in such cases ignore model
parsing errors, and assume it takes no resources. In the future we
should come up with a cleaner way of dealing with this (e.g. ship a
model memory estimator along with the llama-server).
Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
- a helm chart
- static Kubernetes configs for a few common setups
I put these under ./charts so we can expose
this as a Helm chart repo later if we want,
but for now we'll just tell people to install it
from source.
Signed-off-by: Nick Santos <nick.santos@docker.com>
First pass implementation of memory estimation logic in model scheduler.
This change heavily relies on gguf-parser-go to calculate estimated peak
memory requirement for running inference with a given model. It adds
GetRequiredMemoryForModel() to the Backend interface to allow each
backend to deal with its config and calculate required memory usage
based on it.
Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>
Global locks in ServeHTTP methods were preventing concurrent request processing. Individual handlers should manage their own synchronization.
Signed-off-by: Dorin Geman <dorin.geman@docker.com>
Prevents "superfluous response.WriteHeader" errors by checking if the request context was canceled or timed out before attempting to call http.Error().
Signed-off-by: Dorin Geman <dorin.geman@docker.com>
Using Shutdown with an already cancelled context will cause the method
to return almost immediately, with only idle connections being closed.
More problematically, it can't close active connections, which can
remain in flight indefinitely. At the moment, these active connections
(and their associated contexts) can cause loader.load() to block
loader.run() from exiting, especially if a backend is misbehaving, which
can cause shutdown to halt waiting on the request. Even if a backend
isn't misbehaving, an inference request can take many seconds. The
best solution would be to make loader.load() unblock if the context
passed to loader.run() is cancelled, but this is fairly complicated to
implemented. The easier solution for now is just to use a hard server
Close() to cancel inflight requests (and their contexts) and then wait
for scheduler shutdown. This is what we do in Docker Desktop.
Signed-off-by: Jacob Howard <jacob.howard@docker.com>