Blog post on Knative Functions with Llama stack for AI inference (#6287)

* 🦙 post on functions and Llama stack Signed-off-by: Matthias Wessendorf <mwessend@redhat.com> * Update blog/docs/articles/ai_functions_llama_stack.md Co-authored-by: Calum Murray <cmurray@redhat.com> * Update blog/docs/articles/ai_functions_llama_stack.md Co-authored-by: Calum Murray <cmurray@redhat.com> * Update blog/docs/articles/ai_functions_llama_stack.md Co-authored-by: Calum Murray <cmurray@redhat.com> * Update blog/docs/articles/ai_functions_llama_stack.md Co-authored-by: Calum Murray <cmurray@redhat.com> * Update blog/docs/articles/ai_functions_llama_stack.md Co-authored-by: Calum Murray <cmurray@redhat.com> * Update blog/docs/articles/ai_functions_llama_stack.md Co-authored-by: Calum Murray <cmurray@redhat.com> * Update blog/docs/articles/ai_functions_llama_stack.md Co-authored-by: Calum Murray <cmurray@redhat.com> * saying init is a function --------- Signed-off-by: Matthias Wessendorf <mwessend@redhat.com> Co-authored-by: Calum Murray <cmurray@redhat.com>
2025-07-07 15:13:00 +02:00 · 2025-07-07 15:13:00 +02:00 · 50dff92978
parent 27a706ecdf
commit 50dff92978
2 changed files with 266 additions and 0 deletions
--- a/blog/config/nav.yml
+++ b/blog/config/nav.yml
@ -51,6 +51,7 @@ nav:
          - releases/announcing-knative-v0-3-release.md
          - releases/announcing-knative-v0-2-release.md
      - Articles:
+          - articles/ai_functions_llama_stack.md
          - articles/configurable-runtimeclassnames.md
          - articles/CONXAIs_AI_inference_on_Amazon_EKS.md
          - articles/How_Conxai_uses_Knative_Eventing_to_provide_AI_APIs_in_the_construction_industry.md
--- a/blog/docs/articles/ai_functions_llama_stack.md
+++ b/blog/docs/articles/ai_functions_llama_stack.md
@ -0,0 +1,265 @@
+# AI powered Knative Functions for Inference with Llama Stack
+
+**Author: Matthias Weßendorf, Senior Principal Software Engineer @ Red Hat**
+
+_This post will describe how to create Knative Functions for Inference with [Llama Stack](https://llama-stack.readthedocs.io/en/latest/index.html){:target="_blank"}._
+
+## Llama Stack
+
+Llama Stack is an open-source framework for building generative AI applications. It aims to define and standardize the core building blocks for generative AI apps. For this it provides a unified set of APIs and building blocks:
+
+* Unified API layer for Inference, RAG, Agents, Tools, Safety, Evals, and Telemetry.
+* Plugin architecture to support the rich ecosystem of implementations of the different APIs in different environments like local development, on-premises, cloud, and mobile.
+* Prepackaged verified distributions which offer a one-stop solution for developers to get started quickly and reliably in any environment
+* Multiple developer interfaces like CLI and SDKs for Python, Node, iOS, and Android
+
+* Standalone applications as examples for how to build production-grade AI applications with Llama Stack
+
+<img alt="Llama Stack" src="https://llama-stack.readthedocs.io/en/latest/_images/llama-stack.png" style="width: 400px;" />
+
+### Llama Stack setup for Kubernetes
+
+For Llama Stack an easy way to get started is using [Ollama](https://github.com/ollama/ollama){:target="_blank"} as the inference provider for a Llama Model. The [Llama Stack Quickstart](https://llama-stack.readthedocs.io/en/latest/getting_started/index.html){:target="_blank"} shows how to do this for a local environment. But we are using Kubernetes! For k8s, you need:
+
+* [Kind](https://kind.sigs.k8s.io/){:target="_blank"}
+* [Ollama](https://github.com/ollama/ollama){:target="_blank"}
+* [Llama Stack Operator for Kubernetes](https://github.com/llamastack/llama-stack-k8s-operator){:target="_blank"}
+
+For convenience I have created a Github repository that contains scripts for an easy setup. See the [Llama Stack Stack repo](https://github.com/matzew/llama-stack-stack){:target="_blank"} for more details.
+
+#### Portforwarding for local access to the Llama Stack
+
+For local development it is recommended to enable port-forwarding for the Llama Stack server:
+
+```shell
+kubectl port-forward service/llamastackdistribution-sample-service 8321:8321 
+```
+
+Now your scripts can access it via `locahost:8321`:
+
+```shell
+http localhost:8321/v1/version
+
+HTTP/1.1 200 OK
+content-length: 19
+content-type: application/json
+date: Fri, 04 Jul 2025 08:33:10 GMT
+server: uvicorn
+x-trace-id: edc0436a6bc932434f34d3d3ce1be182
+
+{
+    "version": "0.2.9"
+}
+```
+
+_**Note:** The APIs of Llama Stack are fast evolving, but it supports a docs endpoint (in our case, `http://localhost:8321/docs`). You can also use that endpoint to invoke and experiment with the APIs._
+
+## Python runtime for Knative Functions
+
+Once all of the above is running you need to create your [Knative Functions](https://knative.dev/docs/functions/){:target="_blank"} project. We are using the CloudEvent template for the new [functions runtime for Python](https://github.com/knative-extensions/func-python){:target="_blank"}.
+
+```shell
+func create -l python -t cloudevents inference-func
+```
+
+This gives you project that already works and can be locally tested with `func run`.
+
+
+### My first AI function
+
+Our generated function code was updated to use the `llama_stack_client` python library. Below is the relevant snippet from the `pyproject.toml` file:
+
+```toml
+dependencies = [
+  "httpx",
+  "cloudevents",
+  "pytest",
+  "pytest-asyncio",
+  "llama-stack-client"
+]
+```
+
+### Accessing the Llama Stack Server
+
+Inside the `init` function we get access to the URL of the _Llama Stack_ and fetch the available models:
+
+```python
+...
+self.client = LlamaStackClient(base_url=os.environ.get("LLSD_HOST", "http://localhost:8321"))
+models = self.client.models.list()
+
+# Select the first LLM
+llm = next(m for m in models if m.model_type == "llm")
+self.model_id = llm.identifier
+...
+```
+
+The `LLSD_HOST` is pointing to the address of the Llama Stack instance. If the environment variable is not set it defaults to `localhost:8321`, which matches our `port-forward` command. For Kubernetes we configure it like:
+
+```
+func config envs add --name=LLSD_HOST --value=http://llamastackdistribution-sample-service.default.svc.cluster.local:8321
+```
+
+### Inference request with Knative Functions
+
+The main interaction for the actual inference is done in the `handle` function, which is executed on every request:
+
+```python
+...
+# 1) Extract the CloudEvent from the scope
+request = scope["event"]
+
+# 2) Create a new CloudEvent as response
+response = CloudEvent({
+	"type": "dev.knative.function.response",
+	"source": "my/ai/knative/function",
+	"id": f"response-{request.get('id', 'unknown')}"
+})
+
+chat_response = self.client.inference.chat_completion(
+	model_id=self.model_id,
+	messages=[
+		{"role": "system", "content": "You are a helpful assistant."},
+		{"role": "user", "content": request.data["query"]},
+	],
+)
+
+# 3) Set the response's data field to the request event's data field
+response.data = {"chat": chat_response.completion_message.content}
+
+# 4) Send the response CloudEvent
+# The 'send' method is already decorated with CloudEvent middleware
+await send(response)
+...
+```
+
+We basically extract a `query` object from the given _JSON_ payload of the incoming CloudEvent, and put that as our `content` to the `chat_completion` function from the Llama Stack inference client. After the `inference request` is executed we are using the given `chat` value in our HTTP CloudEvent response and return it. The `chat` contains the response from the used model.
+
+See the entire code:
+
+```python
+# Function
+import logging
+import os
+from cloudevents.http import CloudEvent
+from llama_stack_client import LlamaStackClient
+
+def new():
+    return Function()
+
+
+class Function:
+    def __init__(self):
+        logging.info("Connecting to LLama Stack")
+        self.client = LlamaStackClient(base_url=os.environ.get("LLSD_HOST", "http://localhost:8321"))
+        models = self.client.models.list()
+
+        # Select the first LLM
+        llm = next(m for m in models if m.model_type == "llm")
+        self.model_id = llm.identifier
+
+        print("Using Model:", self.model_id)
+
+    async def handle(self, scope, receive, send):
+        logging.info("Request Received")
+
+        # 1) Extract the CloudEvent from the scope
+        request = scope["event"]
+
+        # 2) Create a new CloudEvent as response
+        response = CloudEvent({
+            "type": "dev.knative.function.response",
+            "source": "my/ai/knative/function",
+            "id": f"response-{request.get('id', 'unknown')}"
+        })
+
+        chat_response = self.client.inference.chat_completion(
+            model_id=self.model_id,
+            messages=[
+                {"role": "system", "content": "You are a helpful assistant."},
+                {"role": "user", "content": request.data["query"]},
+            ],
+        )
+        
+        # 3) Set the response's data field to the request event's data field
+        response.data = {"chat": chat_response.completion_message.content}
+
+        # 4) Send the response CloudEvent
+        # The 'send' method is already decorated with CloudEvent middleware
+        await send(response)
+
+    def start(self, cfg):
+        logging.info("Function starting")
+
+    def stop(self):
+        logging.info("Function stopping")
+
+    def alive(self):
+        return True, "Alive"
+
+    def ready(self):
+        return True, "Ready"
+```
+
+_**NOTE:** the docstrings were remove to keep the program compact._
+
+### First contact
+
+We can now run our function locally by issuing `func run` on the command line. Once it is running there will a system log like below:
+
+```shell
+INFO:root:Functions middleware invoking user function
+INFO:root:Connecting to LLama Stack
+INFO:httpx:HTTP Request: GET http://localhost:8321/v1/models "HTTP/1.1 200 OK"
+Using Model: llama3.2:1b
+INFO:root:function starting on ['127.0.0.1:8080']
+INFO:root:Function starting
+[2025-07-04 14:22:05 +0200] [38829] [INFO] Running on http://127.0.0.1:8080 (CTRL + C to quit)
+INFO:hypercorn.error:Running on http://127.0.0.1:8080 (CTRL + C to quit)
+Running on host port 8080
+```
+
+Now we can send a CloudEvent to the function, which contains our query for the AI model inference. In a new terminal of the function project we use `func invoke` for this:
+
+```shell
+func invoke -f=cloudevent --data='{"query":"Tell me a dad joke!"}'   
+Context Attributes,
+  specversion: 1.0
+  type: dev.knative.function.response
+  source: https://knative.dev/python-function-response
+  id: response-b2bf2fd6-600b-49a9-9644-a86e119d0873
+  time: 2025-07-04T12:35:49.985942Z
+Data,
+   {"chat": "Here's one:\n\nWhy couldn't the bicycle stand up by itself?\n\n(wait for it...)\n\nBecause it was two-tired!\n\nHope that made you groan and laugh! Do you want another one?"}
+```
+
+We see that the function was returning a different CloudEvent, which contains the `chat` object in its body, which is a joke that we got from the LLama model.
+
+### Kubernetes
+
+To deploy the function to our `kind` cluster you need to install Knative Serving. The [Llama Stack Stack repo](https://github.com/matzew/llama-stack-stack){:target="_blank"} has a script for this as well. Once it is installed simply run:
+
+```shell
+func deploy --builder=host --build
+```
+
+This builds the function, using the `host` builder, pushes it to the container registry and eventually deploys it as a Knative Serving Service on Kubernetes:
+
+```shell
+🙌 Function built: quay.io/<my-func-org>/inference-func:latest
+pushing 100% |████████████████████████████████████████| (175/121 MB, 24 MB/s) [7s]
+✅ Function deployed in namespace "default" and exposed at URL: 
+   http://inference-func.default.svc.cluster.local
+```
+
+We can now use the exact same `func invoke` command to also test the Knative Function.
+
+_Note: The container registry is configurable via the `$FUNC_REGISTRY` environment variable on the computer._
+
+## Conclusion
+
+We have seen that the new Python runtime for Knative Functions is handling AI integrations very nicely. The functions have a well defined lifecycle for various aspects of AI applications. The Llama Stack libraries could be used to access a standardized AI stack: Llama Stack.
+
+To learn more about Knative Functions visit the [documentation](docs/functions){:target="_blank"} on our website or join our CNCF Slack channel [#knative-functions](https://cloud-native.slack.com/archives/C04LKEZUXEE)!
+
+Enjoy!