An external provider for Llama Stack allowing for the use of RamaLama for inference.

Go to file

Nathan Weinberg d09a8cf3ef chore: update README for 0.2.1 release (#91 ) Signed-off-by: Nathan Weinberg <nweinber@redhat.com>		2025-06-16 09:53:14 -04:00
.github	chore(deps): Bump hynek/build-and-inspect-python-package (#90 )	2025-06-16 09:40:14 -04:00
docs	ci: add job to test ramalama llama-stack container (#63 )	2025-05-06 16:13:42 -04:00
src/ramalama_stack	Update run config (#85 )	2025-06-16 09:36:15 -04:00
tests	Update run config (#85 )	2025-06-16 09:36:15 -04:00
.gitignore	Revamp dependency system (#81 )	2025-06-05 13:18:15 -04:00
.pre-commit-config.yaml	Revamp dependency system (#81 )	2025-06-05 13:18:15 -04:00
CONTRIBUTING.md	Revamp dependency system (#81 )	2025-06-05 13:18:15 -04:00
LICENSE	chore: populate LICENSE (#33 )	2025-04-28 15:03:06 -04:00
MANIFEST.in	build: add custom file writing to pypi installation (#37 )	2025-05-01 10:23:29 -04:00
README.md	chore: update README for 0.2.1 release (#91 )	2025-06-16 09:53:14 -04:00
pyproject.toml	Update run config (#85 )	2025-06-16 09:36:15 -04:00
requirements.txt	Update run config (#85 )	2025-06-16 09:36:15 -04:00
setup.py	build: add custom file writing to pypi installation (#37 )	2025-05-01 10:23:29 -04:00
uv.lock	Update run config (#85 )	2025-06-16 09:36:15 -04:00

README.md

ramalama-stack

An external provider for Llama Stack allowing for the use of RamaLama for inference.

Installing

You can install ramalama-stack from PyPI via pip install ramalama-stack

This will install Llama Stack and RamaLama as well if they are not installed already.

Usage

[!WARNING] The following workaround is currently needed to run this provider - see https://github.com/containers/ramalama-stack/issues/53 for more details

curl --create-dirs --output ~/.llama/providers.d/remote/inference/ramalama.yaml https://raw.githubusercontent.com/containers/ramalama-stack/refs/tags/v0.2.1/src/ramalama_stack/providers.d/remote/inference/ramalama.yaml
curl --create-dirs --output ~/.llama/distributions/ramalama/ramalama-run.yaml https://raw.githubusercontent.com/containers/ramalama-stack/refs/tags/v0.2.1/src/ramalama_stack/ramalama-run.yaml

First you will need a RamaLama server running - see the RamaLama project docs for more information.
Ensure you set your INFERENCE_MODEL environment variable to the name of the model you have running via RamaLama.
You can then run the RamaLama external provider via llama stack run ~/.llama/distributions/ramalama/ramalama-run.yaml

[!NOTE] You can also run the RamaLama external provider inside of a container via Podman
podman run \
 --net=host \
 --env RAMALAMA_URL=http://0.0.0.0:8080 \
 --env INFERENCE_MODEL=$INFERENCE_MODEL \
 quay.io/ramalama/llama-stack

This will start a Llama Stack server which will use port 8321 by default. You can test this works by configuring the Llama Stack Client to run against this server and sending a test request.

If your client is running on the same machine as the server, you can run llama-stack-client configure --endpoint http://0.0.0.0:8321 --api-key none
If your client is running on a different machine, you can run llama-stack-client configure --endpoint http://<hostname>:8321 --api-key none
The client should give you a message similar to Done! You can now use the Llama Stack Client CLI with endpoint <endpoint>
You can then test the server by running llama-stack-client inference chat-completion --message "tell me a joke" which should return something like

ChatCompletionResponse(
    completion_message=CompletionMessage(
        content='A man walked into a library and asked the librarian, "Do you have any books on Pavlov\'s dogs
and Schrödinger\'s cat?" The librarian replied, "It rings a bell, but I\'m not sure if it\'s here or not."',
        role='assistant',
        stop_reason='end_of_turn',
        tool_calls=[]
    ),
    logprobs=None,
    metrics=[
        Metric(metric='prompt_tokens', value=14.0, unit=None),
        Metric(metric='completion_tokens', value=63.0, unit=None),
        Metric(metric='total_tokens', value=77.0, unit=None)
    ]
)

Llama Stack User Interface

Llama Stack includes an experimental user-interface, check it out here.

To deploy the UI, run this:

podman run -d --rm --network=container:ramalama --name=streamlit quay.io/redhat-et/streamlit_client:0.1.0

[!NOTE] If running on MacOS (not Linux), --network=host doesn't work. You'll need to publish additional ports 8321:8321 and 8501:8501 with the ramalama serve command, then run with network=container:ramalama.

If running on Linux use --network=host or -p 8501:8501 instead. The streamlit container will be able to access the ramalama endpoint with either.