Signed-off-by: Nathan Weinberg <nweinber@redhat.com> |
||
---|---|---|
.github | ||
docs | ||
src/ramalama_stack | ||
tests | ||
.gitignore | ||
.pre-commit-config.yaml | ||
CONTRIBUTING.md | ||
LICENSE | ||
MANIFEST.in | ||
README.md | ||
pyproject.toml | ||
requirements.txt | ||
setup.py | ||
uv.lock |
README.md
ramalama-stack
An external provider for Llama Stack allowing for the use of RamaLama for inference.
Installing
You can install ramalama-stack
from PyPI via pip install ramalama-stack
This will install Llama Stack and RamaLama as well if they are not installed already.
Usage
[!WARNING] The following workaround is currently needed to run this provider - see https://github.com/containers/ramalama-stack/issues/53 for more details
curl --create-dirs --output ~/.llama/providers.d/remote/inference/ramalama.yaml https://raw.githubusercontent.com/containers/ramalama-stack/refs/tags/v0.2.1/src/ramalama_stack/providers.d/remote/inference/ramalama.yaml curl --create-dirs --output ~/.llama/distributions/ramalama/ramalama-run.yaml https://raw.githubusercontent.com/containers/ramalama-stack/refs/tags/v0.2.1/src/ramalama_stack/ramalama-run.yaml
-
First you will need a RamaLama server running - see the RamaLama project docs for more information.
-
Ensure you set your
INFERENCE_MODEL
environment variable to the name of the model you have running via RamaLama. -
You can then run the RamaLama external provider via
llama stack run ~/.llama/distributions/ramalama/ramalama-run.yaml
[!NOTE] You can also run the RamaLama external provider inside of a container via Podman
podman run \ --net=host \ --env RAMALAMA_URL=http://0.0.0.0:8080 \ --env INFERENCE_MODEL=$INFERENCE_MODEL \ quay.io/ramalama/llama-stack
This will start a Llama Stack server which will use port 8321 by default. You can test this works by configuring the Llama Stack Client to run against this server and sending a test request.
- If your client is running on the same machine as the server, you can run
llama-stack-client configure --endpoint http://0.0.0.0:8321 --api-key none
- If your client is running on a different machine, you can run
llama-stack-client configure --endpoint http://<hostname>:8321 --api-key none
- The client should give you a message similar to
Done! You can now use the Llama Stack Client CLI with endpoint <endpoint>
- You can then test the server by running
llama-stack-client inference chat-completion --message "tell me a joke"
which should return something like
ChatCompletionResponse(
completion_message=CompletionMessage(
content='A man walked into a library and asked the librarian, "Do you have any books on Pavlov\'s dogs
and Schrödinger\'s cat?" The librarian replied, "It rings a bell, but I\'m not sure if it\'s here or not."',
role='assistant',
stop_reason='end_of_turn',
tool_calls=[]
),
logprobs=None,
metrics=[
Metric(metric='prompt_tokens', value=14.0, unit=None),
Metric(metric='completion_tokens', value=63.0, unit=None),
Metric(metric='total_tokens', value=77.0, unit=None)
]
)
Llama Stack User Interface
Llama Stack includes an experimental user-interface, check it out here.
To deploy the UI, run this:
podman run -d --rm --network=container:ramalama --name=streamlit quay.io/redhat-et/streamlit_client:0.1.0
[!NOTE] If running on MacOS (not Linux),
--network=host
doesn't work. You'll need to publish additional ports8321:8321
and8501:8501
with the ramalama serve command, then run withnetwork=container:ramalama
.If running on Linux use
--network=host
or-p 8501:8501
instead. The streamlit container will be able to access the ramalama endpoint with either.