dapr-agents/cookbook/llm/openai_audio_basic.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# LLM: OpenAI Audio Endpoint Basic Examples\n",
    "\n",
    "This notebook demonstrates how to use the `OpenAIAudioClient` in `dapr-agents` for basic tasks with the OpenAI Audio API. We will explore:\n",
    "\n",
    "* Generating speech from text and saving it as an MP3 file.\n",
    "* Transcribing audio to text.\n",
    "* Translating audio content to English."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Install Required Libraries\n",
    "\n",
    "Ensure you have the required library installed:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install dapr-agents python-dotenv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load Environment Variables"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from dotenv import load_dotenv\n",
    "load_dotenv()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Initialize OpenAIAudioClient"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from dapr_agents import OpenAIAudioClient\n",
    "\n",
    "client = OpenAIAudioClient()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generate Speech from Text\n",
    "\n",
    "### Manual File Creation\n",
    "\n",
    "This section demonstrates how to generate speech from a given text input and save it as an MP3 file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Audio saved to output_speech.mp3\n"
     ]
    }
   ],
   "source": [
    "from dapr_agents.types.llm import AudioSpeechRequest\n",
    "\n",
    "# Define the text to convert to speech\n",
    "text_to_speech = \"Hello Roberto! This is an example of text-to-speech generation.\"\n",
    "\n",
    "# Create a request for TTS\n",
    "tts_request = AudioSpeechRequest(\n",
    "    model=\"tts-1\",\n",
    "    input=text_to_speech,\n",
    "    voice=\"fable\",\n",
    "    response_format=\"mp3\"\n",
    ")\n",
    "\n",
    "# Generate the audio\n",
    "audio_bytes = client.create_speech(request=tts_request)\n",
    "\n",
    "# Save the audio to an MP3 file\n",
    "output_path = \"output_speech.mp3\"\n",
    "with open(output_path, \"wb\") as audio_file:\n",
    "    audio_file.write(audio_bytes)\n",
    "\n",
    "print(f\"Audio saved to {output_path}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Automatic File Creation\n",
    "\n",
    "The audio file is saved directly by providing the file_name parameter."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "from dapr_agents.types.llm import AudioSpeechRequest\n",
    "\n",
    "# Define the text to convert to speech\n",
    "text_to_speech = \"Hola Roberto! Este es otro ejemplo de generacion de voz desde texto.\"\n",
    "\n",
    "# Create a request for TTS\n",
    "tts_request = AudioSpeechRequest(\n",
    "    model=\"tts-1\",\n",
    "    input=text_to_speech,\n",
    "    voice=\"echo\",\n",
    "    response_format=\"mp3\"\n",
    ")\n",
    "\n",
    "# Generate the audio\n",
    "client.create_speech(request=tts_request, file_name=\"output_speech_spanish_auto.mp3\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Transcribe Audio to Text\n",
    "\n",
    "This section demonstrates how to transcribe audio content into text."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using a File Path"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Transcription: Hello Roberto, this is an example of text-to-speech generation.\n"
     ]
    }
   ],
   "source": [
    "from dapr_agents.types.llm import AudioTranscriptionRequest\n",
    "\n",
    "# Specify the audio file to transcribe\n",
    "audio_file_path = \"output_speech.mp3\"\n",
    "\n",
    "# Create a transcription request\n",
    "transcription_request = AudioTranscriptionRequest(\n",
    "    model=\"whisper-1\",\n",
    "    file=audio_file_path\n",
    ")\n",
    "\n",
    "# Generate transcription\n",
    "transcription_response = client.create_transcription(request=transcription_request)\n",
    "\n",
    "# Display the transcription result\n",
    "print(\"Transcription:\", transcription_response.text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using Audio Bytes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Transcription: Hola Roberto, este es otro ejemplo de generación de voz desde texto.\n"
     ]
    }
   ],
   "source": [
    "# audio_bytes = open(\"output_speech_spanish_auto.mp3\", \"rb\")\n",
    "\n",
    "with open(\"output_speech_spanish_auto.mp3\", \"rb\") as f:\n",
    "    audio_bytes = f.read()\n",
    "\n",
    "transcription_request = AudioTranscriptionRequest(\n",
    "    model=\"whisper-1\",\n",
    "    file=audio_bytes,  # File as bytes\n",
    "    language=\"en\"  # Optional: Specify the language of the audio\n",
    ")\n",
    "\n",
    "# Generate transcription\n",
    "transcription_response = client.create_transcription(request=transcription_request)\n",
    "\n",
    "# Display the transcription result\n",
    "print(\"Transcription:\", transcription_response.text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using File-Like Objects (e.g., BufferedReader)\n",
    "\n",
    "You can use file-like objects, such as BufferedReader, directly for transcription or translation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Transcription: ¡Hola, Roberto! Este es otro ejemplo de generación de voz desde texto.\n"
     ]
    }
   ],
   "source": [
    "from io import BufferedReader\n",
    "\n",
    "# Open the audio file as a BufferedReader\n",
    "audio_file_path = \"output_speech_spanish_auto.mp3\"\n",
    "with open(audio_file_path, \"rb\") as f:\n",
    "    buffered_file = BufferedReader(f)\n",
    "\n",
    "    # Create a transcription request\n",
    "    transcription_request = AudioTranscriptionRequest(\n",
    "        model=\"whisper-1\",\n",
    "        file=buffered_file,  # File as BufferedReader\n",
    "        language=\"es\"\n",
    "    )\n",
    "\n",
    "    # Generate transcription\n",
    "    transcription_response = client.create_transcription(request=transcription_request)\n",
    "\n",
    "    # Display the transcription result\n",
    "    print(\"Transcription:\", transcription_response.text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Translate Audio to English\n",
    "\n",
    "This section demonstrates how to translate audio content into English."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using a File Path"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Translation: Hola Roberto, este es otro ejemplo de generación de voz desde texto.\n"
     ]
    }
   ],
   "source": [
    "from dapr_agents.types.llm import AudioTranslationRequest\n",
    "\n",
    "# Specify the audio file to translate\n",
    "audio_file_path = \"output_speech_spanish_auto.mp3\"\n",
    "\n",
    "# Create a translation request\n",
    "translation_request = AudioTranslationRequest(\n",
    "    model=\"whisper-1\",\n",
    "    file=audio_file_path,\n",
    "    prompt=\"The following audio needs to be translated to English.\"\n",
    ")\n",
    "\n",
    "# Generate translation\n",
    "translation_response = client.create_translation(request=translation_request)\n",
    "\n",
    "# Display the translation result\n",
    "print(\"Translation:\", translation_response.text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using Audio Bytes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Translation: Hola Roberto, este es otro ejemplo de generación de voz desde texto.\n"
     ]
    }
   ],
   "source": [
    "# audio_bytes = open(\"output_speech_spanish_auto.mp3\", \"rb\")\n",
    "\n",
    "with open(\"output_speech_spanish_auto.mp3\", \"rb\") as f:\n",
    "    audio_bytes = f.read()\n",
    "\n",
    "translation_request = AudioTranslationRequest(\n",
    "    model=\"whisper-1\",\n",
    "    file=audio_bytes,  # File as bytes\n",
    "    prompt=\"The following audio needs to be translated to English.\"\n",
    ")\n",
    "\n",
    "# Generate translation\n",
    "translation_response = client.create_translation(request=translation_request)\n",
    "\n",
    "# Display the translation result\n",
    "print(\"Translation:\", translation_response.text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using File-Like Objects (e.g., BufferedReader) for Translation\n",
    "\n",
    "You can use a file-like object, such as a BufferedReader, directly for translating audio content."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Translation: Hola Roberto, este es otro ejemplo de generación de voz desde texto.\n"
     ]
    }
   ],
   "source": [
    "from io import BufferedReader\n",
    "\n",
    "# Open the audio file as a BufferedReader\n",
    "audio_file_path = \"output_speech_spanish_auto.mp3\"\n",
    "with open(audio_file_path, \"rb\") as f:\n",
    "    buffered_file = BufferedReader(f)\n",
    "\n",
    "    # Create a translation request\n",
    "    translation_request = AudioTranslationRequest(\n",
    "        model=\"whisper-1\",\n",
    "        file=buffered_file,  # File as BufferedReader\n",
    "        prompt=\"The following audio needs to be translated to English.\"\n",
    "    )\n",
    "\n",
    "    # Generate translation\n",
    "    translation_response = client.create_translation(request=translation_request)\n",
    "\n",
    "    # Display the translation result\n",
    "    print(\"Translation:\", translation_response.text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}