🗣️ Speech-LLM-Speech: Containerized Conversational AI Pipeline

Developed an end-to-end conversational AI pipeline that processes speech input through three modular components: Automatic Speech Recognition (ASR) using Whisper.cpp, an LLM decision maker integrating OpenAI/Gemini/Ollama APIs, and text-to-speech synthesis via Google Cloud. The system leverages ROS2 for inter-node communication and Docker for seamless deployment. Github repo https://github.com/naren200/speech-llm-speech

%%{init: {'theme': 'dark', 'themeVariables': { 'background': 'transparent', 'primaryBorderColor': '#4FD1C550', 'lineColor': '#4FD1C580', 'textColor': '#FFFFFF', 'edgeLabelBackground': '#1d27383d', 'edgeLabelColor': '#FFFFFF' }}}%% graph LR subgraph Docker_Network["Docker Network | ROS2 Domain"] direction TB A[["Whisper ASR
(Docker Container)"]]:::asr -->|/recognized_speech| B[["Decision Maker
(Docker Container)"]]:::llm B -->|/text_to_speak| C[["Google TTS
(Docker Container)"]]:::tts end D[🎤 Audio Input]:::input --> A C --> E[🔈 Synthesized Speech]:::output F[[GPT-4]]:::openai -->|OpenAI API| B G[[Llama 2]]:::huggingface -->|Huggingface API| B H[[Ollama]]:::ollama -->|Local LLM| B classDef asr fill:#2B7A78,stroke:#38B2AC,color:#FFFFFF classDef llm fill:#7295d1b7,stroke:#718096,color:#FFFFFF classDef tts fill:#6B46c157,stroke:#9F7AEA,color:#FFFFFF classDef openai fill:#3182ce63,stroke:#63B3ED classDef huggingface fill:#38a1696e,stroke:#68D391 classDef ollama fill:#dd6c2069,stroke:#F6AD55

Technical Highlights

Containerized ROS2 Nodes: Independently deployable Docker containers for ASR (C++), LLM decision maker (C++), and TTS (C++)
Multi-LLM Integration: Dynamic API selection between OpenAI, HuggingFace, and local Ollama models
Real-Time Audio Processing: Whisper.cpp optimization for WAV/MP3 parsing with 2.5s latency

Key Features

ROS2 /recognized_speech & /text_to_speak topics for modular communication
CMake integration for Whisper.cpp with custom audio preprocessing
Docker Compose orchestration for multi-container deployment
GPU acceleration support via NVIDIA Container Toolkit

Challenges Solved

Whisper.cpp Integration: Resolved CMake build issues and audio parsing challenges
ROS2-Docker Networking: Configured cross-container discovery using shared ROS domains
LLM Response Optimization: Implemented confidence-based API fallback mechanism
Audio Format Handling: Added resampling pipeline for MP3/WAV compatibility

Prerequisites

sudo apt install -y gnome-terminal 

For further instructions, follow here

Launch Full System

git clone https://github.com/naren200/speech-llm-speech.git
cd speech-llm-speech

# Start all services
./start_all_docker.sh

Demo Video: End-to-End System Walkthrough

Technical Deep Dive

Core Components

Component	Tech Stack	Optimization
ASR	C++17, Whisper.cpp	SIMD acceleration
LLM	Python3.9, FastAPI	Async API calls
TTS	gTTS, SoundFile	Audio buffering

Troubleshooting

Common Issues:

Audio Device Permissions:

sudo usermod -aG audio $USER
sudo reboot

ROS2 Discovery:

export ROS_DOMAIN_ID=42
export ROS_LOCALHOST_ONLY=0

Debugging Modes:

# Developer mode with shell access
./start_docker.sh transcribe --developer=true

# Force rebuild containers
./start_docker.sh decide --build=true

Reference information:

Github Page: https://github.com/naren200/speech-llm-speech
Demo Video: https://youtu.be/7YaoBxjnQag
Docker Hub: https://hub.docker.com/u/naren200