title: VoxSum Studio
emoji: π
colorFrom: green
colorTo: yellow
sdk: docker
app_port: 7860
tags:
- fastapi
- web-app
pinned: false
short_description: 'VoxSum Studio: Transform Audio into Insightful Summaries'
license: apache-2.0
VoxSum Studio
VoxSum Studio is a powerful web application built for Hugging Face Spaces, designed to transform audio into insightful summaries. This tool leverages advanced Automatic Speech Recognition (ASR) and Large Language Models (LLMs) to transcribe and summarize audio from podcasts, YouTube videos, or uploaded files. With an interactive transcript player and customizable settings, VoxSum Studio makes it easy to extract key insights from audio content in real time.
The application features a modern web interface built with HTML, CSS, and JavaScript, powered by a FastAPI backend for robust API handling.
Features
- Podcast Search & Download: Search for podcast series, browse episodes, and download audio directly from the app.
- YouTube Audio Fetching: Extract audio from YouTube videos by providing a URL.
- Audio Upload: Upload your own audio files (MP3, WAV) for transcription and summarization.
- Interactive Transcript Player: View real-time transcripts synced with audio playback, with clickable timestamps for easy navigation and auto-scrolling highlights.
- Customizable Summarization: Choose from multiple LLMs and provide custom prompts to generate tailored summaries.
- Voice Activity Detection (VAD): Adjust the VAD threshold to optimize transcription accuracy.
- Web Interface: A user-friendly interface with settings for model selection and real-time status updates.
Getting Started
Usage
Launch the application: Open the application via the URL provided by Hugging Face Space. The interface is intuitive for easy operation.
Select audio source: Search for podcast series, browse episodes, and download audio. Upload MP3 or WAV audio files, or enter a YouTube video URL to extract audio content.
Perform transcription: Click the "Transcribe Audio" button to start generating transcript content. The transcription process will display text content in real time, and upon completion, an interactive player will be automatically generated for viewing transcripts synchronized with audio.
Generate summary: Click the "Generate Summary" button to create a summary based on the transcript. Choose the appropriate LLM model and input custom prompts to generate summaries that meet your needs.
Interact with results: Use the interactive player to click on specific segments in the transcript to jump to the corresponding audio time point, quickly locating key content. The final summary will be displayed below the transcript, clearly presenting the core information in the audio.
Configuration
- Settings:
- VAD Threshold: Adjust the threshold (0.1 to 0.9) to fine-tune voice activity detection.
- ASR Model: Select from available models for transcription.
- LLM Model: Choose an LLM for summarization from the available options.
- Custom Prompt: Input a custom prompt to guide the summarization process.
Project Structure
voxsum-studio/
βββ Dockerfile # Docker configuration for building and running the app
βββ README.md # Project documentation and setup instructions
βββ requirements.txt # Python dependencies for the project
βββ src/ # Source code directory
β βββ __init__.py # Makes src a Python package
β βββ asr.py # Logic for Automatic Speech Recognition (ASR) transcription
β βββ diarization.py # Speaker diarization functionality
β βββ export_utils.py # Utilities for exporting transcripts and summaries
β βββ improved_diarization.py # Enhanced diarization features
β βββ podcast.py # Functions for podcast search, episode fetching, and audio downloading
β βββ summarization.py # Logic for generating summaries using LLMs
β βββ utils.py # Utility functions and model configurations
β βββ server/ # FastAPI backend
β β βββ __init__.py
β β βββ main.py # Main FastAPI application
β β βββ core/ # Core configuration
β β βββ models/ # Pydantic models for API
β β βββ routers/ # API routes
β β βββ services/ # Business logic services
β βββ frontend/ # Static frontend files
β βββ static/ # Static assets
βββ frontend/ # Frontend source files
β βββ app.js # Main JavaScript application
β βββ index.html # Main HTML page
β βββ styles.css # CSS styles
βββ static/ # Static assets directory
βββ audio/ # Temporary storage for audio files (not tracked in git)
Notes
- Architecture: The application uses a FastAPI backend for API endpoints and a vanilla JavaScript frontend for the user interface.
- Temporary Storage: Uploaded and downloaded audio files are stored in the
/tmpdirectory (mapped tostatic/audio/) for Hugging Face Spaces compatibility. - Audio Formats: Supports MP3 and WAV files for uploads and downloads.
- Error Handling: The app provides real-time status updates and error messages for transcription or summarization failures.
- Interactive Player: The player is implemented as a single HTML component with JavaScript for seamless audio-transcript synchronization.
- Docker Support: The
Dockerfileensures consistent environments on Hugging Face Spaces. - The
__pycache__directory (auto-generated) is excluded from version control.
Limitations
- Transcription and summarization quality depend on the selected models and audio clarity.
- Large audio files may take longer to process, especially in a resource-constrained environment like Hugging Face Spaces.
- YouTube audio fetching requires a valid URL and may be subject to rate limits or availability.
Tests
Overview
LLM tests are now part of the default test run because multilingual summarization and title generation are core to VoxSumβs value.
Test categories:
- LLM-dependent tests (default ON): multilingual summarization, title generation, language consistency.
- Lightweight diarization tests: fast heuristics & structural checks.
If you need a fast pass without loading models (e.g. in a tiny CI runner), you can explicitly skip LLM tests (see below).
Running all tests (default, includes LLM)
Install dependencies then run:
pip install -r requirements.txt
pytest -q
Skipping LLM tests (opt-out)
If you only want the lightweight diarization tests:
export VOXSUM_SKIP_LLM_TESTS=1
pytest -q
This will module-skip:
test_multilingual.pytest_multilingual_quick.pytest_summary_language.py
These tests exercise:
- Multilingual summarization pipeline (
summarize_transcript) - Title generation (
generate_title) - Language consistency heuristics
Mocking strategy (opt-out mode)
tests/conftest.py activates a lightweight mock of the LLM interface only when VOXSUM_SKIP_LLM_TESTS=1:
- Replaces
get_llm()with a dummy object. - Avoids native model loading cost.
- Provides deterministic minimal outputs for structural assertions.
Minimal diarization sanity test
File: tests/test_diarization_minimal.py
It validates four scenarios:
- Single segment
- Two very similar segments (should unify speaker identity)
- Two dissimilar segments (can diverge; heuristic tolerant)
- Three segments (granularity preservation path)
The test harness:
- Uses a mock embedding extractor (no external model downloads).
- Exercises the small-
nheuristic path (<3 embeddings) and the adaptive clustering interface.
Run directly if desired:
python3 tests/test_diarization_minimal.py
Troubleshooting
| Symptom | Likely Cause | Fix |
|---|---|---|
| Segmentation fault during tests | Native model resource issue | Temporarily export VOXSUM_SKIP_LLM_TESTS=1 to isolate; verify llama_cpp install / model size |
| LLM tests unexpectedly skipped | You left skip var set | unset VOXSUM_SKIP_LLM_TESTS; re-run tests |
| Slow startup | Large GGUF model download/load | Choose a smaller model in available_gguf_llms |
| Mock not applied (you wanted skip) | Forgot to set skip var | export VOXSUM_SKIP_LLM_TESTS=1 |
Adding new tests
When adding tests that touch summarization or title generation:
- Assume they run by default; only guard them with the skip variable if theyβre extremely slow or redundant.
- Keep logic deterministicβavoid external network calls beyond local model loading.
- For structure-only assertions, instruct contributors they can run with
VOXSUM_SKIP_LLM_TESTS=1for speed.
CI Recommendation
Two useful CI lanes:
- Full (default):
pytest -q(includes LLM tests) - Fast lane (optional):
VOXSUM_SKIP_LLM_TESTS=1 pytest -qfor quick structural feedback.
Run the fast lane on every commit if startup time is critical; schedule the full lane on PR and nightly builds.
Contributing
Contributions are welcome! To contribute:
- Fork the repository on Hugging Face.
- Create a new branch for your feature or bug fix.
- Submit a pull request with a clear description of your changes.
License
This project is licensed under the MIT License. See the LICENSE file for details.
Acknowledgments
- Built with FastAPI for the backend API and vanilla JavaScript for the frontend.
- Powered by Hugging Face Spaces for hosting and deployment.
- Inspired by advancements in ASR and LLM technologies for audio processing.