---
title: VoxSum Studio
emoji: 🚀
colorFrom: green
colorTo: yellow
sdk: docker
app_port: 7860
tags:
- fastapi
- web-app
pinned: false
short_description: 'VoxSum Studio: Transform Audio into Insightful Summaries'
license: apache-2.0
---

# VoxSum Studio

**VoxSum Studio** is a powerful web application built for Hugging Face Spaces, designed to transform audio into insightful summaries. This tool leverages advanced Automatic Speech Recognition (ASR) and Large Language Models (LLMs) to transcribe and summarize audio from podcasts, YouTube videos, or uploaded files. With an interactive transcript player and customizable settings, VoxSum Studio makes it easy to extract key insights from audio content in real time.

The application features a modern web interface built with HTML, CSS, and JavaScript, powered by a FastAPI backend for robust API handling.

## Features

- **Podcast Search & Download**: Search for podcast series, browse episodes, and download audio directly from the app.
- **YouTube Audio Fetching**: Extract audio from YouTube videos by providing a URL.
- **Audio Upload**: Upload your own audio files (MP3, WAV) for transcription and summarization.
- **Interactive Transcript Player**: View real-time transcripts synced with audio playback, with clickable timestamps for easy navigation and auto-scrolling highlights.
- **Customizable Summarization**: Choose from multiple LLMs and provide custom prompts to generate tailored summaries.
- **Voice Activity Detection (VAD)**: Adjust the VAD threshold to optimize transcription accuracy.
- **Web Interface**: A user-friendly interface with settings for model selection and real-time status updates.

## Getting Started

### Usage
1. **Launch the application**: Open the application via the URL provided by Hugging Face Space. The interface is intuitive for easy operation.

2. **Select audio source**: Search for podcast series, browse episodes, and download audio. Upload MP3 or WAV audio files, or enter a YouTube video URL to extract audio content.

3. **Perform transcription**: Click the "Transcribe Audio" button to start generating transcript content. The transcription process will display text content in real time, and upon completion, an interactive player will be automatically generated for viewing transcripts synchronized with audio.

4. **Generate summary**: Click the "Generate Summary" button to create a summary based on the transcript. Choose the appropriate LLM model and input custom prompts to generate summaries that meet your needs.

5. **Interact with results**: Use the interactive player to click on specific segments in the transcript to jump to the corresponding audio time point, quickly locating key content. The final summary will be displayed below the transcript, clearly presenting the core information in the audio.

## Configuration
- **Settings**:
  - **VAD Threshold**: Adjust the threshold (0.1 to 0.9) to fine-tune voice activity detection.
  - **ASR Model**: Select from available models for transcription.
  - **LLM Model**: Choose an LLM for summarization from the available options.
  - **Custom Prompt**: Input a custom prompt to guide the summarization process.

## Project Structure
```
voxsum-studio/
├── Dockerfile                # Docker configuration for building and running the app
├── README.md                 # Project documentation and setup instructions
├── requirements.txt          # Python dependencies for the project
├── src/                      # Source code directory
│   ├── __init__.py           # Makes src a Python package
│   ├── asr.py                # Logic for Automatic Speech Recognition (ASR) transcription
│   ├── diarization.py        # Speaker diarization functionality
│   ├── export_utils.py       # Utilities for exporting transcripts and summaries
│   ├── improved_diarization.py # Enhanced diarization features
│   ├── podcast.py            # Functions for podcast search, episode fetching, and audio downloading
│   ├── summarization.py      # Logic for generating summaries using LLMs
│   ├── utils.py              # Utility functions and model configurations
│   ├── server/               # FastAPI backend
│   │   ├── __init__.py
│   │   ├── main.py           # Main FastAPI application
│   │   ├── core/             # Core configuration
│   │   ├── models/           # Pydantic models for API
│   │   ├── routers/          # API routes
│   │   └── services/         # Business logic services
│   ├── frontend/             # Static frontend files
│   └── static/               # Static assets
├── frontend/                 # Frontend source files
│   ├── app.js                # Main JavaScript application
│   ├── index.html            # Main HTML page
│   └── styles.css            # CSS styles
└── static/                   # Static assets directory
    └── audio/                # Temporary storage for audio files (not tracked in git)
```

## Notes
- **Architecture**: The application uses a FastAPI backend for API endpoints and a vanilla JavaScript frontend for the user interface.
- **Temporary Storage**: Uploaded and downloaded audio files are stored in the `/tmp` directory (mapped to `static/audio/`) for Hugging Face Spaces compatibility.
- **Audio Formats**: Supports MP3 and WAV files for uploads and downloads.
- **Error Handling**: The app provides real-time status updates and error messages for transcription or summarization failures.
- **Interactive Player**: The player is implemented as a single HTML component with JavaScript for seamless audio-transcript synchronization.
- **Docker Support**: The `Dockerfile` ensures consistent environments on Hugging Face Spaces.
- The `__pycache__` directory (auto-generated) is excluded from version control.

## Limitations
- Transcription and summarization quality depend on the selected models and audio clarity.
- Large audio files may take longer to process, especially in a resource-constrained environment like Hugging Face Spaces.
- YouTube audio fetching requires a valid URL and may be subject to rate limits or availability.

## Tests

### Overview
LLM tests are now part of the default test run because multilingual summarization and title generation are core to VoxSum’s value.

Test categories:
1. LLM-dependent tests (default ON): multilingual summarization, title generation, language consistency.
2. Lightweight diarization tests: fast heuristics & structural checks.

If you need a fast pass without loading models (e.g. in a tiny CI runner), you can explicitly skip LLM tests (see below).

### Running all tests (default, includes LLM)
Install dependencies then run:

```
pip install -r requirements.txt
pytest -q
```

### Skipping LLM tests (opt-out)
If you only want the lightweight diarization tests:
```
export VOXSUM_SKIP_LLM_TESTS=1
pytest -q
```
This will module-skip:
- `test_multilingual.py`
- `test_multilingual_quick.py`
- `test_summary_language.py`

These tests exercise:
- Multilingual summarization pipeline (`summarize_transcript`)
- Title generation (`generate_title`)
- Language consistency heuristics

### Mocking strategy (opt-out mode)
`tests/conftest.py` activates a lightweight mock of the LLM interface only when `VOXSUM_SKIP_LLM_TESTS=1`:
- Replaces `get_llm()` with a dummy object.
- Avoids native model loading cost.
- Provides deterministic minimal outputs for structural assertions.

### Minimal diarization sanity test
File: `tests/test_diarization_minimal.py`

It validates four scenarios:
- Single segment
- Two very similar segments (should unify speaker identity)
- Two dissimilar segments (can diverge; heuristic tolerant)
- Three segments (granularity preservation path)

The test harness:
- Uses a mock embedding extractor (no external model downloads).
- Exercises the small-`n` heuristic path (<3 embeddings) and the adaptive clustering interface.

Run directly if desired:
```
python3 tests/test_diarization_minimal.py
```

### Troubleshooting
| Symptom | Likely Cause | Fix |
|---------|--------------|-----|
| Segmentation fault during tests | Native model resource issue | Temporarily `export VOXSUM_SKIP_LLM_TESTS=1` to isolate; verify `llama_cpp` install / model size |
| LLM tests unexpectedly skipped | You left skip var set | `unset VOXSUM_SKIP_LLM_TESTS`; re-run tests |
| Slow startup | Large GGUF model download/load | Choose a smaller model in `available_gguf_llms` |
| Mock not applied (you wanted skip) | Forgot to set skip var | `export VOXSUM_SKIP_LLM_TESTS=1` |

### Adding new tests
When adding tests that touch summarization or title generation:
1. Assume they run by default; only guard them with the skip variable if they’re extremely slow or redundant.
2. Keep logic deterministic—avoid external network calls beyond local model loading.
3. For structure-only assertions, instruct contributors they can run with `VOXSUM_SKIP_LLM_TESTS=1` for speed.

### CI Recommendation
Two useful CI lanes:
1. Full (default): `pytest -q` (includes LLM tests)
2. Fast lane (optional): `VOXSUM_SKIP_LLM_TESTS=1 pytest -q` for quick structural feedback.

Run the fast lane on every commit if startup time is critical; schedule the full lane on PR and nightly builds.

## Contributing
Contributions are welcome! To contribute:
1. Fork the repository on Hugging Face.
2. Create a new branch for your feature or bug fix.
3. Submit a pull request with a clear description of your changes.

## License
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.

## Acknowledgments
- Built with [FastAPI](https://fastapi.tiangolo.com/) for the backend API and vanilla JavaScript for the frontend.
- Powered by Hugging Face Spaces for hosting and deployment.
- Inspired by advancements in ASR and LLM technologies for audio processing.