Spaces:

Luigi
/

VoxSum

Sleeping

App Files Files Community

VoxSum / README.md

Luigi

Consolidate tests under tests/, add LLM default tests with opt-out flag, model selection, README update

913c94a about 1 month ago

preview code

raw

history blame contribute delete

9.98 kB

metadata

title: VoxSum Studio
emoji: 🚀
colorFrom: green
colorTo: yellow
sdk: docker
app_port: 7860
tags:
  - fastapi
  - web-app
pinned: false
short_description: 'VoxSum Studio: Transform Audio into Insightful Summaries'
license: apache-2.0

VoxSum Studio

VoxSum Studio is a powerful web application built for Hugging Face Spaces, designed to transform audio into insightful summaries. This tool leverages advanced Automatic Speech Recognition (ASR) and Large Language Models (LLMs) to transcribe and summarize audio from podcasts, YouTube videos, or uploaded files. With an interactive transcript player and customizable settings, VoxSum Studio makes it easy to extract key insights from audio content in real time.

The application features a modern web interface built with HTML, CSS, and JavaScript, powered by a FastAPI backend for robust API handling.

Features

Podcast Search & Download: Search for podcast series, browse episodes, and download audio directly from the app.
YouTube Audio Fetching: Extract audio from YouTube videos by providing a URL.
Audio Upload: Upload your own audio files (MP3, WAV) for transcription and summarization.
Interactive Transcript Player: View real-time transcripts synced with audio playback, with clickable timestamps for easy navigation and auto-scrolling highlights.
Customizable Summarization: Choose from multiple LLMs and provide custom prompts to generate tailored summaries.
Voice Activity Detection (VAD): Adjust the VAD threshold to optimize transcription accuracy.
Web Interface: A user-friendly interface with settings for model selection and real-time status updates.

Getting Started

Usage

Launch the application: Open the application via the URL provided by Hugging Face Space. The interface is intuitive for easy operation.
Select audio source: Search for podcast series, browse episodes, and download audio. Upload MP3 or WAV audio files, or enter a YouTube video URL to extract audio content.
Perform transcription: Click the "Transcribe Audio" button to start generating transcript content. The transcription process will display text content in real time, and upon completion, an interactive player will be automatically generated for viewing transcripts synchronized with audio.
Generate summary: Click the "Generate Summary" button to create a summary based on the transcript. Choose the appropriate LLM model and input custom prompts to generate summaries that meet your needs.
Interact with results: Use the interactive player to click on specific segments in the transcript to jump to the corresponding audio time point, quickly locating key content. The final summary will be displayed below the transcript, clearly presenting the core information in the audio.

Configuration

Settings:
- VAD Threshold: Adjust the threshold (0.1 to 0.9) to fine-tune voice activity detection.
- ASR Model: Select from available models for transcription.
- LLM Model: Choose an LLM for summarization from the available options.
- Custom Prompt: Input a custom prompt to guide the summarization process.

Project Structure

voxsum-studio/
├── Dockerfile                # Docker configuration for building and running the app
├── README.md                 # Project documentation and setup instructions
├── requirements.txt          # Python dependencies for the project
├── src/                      # Source code directory
│   ├── __init__.py           # Makes src a Python package
│   ├── asr.py                # Logic for Automatic Speech Recognition (ASR) transcription
│   ├── diarization.py        # Speaker diarization functionality
│   ├── export_utils.py       # Utilities for exporting transcripts and summaries
│   ├── improved_diarization.py # Enhanced diarization features
│   ├── podcast.py            # Functions for podcast search, episode fetching, and audio downloading
│   ├── summarization.py      # Logic for generating summaries using LLMs
│   ├── utils.py              # Utility functions and model configurations
│   ├── server/               # FastAPI backend
│   │   ├── __init__.py
│   │   ├── main.py           # Main FastAPI application
│   │   ├── core/             # Core configuration
│   │   ├── models/           # Pydantic models for API
│   │   ├── routers/          # API routes
│   │   └── services/         # Business logic services
│   ├── frontend/             # Static frontend files
│   └── static/               # Static assets
├── frontend/                 # Frontend source files
│   ├── app.js                # Main JavaScript application
│   ├── index.html            # Main HTML page
│   └── styles.css            # CSS styles
└── static/                   # Static assets directory
    └── audio/                # Temporary storage for audio files (not tracked in git)

Notes

Architecture: The application uses a FastAPI backend for API endpoints and a vanilla JavaScript frontend for the user interface.
Temporary Storage: Uploaded and downloaded audio files are stored in the /tmp directory (mapped to static/audio/) for Hugging Face Spaces compatibility.
Audio Formats: Supports MP3 and WAV files for uploads and downloads.
Error Handling: The app provides real-time status updates and error messages for transcription or summarization failures.
Interactive Player: The player is implemented as a single HTML component with JavaScript for seamless audio-transcript synchronization.
Docker Support: The Dockerfile ensures consistent environments on Hugging Face Spaces.
The __pycache__ directory (auto-generated) is excluded from version control.

Limitations

Transcription and summarization quality depend on the selected models and audio clarity.
Large audio files may take longer to process, especially in a resource-constrained environment like Hugging Face Spaces.
YouTube audio fetching requires a valid URL and may be subject to rate limits or availability.

Tests

Overview

LLM tests are now part of the default test run because multilingual summarization and title generation are core to VoxSum’s value.

Test categories:

LLM-dependent tests (default ON): multilingual summarization, title generation, language consistency.
Lightweight diarization tests: fast heuristics & structural checks.

If you need a fast pass without loading models (e.g. in a tiny CI runner), you can explicitly skip LLM tests (see below).

Running all tests (default, includes LLM)

Install dependencies then run:

pip install -r requirements.txt
pytest -q

Skipping LLM tests (opt-out)

If you only want the lightweight diarization tests:

export VOXSUM_SKIP_LLM_TESTS=1
pytest -q

This will module-skip:

test_multilingual.py
test_multilingual_quick.py
test_summary_language.py

These tests exercise:

Multilingual summarization pipeline (summarize_transcript)
Title generation (generate_title)
Language consistency heuristics

Mocking strategy (opt-out mode)

tests/conftest.py activates a lightweight mock of the LLM interface only when VOXSUM_SKIP_LLM_TESTS=1:

Replaces get_llm() with a dummy object.
Avoids native model loading cost.
Provides deterministic minimal outputs for structural assertions.

Minimal diarization sanity test

File: tests/test_diarization_minimal.py

It validates four scenarios:

Single segment
Two very similar segments (should unify speaker identity)
Two dissimilar segments (can diverge; heuristic tolerant)
Three segments (granularity preservation path)

The test harness:

Uses a mock embedding extractor (no external model downloads).
Exercises the small-n heuristic path (<3 embeddings) and the adaptive clustering interface.

Run directly if desired:

python3 tests/test_diarization_minimal.py

Troubleshooting

Symptom	Likely Cause	Fix
Segmentation fault during tests	Native model resource issue	Temporarily `export VOXSUM_SKIP_LLM_TESTS=1` to isolate; verify `llama_cpp` install / model size
LLM tests unexpectedly skipped	You left skip var set	`unset VOXSUM_SKIP_LLM_TESTS`; re-run tests
Slow startup	Large GGUF model download/load	Choose a smaller model in `available_gguf_llms`
Mock not applied (you wanted skip)	Forgot to set skip var	`export VOXSUM_SKIP_LLM_TESTS=1`

Adding new tests

When adding tests that touch summarization or title generation:

Assume they run by default; only guard them with the skip variable if they’re extremely slow or redundant.
Keep logic deterministic—avoid external network calls beyond local model loading.
For structure-only assertions, instruct contributors they can run with VOXSUM_SKIP_LLM_TESTS=1 for speed.

CI Recommendation

Two useful CI lanes:

Full (default): pytest -q (includes LLM tests)
Fast lane (optional): VOXSUM_SKIP_LLM_TESTS=1 pytest -q for quick structural feedback.

Run the fast lane on every commit if startup time is critical; schedule the full lane on PR and nightly builds.

Contributing

Contributions are welcome! To contribute:

Fork the repository on Hugging Face.
Create a new branch for your feature or bug fix.
Submit a pull request with a clear description of your changes.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Acknowledgments

Built with FastAPI for the backend API and vanilla JavaScript for the frontend.
Powered by Hugging Face Spaces for hosting and deployment.
Inspired by advancements in ASR and LLM technologies for audio processing.