Spaces:

MahmoudElsamadony
/

vtt-with-diariazation

Paused

App Files Files Community

Mahmoud Elsamadony commited on 27 days ago

Commit

988a3de

1 Parent(s): e1c6b8d

Update

Browse files

Files changed (5) hide show

API_USAGE.md +238 -0
README.md +69 -2
api_client.py +128 -0
api_requirements.txt +1 -0
app.py +30 -20

API_USAGE.md ADDED Viewed

	@@ -0,0 +1,238 @@

+# Using VTT with Diarization Space via API
+This guide shows you how to use your Hugging Face Space via API.
+## Option 1: Using Python (Gradio Client)
+### Installation
+```bash
+pip install gradio_client
+```
+### Quick Start
+```python
+from gradio_client import Client
+# Initialize client
+client = Client("MahmoudElsamadony/vtt-with-diariazation")
+# Transcribe audio
+result = client.predict(
+    audio_path="path/to/your/audio.mp3",
+    language="ar",  # or "en", "fr", etc., or "" for auto-detect
+    enable_diarization=False,
+    beam_size=5,
+    best_of=5,
+    api_name="/predict"
+)
+transcript, details = result
+print(f"Transcript: {transcript}")
+print(f"Language: {details['language']}")
+print(f"Duration: {details['duration']} seconds")
+```
+### With Speaker Diarization
+```python
+# Enable diarization to identify different speakers
+result = client.predict(
+    audio_path="path/to/your/audio.mp3",
+    language="ar",
+    enable_diarization=True,  # Enable speaker diarization
+    beam_size=5,
+    best_of=5,
+    api_name="/predict"
+)
+transcript, details = result
+# Access speaker information
+for segment in details['segments']:
+    speaker = segment.get('speaker', 'Unknown')
+    text = segment['text']
+    start = segment['start']
+    print(f"[{start:.2f}s] {speaker}: {text}")
+```
+### Full Example Script
+See `api_client.py` for a complete example with multiple use cases.
+```bash
+python api_client.py
+```
+## Option 2: Using JavaScript/TypeScript
+### Installation
+```bash
+npm install @gradio/client
+```
+### Usage
+```javascript
+import { client } from "@gradio/client";
+const app = await client("MahmoudElsamadony/vtt-with-diariazation");
+const result = await app.predict("/predict", [
+    "path/to/audio.mp3",  // audio_path
+    "ar",                  // language
+    false,                 // enable_diarization
+    5,                     // beam_size
+    5                      // best_of
+]);
+const [transcript, details] = result.data;
+console.log("Transcript:", transcript);
+console.log("Language:", details.language);
+console.log("Duration:", details.duration);
+```
+## Option 3: Using cURL (REST API)
+First, get your Space's API endpoint:
+```bash
+curl https://mahmoudelsamadony-vtt-with-diariazation.hf.space/info
+```
+Then make a prediction (you'll need to upload the file first):
+```bash
+# This is more complex with cURL as you need to handle file uploads
+# It's recommended to use the Python or JavaScript clients instead
+```
+## Parameters
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `audio_path` | string | required | Path to audio file (mp3, wav, m4a, etc.) |
+| `language` | string | "ar" | Language code ("ar", "en", "fr", "de", "es", "ru", "zh") or "" for auto-detect |
+| `enable_diarization` | boolean | false | Enable speaker diarization (identifies different speakers) |
+| `beam_size` | integer | 5 | Beam size for Whisper (1-10, higher = more accurate but slower) |
+| `best_of` | integer | 5 | Best of parameter for Whisper (1-10) |
+## Response Format
+The API returns a tuple `(transcript, details)`:
+### transcript (string)
+The complete transcribed text.
+### details (object)
+```json
+{
+  "text": "Complete transcript text",
+  "language": "ar",
+  "language_probability": 0.98,
+  "duration": 123.45,
+  "segments": [
+    {
+      "start": 0.0,
+      "end": 5.2,
+      "text": "Segment text",
+      "speaker": "SPEAKER_00",  // Only if diarization is enabled
+      "words": [
+        {
+          "start": 0.0,
+          "end": 0.5,
+          "word": "word",
+          "probability": 0.95
+        }
+      ]
+    }
+  ],
+  "speakers": [  // Only if diarization is enabled
+    {
+      "start": 0.0,
+      "end": 10.5,
+      "speaker": "SPEAKER_00"
+    }
+  ]
+}
+```
+## Error Handling
+```python
+from gradio_client import Client
+try:
+    client = Client("MahmoudElsamadony/vtt-with-diariazation")
+    result = client.predict(
+        audio_path="audio.mp3",
+        language="ar",
+        enable_diarization=False,
+        beam_size=5,
+        best_of=5,
+        api_name="/predict"
+    )
+    transcript, details = result
+    print(transcript)
+except Exception as e:
+    print(f"Error: {e}")
+```
+## Tips
+1. **First run takes longer** - The space needs to download models (~1.2GB total)
+2. **Diarization requires HF token** - Make sure you've set `HF_TOKEN` in your Space secrets
+3. **Use appropriate beam_size** - Higher values (8-10) are more accurate but slower
+4. **Language auto-detection** - Pass empty string `""` for language to auto-detect
+5. **Rate limits** - Hugging Face Spaces have rate limits for free usage
+## Local Testing
+To test the API locally before deploying:
+```bash
+# In your space directory
+python app.py
+```
+Then access via:
+```python
+client = Client("http://127.0.0.1:7860")
+```
+## Advanced: Async Usage
+```python
+from gradio_client import Client
+async def transcribe_async():
+    client = Client("MahmoudElsamadony/vtt-with-diariazation")
+    # Submit job
+    job = client.submit(
+        audio_path="audio.mp3",
+        language="ar",
+        enable_diarization=False,
+        beam_size=5,
+        best_of=5,
+        api_name="/predict"
+    )
+    # Do other work while waiting...
+    # Get result when ready
+    result = job.result()
+    return result
+# Use with asyncio
+import asyncio
+result = asyncio.run(transcribe_async())
+```
+## Support
+For issues with the API, check:
+- Space logs: https://huggingface.co/spaces/MahmoudElsamadony/vtt-with-diariazation/logs
+- Gradio Client docs: https://www.gradio.app/guides/getting-started-with-the-python-client

README.md CHANGED Viewed

@@ -1,6 +1,6 @@
 ---
-title: Vtt With Diariazation
-emoji: 🚀
 colorFrom: purple
 colorTo: indigo
 sdk: gradio
@@ -10,4 +10,71 @@ pinned: false
 license: mit
 ---
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: VTT with Diarization
+emoji: 🎙️
 colorFrom: purple
 colorTo: indigo
 sdk: gradio
 license: mit
 ---
+# Voice-to-Text with Speaker Diarization
+Powered by **faster-whisper** and **pyannote.audio** running locally on this Space.
+## Features
+- 🎯 **High-quality transcription** using faster-whisper (2-4x faster than OpenAI Whisper)
+- 👥 **Speaker diarization** with pyannote.audio 3.1
+- 🌍 **Multi-language support** (Arabic, English, French, German, Spanish, Russian, Chinese, etc.)
+- ⚙️ **Configurable parameters** (beam size, best_of, model size)
+- 🔧 **Optimized for Arabic customer service calls** with specialized prompts
+## Usage
+1. Upload an audio file (mp3, wav, m4a, flac, etc.)
+2. Select language (or leave blank for auto-detect)
+3. Enable speaker diarization if needed (requires HF_TOKEN)
+4. Adjust quality parameters if desired
+5. Click "Transcribe"
+## Configuration
+Set these in Space Settings → Variables:
+- `WHISPER_MODEL_SIZE`: Model size (`tiny`, `base`, `small`, `medium`, `large-v3`) - default: `small`
+- `WHISPER_DEVICE`: Device (`cpu` or `cuda`) - default: `cpu`
+- `WHISPER_COMPUTE_TYPE`: Compute type (`int8`, `int16`, `float32`) - default: `int8`
+- `DEFAULT_LANGUAGE`: Default language code - default: `ar` (Arabic)
+- `WHISPER_BEAM_SIZE`: Beam search size (1-10) - default: `5`
+- `WHISPER_BEST_OF`: Best of candidates (1-10) - default: `5`
+### Secrets (required for diarization):
+- `HF_TOKEN`: Your Hugging Face token with access to `pyannote/speaker-diarization-3.1`
+## Model Information
+### Whisper Models
+| Model | Size | RAM | Quality | Speed |
+|-------|------|-----|---------|-------|
+| tiny | 75MB | 1GB | ⭐⭐ | Very Fast |
+| base | 150MB | 1GB | ⭐⭐⭐ | Fast |
+| small | 500MB | 2GB | ⭐⭐⭐⭐ | Moderate |
+| medium | 1.5GB | 5GB | ⭐⭐⭐⭐⭐ | Slow |
+| large-v3 | 3GB | 10GB | ⭐⭐⭐⭐⭐⭐ | Very Slow |
+### First Run
+- First transcription will download the selected Whisper model automatically
+- Diarization downloads ~700MB on first use (cached afterward)
+- Models are stored in the Space's persistent storage
+## Technical Details
+- Uses the same model loading approach as the Django backend
+- faster-whisper automatically downloads models from Hugging Face
+- Diarization pipeline is downloaded locally to avoid repeated API calls
+- All processing happens on this Space (no external inference APIs)
+## Credits
+- [faster-whisper](https://github.com/guillaumekln/faster-whisper) by Guillaume Klein
+- [pyannote.audio](https://github.com/pyannote/pyannote-audio) by Hervé Bredin
+- Original Django backend by IZI Techs
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

api_client.py ADDED Viewed

	@@ -0,0 +1,128 @@

+"""
+API Client for VTT with Diarization Hugging Face Space
+Usage example for calling the space via Gradio Client API
+"""
+from gradio_client import Client
+import os
+# Your Hugging Face Space URL
+SPACE_URL = "MahmoudElsamadony/vtt-with-diariazation"
+def transcribe_audio(
+    audio_file_path: str,
+    language: str = "ar",
+    enable_diarization: bool = False,
+    beam_size: int = 5,
+    best_of: int = 5,
+):
+    """
+    Transcribe audio file using the Hugging Face Space API
+    Args:
+        audio_file_path: Path to the audio file (mp3, wav, m4a, etc.)
+        language: Language code ("ar", "en", "fr", etc.) or "" for auto-detect
+        enable_diarization: Whether to enable speaker diarization
+        beam_size: Beam size for Whisper (1-10)
+        best_of: Best of parameter for Whisper (1-10)
+    Returns:
+        tuple: (transcript_text, detailed_json)
+    """
+    # Initialize the client
+    client = Client(SPACE_URL)
+    # Call the transcribe function
+    result = client.predict(
+        audio_path=audio_file_path,
+        language=language,
+        enable_diarization=enable_diarization,
+        beam_size=beam_size,
+        best_of=best_of,
+        api_name="/predict"
+    )
+    return result
+def main():
+    """Example usage of the API client"""
+    # Example 1: Basic transcription (Arabic, no diarization)
+    print("=" * 60)
+    print("Example 1: Basic Arabic transcription")
+    print("=" * 60)
+    # Replace with your actual audio file path
+    audio_file = "path/to/your/audio.mp3"
+    if os.path.exists(audio_file):
+        transcript, details = transcribe_audio(
+            audio_file_path=audio_file,
+            language="ar",
+            enable_diarization=False,
+        )
+        print(f"\nTranscript:\n{transcript}\n")
+        print(f"Language: {details.get('language')}")
+        print(f"Duration: {details.get('duration')} seconds")
+        print(f"Number of segments: {len(details.get('segments', []))}")
+    else:
+        print(f"Audio file not found: {audio_file}")
+    print("\n" + "=" * 60)
+    print("Example 2: Transcription with speaker diarization")
+    print("=" * 60)
+    # Example 2: Transcription with diarization
+    if os.path.exists(audio_file):
+        transcript, details = transcribe_audio(
+            audio_file_path=audio_file,
+            language="ar",
+            enable_diarization=True,
+            beam_size=5,
+            best_of=5,
+        )
+        print(f"\nTranscript:\n{transcript}\n")
+        # Print speaker turns
+        if "speakers" in details:
+            print("\nSpeaker turns:")
+            for turn in details["speakers"][:5]:  # Show first 5 turns
+                print(f"  {turn['speaker']}: {turn['start']:.2f}s - {turn['end']:.2f}s")
+        # Print segments with speakers
+        print("\nSegments with speakers:")
+        for segment in details.get("segments", [])[:3]:  # Show first 3 segments
+            speaker = segment.get("speaker", "Unknown")
+            text = segment.get("text", "")
+            start = segment.get("start", 0)
+            print(f"  [{start:.2f}s] {speaker}: {text}")
+    else:
+        print(f"Audio file not found: {audio_file}")
+    print("\n" + "=" * 60)
+    print("Example 3: Auto-detect language")
+    print("=" * 60)
+    # Example 3: Auto-detect language
+    if os.path.exists(audio_file):
+        transcript, details = transcribe_audio(
+            audio_file_path=audio_file,
+            language="",  # Empty string for auto-detect
+            enable_diarization=False,
+        )
+        print(f"\nDetected language: {details.get('language')}")
+        print(f"Language probability: {details.get('language_probability'):.2%}")
+        print(f"\nTranscript:\n{transcript}")
+    else:
+        print(f"Audio file not found: {audio_file}")
+if __name__ == "__main__":
+    # Install gradio_client first:
+    # pip install gradio_client
+    main()

api_requirements.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ gradio_client>=0.7.0

app.py CHANGED Viewed

@@ -14,16 +14,13 @@ load_dotenv()
 # ---------------------------------------------------------------------------
 # Configuration via environment variables (override inside HF Space settings)
 # ---------------------------------------------------------------------------
 WHISPER_MODEL_SIZE = os.environ.get("WHISPER_MODEL_SIZE", "large-v3")
-WHISPER_REPO_ID = os.environ.get(
-    "WHISPER_REPO_ID", f"guillaumekln/faster-whisper-{WHISPER_MODEL_SIZE}"
-)
-WHISPER_LOCAL_DIR = os.environ.get(
-    "WHISPER_LOCAL_DIR", f"models/faster-whisper-{WHISPER_MODEL_SIZE}"
-)
 WHISPER_DEVICE = os.environ.get("WHISPER_DEVICE", "cpu")
 WHISPER_COMPUTE_TYPE = os.environ.get("WHISPER_COMPUTE_TYPE", "int8_float32")
 DIARIZATION_REPO_ID = os.environ.get(
     "DIARIZATION_REPO_ID", "pyannote/speaker-diarization-3.1"
 )
@@ -67,15 +64,15 @@ def _ensure_snapshot(repo_id: str, local_dir: str, allow_patterns: Optional[List
 def _load_whisper_model() -> WhisperModel:
     global _whisper_model
     if _whisper_model is None:
-        local_dir = _ensure_snapshot(
-            WHISPER_REPO_ID,
-            WHISPER_LOCAL_DIR,
-            allow_patterns=["*.bin", "*.json", "*.txt", "*.onnx"],
-        )
         _whisper_model = WhisperModel(
-            model_size_or_path=local_dir,
             device=WHISPER_DEVICE,
             compute_type=WHISPER_COMPUTE_TYPE,
         )
@@ -83,14 +80,21 @@ def _load_whisper_model() -> WhisperModel:
 def _load_diarization_pipeline() -> Optional[Pipeline]:
     global _diarization_pipeline
     if _diarization_pipeline is None:
         if HF_TOKEN is None:
             raise gr.Error(
                 "HF_TOKEN secret is missing. Add it in Space settings to enable diarization."
             )
         local_dir = _ensure_snapshot(DIARIZATION_REPO_ID, DIARIZATION_LOCAL_DIR)
-        _diarization_pipeline = Pipeline.from_pretrained(local_dir)
         _diarization_pipeline.to(torch.device("cpu"))
     return _diarization_pipeline
@@ -110,15 +114,19 @@ def transcribe(
     model = _load_whisper_model()
     segments, info = model.transcribe(
         audio_path,
         language=language if language else None,
         beam_size=beam_size,
         best_of=best_of,
-        temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
         vad_filter=True,
-        vad_parameters=dict(min_silence_duration_ms=300, speech_pad_ms=120),
-        condition_on_previous_text=False,
         initial_prompt=initial_prompt,
         compression_ratio_threshold=2.4,
         log_prob_threshold=-1.0,
@@ -242,11 +250,13 @@ def build_interface() -> gr.Blocks:
         )
         gr.Markdown(
-            """
             ## Tips
-            - First run downloads models (small ~500MB, diarization ~700MB). Please wait.
-            - Store your Hugging Face token in the Space Secrets as **HF_TOKEN**.
-            - Set `WHISPER_MODEL_SIZE` to `medium` or `large-v3` for higher accuracy (requires more RAM).
             """
         )

 # ---------------------------------------------------------------------------
 # Configuration via environment variables (override inside HF Space settings)
 # ---------------------------------------------------------------------------
+# Whisper model: use same model names as Django app (tiny, base, small, medium, large-v3)
+# faster-whisper will download these automatically from Hugging Face on first run
 WHISPER_MODEL_SIZE = os.environ.get("WHISPER_MODEL_SIZE", "large-v3")
 WHISPER_DEVICE = os.environ.get("WHISPER_DEVICE", "cpu")
 WHISPER_COMPUTE_TYPE = os.environ.get("WHISPER_COMPUTE_TYPE", "int8_float32")
+# Diarization: download locally to avoid repeated API calls
 DIARIZATION_REPO_ID = os.environ.get(
     "DIARIZATION_REPO_ID", "pyannote/speaker-diarization-3.1"
 )
 def _load_whisper_model() -> WhisperModel:
+    """Load Faster Whisper model lazily (singleton) - same approach as Django app"""
     global _whisper_model
     if _whisper_model is None:
+        print(f"Loading Faster Whisper model: {WHISPER_MODEL_SIZE} on {WHISPER_DEVICE} with compute_type={WHISPER_COMPUTE_TYPE}")
+        # Load model by name - faster-whisper downloads automatically from HuggingFace
+        # This is the same approach used in the Django app
         _whisper_model = WhisperModel(
+            WHISPER_MODEL_SIZE,  # Model name: tiny, base, small, medium, large-v3
             device=WHISPER_DEVICE,
             compute_type=WHISPER_COMPUTE_TYPE,
         )
 def _load_diarization_pipeline() -> Optional[Pipeline]:
+    """Load speaker diarization pipeline lazily (singleton)"""
     global _diarization_pipeline
     if _diarization_pipeline is None:
         if HF_TOKEN is None:
             raise gr.Error(
                 "HF_TOKEN secret is missing. Add it in Space settings to enable diarization."
             )
+        print("Loading diarization pipeline...")
+        # Download the pipeline locally to avoid repeated API calls
         local_dir = _ensure_snapshot(DIARIZATION_REPO_ID, DIARIZATION_LOCAL_DIR)
+        _diarization_pipeline = Pipeline.from_pretrained(
+            local_dir,
+            use_auth_token=HF_TOKEN  # Note: newer versions use 'token' instead
+        )
         _diarization_pipeline.to(torch.device("cpu"))
     return _diarization_pipeline
     model = _load_whisper_model()
+    # Transcription parameters matching Django app configuration
     segments, info = model.transcribe(
         audio_path,
         language=language if language else None,
         beam_size=beam_size,
         best_of=best_of,
+        temperature=[0.0, 0.2, 0.4, 0.6],  # Matching Django app
         vad_filter=True,
+        vad_parameters=dict(
+            min_silence_duration_ms=300,  # Split sooner on short pauses
+            speech_pad_ms=120
+        ),
+        condition_on_previous_text=False,  # KEY: stop cross-segment repetition
         initial_prompt=initial_prompt,
         compression_ratio_threshold=2.4,
         log_prob_threshold=-1.0,
         )
         gr.Markdown(
+            f"""
             ## Tips
+            - **Current model**: `{WHISPER_MODEL_SIZE}` (first run downloads model automatically)
+            - Diarization downloads ~700MB on first use (cached afterward)
+            - Store your Hugging Face token in Space Secrets as **HF_TOKEN** (required for diarization)
+            - Change `WHISPER_MODEL_SIZE` in Space Variables to `medium` or `large-v3` for higher accuracy
+            - Optimized for Arabic customer service calls with specialized initial prompt
             """
         )