You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Streaming Keyword Spotting System

A real-time keyword spotting (KWS) system for detecting "δ½ ε₯½ηœŸηœŸ" (Hello Zhen Zhen) using streaming inference with Zipformer and MLP verification.

Overview

This is a complete streaming KWS system featuring:

  • Two-stage detection pipeline: Fast Zipformer screening + MLP verification
  • ONNX models: All models in ONNX format for cross-platform deployment
  • Streaming inference: 100ms latency, real-time response
  • Quantized models: INT8 ONNX models for efficient inference
  • Multi-platform support: Works on Windows, Linux, macOS

Model Architecture

Audio Input (16kHz) 
    ↓
Audio Capture & Feature Extraction (MFCC)
    ↓
Streaming Buffer (with overlap)
    ↓
Zipformer KWS (Stage 1: Fast screening)
    ↓
[If keyword detected]
    ↓
MLP Verifier (Stage 2: Confirmation)
    ↓
Wake-up Event

Performance

Metric Value
False Alarm Rate (FAR) 1.3% (Stage 2 with MLP)
Detection Latency ~100ms
Model Size ~4.2MB (Zipformer int8) + 12KB (MLP)
Memory Footprint <50MB

Model Files

Zipformer V3 Models (Streaming KWS)

  • kws_finetune_v3/encoder.int8.onnx (4.03 MB) - Encoder module
  • kws_finetune_v3/decoder.int8.onnx (170 KB) - Decoder module
  • kws_finetune_v3/joiner.int8.onnx (63 KB) - Joiner module
  • kws_finetune_v3/tokens.txt - Vocabulary
  • kws_finetune_v3/keywords.txt - Keywords configuration

MLP Verifier

  • models/mlp_verifier.onnx (12 KB) - MLP verification model

Installation

# Install dependencies
pip install -r requirements.txt

# For HuggingFace integration
pip install huggingface_hub

# For ONNX inference
pip install onnxruntime

Usage

Basic Usage

from src.pipeline.kws_stream import StreamingKWSPipeline
from src.audio.capture import AudioCapture
from src.utils.config import KWSConfig

# Create configuration
config = KWSConfig(
    encoder_path="kws_finetune_v3/encoder.int8.onnx",
    decoder_path="kws_finetune_v3/decoder.int8.onnx",
    joiner_path="kws_finetune_v3/joiner.int8.onnx",
    tokens_path="kws_finetune_v3/tokens.txt",
    keywords_file="kws_finetune_v3/keywords.txt",
    mlp_model_path="models/mlp_verifier.onnx",
    keywords=["δ½ ε₯½ηœŸηœŸ"],
)

# Initialize the pipeline
pipeline = StreamingKWSPipeline(config)
pipeline.load()

# Capture audio and detect keywords
capture = AudioCapture(sample_rate=16000, chunk_duration_ms=100)
capture.start()

try:
    while True:
        audio_chunk = capture.read(timeout=1.0)
        if audio_chunk is not None:
            detection = pipeline.process_chunk(audio_chunk)
            if detection:
                print(f"Wake-up detected! Keyword: {detection.keyword}, Confidence: {detection.confidence:.2%}")
finally:
    capture.stop()

Command Line Usage

# Run the main demo (interactive mode)
python main.py --model-dir ./kws_finetune_v3

# With custom thresholds
python main.py --model-dir ./kws_finetune_v3 \
               --stage1-threshold 0.5 \
               --stage2-threshold 0.7

API Reference

KWSConfig

Configuration dataclass for the KWS system.

Parameters:

  • encoder_path (str): Path to encoder ONNX model
  • decoder_path (str): Path to decoder ONNX model
  • joiner_path (str): Path to joiner ONNX model
  • tokens_path (str): Path to tokens.txt
  • keywords_file (str): Path to keywords.txt
  • mlp_model_path (str): Path to MLP verifier ONNX model
  • keywords (List[str]): List of keywords to detect (default: ["δ½ ε₯½ηœŸηœŸ"])
  • keywords_threshold (float): Zipformer detection threshold (default: 0.25)
  • mlp_threshold (float): MLP verification threshold (default: 0.5)
  • mlp_enabled (bool): Enable MLP verification (default: True)
  • sample_rate (int): Audio sample rate (default: 16000)

StreamingKWSPipeline

The main inference pipeline class.

Constructor:

pipeline = StreamingKWSPipeline(config: KWSConfig)

Methods:

# Load all models
pipeline.load()

# Process an audio chunk (returns DetectionResult or None)
detection = pipeline.process_chunk(audio_chunk)
# Returns: DetectionResult(keyword, confidence, timestamp, verified, mlp_confidence) or None

# Reset pipeline state
pipeline.reset()

# Set detection callback
pipeline.set_on_detection(callback_function)

# Properties
pipeline.is_loaded  # bool: Check if models are loaded
pipeline.detection_count  # int: Number of detections

AudioCapture

Real-time microphone input handler.

capture = AudioCapture(sample_rate=16000, chunk_duration_ms=100)
capture.start()

# Read audio chunks
audio_chunk = capture.read(timeout=1.0)  # Returns np.ndarray or None

# Stop capture
capture.stop()

# List available devices
AudioCapture.list_devices()

Model Details

Zipformer Encoder

  • Input: MFCC features (batch, time, 13 features)
  • Output: Encoded representations
  • Parameters: ~48M (quantized to ~4MB)

MLP Verifier

  • Input: Concatenated context (13 Γ— 50 = 650 features)
  • Architecture: 650 β†’ 256 β†’ 128 β†’ 64 β†’ 1 (Sigmoid)
  • Output: Confidence score [0, 1]
  • Parameters: ~200K (quantized to 12KB)

File Structure

.
β”œβ”€β”€ kws_finetune_v3/              # Zipformer models
β”‚   β”œβ”€β”€ encoder.int8.onnx
β”‚   β”œβ”€β”€ decoder.int8.onnx
β”‚   β”œβ”€β”€ joiner.int8.onnx
β”‚   β”œβ”€β”€ tokens.txt
β”‚   └── keywords.txt
β”œβ”€β”€ models/
β”‚   └── mlp_verifier.onnx         # MLP verification model
β”œβ”€β”€ src/                          # Python source code
β”‚   β”œβ”€β”€ audio/                    # Audio processing
β”‚   β”œβ”€β”€ models/                   # Model inference
β”‚   β”œβ”€β”€ pipeline/                 # KWS pipeline
β”‚   └── utils/                    # Configuration
β”œβ”€β”€ main.py                       # Main entry point
└── requirements.txt              # Python dependencies

Requirements

  • Python 3.7+
  • ONNX Runtime 1.14+
  • NumPy 1.21+
  • For microphone input: PyAudio or sounddevice

Training & Fine-tuning

This model was fine-tuned on WeNet speech corpus with focus on "δ½ ε₯½ηœŸηœŸ" detection. The two-stage architecture significantly reduces false alarms while maintaining low latency.

Performance Considerations

  • Latency: ~100ms end-to-end (50ms feature extraction + 50ms model inference)
  • CPU Usage: <5% on modern CPUs
  • Memory: ~50MB for models + buffers
  • Throughput: Can handle multiple concurrent streams

Troubleshooting

No Detections

  • Check microphone is working: python -c "import sounddevice; print(sounddevice.default_device)"
  • Verify model files exist in correct paths
  • Try increasing stage1_threshold if false negatives occur

High False Alarm Rate

  • Increase stage2_threshold (default 0.5, try 0.7+)
  • Verify keywords.txt matches your target phrase
  • Check audio quality and background noise levels

Performance Issues

  • Use INT8 models (provided) for better efficiency
  • Reduce frame size or buffer overlap if memory-constrained
  • Enable hardware acceleration if available (ONNX GPU providers)

License

Apache License 2.0

Citation

If you use this model, please cite:

@model{streaming-kws-2024,
  title={Streaming Keyword Spotting with Zipformer and MLP Verification},
  author={KWS Project},
  year={2024}
}

Support

For issues, questions, or contributions, please visit the project repository.

Version History

  • v1.0 (2024-01): Initial release with Zipformer V3 and MLP verification
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support