You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Streaming Keyword Spotting System

A real-time keyword spotting (KWS) system for detecting "你好真真" (Hello Zhen Zhen) using streaming inference with Zipformer and MLP verification.

Overview

This is a complete streaming KWS system featuring:

Two-stage detection pipeline: Fast Zipformer screening + MLP verification
ONNX models: All models in ONNX format for cross-platform deployment
Streaming inference: 100ms latency, real-time response
Quantized models: INT8 ONNX models for efficient inference
Multi-platform support: Works on Windows, Linux, macOS

Model Architecture

Audio Input (16kHz) 
    ↓
Audio Capture & Feature Extraction (MFCC)
    ↓
Streaming Buffer (with overlap)
    ↓
Zipformer KWS (Stage 1: Fast screening)
    ↓
[If keyword detected]
    ↓
MLP Verifier (Stage 2: Confirmation)
    ↓
Wake-up Event

Performance

Metric	Value
False Alarm Rate (FAR)	1.3% (Stage 2 with MLP)
Detection Latency	~100ms
Model Size	~4.2MB (Zipformer int8) + 12KB (MLP)
Memory Footprint	<50MB

Model Files

Zipformer V3 Models (Streaming KWS)

kws_finetune_v3/encoder.int8.onnx (4.03 MB) - Encoder module
kws_finetune_v3/decoder.int8.onnx (170 KB) - Decoder module
kws_finetune_v3/joiner.int8.onnx (63 KB) - Joiner module
kws_finetune_v3/tokens.txt - Vocabulary
kws_finetune_v3/keywords.txt - Keywords configuration

MLP Verifier

models/mlp_verifier.onnx (12 KB) - MLP verification model

Installation

# Install dependencies
pip install -r requirements.txt

# For HuggingFace integration
pip install huggingface_hub

# For ONNX inference
pip install onnxruntime

Usage

Basic Usage

from src.pipeline.kws_stream import StreamingKWSPipeline
from src.audio.capture import AudioCapture
from src.utils.config import KWSConfig

# Create configuration
config = KWSConfig(
    encoder_path="kws_finetune_v3/encoder.int8.onnx",
    decoder_path="kws_finetune_v3/decoder.int8.onnx",
    joiner_path="kws_finetune_v3/joiner.int8.onnx",
    tokens_path="kws_finetune_v3/tokens.txt",
    keywords_file="kws_finetune_v3/keywords.txt",
    mlp_model_path="models/mlp_verifier.onnx",
    keywords=["你好真真"],
)

# Initialize the pipeline
pipeline = StreamingKWSPipeline(config)
pipeline.load()

# Capture audio and detect keywords
capture = AudioCapture(sample_rate=16000, chunk_duration_ms=100)
capture.start()

try:
    while True:
        audio_chunk = capture.read(timeout=1.0)
        if audio_chunk is not None:
            detection = pipeline.process_chunk(audio_chunk)
            if detection:
                print(f"Wake-up detected! Keyword: {detection.keyword}, Confidence: {detection.confidence:.2%}")
finally:
    capture.stop()

Command Line Usage

# Run the main demo (interactive mode)
python main.py --model-dir ./kws_finetune_v3

# With custom thresholds
python main.py --model-dir ./kws_finetune_v3 \
               --stage1-threshold 0.5 \
               --stage2-threshold 0.7

API Reference

KWSConfig

Configuration dataclass for the KWS system.

Parameters:

encoder_path (str): Path to encoder ONNX model
decoder_path (str): Path to decoder ONNX model
joiner_path (str): Path to joiner ONNX model
tokens_path (str): Path to tokens.txt
keywords_file (str): Path to keywords.txt
mlp_model_path (str): Path to MLP verifier ONNX model
keywords (List[str]): List of keywords to detect (default: ["你好真真"])
keywords_threshold (float): Zipformer detection threshold (default: 0.25)
mlp_threshold (float): MLP verification threshold (default: 0.5)
mlp_enabled (bool): Enable MLP verification (default: True)
sample_rate (int): Audio sample rate (default: 16000)

StreamingKWSPipeline

The main inference pipeline class.

Constructor:

pipeline = StreamingKWSPipeline(config: KWSConfig)

Methods:

# Load all models
pipeline.load()

# Process an audio chunk (returns DetectionResult or None)
detection = pipeline.process_chunk(audio_chunk)
# Returns: DetectionResult(keyword, confidence, timestamp, verified, mlp_confidence) or None

# Reset pipeline state
pipeline.reset()

# Set detection callback
pipeline.set_on_detection(callback_function)

# Properties
pipeline.is_loaded  # bool: Check if models are loaded
pipeline.detection_count  # int: Number of detections

AudioCapture

Real-time microphone input handler.

capture = AudioCapture(sample_rate=16000, chunk_duration_ms=100)
capture.start()

# Read audio chunks
audio_chunk = capture.read(timeout=1.0)  # Returns np.ndarray or None

# Stop capture
capture.stop()

# List available devices
AudioCapture.list_devices()

Model Details

Zipformer Encoder

Input: MFCC features (batch, time, 13 features)
Output: Encoded representations
Parameters: ~48M (quantized to ~4MB)

MLP Verifier

Input: Concatenated context (13 × 50 = 650 features)
Architecture: 650 → 256 → 128 → 64 → 1 (Sigmoid)
Output: Confidence score [0, 1]
Parameters: ~200K (quantized to 12KB)

File Structure

.
├── kws_finetune_v3/              # Zipformer models
│   ├── encoder.int8.onnx
│   ├── decoder.int8.onnx
│   ├── joiner.int8.onnx
│   ├── tokens.txt
│   └── keywords.txt
├── models/
│   └── mlp_verifier.onnx         # MLP verification model
├── src/                          # Python source code
│   ├── audio/                    # Audio processing
│   ├── models/                   # Model inference
│   ├── pipeline/                 # KWS pipeline
│   └── utils/                    # Configuration
├── main.py                       # Main entry point
└── requirements.txt              # Python dependencies

Requirements

Python 3.7+
ONNX Runtime 1.14+
NumPy 1.21+
For microphone input: PyAudio or sounddevice

Training & Fine-tuning

This model was fine-tuned on WeNet speech corpus with focus on "你好真真" detection. The two-stage architecture significantly reduces false alarms while maintaining low latency.

Performance Considerations

Latency: ~100ms end-to-end (50ms feature extraction + 50ms model inference)
CPU Usage: <5% on modern CPUs
Memory: ~50MB for models + buffers
Throughput: Can handle multiple concurrent streams

Troubleshooting

No Detections

Check microphone is working: python -c "import sounddevice; print(sounddevice.default_device)"
Verify model files exist in correct paths
Try increasing stage1_threshold if false negatives occur

High False Alarm Rate

Increase stage2_threshold (default 0.5, try 0.7+)
Verify keywords.txt matches your target phrase
Check audio quality and background noise levels

Performance Issues

Use INT8 models (provided) for better efficiency
Reduce frame size or buffer overlap if memory-constrained
Enable hardware acceleration if available (ONNX GPU providers)

License

Apache License 2.0

Citation

If you use this model, please cite:

@model{streaming-kws-2024,
  title={Streaming Keyword Spotting with Zipformer and MLP Verification},
  author={KWS Project},
  year={2024}
}

Support

For issues, questions, or contributions, please visit the project repository.

Version History

v1.0 (2024-01): Initial release with Zipformer V3 and MLP verification

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support