Text-to-Speech
Safetensors
GGUF
qwen2
audio
speech
speech-language-models
conversational

Streaming on Apple Silicon - Best Practices

#16
by ACloudCenter - opened

Can anyone give some insight to running the streaming version on Apple silicon?

I’m running on M2, install all dependencies, but see some mismatches in dependency versioning. I run the streaming example in the samples directory and it works but is very choppy, not smooth in between chunks.

Before I dug in deep, I wanted to see if anyone had already got this running on Apple smoothly. If so you mind sharing?

jiamengjiameng changed discussion status to closed
jiamengjiameng changed discussion status to open
Neuphonic org

Hey, just tried stuff on my M1 Macbook Air, which I can get close to real time by shutting off all the browsers and spotify and stuff, without fiddling around with stuff too much. What's taking up more time for you atm, the Speech LM or the codec decoder?

I think it may be the decoder, but I'm unsure. I'm digging in this morning doing some troubleshooting. It does work, but again pretty choppy. Are the ggml_metal_init normal on Apple silicon when not using GPU?

user@userHost neutts-air % python -m examples.basic_streaming_example
--input_text "My name is Dave, and um, I'm from London"
--ref_codes samples/dave.pt
--ref_text samples/dave.txt
/Users/user/.pyenv/versions/3.12.1/lib/python3.12/site-packages/perth/perth_net/init.py:1: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
from pkg_resources import resource_filename
Skipping import of cpp extensions due to incompatible torch version 2.8.0 for torchao version 0.14.0 Please see GitHub issue #2919 for more info
W1018 08:26:59.809000 73635 site-packages/torch/distributed/elastic/multiprocessing/redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
Loading phonemizer...
Loading backbone from: neuphonic/neutts-air-q8-gguf on cpu ...
llama_context: n_ctx_per_seq (2048) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
ggml_metal_init: skipping kernel_get_rows_bf16 (not supported)
ggml_metal_init: skipping kernel_set_rows_bf16 (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_c4 (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4 (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16 (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f16 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h192 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk192_hv128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk576_hv512 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h64 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h96 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h192 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk192_hv128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk576_hv512 (not supported)
ggml_metal_init: skipping kernel_cpy_f32_bf16 (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_f32 (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_bf16 (not supported)
Loading codec from: neuphonic/neucodec-onnx-decoder on cpu ...
loaded PerthNet (Implicit) at step 250,000
Generating audio for input text: My name is Dave, and um, I'm from London
Streaming...
(12000,)
Wrote chunk to stream
(12000,)
Wrote chunk to stream
(12000,)
Wrote chunk to stream
(12000,)
Wrote chunk to stream
(12000,)
Wrote chunk to stream
(12000,)
Wrote chunk to stream
(12000,)
Wrote chunk to stream
(12000,)
Wrote chunk to stream
(12000,)
Wrote chunk to stream
(12000,)
Wrote chunk to stream
(12000,)
Wrote chunk to stream
(12960,)
Wrote chunk to stream
Finished streaming.

I will close everything out and also monitor CPU usage with a larger transcript. I'll also add some additional logging to see if I can help narrow it down.

Ok good news! I am getting very close to RT now. I believe my issue was a number of things. I created a fresh clone, created new venv and reinstalled all requirements fresh. I've added some diagnostics that I can include in a PR that can help users log the LM gen stats.

Loading codec from: neuphonic/neucodec-onnx-decoder on cpu ...
loaded PerthNet (Implicit) at step 250,000
Input text: 39 words / 193 chars
Reference codes shape: 372
Streaming frames per chunk: 25, hop length: 480
Each chunk ~500.0 ms of audio
Generating audio for input text: My name is Dave, and um, I'm from London. I'd like to discuss the idea of streaming audio from a small text to speech model. This text to speech model can be run on cpu only which is incredible
Streaming...

Chunk 1: LM= n/a β”‚ Audio=500.0ms β”‚ 0.0% RT
Chunk 2: LM= 271.8ms β”‚ Audio=500.0ms β”‚ 54.4% RT
Chunk 3: LM= 270.1ms β”‚ Audio=500.0ms β”‚ 54.0% RT
Chunk 4: LM= 362.5ms β”‚ Audio=500.0ms β”‚ 72.5% RT
Chunk 5: LM= 532.5ms β”‚ Audio=500.0ms β”‚ 106.5% RT
Chunk 6: LM= 503.6ms β”‚ Audio=500.0ms β”‚ 100.7% RT
Chunk 7: LM= 416.1ms β”‚ Audio=500.0ms β”‚ 83.2% RT
Chunk 8: LM= 531.6ms β”‚ Audio=500.0ms β”‚ 106.3% RT
Chunk 9: LM= 519.3ms β”‚ Audio=500.0ms β”‚ 103.9% RT
Chunk 10: LM= 503.9ms β”‚ Audio=500.0ms β”‚ 100.8% RT
Chunk 11: LM= 510.5ms β”‚ Audio=500.0ms β”‚ 102.1% RT
Chunk 12: LM= 511.2ms β”‚ Audio=500.0ms β”‚ 102.2% RT
Chunk 13: LM= 515.3ms β”‚ Audio=500.0ms β”‚ 103.1% RT
Chunk 14: LM= 417.9ms β”‚ Audio=500.0ms β”‚ 83.6% RT
Chunk 15: LM= 521.6ms β”‚ Audio=500.0ms β”‚ 104.3% RT
Chunk 16: LM= 510.3ms β”‚ Audio=500.0ms β”‚ 102.1% RT
Chunk 17: LM= 522.5ms β”‚ Audio=500.0ms β”‚ 104.5% RT
Chunk 18: LM= 500.5ms β”‚ Audio=500.0ms β”‚ 100.1% RT
Chunk 19: LM= 526.3ms β”‚ Audio=500.0ms β”‚ 105.3% RT
Chunk 20: LM= 507.8ms β”‚ Audio=500.0ms β”‚ 101.6% RT
Chunk 21: LM= 436.7ms β”‚ Audio=500.0ms β”‚ 87.3% RT
Chunk 22: LM= 501.1ms β”‚ Audio=500.0ms β”‚ 100.2% RT
Chunk 23: LM= 479.7ms β”‚ Audio=500.0ms β”‚ 95.9% RT

Streaming complete. Generated 11.50s of audio in 13.11s.
β†’ Average Speech LM time per chunk: 451.0ms
β†’ Real-Time Factor (RTF): 1.14

SO COOL!
Nice Work!

Sign up or log in to comment