Spaces:
Sleeping
Sleeping
| title: SingingSDS | |
| emoji: πΆ | |
| colorFrom: pink | |
| colorTo: yellow | |
| sdk: gradio | |
| sdk_version: 5.4.0 | |
| app_file: app.py | |
| pinned: false | |
| # SingingSDS: Role-Playing Singing Spoken Dialogue System | |
| A role-playing singing dialogue system that converts speech input into character-based singing output. | |
| ## Installation | |
| ### Requirements | |
| - Python 3.11+ | |
| - CUDA (optional, for GPU acceleration) | |
| ### Install Dependencies | |
| #### Option 1: Using Conda (Recommended) | |
| ```bash | |
| conda create -n singingsds python=3.11 | |
| conda activate singingsds | |
| conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia | |
| pip install -r requirements.txt | |
| ``` | |
| #### Option 2: Using pip only | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| #### Option 3: Using pip with virtual environment | |
| ```bash | |
| python -m venv singingsds_env | |
| # On Windows: | |
| singingsds_env\Scripts\activate | |
| # On macOS/Linux: | |
| source singingsds_env/bin/activate | |
| pip install -r requirements.txt | |
| ``` | |
| ## Usage | |
| ### Command Line Interface (CLI) | |
| #### Example Usage | |
| ```bash | |
| python cli.py \ | |
| --query_audio tests/audio/hello.wav \ | |
| --config_path config/cli/yaoyin_default.yaml \ | |
| --output_audio outputs/yaoyin_hello.wav \ | |
| --eval_results_csv outputs/yaoyin_test.csv | |
| ``` | |
| #### Inference-Only Mode | |
| Run minimal inference without evaluation. | |
| ```bash | |
| python cli.py \ | |
| --query_audio tests/audio/hello.wav \ | |
| --config_path config/cli/yaoyin_default_infer_only.yaml \ | |
| --output_audio outputs/yaoyin_hello.wav | |
| ``` | |
| #### Parameter Description | |
| - `--query_audio`: Input audio file path (required) | |
| - `--config_path`: Configuration file path (default: config/cli/yaoyin_default.yaml) | |
| - `--output_audio`: Output audio file path (required) | |
| ### Web Interface (Gradio) | |
| Start the web interface: | |
| ```bash | |
| python app.py | |
| ``` | |
| Then visit the displayed address in your browser to use the graphical interface. | |
| ## Configuration | |
| ### Character Configuration | |
| The system supports multiple preset characters: | |
| - **Yaoyin (ι₯ι³)**: Default timbre is `timbre2` | |
| - **Limei (δΈ½ζ’ )**: Default timbre is `timbre1` | |
| ### Model Configuration | |
| #### ASR Models | |
| - `openai/whisper-large-v3-turbo` | |
| - `openai/whisper-large-v3` | |
| - `openai/whisper-medium` | |
| - `openai/whisper-small` | |
| - `funasr/paraformer-zh` | |
| #### LLM Models | |
| - `gemini-2.5-flash` | |
| - `google/gemma-2-2b` | |
| - `meta-llama/Llama-3.2-3B-Instruct` | |
| - `meta-llama/Llama-3.1-8B-Instruct` | |
| - `Qwen/Qwen3-8B` | |
| - `Qwen/Qwen3-30B-A3B` | |
| - `MiniMaxAI/MiniMax-Text-01` | |
| #### SVS Models | |
| - `espnet/mixdata_svs_visinger2_spkemb_lang_pretrained_avg` (Bilingual) | |
| - `espnet/aceopencpop_svs_visinger2_40singer_pretrain` (Chinese) | |
| ## Project Structure | |
| ``` | |
| SingingSDS/ | |
| βββ app.py, cli.py # Entry points (demo app & CLI) | |
| βββ pipeline.py # Main orchestration pipeline | |
| βββ interface.py # Gradio interface | |
| βββ characters/ # Virtual character definitions | |
| βββ modules/ # Core modules | |
| β βββ asr/ # ASR models (Whisper, Paraformer) | |
| β βββ llm/ # LLMs (Gemini, LLaMA, etc.) | |
| β βββ svs/ # Singing voice synthesis (ESPnet) | |
| β βββ utils/ # G2P, text normalization, resources | |
| βββ config/ # YAML configuration files | |
| βββ data/ # Dataset metadata and length info | |
| βββ data_handlers/ # Parsers for KiSing, Touhou, etc. | |
| βββ evaluation/ # Evaluation metrics | |
| βββ resources/ # Singer embeddings, phoneme dicts, MIDI | |
| βββ assets/ # Character visuals | |
| βββ tests/ # Unit tests and sample audios | |
| βββ README.md, requirements.txt | |
| ``` | |
| ## Contributing | |
| Issues and Pull Requests are welcome! | |
| ## License | |