SingingSDS / README.md
jhansss's picture
Relax Python version specification in README.md
ff8bce5
|
raw
history blame
3.84 kB
---
title: SingingSDS
emoji: 🎢
colorFrom: pink
colorTo: yellow
sdk: gradio
sdk_version: 5.4.0
app_file: app.py
pinned: false
---
# SingingSDS: Role-Playing Singing Spoken Dialogue System
A role-playing singing dialogue system that converts speech input into character-based singing output.
## Installation
### Requirements
- Python 3.11+
- CUDA (optional, for GPU acceleration)
### Install Dependencies
#### Option 1: Using Conda (Recommended)
```bash
conda create -n singingsds python=3.11
conda activate singingsds
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r requirements.txt
```
#### Option 2: Using pip only
```bash
pip install -r requirements.txt
```
#### Option 3: Using pip with virtual environment
```bash
python -m venv singingsds_env
# On Windows:
singingsds_env\Scripts\activate
# On macOS/Linux:
source singingsds_env/bin/activate
pip install -r requirements.txt
```
## Usage
### Command Line Interface (CLI)
#### Example Usage
```bash
python cli.py \
--query_audio tests/audio/hello.wav \
--config_path config/cli/yaoyin_default.yaml \
--output_audio outputs/yaoyin_hello.wav \
--eval_results_csv outputs/yaoyin_test.csv
```
#### Inference-Only Mode
Run minimal inference without evaluation.
```bash
python cli.py \
--query_audio tests/audio/hello.wav \
--config_path config/cli/yaoyin_default_infer_only.yaml \
--output_audio outputs/yaoyin_hello.wav
```
#### Parameter Description
- `--query_audio`: Input audio file path (required)
- `--config_path`: Configuration file path (default: config/cli/yaoyin_default.yaml)
- `--output_audio`: Output audio file path (required)
### Web Interface (Gradio)
Start the web interface:
```bash
python app.py
```
Then visit the displayed address in your browser to use the graphical interface.
## Configuration
### Character Configuration
The system supports multiple preset characters:
- **Yaoyin (ι₯音)**: Default timbre is `timbre2`
- **Limei (δΈ½ζ’…)**: Default timbre is `timbre1`
### Model Configuration
#### ASR Models
- `openai/whisper-large-v3-turbo`
- `openai/whisper-large-v3`
- `openai/whisper-medium`
- `openai/whisper-small`
- `funasr/paraformer-zh`
#### LLM Models
- `gemini-2.5-flash`
- `google/gemma-2-2b`
- `meta-llama/Llama-3.2-3B-Instruct`
- `meta-llama/Llama-3.1-8B-Instruct`
- `Qwen/Qwen3-8B`
- `Qwen/Qwen3-30B-A3B`
- `MiniMaxAI/MiniMax-Text-01`
#### SVS Models
- `espnet/mixdata_svs_visinger2_spkemb_lang_pretrained_avg` (Bilingual)
- `espnet/aceopencpop_svs_visinger2_40singer_pretrain` (Chinese)
## Project Structure
```
SingingSDS/
β”œβ”€β”€ app.py, cli.py # Entry points (demo app & CLI)
β”œβ”€β”€ pipeline.py # Main orchestration pipeline
β”œβ”€β”€ interface.py # Gradio interface
β”œβ”€β”€ characters/ # Virtual character definitions
β”œβ”€β”€ modules/ # Core modules
β”‚ β”œβ”€β”€ asr/ # ASR models (Whisper, Paraformer)
β”‚ β”œβ”€β”€ llm/ # LLMs (Gemini, LLaMA, etc.)
β”‚ β”œβ”€β”€ svs/ # Singing voice synthesis (ESPnet)
β”‚ └── utils/ # G2P, text normalization, resources
β”œβ”€β”€ config/ # YAML configuration files
β”œβ”€β”€ data/ # Dataset metadata and length info
β”œβ”€β”€ data_handlers/ # Parsers for KiSing, Touhou, etc.
β”œβ”€β”€ evaluation/ # Evaluation metrics
β”œβ”€β”€ resources/ # Singer embeddings, phoneme dicts, MIDI
β”œβ”€β”€ assets/ # Character visuals
β”œβ”€β”€ tests/ # Unit tests and sample audios
└── README.md, requirements.txt
```
## Contributing
Issues and Pull Requests are welcome!
## License