SNIPED_rapo / APP_INFO.md
jbilcke-hf's picture
Upload repository for paper 2510.20206
ee81688 verified
# RAPO++ Gradio App Documentation
## Overview
This Gradio app demonstrates **Stage 1 (RAPO)** of the RAPO++ framework: Retrieval-Augmented Prompt Optimization using knowledge graphs.
## What It Does
The app takes a simple text-to-video (T2V) generation prompt and enriches it with contextually relevant modifiers retrieved from a knowledge graph. This optimization helps create more detailed, coherent prompts that lead to better video generation results.
## How It Works
### Architecture
1. **Knowledge Graph Construction**
- Creates a graph with "places" as central nodes (e.g., forest, beach, city street)
- Places connect to relevant "actions/verbs" (e.g., "walking through", "exploring")
- Places also connect to "atmospheric descriptors" (e.g., "dense trees", "peaceful atmosphere")
2. **Retrieval Process**
- Input prompt is encoded using SentenceTransformer (all-MiniLM-L6-v2)
- Finds top-K most similar places via cosine similarity
- Samples connected actions and atmosphere descriptors from graph neighbors
- Filters modifiers by relevance to the input prompt
3. **Prompt Augmentation**
- Combines original prompt with retrieved modifiers
- Structures the output to maintain coherence
- Returns optimized prompt suitable for T2V generation
### Key Components
**app.py** (main application):
- `create_demo_graph()`: Builds a simplified knowledge graph with common T2V concepts
- `retrieve_and_augment_prompt()`: Core RAPO function decorated with @spaces.GPU
- Gradio interface with examples and detailed documentation
**requirements.txt**:
- gradio 5.49.1 (pinned for compatibility)
- sentence-transformers + sentencepiece for embeddings
- torch 2.5.1 for tensor operations
- networkx for graph operations
- huggingface_hub for model downloads
## Model Downloads
The app automatically downloads the required model on first run:
- **all-MiniLM-L6-v2**: Sentence transformer for computing text embeddings (~80MB)
Downloaded to: `./ckpt/all-MiniLM-L6-v2/`
## Usage
### Basic Usage
1. Enter a simple prompt (e.g., "A person walking")
2. Click "Optimize Prompt"
3. View the enhanced prompt with contextual details
### Advanced Settings
- **Number of Places to Retrieve**: How many related places to search (1-5, default: 2)
- **Modifiers per Place**: How many modifiers to sample from each place (1-10, default: 5)
### Example Prompts
Try these examples to see the optimization in action:
- "A person walking"
- "A car driving at night"
- "Someone cooking in a kitchen"
- "A group of people talking"
- "A bird flying"
- "Someone sitting and reading"
## Technical Details
### Graph Structure
**Places (central nodes):**
- forest, beach, city street, mountain, room, park, studio, kitchen, bridge, parking lot, desert, lake
**Edge Types:**
- Place β†’ Verb/Action edges (e.g., "forest" β†’ "walking through")
- Place β†’ Atmosphere edges (e.g., "forest" β†’ "dense trees")
**Retrieval Algorithm:**
1. Encode input prompt: `prompt_emb = model.encode(prompt)`
2. Compute similarities: `cosine_similarity(prompt_emb, place_embeddings)`
3. Select top-K places by similarity score
4. Sample neighbors from graph: `G.neighbors(place)`
5. Deduplicate and rank modifiers
### ZeroGPU Integration
The `retrieve_and_augment_prompt()` function is decorated with `@spaces.GPU` to leverage the allocated ZeroGPU (NVIDIA H200, 70GB VRAM). This enables:
- Fast embedding computations
- Efficient cosine similarity calculations
- Scalability to larger graphs and batch processing
### Differences from Full RAPO
This demo implements a **simplified version** of Stage 1 RAPO:
**Included:**
βœ… Knowledge graph with place-verb-scene relations
βœ… Embedding-based retrieval via SentenceTransformer
βœ… Cosine similarity ranking
βœ… Basic prompt augmentation
**Not Included (requires additional models/data):**
❌ Full relation graph from paper (requires ~GB of graph data)
❌ LLM-based sentence refactoring (Mistral-7B)
❌ Iterative merging with similarity thresholds
❌ Instruction-based rewriting (Llama3.1)
**Why This Approach:**
- Full RAPO requires 7B+ LLM downloads (~15GB+)
- Full graph data requires downloading preprocessed datasets
- This demo focuses on the **core concept**: retrieval-augmented prompt optimization
- Users can understand the methodology without waiting for large downloads
## Running the Full RAPO Pipeline
To run the complete Stage 1 RAPO from the paper:
```bash
cd examples/Stage1_RAPO
# 1. Retrieve modifiers from graph
sh retrieve_modifiers.sh
# 2. Word augmentation
sh word_augment.sh
# 3. Sentence refactoring
sh refactoring.sh
# 4. Instruction-based rewriting
sh rewrite_via_instruction.sh
```
**Requirements:**
- Download full relation graph data to `relation_graph/graph_data/`
- Download Mistral-7B-Instruct-v0.3 to `ckpt/`
- Download llama3_1_instruct_lora_rewrite to `ckpt/`
See README.md for full installation instructions.
## Integration with RAPO++ Stages
This demo showcases **Stage 1 only**. The complete RAPO++ framework includes:
**Stage 1 (RAPO)** - *Demonstrated Here*
- Retrieval-augmented prompt optimization via knowledge graphs
- Offline refinement using curated data
**Stage 2 (SSPO)**
- Self-supervised prompt optimization
- Iterative refinement based on generated video feedback
- Physics-aware consistency checks
- VLM-based alignment scoring
**Stage 3 (Fine-tuning)**
- LLM fine-tuning on collected feedback from Stage 2
- Model-specific prompt refiners
## Performance Notes
- First run: ~1-2 minutes (downloads model)
- Subsequent runs: <1 second per prompt
- GPU allocation: Automatic via ZeroGPU
- Memory usage: ~500MB (model + graph)
## Troubleshooting
**"No module named 'sentencepiece'"**
- Ensure `sentencepiece==0.2.1` is in requirements.txt
- sentence-transformers requires sentencepiece for tokenization
**"CUDA has been initialized before importing spaces"**
- The app correctly imports `spaces` FIRST before torch
- If you modify the code, maintain this import order
**Model download fails**
- Check internet connection
- HuggingFace Hub may be temporarily unavailable
- Model will retry on next run (cached after successful download)
## References
**Papers:**
- [RAPO (CVPR 2025)](https://arxiv.org/abs/2502.07516): The Devil is in the Prompts
- [RAPO++ (arXiv:2510.20206)](https://arxiv.org/abs/2510.20206): Cross-Stage Prompt Optimization
**Project Pages:**
- RAPO: https://whynothaha.github.io/Prompt_optimizer/RAPO.html
- RAPO++: https://whynothaha.github.io/RAPO_plus_github/
**Code:**
- GitHub: https://github.com/Vchitect/RAPO
## License
Please refer to the original repository for licensing information.
---
**Created for HuggingFace Spaces deployment**