Spaces:
Running
on
Zero
RAPO++ Gradio App Documentation
Overview
This Gradio app demonstrates Stage 1 (RAPO) of the RAPO++ framework: Retrieval-Augmented Prompt Optimization using knowledge graphs.
What It Does
The app takes a simple text-to-video (T2V) generation prompt and enriches it with contextually relevant modifiers retrieved from a knowledge graph. This optimization helps create more detailed, coherent prompts that lead to better video generation results.
How It Works
Architecture
Knowledge Graph Construction
- Creates a graph with "places" as central nodes (e.g., forest, beach, city street)
- Places connect to relevant "actions/verbs" (e.g., "walking through", "exploring")
- Places also connect to "atmospheric descriptors" (e.g., "dense trees", "peaceful atmosphere")
Retrieval Process
- Input prompt is encoded using SentenceTransformer (all-MiniLM-L6-v2)
- Finds top-K most similar places via cosine similarity
- Samples connected actions and atmosphere descriptors from graph neighbors
- Filters modifiers by relevance to the input prompt
Prompt Augmentation
- Combines original prompt with retrieved modifiers
- Structures the output to maintain coherence
- Returns optimized prompt suitable for T2V generation
Key Components
app.py (main application):
create_demo_graph(): Builds a simplified knowledge graph with common T2V conceptsretrieve_and_augment_prompt(): Core RAPO function decorated with @spaces.GPU- Gradio interface with examples and detailed documentation
requirements.txt:
- gradio 5.49.1 (pinned for compatibility)
- sentence-transformers + sentencepiece for embeddings
- torch 2.5.1 for tensor operations
- networkx for graph operations
- huggingface_hub for model downloads
Model Downloads
The app automatically downloads the required model on first run:
- all-MiniLM-L6-v2: Sentence transformer for computing text embeddings (~80MB)
Downloaded to: ./ckpt/all-MiniLM-L6-v2/
Usage
Basic Usage
- Enter a simple prompt (e.g., "A person walking")
- Click "Optimize Prompt"
- View the enhanced prompt with contextual details
Advanced Settings
- Number of Places to Retrieve: How many related places to search (1-5, default: 2)
- Modifiers per Place: How many modifiers to sample from each place (1-10, default: 5)
Example Prompts
Try these examples to see the optimization in action:
- "A person walking"
- "A car driving at night"
- "Someone cooking in a kitchen"
- "A group of people talking"
- "A bird flying"
- "Someone sitting and reading"
Technical Details
Graph Structure
Places (central nodes):
- forest, beach, city street, mountain, room, park, studio, kitchen, bridge, parking lot, desert, lake
Edge Types:
- Place β Verb/Action edges (e.g., "forest" β "walking through")
- Place β Atmosphere edges (e.g., "forest" β "dense trees")
Retrieval Algorithm:
- Encode input prompt:
prompt_emb = model.encode(prompt) - Compute similarities:
cosine_similarity(prompt_emb, place_embeddings) - Select top-K places by similarity score
- Sample neighbors from graph:
G.neighbors(place) - Deduplicate and rank modifiers
ZeroGPU Integration
The retrieve_and_augment_prompt() function is decorated with @spaces.GPU to leverage the allocated ZeroGPU (NVIDIA H200, 70GB VRAM). This enables:
- Fast embedding computations
- Efficient cosine similarity calculations
- Scalability to larger graphs and batch processing
Differences from Full RAPO
This demo implements a simplified version of Stage 1 RAPO:
Included: β Knowledge graph with place-verb-scene relations β Embedding-based retrieval via SentenceTransformer β Cosine similarity ranking β Basic prompt augmentation
Not Included (requires additional models/data): β Full relation graph from paper (requires ~GB of graph data) β LLM-based sentence refactoring (Mistral-7B) β Iterative merging with similarity thresholds β Instruction-based rewriting (Llama3.1)
Why This Approach:
- Full RAPO requires 7B+ LLM downloads (~15GB+)
- Full graph data requires downloading preprocessed datasets
- This demo focuses on the core concept: retrieval-augmented prompt optimization
- Users can understand the methodology without waiting for large downloads
Running the Full RAPO Pipeline
To run the complete Stage 1 RAPO from the paper:
cd examples/Stage1_RAPO
# 1. Retrieve modifiers from graph
sh retrieve_modifiers.sh
# 2. Word augmentation
sh word_augment.sh
# 3. Sentence refactoring
sh refactoring.sh
# 4. Instruction-based rewriting
sh rewrite_via_instruction.sh
Requirements:
- Download full relation graph data to
relation_graph/graph_data/ - Download Mistral-7B-Instruct-v0.3 to
ckpt/ - Download llama3_1_instruct_lora_rewrite to
ckpt/
See README.md for full installation instructions.
Integration with RAPO++ Stages
This demo showcases Stage 1 only. The complete RAPO++ framework includes:
Stage 1 (RAPO) - Demonstrated Here
- Retrieval-augmented prompt optimization via knowledge graphs
- Offline refinement using curated data
Stage 2 (SSPO)
- Self-supervised prompt optimization
- Iterative refinement based on generated video feedback
- Physics-aware consistency checks
- VLM-based alignment scoring
Stage 3 (Fine-tuning)
- LLM fine-tuning on collected feedback from Stage 2
- Model-specific prompt refiners
Performance Notes
- First run: ~1-2 minutes (downloads model)
- Subsequent runs: <1 second per prompt
- GPU allocation: Automatic via ZeroGPU
- Memory usage: ~500MB (model + graph)
Troubleshooting
"No module named 'sentencepiece'"
- Ensure
sentencepiece==0.2.1is in requirements.txt - sentence-transformers requires sentencepiece for tokenization
"CUDA has been initialized before importing spaces"
- The app correctly imports
spacesFIRST before torch - If you modify the code, maintain this import order
Model download fails
- Check internet connection
- HuggingFace Hub may be temporarily unavailable
- Model will retry on next run (cached after successful download)
References
Papers:
- RAPO (CVPR 2025): The Devil is in the Prompts
- RAPO++ (arXiv:2510.20206): Cross-Stage Prompt Optimization
Project Pages:
- RAPO: https://whynothaha.github.io/Prompt_optimizer/RAPO.html
- RAPO++: https://whynothaha.github.io/RAPO_plus_github/
Code:
- GitHub: https://github.com/Vchitect/RAPO
License
Please refer to the original repository for licensing information.
Created for HuggingFace Spaces deployment