# RAPO++ Gradio App Documentation ## Overview This Gradio app demonstrates **Stage 1 (RAPO)** of the RAPO++ framework: Retrieval-Augmented Prompt Optimization using knowledge graphs. ## What It Does The app takes a simple text-to-video (T2V) generation prompt and enriches it with contextually relevant modifiers retrieved from a knowledge graph. This optimization helps create more detailed, coherent prompts that lead to better video generation results. ## How It Works ### Architecture 1. **Knowledge Graph Construction** - Creates a graph with "places" as central nodes (e.g., forest, beach, city street) - Places connect to relevant "actions/verbs" (e.g., "walking through", "exploring") - Places also connect to "atmospheric descriptors" (e.g., "dense trees", "peaceful atmosphere") 2. **Retrieval Process** - Input prompt is encoded using SentenceTransformer (all-MiniLM-L6-v2) - Finds top-K most similar places via cosine similarity - Samples connected actions and atmosphere descriptors from graph neighbors - Filters modifiers by relevance to the input prompt 3. **Prompt Augmentation** - Combines original prompt with retrieved modifiers - Structures the output to maintain coherence - Returns optimized prompt suitable for T2V generation ### Key Components **app.py** (main application): - `create_demo_graph()`: Builds a simplified knowledge graph with common T2V concepts - `retrieve_and_augment_prompt()`: Core RAPO function decorated with @spaces.GPU - Gradio interface with examples and detailed documentation **requirements.txt**: - gradio 5.49.1 (pinned for compatibility) - sentence-transformers + sentencepiece for embeddings - torch 2.5.1 for tensor operations - networkx for graph operations - huggingface_hub for model downloads ## Model Downloads The app automatically downloads the required model on first run: - **all-MiniLM-L6-v2**: Sentence transformer for computing text embeddings (~80MB) Downloaded to: `./ckpt/all-MiniLM-L6-v2/` ## Usage ### Basic Usage 1. Enter a simple prompt (e.g., "A person walking") 2. Click "Optimize Prompt" 3. View the enhanced prompt with contextual details ### Advanced Settings - **Number of Places to Retrieve**: How many related places to search (1-5, default: 2) - **Modifiers per Place**: How many modifiers to sample from each place (1-10, default: 5) ### Example Prompts Try these examples to see the optimization in action: - "A person walking" - "A car driving at night" - "Someone cooking in a kitchen" - "A group of people talking" - "A bird flying" - "Someone sitting and reading" ## Technical Details ### Graph Structure **Places (central nodes):** - forest, beach, city street, mountain, room, park, studio, kitchen, bridge, parking lot, desert, lake **Edge Types:** - Place → Verb/Action edges (e.g., "forest" → "walking through") - Place → Atmosphere edges (e.g., "forest" → "dense trees") **Retrieval Algorithm:** 1. Encode input prompt: `prompt_emb = model.encode(prompt)` 2. Compute similarities: `cosine_similarity(prompt_emb, place_embeddings)` 3. Select top-K places by similarity score 4. Sample neighbors from graph: `G.neighbors(place)` 5. Deduplicate and rank modifiers ### ZeroGPU Integration The `retrieve_and_augment_prompt()` function is decorated with `@spaces.GPU` to leverage the allocated ZeroGPU (NVIDIA H200, 70GB VRAM). This enables: - Fast embedding computations - Efficient cosine similarity calculations - Scalability to larger graphs and batch processing ### Differences from Full RAPO This demo implements a **simplified version** of Stage 1 RAPO: **Included:** ✅ Knowledge graph with place-verb-scene relations ✅ Embedding-based retrieval via SentenceTransformer ✅ Cosine similarity ranking ✅ Basic prompt augmentation **Not Included (requires additional models/data):** ❌ Full relation graph from paper (requires ~GB of graph data) ❌ LLM-based sentence refactoring (Mistral-7B) ❌ Iterative merging with similarity thresholds ❌ Instruction-based rewriting (Llama3.1) **Why This Approach:** - Full RAPO requires 7B+ LLM downloads (~15GB+) - Full graph data requires downloading preprocessed datasets - This demo focuses on the **core concept**: retrieval-augmented prompt optimization - Users can understand the methodology without waiting for large downloads ## Running the Full RAPO Pipeline To run the complete Stage 1 RAPO from the paper: ```bash cd examples/Stage1_RAPO # 1. Retrieve modifiers from graph sh retrieve_modifiers.sh # 2. Word augmentation sh word_augment.sh # 3. Sentence refactoring sh refactoring.sh # 4. Instruction-based rewriting sh rewrite_via_instruction.sh ``` **Requirements:** - Download full relation graph data to `relation_graph/graph_data/` - Download Mistral-7B-Instruct-v0.3 to `ckpt/` - Download llama3_1_instruct_lora_rewrite to `ckpt/` See README.md for full installation instructions. ## Integration with RAPO++ Stages This demo showcases **Stage 1 only**. The complete RAPO++ framework includes: **Stage 1 (RAPO)** - *Demonstrated Here* - Retrieval-augmented prompt optimization via knowledge graphs - Offline refinement using curated data **Stage 2 (SSPO)** - Self-supervised prompt optimization - Iterative refinement based on generated video feedback - Physics-aware consistency checks - VLM-based alignment scoring **Stage 3 (Fine-tuning)** - LLM fine-tuning on collected feedback from Stage 2 - Model-specific prompt refiners ## Performance Notes - First run: ~1-2 minutes (downloads model) - Subsequent runs: <1 second per prompt - GPU allocation: Automatic via ZeroGPU - Memory usage: ~500MB (model + graph) ## Troubleshooting **"No module named 'sentencepiece'"** - Ensure `sentencepiece==0.2.1` is in requirements.txt - sentence-transformers requires sentencepiece for tokenization **"CUDA has been initialized before importing spaces"** - The app correctly imports `spaces` FIRST before torch - If you modify the code, maintain this import order **Model download fails** - Check internet connection - HuggingFace Hub may be temporarily unavailable - Model will retry on next run (cached after successful download) ## References **Papers:** - [RAPO (CVPR 2025)](https://arxiv.org/abs/2502.07516): The Devil is in the Prompts - [RAPO++ (arXiv:2510.20206)](https://arxiv.org/abs/2510.20206): Cross-Stage Prompt Optimization **Project Pages:** - RAPO: https://whynothaha.github.io/Prompt_optimizer/RAPO.html - RAPO++: https://whynothaha.github.io/RAPO_plus_github/ **Code:** - GitHub: https://github.com/Vchitect/RAPO ## License Please refer to the original repository for licensing information. --- **Created for HuggingFace Spaces deployment**