Spaces:
Running
on
Zero
Running
on
Zero
| # RAPO++ Gradio App Documentation | |
| ## Overview | |
| This Gradio app demonstrates **Stage 1 (RAPO)** of the RAPO++ framework: Retrieval-Augmented Prompt Optimization using knowledge graphs. | |
| ## What It Does | |
| The app takes a simple text-to-video (T2V) generation prompt and enriches it with contextually relevant modifiers retrieved from a knowledge graph. This optimization helps create more detailed, coherent prompts that lead to better video generation results. | |
| ## How It Works | |
| ### Architecture | |
| 1. **Knowledge Graph Construction** | |
| - Creates a graph with "places" as central nodes (e.g., forest, beach, city street) | |
| - Places connect to relevant "actions/verbs" (e.g., "walking through", "exploring") | |
| - Places also connect to "atmospheric descriptors" (e.g., "dense trees", "peaceful atmosphere") | |
| 2. **Retrieval Process** | |
| - Input prompt is encoded using SentenceTransformer (all-MiniLM-L6-v2) | |
| - Finds top-K most similar places via cosine similarity | |
| - Samples connected actions and atmosphere descriptors from graph neighbors | |
| - Filters modifiers by relevance to the input prompt | |
| 3. **Prompt Augmentation** | |
| - Combines original prompt with retrieved modifiers | |
| - Structures the output to maintain coherence | |
| - Returns optimized prompt suitable for T2V generation | |
| ### Key Components | |
| **app.py** (main application): | |
| - `create_demo_graph()`: Builds a simplified knowledge graph with common T2V concepts | |
| - `retrieve_and_augment_prompt()`: Core RAPO function decorated with @spaces.GPU | |
| - Gradio interface with examples and detailed documentation | |
| **requirements.txt**: | |
| - gradio 5.49.1 (pinned for compatibility) | |
| - sentence-transformers + sentencepiece for embeddings | |
| - torch 2.5.1 for tensor operations | |
| - networkx for graph operations | |
| - huggingface_hub for model downloads | |
| ## Model Downloads | |
| The app automatically downloads the required model on first run: | |
| - **all-MiniLM-L6-v2**: Sentence transformer for computing text embeddings (~80MB) | |
| Downloaded to: `./ckpt/all-MiniLM-L6-v2/` | |
| ## Usage | |
| ### Basic Usage | |
| 1. Enter a simple prompt (e.g., "A person walking") | |
| 2. Click "Optimize Prompt" | |
| 3. View the enhanced prompt with contextual details | |
| ### Advanced Settings | |
| - **Number of Places to Retrieve**: How many related places to search (1-5, default: 2) | |
| - **Modifiers per Place**: How many modifiers to sample from each place (1-10, default: 5) | |
| ### Example Prompts | |
| Try these examples to see the optimization in action: | |
| - "A person walking" | |
| - "A car driving at night" | |
| - "Someone cooking in a kitchen" | |
| - "A group of people talking" | |
| - "A bird flying" | |
| - "Someone sitting and reading" | |
| ## Technical Details | |
| ### Graph Structure | |
| **Places (central nodes):** | |
| - forest, beach, city street, mountain, room, park, studio, kitchen, bridge, parking lot, desert, lake | |
| **Edge Types:** | |
| - Place β Verb/Action edges (e.g., "forest" β "walking through") | |
| - Place β Atmosphere edges (e.g., "forest" β "dense trees") | |
| **Retrieval Algorithm:** | |
| 1. Encode input prompt: `prompt_emb = model.encode(prompt)` | |
| 2. Compute similarities: `cosine_similarity(prompt_emb, place_embeddings)` | |
| 3. Select top-K places by similarity score | |
| 4. Sample neighbors from graph: `G.neighbors(place)` | |
| 5. Deduplicate and rank modifiers | |
| ### ZeroGPU Integration | |
| The `retrieve_and_augment_prompt()` function is decorated with `@spaces.GPU` to leverage the allocated ZeroGPU (NVIDIA H200, 70GB VRAM). This enables: | |
| - Fast embedding computations | |
| - Efficient cosine similarity calculations | |
| - Scalability to larger graphs and batch processing | |
| ### Differences from Full RAPO | |
| This demo implements a **simplified version** of Stage 1 RAPO: | |
| **Included:** | |
| β Knowledge graph with place-verb-scene relations | |
| β Embedding-based retrieval via SentenceTransformer | |
| β Cosine similarity ranking | |
| β Basic prompt augmentation | |
| **Not Included (requires additional models/data):** | |
| β Full relation graph from paper (requires ~GB of graph data) | |
| β LLM-based sentence refactoring (Mistral-7B) | |
| β Iterative merging with similarity thresholds | |
| β Instruction-based rewriting (Llama3.1) | |
| **Why This Approach:** | |
| - Full RAPO requires 7B+ LLM downloads (~15GB+) | |
| - Full graph data requires downloading preprocessed datasets | |
| - This demo focuses on the **core concept**: retrieval-augmented prompt optimization | |
| - Users can understand the methodology without waiting for large downloads | |
| ## Running the Full RAPO Pipeline | |
| To run the complete Stage 1 RAPO from the paper: | |
| ```bash | |
| cd examples/Stage1_RAPO | |
| # 1. Retrieve modifiers from graph | |
| sh retrieve_modifiers.sh | |
| # 2. Word augmentation | |
| sh word_augment.sh | |
| # 3. Sentence refactoring | |
| sh refactoring.sh | |
| # 4. Instruction-based rewriting | |
| sh rewrite_via_instruction.sh | |
| ``` | |
| **Requirements:** | |
| - Download full relation graph data to `relation_graph/graph_data/` | |
| - Download Mistral-7B-Instruct-v0.3 to `ckpt/` | |
| - Download llama3_1_instruct_lora_rewrite to `ckpt/` | |
| See README.md for full installation instructions. | |
| ## Integration with RAPO++ Stages | |
| This demo showcases **Stage 1 only**. The complete RAPO++ framework includes: | |
| **Stage 1 (RAPO)** - *Demonstrated Here* | |
| - Retrieval-augmented prompt optimization via knowledge graphs | |
| - Offline refinement using curated data | |
| **Stage 2 (SSPO)** | |
| - Self-supervised prompt optimization | |
| - Iterative refinement based on generated video feedback | |
| - Physics-aware consistency checks | |
| - VLM-based alignment scoring | |
| **Stage 3 (Fine-tuning)** | |
| - LLM fine-tuning on collected feedback from Stage 2 | |
| - Model-specific prompt refiners | |
| ## Performance Notes | |
| - First run: ~1-2 minutes (downloads model) | |
| - Subsequent runs: <1 second per prompt | |
| - GPU allocation: Automatic via ZeroGPU | |
| - Memory usage: ~500MB (model + graph) | |
| ## Troubleshooting | |
| **"No module named 'sentencepiece'"** | |
| - Ensure `sentencepiece==0.2.1` is in requirements.txt | |
| - sentence-transformers requires sentencepiece for tokenization | |
| **"CUDA has been initialized before importing spaces"** | |
| - The app correctly imports `spaces` FIRST before torch | |
| - If you modify the code, maintain this import order | |
| **Model download fails** | |
| - Check internet connection | |
| - HuggingFace Hub may be temporarily unavailable | |
| - Model will retry on next run (cached after successful download) | |
| ## References | |
| **Papers:** | |
| - [RAPO (CVPR 2025)](https://arxiv.org/abs/2502.07516): The Devil is in the Prompts | |
| - [RAPO++ (arXiv:2510.20206)](https://arxiv.org/abs/2510.20206): Cross-Stage Prompt Optimization | |
| **Project Pages:** | |
| - RAPO: https://whynothaha.github.io/Prompt_optimizer/RAPO.html | |
| - RAPO++: https://whynothaha.github.io/RAPO_plus_github/ | |
| **Code:** | |
| - GitHub: https://github.com/Vchitect/RAPO | |
| ## License | |
| Please refer to the original repository for licensing information. | |
| --- | |
| **Created for HuggingFace Spaces deployment** | |