Spaces:

jbilcke-hf
/

SNIPED_rapo

Running on Zero

App Files Files Community

SNIPED_rapo / APP_INFO.md

jbilcke-hf

Upload repository for paper 2510.20206

ee81688 verified 19 days ago

preview code

raw

history blame contribute delete

6.71 kB

	# RAPO++ Gradio App Documentation

	## Overview

	This Gradio app demonstrates Stage 1 (RAPO) of the RAPO++ framework: Retrieval-Augmented Prompt Optimization using knowledge graphs.

	## What It Does

	The app takes a simple text-to-video (T2V) generation prompt and enriches it with contextually relevant modifiers retrieved from a knowledge graph. This optimization helps create more detailed, coherent prompts that lead to better video generation results.

	## How It Works

	### Architecture

	1. Knowledge Graph Construction
	- Creates a graph with "places" as central nodes (e.g., forest, beach, city street)
	- Places connect to relevant "actions/verbs" (e.g., "walking through", "exploring")
	- Places also connect to "atmospheric descriptors" (e.g., "dense trees", "peaceful atmosphere")

	2. Retrieval Process
	- Input prompt is encoded using SentenceTransformer (all-MiniLM-L6-v2)
	- Finds top-K most similar places via cosine similarity
	- Samples connected actions and atmosphere descriptors from graph neighbors
	- Filters modifiers by relevance to the input prompt

	3. Prompt Augmentation
	- Combines original prompt with retrieved modifiers
	- Structures the output to maintain coherence
	- Returns optimized prompt suitable for T2V generation

	### Key Components

	app.py (main application):
	- `create_demo_graph()`: Builds a simplified knowledge graph with common T2V concepts
	- `retrieve_and_augment_prompt()`: Core RAPO function decorated with @spaces.GPU
	- Gradio interface with examples and detailed documentation

	requirements.txt:
	- gradio 5.49.1 (pinned for compatibility)
	- sentence-transformers + sentencepiece for embeddings
	- torch 2.5.1 for tensor operations
	- networkx for graph operations
	- huggingface_hub for model downloads

	## Model Downloads

	The app automatically downloads the required model on first run:
	- all-MiniLM-L6-v2: Sentence transformer for computing text embeddings (~80MB)

	Downloaded to: `./ckpt/all-MiniLM-L6-v2/`

	## Usage

	### Basic Usage

	1. Enter a simple prompt (e.g., "A person walking")
	2. Click "Optimize Prompt"
	3. View the enhanced prompt with contextual details

	### Advanced Settings

	- Number of Places to Retrieve: How many related places to search (1-5, default: 2)
	- Modifiers per Place: How many modifiers to sample from each place (1-10, default: 5)

	### Example Prompts

	Try these examples to see the optimization in action:
	- "A person walking"
	- "A car driving at night"
	- "Someone cooking in a kitchen"
	- "A group of people talking"
	- "A bird flying"
	- "Someone sitting and reading"

	## Technical Details

	### Graph Structure

	Places (central nodes):
	- forest, beach, city street, mountain, room, park, studio, kitchen, bridge, parking lot, desert, lake

	Edge Types:
	- Place → Verb/Action edges (e.g., "forest" → "walking through")
	- Place → Atmosphere edges (e.g., "forest" → "dense trees")

	Retrieval Algorithm:
	1. Encode input prompt: `prompt_emb = model.encode(prompt)`
	2. Compute similarities: `cosine_similarity(prompt_emb, place_embeddings)`
	3. Select top-K places by similarity score
	4. Sample neighbors from graph: `G.neighbors(place)`
	5. Deduplicate and rank modifiers

	### ZeroGPU Integration

	The `retrieve_and_augment_prompt()` function is decorated with `@spaces.GPU` to leverage the allocated ZeroGPU (NVIDIA H200, 70GB VRAM). This enables:
	- Fast embedding computations
	- Efficient cosine similarity calculations
	- Scalability to larger graphs and batch processing

	### Differences from Full RAPO

	This demo implements a simplified version of Stage 1 RAPO:

	Included:
	✅ Knowledge graph with place-verb-scene relations
	✅ Embedding-based retrieval via SentenceTransformer
	✅ Cosine similarity ranking
	✅ Basic prompt augmentation

	Not Included (requires additional models/data):
	❌ Full relation graph from paper (requires ~GB of graph data)
	❌ LLM-based sentence refactoring (Mistral-7B)
	❌ Iterative merging with similarity thresholds
	❌ Instruction-based rewriting (Llama3.1)

	Why This Approach:
	- Full RAPO requires 7B+ LLM downloads (~15GB+)
	- Full graph data requires downloading preprocessed datasets
	- This demo focuses on the core concept: retrieval-augmented prompt optimization
	- Users can understand the methodology without waiting for large downloads

	## Running the Full RAPO Pipeline

	To run the complete Stage 1 RAPO from the paper:

	```bash
	cd examples/Stage1_RAPO

	# 1. Retrieve modifiers from graph
	sh retrieve_modifiers.sh

	# 2. Word augmentation
	sh word_augment.sh

	# 3. Sentence refactoring
	sh refactoring.sh

	# 4. Instruction-based rewriting
	sh rewrite_via_instruction.sh
	```

	Requirements:
	- Download full relation graph data to `relation_graph/graph_data/`
	- Download Mistral-7B-Instruct-v0.3 to `ckpt/`
	- Download llama3_1_instruct_lora_rewrite to `ckpt/`

	See README.md for full installation instructions.

	## Integration with RAPO++ Stages

	This demo showcases Stage 1 only. The complete RAPO++ framework includes:

	Stage 1 (RAPO) - Demonstrated Here
	- Retrieval-augmented prompt optimization via knowledge graphs
	- Offline refinement using curated data

	Stage 2 (SSPO)
	- Self-supervised prompt optimization
	- Iterative refinement based on generated video feedback
	- Physics-aware consistency checks
	- VLM-based alignment scoring

	Stage 3 (Fine-tuning)
	- LLM fine-tuning on collected feedback from Stage 2
	- Model-specific prompt refiners

	## Performance Notes

	- First run: ~1-2 minutes (downloads model)
	- Subsequent runs: <1 second per prompt
	- GPU allocation: Automatic via ZeroGPU
	- Memory usage: ~500MB (model + graph)

	## Troubleshooting

	"No module named 'sentencepiece'"
	- Ensure `sentencepiece==0.2.1` is in requirements.txt
	- sentence-transformers requires sentencepiece for tokenization

	"CUDA has been initialized before importing spaces"
	- The app correctly imports `spaces` FIRST before torch
	- If you modify the code, maintain this import order

	Model download fails
	- Check internet connection
	- HuggingFace Hub may be temporarily unavailable
	- Model will retry on next run (cached after successful download)

	## References

	Papers:
	- [RAPO (CVPR 2025)](https://arxiv.org/abs/2502.07516): The Devil is in the Prompts
	- [RAPO++ (arXiv:2510.20206)](https://arxiv.org/abs/2510.20206): Cross-Stage Prompt Optimization

	Project Pages:
	- RAPO: https://whynothaha.github.io/Prompt_optimizer/RAPO.html
	- RAPO++: https://whynothaha.github.io/RAPO_plus_github/

	Code:
	- GitHub: https://github.com/Vchitect/RAPO

	## License

	Please refer to the original repository for licensing information.

	---

	Created for HuggingFace Spaces deployment