Spaces:

jbilcke-hf
/

SNIPED_rapo

Running on Zero

App Files Files Community

SNIPED_rapo / APP_INFO.md

jbilcke-hf

Upload repository for paper 2510.20206

ee81688 verified 17 days ago

preview code

raw

history blame contribute delete

6.71 kB

RAPO++ Gradio App Documentation

Overview

This Gradio app demonstrates Stage 1 (RAPO) of the RAPO++ framework: Retrieval-Augmented Prompt Optimization using knowledge graphs.

What It Does

The app takes a simple text-to-video (T2V) generation prompt and enriches it with contextually relevant modifiers retrieved from a knowledge graph. This optimization helps create more detailed, coherent prompts that lead to better video generation results.

How It Works

Architecture

Knowledge Graph Construction
- Creates a graph with "places" as central nodes (e.g., forest, beach, city street)
- Places connect to relevant "actions/verbs" (e.g., "walking through", "exploring")
- Places also connect to "atmospheric descriptors" (e.g., "dense trees", "peaceful atmosphere")
Retrieval Process
- Input prompt is encoded using SentenceTransformer (all-MiniLM-L6-v2)
- Finds top-K most similar places via cosine similarity
- Samples connected actions and atmosphere descriptors from graph neighbors
- Filters modifiers by relevance to the input prompt
Prompt Augmentation
- Combines original prompt with retrieved modifiers
- Structures the output to maintain coherence
- Returns optimized prompt suitable for T2V generation

Key Components

app.py (main application):

create_demo_graph(): Builds a simplified knowledge graph with common T2V concepts
retrieve_and_augment_prompt(): Core RAPO function decorated with @spaces.GPU
Gradio interface with examples and detailed documentation

requirements.txt:

gradio 5.49.1 (pinned for compatibility)
sentence-transformers + sentencepiece for embeddings
torch 2.5.1 for tensor operations
networkx for graph operations
huggingface_hub for model downloads

Model Downloads

The app automatically downloads the required model on first run:

all-MiniLM-L6-v2: Sentence transformer for computing text embeddings (~80MB)

Downloaded to: ./ckpt/all-MiniLM-L6-v2/

Usage

Basic Usage

Enter a simple prompt (e.g., "A person walking")
Click "Optimize Prompt"
View the enhanced prompt with contextual details

Advanced Settings

Number of Places to Retrieve: How many related places to search (1-5, default: 2)
Modifiers per Place: How many modifiers to sample from each place (1-10, default: 5)

Example Prompts

Try these examples to see the optimization in action:

"A person walking"
"A car driving at night"
"Someone cooking in a kitchen"
"A group of people talking"
"A bird flying"
"Someone sitting and reading"

Technical Details

Graph Structure

Places (central nodes):

forest, beach, city street, mountain, room, park, studio, kitchen, bridge, parking lot, desert, lake

Edge Types:

Place → Verb/Action edges (e.g., "forest" → "walking through")
Place → Atmosphere edges (e.g., "forest" → "dense trees")

Retrieval Algorithm:

Encode input prompt: prompt_emb = model.encode(prompt)
Compute similarities: cosine_similarity(prompt_emb, place_embeddings)
Select top-K places by similarity score
Sample neighbors from graph: G.neighbors(place)
Deduplicate and rank modifiers

ZeroGPU Integration

The retrieve_and_augment_prompt() function is decorated with @spaces.GPU to leverage the allocated ZeroGPU (NVIDIA H200, 70GB VRAM). This enables:

Fast embedding computations
Efficient cosine similarity calculations
Scalability to larger graphs and batch processing

Differences from Full RAPO

This demo implements a simplified version of Stage 1 RAPO:

Included: ✅ Knowledge graph with place-verb-scene relations ✅ Embedding-based retrieval via SentenceTransformer ✅ Cosine similarity ranking ✅ Basic prompt augmentation

Not Included (requires additional models/data): ❌ Full relation graph from paper (requires ~GB of graph data) ❌ LLM-based sentence refactoring (Mistral-7B) ❌ Iterative merging with similarity thresholds ❌ Instruction-based rewriting (Llama3.1)

Why This Approach:

Full RAPO requires 7B+ LLM downloads (~15GB+)
Full graph data requires downloading preprocessed datasets
This demo focuses on the core concept: retrieval-augmented prompt optimization
Users can understand the methodology without waiting for large downloads

Running the Full RAPO Pipeline

To run the complete Stage 1 RAPO from the paper:

cd examples/Stage1_RAPO

# 1. Retrieve modifiers from graph
sh retrieve_modifiers.sh

# 2. Word augmentation
sh word_augment.sh

# 3. Sentence refactoring
sh refactoring.sh

# 4. Instruction-based rewriting
sh rewrite_via_instruction.sh

Requirements:

Download full relation graph data to relation_graph/graph_data/
Download Mistral-7B-Instruct-v0.3 to ckpt/
Download llama3_1_instruct_lora_rewrite to ckpt/

See README.md for full installation instructions.

Integration with RAPO++ Stages

This demo showcases Stage 1 only. The complete RAPO++ framework includes:

Stage 1 (RAPO) - Demonstrated Here

Retrieval-augmented prompt optimization via knowledge graphs
Offline refinement using curated data

Stage 2 (SSPO)

Self-supervised prompt optimization
Iterative refinement based on generated video feedback
Physics-aware consistency checks
VLM-based alignment scoring

Stage 3 (Fine-tuning)

LLM fine-tuning on collected feedback from Stage 2
Model-specific prompt refiners

Performance Notes

First run: ~1-2 minutes (downloads model)
Subsequent runs: <1 second per prompt
GPU allocation: Automatic via ZeroGPU
Memory usage: ~500MB (model + graph)

Troubleshooting

"No module named 'sentencepiece'"

Ensure sentencepiece==0.2.1 is in requirements.txt
sentence-transformers requires sentencepiece for tokenization

"CUDA has been initialized before importing spaces"

The app correctly imports spaces FIRST before torch
If you modify the code, maintain this import order

Model download fails

Check internet connection
HuggingFace Hub may be temporarily unavailable
Model will retry on next run (cached after successful download)

References

Papers:

RAPO (CVPR 2025): The Devil is in the Prompts
RAPO++ (arXiv:2510.20206): Cross-Stage Prompt Optimization

Project Pages:

Code:

GitHub: https://github.com/Vchitect/RAPO

License

Please refer to the original repository for licensing information.

Created for HuggingFace Spaces deployment