SNIPED_rapo / APP_INFO.md
jbilcke-hf's picture
Upload repository for paper 2510.20206
ee81688 verified

RAPO++ Gradio App Documentation

Overview

This Gradio app demonstrates Stage 1 (RAPO) of the RAPO++ framework: Retrieval-Augmented Prompt Optimization using knowledge graphs.

What It Does

The app takes a simple text-to-video (T2V) generation prompt and enriches it with contextually relevant modifiers retrieved from a knowledge graph. This optimization helps create more detailed, coherent prompts that lead to better video generation results.

How It Works

Architecture

  1. Knowledge Graph Construction

    • Creates a graph with "places" as central nodes (e.g., forest, beach, city street)
    • Places connect to relevant "actions/verbs" (e.g., "walking through", "exploring")
    • Places also connect to "atmospheric descriptors" (e.g., "dense trees", "peaceful atmosphere")
  2. Retrieval Process

    • Input prompt is encoded using SentenceTransformer (all-MiniLM-L6-v2)
    • Finds top-K most similar places via cosine similarity
    • Samples connected actions and atmosphere descriptors from graph neighbors
    • Filters modifiers by relevance to the input prompt
  3. Prompt Augmentation

    • Combines original prompt with retrieved modifiers
    • Structures the output to maintain coherence
    • Returns optimized prompt suitable for T2V generation

Key Components

app.py (main application):

  • create_demo_graph(): Builds a simplified knowledge graph with common T2V concepts
  • retrieve_and_augment_prompt(): Core RAPO function decorated with @spaces.GPU
  • Gradio interface with examples and detailed documentation

requirements.txt:

  • gradio 5.49.1 (pinned for compatibility)
  • sentence-transformers + sentencepiece for embeddings
  • torch 2.5.1 for tensor operations
  • networkx for graph operations
  • huggingface_hub for model downloads

Model Downloads

The app automatically downloads the required model on first run:

  • all-MiniLM-L6-v2: Sentence transformer for computing text embeddings (~80MB)

Downloaded to: ./ckpt/all-MiniLM-L6-v2/

Usage

Basic Usage

  1. Enter a simple prompt (e.g., "A person walking")
  2. Click "Optimize Prompt"
  3. View the enhanced prompt with contextual details

Advanced Settings

  • Number of Places to Retrieve: How many related places to search (1-5, default: 2)
  • Modifiers per Place: How many modifiers to sample from each place (1-10, default: 5)

Example Prompts

Try these examples to see the optimization in action:

  • "A person walking"
  • "A car driving at night"
  • "Someone cooking in a kitchen"
  • "A group of people talking"
  • "A bird flying"
  • "Someone sitting and reading"

Technical Details

Graph Structure

Places (central nodes):

  • forest, beach, city street, mountain, room, park, studio, kitchen, bridge, parking lot, desert, lake

Edge Types:

  • Place β†’ Verb/Action edges (e.g., "forest" β†’ "walking through")
  • Place β†’ Atmosphere edges (e.g., "forest" β†’ "dense trees")

Retrieval Algorithm:

  1. Encode input prompt: prompt_emb = model.encode(prompt)
  2. Compute similarities: cosine_similarity(prompt_emb, place_embeddings)
  3. Select top-K places by similarity score
  4. Sample neighbors from graph: G.neighbors(place)
  5. Deduplicate and rank modifiers

ZeroGPU Integration

The retrieve_and_augment_prompt() function is decorated with @spaces.GPU to leverage the allocated ZeroGPU (NVIDIA H200, 70GB VRAM). This enables:

  • Fast embedding computations
  • Efficient cosine similarity calculations
  • Scalability to larger graphs and batch processing

Differences from Full RAPO

This demo implements a simplified version of Stage 1 RAPO:

Included: βœ… Knowledge graph with place-verb-scene relations βœ… Embedding-based retrieval via SentenceTransformer βœ… Cosine similarity ranking βœ… Basic prompt augmentation

Not Included (requires additional models/data): ❌ Full relation graph from paper (requires ~GB of graph data) ❌ LLM-based sentence refactoring (Mistral-7B) ❌ Iterative merging with similarity thresholds ❌ Instruction-based rewriting (Llama3.1)

Why This Approach:

  • Full RAPO requires 7B+ LLM downloads (~15GB+)
  • Full graph data requires downloading preprocessed datasets
  • This demo focuses on the core concept: retrieval-augmented prompt optimization
  • Users can understand the methodology without waiting for large downloads

Running the Full RAPO Pipeline

To run the complete Stage 1 RAPO from the paper:

cd examples/Stage1_RAPO

# 1. Retrieve modifiers from graph
sh retrieve_modifiers.sh

# 2. Word augmentation
sh word_augment.sh

# 3. Sentence refactoring
sh refactoring.sh

# 4. Instruction-based rewriting
sh rewrite_via_instruction.sh

Requirements:

  • Download full relation graph data to relation_graph/graph_data/
  • Download Mistral-7B-Instruct-v0.3 to ckpt/
  • Download llama3_1_instruct_lora_rewrite to ckpt/

See README.md for full installation instructions.

Integration with RAPO++ Stages

This demo showcases Stage 1 only. The complete RAPO++ framework includes:

Stage 1 (RAPO) - Demonstrated Here

  • Retrieval-augmented prompt optimization via knowledge graphs
  • Offline refinement using curated data

Stage 2 (SSPO)

  • Self-supervised prompt optimization
  • Iterative refinement based on generated video feedback
  • Physics-aware consistency checks
  • VLM-based alignment scoring

Stage 3 (Fine-tuning)

  • LLM fine-tuning on collected feedback from Stage 2
  • Model-specific prompt refiners

Performance Notes

  • First run: ~1-2 minutes (downloads model)
  • Subsequent runs: <1 second per prompt
  • GPU allocation: Automatic via ZeroGPU
  • Memory usage: ~500MB (model + graph)

Troubleshooting

"No module named 'sentencepiece'"

  • Ensure sentencepiece==0.2.1 is in requirements.txt
  • sentence-transformers requires sentencepiece for tokenization

"CUDA has been initialized before importing spaces"

  • The app correctly imports spaces FIRST before torch
  • If you modify the code, maintain this import order

Model download fails

  • Check internet connection
  • HuggingFace Hub may be temporarily unavailable
  • Model will retry on next run (cached after successful download)

References

Papers:

Project Pages:

Code:

License

Please refer to the original repository for licensing information.


Created for HuggingFace Spaces deployment