Research_AI_Assistant / DEPLOYMENT_NOTES.md
JatsTheAIGen's picture
Initial commit V1
66dbebd
|
raw
history blame
3.39 kB

Deployment Notes

Hugging Face Spaces Deployment

ZeroGPU Configuration

This MVP is optimized for ZeroGPU deployment on Hugging Face Spaces.

Key Settings

  • GPU: None (CPU-only)
  • Storage: Limited (~20GB)
  • Memory: 32GB RAM
  • Network: Shared infrastructure

Environment Variables

Required environment variables for deployment:

HF_TOKEN=your_huggingface_token_here
HF_HOME=/tmp/huggingface
MAX_WORKERS=2
CACHE_TTL=3600
DB_PATH=sessions.db
FAISS_INDEX_PATH=embeddings.faiss
SESSION_TIMEOUT=3600
MAX_SESSION_SIZE_MB=10
MOBILE_MAX_TOKENS=800
MOBILE_TIMEOUT=15000
GRADIO_PORT=7860
GRADIO_HOST=0.0.0.0
LOG_LEVEL=INFO

Space Configuration

Create a README.md in the HF Space with:

---
title: AI Research Assistant MVP
emoji: 🧠
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.0.0
app_file: app.py
pinned: false
license: apache-2.0
---

Deployment Steps

  1. Clone/Setup Repository

    git clone your-repo
    cd Research_Assistant
    
  2. Install Dependencies

    bash install.sh
    # or
    pip install -r requirements.txt
    
  3. Test Installation

    python test_setup.py
    # or
    bash quick_test.sh
    
  4. Run Locally

    python app.py
    
  5. Deploy to HF Spaces

    • Push to GitHub
    • Connect to HF Spaces
    • Select ZeroGPU hardware
    • Deploy

Resource Management

Memory Limits

  • Base Python: ~100MB
  • Gradio: ~50MB
  • Models (loaded): ~200-500MB
  • Cache: ~100MB max
  • Buffer: ~100MB

Total Budget: ~512MB (within HF Spaces limits)

Strategies

  • Lazy model loading
  • Model offloading when not in use
  • Aggressive cache eviction
  • Stream responses to reduce memory

Performance Optimization

For ZeroGPU

  1. Use HF Inference API for LLM calls (not local models)
  2. Use sentence-transformers for embeddings (lightweight)
  3. Implement request queuing
  4. Use FAISS-CPU (not GPU version)
  5. Implement response streaming

Mobile Optimizations

  • Reduce max tokens to 800
  • Shorten timeout to 15s
  • Implement progressive loading
  • Use touch-optimized UI

Monitoring

Health Checks

  • Application health endpoint: /health
  • Database connectivity check
  • Cache hit rate monitoring
  • Response time tracking

Logging

  • Use structured logging (structlog)
  • Log levels: DEBUG (dev), INFO (prod)
  • Monitor error rates
  • Track performance metrics

Troubleshooting

Common Issues

Issue: Out of memory errors

  • Solution: Reduce max_workers, implement request queuing

Issue: Slow responses

  • Solution: Enable aggressive caching, use streaming

Issue: Model loading failures

  • Solution: Use HF Inference API instead of local models

Issue: Session data loss

  • Solution: Implement proper persistence with SQLite backup

Scaling Considerations

For Production

  1. Horizontal Scaling: Deploy multiple instances
  2. Caching Layer: Add Redis for shared session data
  3. Load Balancing: Use HF Spaces built-in load balancer
  4. CDN: Static assets via CDN
  5. Database: Consider PostgreSQL for production

Migration Path

  • Phase 1: MVP on ZeroGPU (current)
  • Phase 2: Upgrade to GPU for local models
  • Phase 3: Scale to multiple workers
  • Phase 4: Enterprise deployment with managed infrastructure