Deployment Notes
Hugging Face Spaces Deployment
ZeroGPU Configuration
This MVP is optimized for ZeroGPU deployment on Hugging Face Spaces.
Key Settings
- GPU: None (CPU-only)
- Storage: Limited (~20GB)
- Memory: 32GB RAM
- Network: Shared infrastructure
Environment Variables
Required environment variables for deployment:
HF_TOKEN=your_huggingface_token_here
HF_HOME=/tmp/huggingface
MAX_WORKERS=2
CACHE_TTL=3600
DB_PATH=sessions.db
FAISS_INDEX_PATH=embeddings.faiss
SESSION_TIMEOUT=3600
MAX_SESSION_SIZE_MB=10
MOBILE_MAX_TOKENS=800
MOBILE_TIMEOUT=15000
GRADIO_PORT=7860
GRADIO_HOST=0.0.0.0
LOG_LEVEL=INFO
Space Configuration
Create a README.md in the HF Space with:
---
title: AI Research Assistant MVP
emoji: 🧠
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.0.0
app_file: app.py
pinned: false
license: apache-2.0
---
Deployment Steps
Clone/Setup Repository
git clone your-repo cd Research_AssistantInstall Dependencies
bash install.sh # or pip install -r requirements.txtTest Installation
python test_setup.py # or bash quick_test.shRun Locally
python app.pyDeploy to HF Spaces
- Push to GitHub
- Connect to HF Spaces
- Select ZeroGPU hardware
- Deploy
Resource Management
Memory Limits
- Base Python: ~100MB
- Gradio: ~50MB
- Models (loaded): ~200-500MB
- Cache: ~100MB max
- Buffer: ~100MB
Total Budget: ~512MB (within HF Spaces limits)
Strategies
- Lazy model loading
- Model offloading when not in use
- Aggressive cache eviction
- Stream responses to reduce memory
Performance Optimization
For ZeroGPU
- Use HF Inference API for LLM calls (not local models)
- Use
sentence-transformersfor embeddings (lightweight) - Implement request queuing
- Use FAISS-CPU (not GPU version)
- Implement response streaming
Mobile Optimizations
- Reduce max tokens to 800
- Shorten timeout to 15s
- Implement progressive loading
- Use touch-optimized UI
Monitoring
Health Checks
- Application health endpoint:
/health - Database connectivity check
- Cache hit rate monitoring
- Response time tracking
Logging
- Use structured logging (structlog)
- Log levels: DEBUG (dev), INFO (prod)
- Monitor error rates
- Track performance metrics
Troubleshooting
Common Issues
Issue: Out of memory errors
- Solution: Reduce max_workers, implement request queuing
Issue: Slow responses
- Solution: Enable aggressive caching, use streaming
Issue: Model loading failures
- Solution: Use HF Inference API instead of local models
Issue: Session data loss
- Solution: Implement proper persistence with SQLite backup
Scaling Considerations
For Production
- Horizontal Scaling: Deploy multiple instances
- Caching Layer: Add Redis for shared session data
- Load Balancing: Use HF Spaces built-in load balancer
- CDN: Static assets via CDN
- Database: Consider PostgreSQL for production
Migration Path
- Phase 1: MVP on ZeroGPU (current)
- Phase 2: Upgrade to GPU for local models
- Phase 3: Scale to multiple workers
- Phase 4: Enterprise deployment with managed infrastructure