File size: 3,391 Bytes
66dbebd |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 |
# Deployment Notes
## Hugging Face Spaces Deployment
### ZeroGPU Configuration
This MVP is optimized for **ZeroGPU** deployment on Hugging Face Spaces.
#### Key Settings
- **GPU**: None (CPU-only)
- **Storage**: Limited (~20GB)
- **Memory**: 32GB RAM
- **Network**: Shared infrastructure
### Environment Variables
Required environment variables for deployment:
```bash
HF_TOKEN=your_huggingface_token_here
HF_HOME=/tmp/huggingface
MAX_WORKERS=2
CACHE_TTL=3600
DB_PATH=sessions.db
FAISS_INDEX_PATH=embeddings.faiss
SESSION_TIMEOUT=3600
MAX_SESSION_SIZE_MB=10
MOBILE_MAX_TOKENS=800
MOBILE_TIMEOUT=15000
GRADIO_PORT=7860
GRADIO_HOST=0.0.0.0
LOG_LEVEL=INFO
```
### Space Configuration
Create a `README.md` in the HF Space with:
```yaml
---
title: AI Research Assistant MVP
emoji: 🧠
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.0.0
app_file: app.py
pinned: false
license: apache-2.0
---
```
### Deployment Steps
1. **Clone/Setup Repository**
```bash
git clone your-repo
cd Research_Assistant
```
2. **Install Dependencies**
```bash
bash install.sh
# or
pip install -r requirements.txt
```
3. **Test Installation**
```bash
python test_setup.py
# or
bash quick_test.sh
```
4. **Run Locally**
```bash
python app.py
```
5. **Deploy to HF Spaces**
- Push to GitHub
- Connect to HF Spaces
- Select ZeroGPU hardware
- Deploy
### Resource Management
#### Memory Limits
- **Base Python**: ~100MB
- **Gradio**: ~50MB
- **Models (loaded)**: ~200-500MB
- **Cache**: ~100MB max
- **Buffer**: ~100MB
**Total Budget**: ~512MB (within HF Spaces limits)
#### Strategies
- Lazy model loading
- Model offloading when not in use
- Aggressive cache eviction
- Stream responses to reduce memory
### Performance Optimization
#### For ZeroGPU
1. Use HF Inference API for LLM calls (not local models)
2. Use `sentence-transformers` for embeddings (lightweight)
3. Implement request queuing
4. Use FAISS-CPU (not GPU version)
5. Implement response streaming
#### Mobile Optimizations
- Reduce max tokens to 800
- Shorten timeout to 15s
- Implement progressive loading
- Use touch-optimized UI
### Monitoring
#### Health Checks
- Application health endpoint: `/health`
- Database connectivity check
- Cache hit rate monitoring
- Response time tracking
#### Logging
- Use structured logging (structlog)
- Log levels: DEBUG (dev), INFO (prod)
- Monitor error rates
- Track performance metrics
### Troubleshooting
#### Common Issues
**Issue**: Out of memory errors
- **Solution**: Reduce max_workers, implement request queuing
**Issue**: Slow responses
- **Solution**: Enable aggressive caching, use streaming
**Issue**: Model loading failures
- **Solution**: Use HF Inference API instead of local models
**Issue**: Session data loss
- **Solution**: Implement proper persistence with SQLite backup
### Scaling Considerations
#### For Production
1. **Horizontal Scaling**: Deploy multiple instances
2. **Caching Layer**: Add Redis for shared session data
3. **Load Balancing**: Use HF Spaces built-in load balancer
4. **CDN**: Static assets via CDN
5. **Database**: Consider PostgreSQL for production
#### Migration Path
- **Phase 1**: MVP on ZeroGPU (current)
- **Phase 2**: Upgrade to GPU for local models
- **Phase 3**: Scale to multiple workers
- **Phase 4**: Enterprise deployment with managed infrastructure
|