Research_AI_Assistant / DEPLOYMENT_NOTES.md
JatsTheAIGen's picture
api migration v2
7632802
# Deployment Notes
## Hugging Face Spaces Deployment
### NVIDIA T4 Medium Configuration
This MVP is optimized for **NVIDIA T4 Medium** GPU deployment on Hugging Face Spaces.
#### Hardware Specifications
- **GPU**: NVIDIA T4 (persistent, always available)
- **vCPU**: 8 cores
- **RAM**: 30GB
- **vRAM**: 24GB
- **Storage**: ~20GB
- **Network**: Shared infrastructure
#### Resource Capacity
- **GPU Memory**: 24GB vRAM (sufficient for local model loading)
- **System Memory**: 30GB RAM (excellent for caching and processing)
- **CPU**: 8 vCPU (good for parallel operations)
### Environment Variables
Required environment variables for deployment:
```bash
HF_TOKEN=your_huggingface_token_here
HF_HOME=/tmp/huggingface
MAX_WORKERS=4
CACHE_TTL=3600
DB_PATH=sessions.db
FAISS_INDEX_PATH=embeddings.faiss
SESSION_TIMEOUT=3600
MAX_SESSION_SIZE_MB=10
MOBILE_MAX_TOKENS=800
MOBILE_TIMEOUT=15000
GRADIO_PORT=7860
GRADIO_HOST=0.0.0.0
LOG_LEVEL=INFO
```
### Space Configuration
Create a `README.md` in the HF Space with:
```yaml
---
title: AI Research Assistant MVP
emoji: 🧠
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
license: apache-2.0
---
```
### Deployment Steps
1. **Clone/Setup Repository**
```bash
git clone your-repo
cd Research_Assistant
```
2. **Install Dependencies**
```bash
bash install.sh
# or
pip install -r requirements.txt
```
3. **Test Installation**
```bash
python test_setup.py
# or
bash quick_test.sh
```
4. **Run Locally**
```bash
python app.py
```
5. **Deploy to HF Spaces**
- Push to GitHub
- Connect to HF Spaces
- Select NVIDIA T4 Medium GPU hardware
- Deploy
### Resource Management
#### Memory Limits
- **Base Python**: ~100MB
- **Gradio**: ~50MB
- **Models (loaded on GPU)**: ~14-16GB vRAM
- Primary model (Qwen/Qwen2.5-7B): ~14GB
- Embedding model: ~500MB
- Classification models: ~500MB each
- **System RAM**: ~2-4GB for caching and processing
- **Cache**: ~500MB-1GB max
**GPU Memory Budget**: ~24GB vRAM (models fit comfortably)
**System RAM Budget**: 30GB (plenty of headroom)
#### Strategies
- **Local GPU Model Loading**: Models loaded on GPU for faster inference
- **Lazy Loading**: Models loaded on-demand to speed up startup
- **GPU Memory Management**: Automatic device placement with FP16 precision
- **Caching**: Aggressive caching with 30GB RAM available
- **Stream responses**: To reduce memory during generation
### Performance Optimization
#### For NVIDIA T4 GPU
1. **Local Model Loading**: Models run locally on GPU (faster than API)
- Primary model: Qwen/Qwen2.5-7B-Instruct (~14GB vRAM)
- Embedding model: sentence-transformers/all-MiniLM-L6-v2 (~500MB)
2. **GPU Acceleration**: All inference runs on GPU
3. **Parallel Processing**: 4 workers (MAX_WORKERS=4) for concurrent requests
4. **Fallback to API**: Automatically falls back to HF Inference API if local models fail
5. **Request Queuing**: Built-in async request handling
6. **Response Streaming**: Implemented for efficient memory usage
#### Mobile Optimizations
- Reduce max tokens to 800
- Shorten timeout to 15s
- Implement progressive loading
- Use touch-optimized UI
### Monitoring
#### Health Checks
- Application health endpoint: `/health`
- Database connectivity check
- Cache hit rate monitoring
- Response time tracking
#### Logging
- Use structured logging (structlog)
- Log levels: DEBUG (dev), INFO (prod)
- Monitor error rates
- Track performance metrics
### Troubleshooting
#### Common Issues
**Issue**: Out of memory errors
- **Solution**: Reduce max_workers, implement request queuing
**Issue**: Slow responses
- **Solution**: Enable aggressive caching, use streaming
**Issue**: Model loading failures
- **Solution**: Use HF Inference API instead of local models
**Issue**: Session data loss
- **Solution**: Implement proper persistence with SQLite backup
### Scaling Considerations
#### For Production
1. **Horizontal Scaling**: Deploy multiple instances
2. **Caching Layer**: Add Redis for shared session data
3. **Load Balancing**: Use HF Spaces built-in load balancer
4. **CDN**: Static assets via CDN
5. **Database**: Consider PostgreSQL for production
#### Migration Path
- **Phase 1**: MVP on ZeroGPU (current)
- **Phase 2**: Upgrade to GPU for local models
- **Phase 3**: Scale to multiple workers
- **Phase 4**: Enterprise deployment with managed infrastructure