| # Deployment Notes | |
| ## Hugging Face Spaces Deployment | |
| ### NVIDIA T4 Medium Configuration | |
| This MVP is optimized for **NVIDIA T4 Medium** GPU deployment on Hugging Face Spaces. | |
| #### Hardware Specifications | |
| - **GPU**: NVIDIA T4 (persistent, always available) | |
| - **vCPU**: 8 cores | |
| - **RAM**: 30GB | |
| - **vRAM**: 24GB | |
| - **Storage**: ~20GB | |
| - **Network**: Shared infrastructure | |
| #### Resource Capacity | |
| - **GPU Memory**: 24GB vRAM (sufficient for local model loading) | |
| - **System Memory**: 30GB RAM (excellent for caching and processing) | |
| - **CPU**: 8 vCPU (good for parallel operations) | |
| ### Environment Variables | |
| Required environment variables for deployment: | |
| ```bash | |
| HF_TOKEN=your_huggingface_token_here | |
| HF_HOME=/tmp/huggingface | |
| MAX_WORKERS=4 | |
| CACHE_TTL=3600 | |
| DB_PATH=sessions.db | |
| FAISS_INDEX_PATH=embeddings.faiss | |
| SESSION_TIMEOUT=3600 | |
| MAX_SESSION_SIZE_MB=10 | |
| MOBILE_MAX_TOKENS=800 | |
| MOBILE_TIMEOUT=15000 | |
| GRADIO_PORT=7860 | |
| GRADIO_HOST=0.0.0.0 | |
| LOG_LEVEL=INFO | |
| ``` | |
| ### Space Configuration | |
| Create a `README.md` in the HF Space with: | |
| ```yaml | |
| --- | |
| title: AI Research Assistant MVP | |
| emoji: 🧠 | |
| colorFrom: blue | |
| colorTo: purple | |
| sdk: docker | |
| app_port: 7860 | |
| pinned: false | |
| license: apache-2.0 | |
| --- | |
| ``` | |
| ### Deployment Steps | |
| 1. **Clone/Setup Repository** | |
| ```bash | |
| git clone your-repo | |
| cd Research_Assistant | |
| ``` | |
| 2. **Install Dependencies** | |
| ```bash | |
| bash install.sh | |
| # or | |
| pip install -r requirements.txt | |
| ``` | |
| 3. **Test Installation** | |
| ```bash | |
| python test_setup.py | |
| # or | |
| bash quick_test.sh | |
| ``` | |
| 4. **Run Locally** | |
| ```bash | |
| python app.py | |
| ``` | |
| 5. **Deploy to HF Spaces** | |
| - Push to GitHub | |
| - Connect to HF Spaces | |
| - Select NVIDIA T4 Medium GPU hardware | |
| - Deploy | |
| ### Resource Management | |
| #### Memory Limits | |
| - **Base Python**: ~100MB | |
| - **Gradio**: ~50MB | |
| - **Models (loaded on GPU)**: ~14-16GB vRAM | |
| - Primary model (Qwen/Qwen2.5-7B): ~14GB | |
| - Embedding model: ~500MB | |
| - Classification models: ~500MB each | |
| - **System RAM**: ~2-4GB for caching and processing | |
| - **Cache**: ~500MB-1GB max | |
| **GPU Memory Budget**: ~24GB vRAM (models fit comfortably) | |
| **System RAM Budget**: 30GB (plenty of headroom) | |
| #### Strategies | |
| - **Local GPU Model Loading**: Models loaded on GPU for faster inference | |
| - **Lazy Loading**: Models loaded on-demand to speed up startup | |
| - **GPU Memory Management**: Automatic device placement with FP16 precision | |
| - **Caching**: Aggressive caching with 30GB RAM available | |
| - **Stream responses**: To reduce memory during generation | |
| ### Performance Optimization | |
| #### For NVIDIA T4 GPU | |
| 1. **Local Model Loading**: Models run locally on GPU (faster than API) | |
| - Primary model: Qwen/Qwen2.5-7B-Instruct (~14GB vRAM) | |
| - Embedding model: sentence-transformers/all-MiniLM-L6-v2 (~500MB) | |
| 2. **GPU Acceleration**: All inference runs on GPU | |
| 3. **Parallel Processing**: 4 workers (MAX_WORKERS=4) for concurrent requests | |
| 4. **Fallback to API**: Automatically falls back to HF Inference API if local models fail | |
| 5. **Request Queuing**: Built-in async request handling | |
| 6. **Response Streaming**: Implemented for efficient memory usage | |
| #### Mobile Optimizations | |
| - Reduce max tokens to 800 | |
| - Shorten timeout to 15s | |
| - Implement progressive loading | |
| - Use touch-optimized UI | |
| ### Monitoring | |
| #### Health Checks | |
| - Application health endpoint: `/health` | |
| - Database connectivity check | |
| - Cache hit rate monitoring | |
| - Response time tracking | |
| #### Logging | |
| - Use structured logging (structlog) | |
| - Log levels: DEBUG (dev), INFO (prod) | |
| - Monitor error rates | |
| - Track performance metrics | |
| ### Troubleshooting | |
| #### Common Issues | |
| **Issue**: Out of memory errors | |
| - **Solution**: Reduce max_workers, implement request queuing | |
| **Issue**: Slow responses | |
| - **Solution**: Enable aggressive caching, use streaming | |
| **Issue**: Model loading failures | |
| - **Solution**: Use HF Inference API instead of local models | |
| **Issue**: Session data loss | |
| - **Solution**: Implement proper persistence with SQLite backup | |
| ### Scaling Considerations | |
| #### For Production | |
| 1. **Horizontal Scaling**: Deploy multiple instances | |
| 2. **Caching Layer**: Add Redis for shared session data | |
| 3. **Load Balancing**: Use HF Spaces built-in load balancer | |
| 4. **CDN**: Static assets via CDN | |
| 5. **Database**: Consider PostgreSQL for production | |
| #### Migration Path | |
| - **Phase 1**: MVP on ZeroGPU (current) | |
| - **Phase 2**: Upgrade to GPU for local models | |
| - **Phase 3**: Scale to multiple workers | |
| - **Phase 4**: Enterprise deployment with managed infrastructure | |