π§ Antonio Gemma3 Evo Q4 β Self-Learning AI for Raspberry Pi
Antonio Gemma3 Evo Q4 is not just another quantized LLM. It's a self-learning micro-intelligence with EvoMemoryβ’, RAG-Lite, and auto-evolution capabilities, optimized for Raspberry Pi 4 & 5 and tested for production 24/7 deployment.
Version: v0.5.0 (NEW: Adaptive Prompting) Author: Antonio Consales (antconsales) Base Model: Google Gemma 3 1B IT
π» Production-Ready for Raspberry Pi
β Tested on Raspberry Pi 4 (4GB) β 3.32 t/s sustained (100% reliable over 60 minutes) β Fully offline β no external APIs, no internet required β Self-learning β EvoMemoryβ’ saves neurons from every conversation β Bilingual β seamlessly switches between Italian and English β 24/7 deployment tested β Zero failures in 60-minute soak test
π What Makes It Special
Unlike traditional quantized models, Antonio Gemma3 Evo Q4 learns and evolves:
- 𧬠EvoMemoryβ’ β Saves "neurons" with input, output, confidence, and mood
- π RAG-Lite β Retrieves past experiences using BM25 (no FAISS!)
- π― Self-evaluation β Assigns confidence scores (0-1) to every response
- π± Auto-evolution β Generates new reasoning rules from accumulated neurons
- π 100% Offline β Runs completely local on Raspberry Pi 4 (4GB RAM)
- π Bilingual β Auto-detects IT/EN and responds in the same language
- β‘ Fast β 3.32 tokens/s sustained on Pi 4 with Q4_K_M quantization
- π― NEW: Adaptive Prompting β Smart question classification (SIMPLE/COMPLEX/CODE/CREATIVE) for 3.6x speedup on simple queries
"The little brain that grows with you" π§
π Benchmark Results (Updated Oct 21, 2025)
Complete 60-minute soak test on Raspberry Pi 4 (4GB RAM) with Ollama.
Production Metrics
| Metric | Value | Status | 
|---|---|---|
| Sustained throughput | 3.32 t/s (256 tokens) | β Production-ready | 
| Reliability | 100% (455/455 requests) | β Perfect | 
| Avg response time | 7.92s | β Consistent | 
| Thermal stability | 70.2Β°C avg (max 73.5Β°C) | β No throttling | 
| Memory usage | 42% (1.6 GB) | β No leaks | 
| Uptime tested | 60+ minutes continuous | β 24/7 ready | 
Performance by Token Count
| Tokens | Run 1 | Run 2 | Run 3 | Average* | 
|---|---|---|---|---|
| 128 | 0.24 | 3.44 | 3.45 | 3.45 t/s | 
| 256 | 3.43 | 3.09 | 3.43 | 3.32 t/s | 
| 512 | 2.76 | 2.77 | 2.11 | 2.55 t/s | 
*Average excludes cold-start (first run)
Model Comparison
| Model | Size | Sustained Speed | Recommended For | 
|---|---|---|---|
| Q4_K_M β | 769 MB | 3.32 t/s | Production (tested 60min, 100% reliable) | 
| Q4_0 | 687 MB | 3.45 t/s | Development (faster but less stable) | 
π View Complete Benchmark Report β Full performance, reliability, and stability analysis
Benchmark Methodology
- Duration: 60.1 minutes (3,603 seconds)
- Total requests: 455 (2-second interval)
- Platform: Raspberry Pi 4 (4GB RAM, ARM Cortex-A72 @ 1.5GHz)
- Runtime: Ollama 0.3.x + llama.cpp backend
- Monitoring: CPU temp, RAM usage, load average (sampled every 5s)
- Tasks: Performance (128/256/512 tokens), Quality (HellaSwag, ARC, TruthfulQA), Robustness (soak test)
Recommendation: Use Q4_K_M for production deployments (proven 100% reliability over 60 minutes). Use Q4_0 for development/testing if you need slightly faster inference.
π§© Available Models
This repository contains two quantization variants:
- gemma3-1b-q4_0.gguf (β687 MB) β Faster, 3% higher throughput, suitable for development
- gemma3-1b-q4_k_m.gguf (β769 MB) β Better quality, production-tested for 60+ minutes
π― Important: Two Usage Modes
Mode 1: Ollama Only (Simple Inference) β‘
Download the GGUF model and run with Ollama:
ollama pull antconsales/antonio-gemma3-evo-q4
ollama run antconsales/antonio-gemma3-evo-q4
What you get:
- β Fast inference (3.32 t/s on Pi 4)
- β Bilingual chat (IT/EN)
- β Offline, privacy-first
- β NO EvoMemory (doesn't save conversations)
- β NO RAG (doesn't retrieve past experiences)
- β NO auto-evolution (doesn't generate rules)
Best for: Quick tests, one-off questions, simple chatbot
Mode 2: Full Evolution Stack (Self-Learning) π§
For EvoMemoryβ’, RAG-Lite, and auto-evolution, use the full Python stack from GitHub:
git clone https://github.com/antconsales/antonio-gemma3-evo-q4.git
cd antonio-gemma3-evo-q4
bash scripts/install.sh
uvicorn api.server:app --host 0.0.0.0 --port 8000
What you get:
- β EvoMemoryβ’ β Saves neurons from every conversation
- β RAG-Lite β Retrieves past experiences (BM25)
- β Auto-evolution β Generates reasoning rules over time
- β Confidence scoring β Knows when it's uncertain
- β FastAPI server β REST + WebSocket endpoints
Comparison:
| Feature | Ollama Only | Full Stack | 
|---|---|---|
| Inference speed | 3.32 t/s | 3.32 t/s | 
| Learns from chats | β | β EvoMemory | 
| Retrieves memories | β | β RAG-Lite | 
| Generates rules | β | β Auto-evolution | 
| API endpoints | β | β FastAPI | 
| Setup time | 1 min | 5 min | 
π οΈ Quick Start Options
Option 1: Ollama Only (see Mode 1 above)
Option 2: Load Directly from GGUF
# Download model from HuggingFace
wget https://huggingface.co/chill123/antonio-gemma3-evo-q4/resolve/main/gemma3-1b-q4_k_m.gguf
# Create Modelfile
cat > Modelfile <<'EOF'
FROM ./gemma3-1b-q4_k_m.gguf
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 1024
PARAMETER num_thread 4
PARAMETER repeat_penalty 1.05
PARAMETER stop "<end_of_turn>"
PARAMETER stop "</s>"
SYSTEM """You are Antonio, an offline AI assistant running on a Raspberry Pi. You MUST detect the user's language and respond in the SAME language:
- If the user writes in Italian, respond ONLY in Italian
- If the user writes in English, respond ONLY in English
You are helpful, friendly, and concise. When you're uncertain, you admit it instead of guessing."""
EOF
# Create and run model
ollama create antonio-evo -f Modelfile
ollama run antonio-evo
π Quick Start with Full Evolution Stack
For the complete self-learning system with EvoMemoryβ’, RAG-Lite, and auto-evolution:
# Clone the full project
git clone https://github.com/antconsales/antonio-gemma3-evo-q4.git
cd antonio-gemma3-evo-q4
# Install and run
bash scripts/install.sh
python -m api.server
Visit: http://localhost:8000/docs for interactive API documentation
Features of the full stack:
- EvoMemoryβ’ SQLite database
- RAG-Lite with BM25 search
- Confidence auto-evaluation
- Rule regeneration (auto-evolution)
- FastAPI server with WebSocket support
- MCP-compatible tool system
π‘ Key Features
1οΈβ£ EvoMemoryβ’ β Living Memory
Every conversation creates a neuron:
{
  "id": 123,
  "input_text": "Accendi il LED rosso",
  "output_text": "OK, attivo GPIO 17 su HIGH",
  "confidence": 0.85,
  "mood": "positive",
  "user_feedback": 1,
  "skill_id": "gpio_control",
  "timestamp": "2025-10-21T14:30:00Z"
}
Features:
- Auto-pruning of low-confidence old neurons
- Neuron compression (similar patterns β meta-neurons)
- Context-aware retrieval via hash matching
2οΈβ£ RAG-Lite β Pure Python BM25
No FAISS, no ChromaDB. Just:
- BM25 scoring (Okapi formula)
- SQLite full-text search (FTS5)
- Top-K retrieval with confidence threshold
- Zero external dependencies
3οΈβ£ Auto-Evolution β Rule Regeneration
Every N conversations (configurable), Antonio:
- Analyzes high-confidence neurons
- Extracts reasoning patterns
- Generates new rules (e.g., "If user asks time-sensitive question β check recency")
- Saves rules to instinct.json
ποΈ Architecture
βββββββββββββββββββββββββββββββββββββββββββββββ
β  Antonio Gemma3 Evo Q4 - Evolution Layer    β
βββββββββββββββββββββββββββββββββββββββββββββββ€
β                                             β
β  ββββββββββββββββ      βββββββββββββββββββ β
β  β EvoMemoryβ’   βββββββΊβ  RAG-Lite       β β
β  β (SQLite)     β      β  BM25 Search    β β
β  ββββββββββββββββ      βββββββββββββββββββ β
β         β²                      β²            β
β         β                      β            β
β  ββββββββ΄βββββββββββββββββββββββ΄βββββββββββ β
β  β    Inference Engine (llama.cpp)        β β
β  β    β’ Q4_0 / Q4_K_M                     β β
β  β    β’ Optimized for Pi 4                β β
β  ββββββββββββββββββββββββββββββββββββββββββ β
β         β²                      β            β
β         β                      βΌ            β
β  ββββββββ΄ββββββββββ    ββββββββββββββββββββ β
β  β Action Broker  β    β  Confidence      β β
β  β (MCP-ready)    β    β  Auto-Eval       β β
β  ββββββββββββββββββ    ββββββββββββββββββββ β
β                                             β
β  FastAPI Server (REST + WebSocket)         β
βββββββββββββββββββββββββββββββββββββββββββββββ
π― Use Cases
Recommended for:
- β Home AI assistants (24/7 operation)
- β IoT edge inference (low power budget)
- β Offline chatbots (privacy-first)
- β Educational projects (affordable hardware)
- β Voice assistants (bilingual IT/EN)
- β Self-learning experiments (neuron/rule evolution)
Not recommended for:
- β Real-time applications (<500ms latency)
- β Batch processing (CPU-bound, single-threaded)
- β High concurrency (>5 simultaneous users)
π Links
- GitHub (Full Stack): https://github.com/antconsales/antonio-gemma3-evo-q4
- Ollama: https://ollama.com/antconsales/antonio-gemma3-evo-q4
- HuggingFace: https://huggingface.co/chill123/antonio-gemma3-evo-q4
- Donate: https://www.paypal.com/donate/?business=58ML44FNPK66Y¤cy_code=EUR
π License
This model is licensed under the Gemma License Agreement (inherits from base model).
The evolution stack code (EvoMemoryβ’, RAG-Lite, etc.) is dual-licensed:
- Gemma License (model weights)
- MIT License (Python code)
See LICENSE for details.
Built with β€οΈ for offline AI and edge computing
"Il piccolo cervello che cresce insieme a te" β Antonio Gemma3 Evo
Support ethical, local, and independent AI. Every donation helps Antonio Gemma grow and evolve. π
- Downloads last month
- 22
4-bit
