🧠 Antonio Gemma3 Evo Q4 β€” Self-Learning AI for Raspberry Pi

Antonio Gemma3 Evo Q4 is not just another quantized LLM. It's a self-learning micro-intelligence with EvoMemoryβ„’, RAG-Lite, and auto-evolution capabilities, optimized for Raspberry Pi 4 & 5 and tested for production 24/7 deployment.

Version: v0.5.0 (NEW: Adaptive Prompting) Author: Antonio Consales (antconsales) Base Model: Google Gemma 3 1B IT


πŸ’» Production-Ready for Raspberry Pi

βœ… Tested on Raspberry Pi 4 (4GB) β€” 3.32 t/s sustained (100% reliable over 60 minutes) βœ… Fully offline β€” no external APIs, no internet required βœ… Self-learning β€” EvoMemoryβ„’ saves neurons from every conversation βœ… Bilingual β€” seamlessly switches between Italian and English βœ… 24/7 deployment tested β€” Zero failures in 60-minute soak test


πŸ” What Makes It Special

Unlike traditional quantized models, Antonio Gemma3 Evo Q4 learns and evolves:

  • 🧬 EvoMemoryβ„’ β€” Saves "neurons" with input, output, confidence, and mood
  • πŸ” RAG-Lite β€” Retrieves past experiences using BM25 (no FAISS!)
  • 🎯 Self-evaluation β€” Assigns confidence scores (0-1) to every response
  • 🌱 Auto-evolution β€” Generates new reasoning rules from accumulated neurons
  • πŸ”’ 100% Offline β€” Runs completely local on Raspberry Pi 4 (4GB RAM)
  • 🌐 Bilingual β€” Auto-detects IT/EN and responds in the same language
  • ⚑ Fast β€” 3.32 tokens/s sustained on Pi 4 with Q4_K_M quantization
  • 🎯 NEW: Adaptive Prompting β€” Smart question classification (SIMPLE/COMPLEX/CODE/CREATIVE) for 3.6x speedup on simple queries

"The little brain that grows with you" 🧠


πŸ“Š Benchmark Results (Updated Oct 21, 2025)

Complete 60-minute soak test on Raspberry Pi 4 (4GB RAM) with Ollama.

Production Metrics

Metric Value Status
Sustained throughput 3.32 t/s (256 tokens) βœ… Production-ready
Reliability 100% (455/455 requests) βœ… Perfect
Avg response time 7.92s βœ… Consistent
Thermal stability 70.2Β°C avg (max 73.5Β°C) βœ… No throttling
Memory usage 42% (1.6 GB) βœ… No leaks
Uptime tested 60+ minutes continuous βœ… 24/7 ready

Performance by Token Count

Tokens Run 1 Run 2 Run 3 Average*
128 0.24 3.44 3.45 3.45 t/s
256 3.43 3.09 3.43 3.32 t/s
512 2.76 2.77 2.11 2.55 t/s

*Average excludes cold-start (first run)

Model Comparison

Model Size Sustained Speed Recommended For
Q4_K_M ⭐ 769 MB 3.32 t/s Production (tested 60min, 100% reliable)
Q4_0 687 MB 3.45 t/s Development (faster but less stable)

πŸ“Š View Complete Benchmark Report β€” Full performance, reliability, and stability analysis

Benchmark Methodology

  • Duration: 60.1 minutes (3,603 seconds)
  • Total requests: 455 (2-second interval)
  • Platform: Raspberry Pi 4 (4GB RAM, ARM Cortex-A72 @ 1.5GHz)
  • Runtime: Ollama 0.3.x + llama.cpp backend
  • Monitoring: CPU temp, RAM usage, load average (sampled every 5s)
  • Tasks: Performance (128/256/512 tokens), Quality (HellaSwag, ARC, TruthfulQA), Robustness (soak test)

Recommendation: Use Q4_K_M for production deployments (proven 100% reliability over 60 minutes). Use Q4_0 for development/testing if you need slightly faster inference.


🧩 Available Models

This repository contains two quantization variants:

  • gemma3-1b-q4_0.gguf (β‰ˆ687 MB) β€” Faster, 3% higher throughput, suitable for development
  • gemma3-1b-q4_k_m.gguf (β‰ˆ769 MB) β€” Better quality, production-tested for 60+ minutes

🎯 Important: Two Usage Modes

Mode 1: Ollama Only (Simple Inference) ⚑

Download the GGUF model and run with Ollama:

ollama pull antconsales/antonio-gemma3-evo-q4
ollama run antconsales/antonio-gemma3-evo-q4

What you get:

  • βœ… Fast inference (3.32 t/s on Pi 4)
  • βœ… Bilingual chat (IT/EN)
  • βœ… Offline, privacy-first
  • ❌ NO EvoMemory (doesn't save conversations)
  • ❌ NO RAG (doesn't retrieve past experiences)
  • ❌ NO auto-evolution (doesn't generate rules)

Best for: Quick tests, one-off questions, simple chatbot


Mode 2: Full Evolution Stack (Self-Learning) 🧠

For EvoMemoryβ„’, RAG-Lite, and auto-evolution, use the full Python stack from GitHub:

git clone https://github.com/antconsales/antonio-gemma3-evo-q4.git
cd antonio-gemma3-evo-q4
bash scripts/install.sh
uvicorn api.server:app --host 0.0.0.0 --port 8000

What you get:

  • βœ… EvoMemoryβ„’ β€” Saves neurons from every conversation
  • βœ… RAG-Lite β€” Retrieves past experiences (BM25)
  • βœ… Auto-evolution β€” Generates reasoning rules over time
  • βœ… Confidence scoring β€” Knows when it's uncertain
  • βœ… FastAPI server β€” REST + WebSocket endpoints

Comparison:

Feature Ollama Only Full Stack
Inference speed 3.32 t/s 3.32 t/s
Learns from chats ❌ βœ… EvoMemory
Retrieves memories ❌ βœ… RAG-Lite
Generates rules ❌ βœ… Auto-evolution
API endpoints ❌ βœ… FastAPI
Setup time 1 min 5 min

πŸ› οΈ Quick Start Options

Option 1: Ollama Only (see Mode 1 above)

Option 2: Load Directly from GGUF

# Download model from HuggingFace
wget https://huggingface.co/chill123/antonio-gemma3-evo-q4/resolve/main/gemma3-1b-q4_k_m.gguf

# Create Modelfile
cat > Modelfile <<'EOF'
FROM ./gemma3-1b-q4_k_m.gguf

PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 1024
PARAMETER num_thread 4
PARAMETER repeat_penalty 1.05
PARAMETER stop "<end_of_turn>"
PARAMETER stop "</s>"

SYSTEM """You are Antonio, an offline AI assistant running on a Raspberry Pi. You MUST detect the user's language and respond in the SAME language:

- If the user writes in Italian, respond ONLY in Italian
- If the user writes in English, respond ONLY in English

You are helpful, friendly, and concise. When you're uncertain, you admit it instead of guessing."""
EOF

# Create and run model
ollama create antonio-evo -f Modelfile
ollama run antonio-evo

πŸš€ Quick Start with Full Evolution Stack

For the complete self-learning system with EvoMemoryβ„’, RAG-Lite, and auto-evolution:

# Clone the full project
git clone https://github.com/antconsales/antonio-gemma3-evo-q4.git
cd antonio-gemma3-evo-q4

# Install and run
bash scripts/install.sh
python -m api.server

Visit: http://localhost:8000/docs for interactive API documentation

Features of the full stack:

  • EvoMemoryβ„’ SQLite database
  • RAG-Lite with BM25 search
  • Confidence auto-evaluation
  • Rule regeneration (auto-evolution)
  • FastAPI server with WebSocket support
  • MCP-compatible tool system

πŸ’‘ Key Features

1️⃣ EvoMemoryβ„’ β€” Living Memory

Every conversation creates a neuron:

{
  "id": 123,
  "input_text": "Accendi il LED rosso",
  "output_text": "OK, attivo GPIO 17 su HIGH",
  "confidence": 0.85,
  "mood": "positive",
  "user_feedback": 1,
  "skill_id": "gpio_control",
  "timestamp": "2025-10-21T14:30:00Z"
}

Features:

  • Auto-pruning of low-confidence old neurons
  • Neuron compression (similar patterns β†’ meta-neurons)
  • Context-aware retrieval via hash matching

2️⃣ RAG-Lite β€” Pure Python BM25

No FAISS, no ChromaDB. Just:

  • BM25 scoring (Okapi formula)
  • SQLite full-text search (FTS5)
  • Top-K retrieval with confidence threshold
  • Zero external dependencies

3️⃣ Auto-Evolution β€” Rule Regeneration

Every N conversations (configurable), Antonio:

  1. Analyzes high-confidence neurons
  2. Extracts reasoning patterns
  3. Generates new rules (e.g., "If user asks time-sensitive question β†’ check recency")
  4. Saves rules to instinct.json

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Antonio Gemma3 Evo Q4 - Evolution Layer    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚ EvoMemoryβ„’   │◄────►│  RAG-Lite       β”‚ β”‚
β”‚  β”‚ (SQLite)     β”‚      β”‚  BM25 Search    β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚         β–²                      β–²            β”‚
β”‚         β”‚                      β”‚            β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚    Inference Engine (llama.cpp)        β”‚ β”‚
β”‚  β”‚    β€’ Q4_0 / Q4_K_M                     β”‚ β”‚
β”‚  β”‚    β€’ Optimized for Pi 4                β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚         β–²                      β”‚            β”‚
β”‚         β”‚                      β–Ό            β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚ Action Broker  β”‚    β”‚  Confidence      β”‚ β”‚
β”‚  β”‚ (MCP-ready)    β”‚    β”‚  Auto-Eval       β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚                                             β”‚
β”‚  FastAPI Server (REST + WebSocket)         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

🎯 Use Cases

Recommended for:

  • βœ… Home AI assistants (24/7 operation)
  • βœ… IoT edge inference (low power budget)
  • βœ… Offline chatbots (privacy-first)
  • βœ… Educational projects (affordable hardware)
  • βœ… Voice assistants (bilingual IT/EN)
  • βœ… Self-learning experiments (neuron/rule evolution)

Not recommended for:

  • ❌ Real-time applications (<500ms latency)
  • ❌ Batch processing (CPU-bound, single-threaded)
  • ❌ High concurrency (>5 simultaneous users)

πŸ“š Links


πŸ“„ License

This model is licensed under the Gemma License Agreement (inherits from base model).

The evolution stack code (EvoMemoryβ„’, RAG-Lite, etc.) is dual-licensed:

  • Gemma License (model weights)
  • MIT License (Python code)

See LICENSE for details.


Built with ❀️ for offline AI and edge computing

"Il piccolo cervello che cresce insieme a te" β€” Antonio Gemma3 Evo


Support ethical, local, and independent AI. Every donation helps Antonio Gemma grow and evolve. πŸ’™

Downloads last month
22
GGUF
Model size
1.0B params
Architecture
gemma3
Hardware compatibility
Log In to view the estimation

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for chill123/antonio-gemma3-evo-q4

Quantized
(124)
this model