AIAlphaBet: Canaanite Semantic Compression
Interpretable 22-dimensional embeddings based on Canaanite pictographic letters.
Created by RockTalk Holdings
What This Does
AIAlphaBet compresses 384-dimensional sentence embeddings into 22 interpretable dimensions. Each dimension corresponds to one of the 22 Canaanite letters, which are ancient pictographs with concrete semantic meanings.
| Dimension | Letter | Name | Semantic Field |
|---|---|---|---|
| 0 | א | Aleph | ox, strength, leader, first |
| 1 | ב | Bet | house, inside, family, dwelling |
| 2 | ג | Gimel | camel, lift up, pride, benefit |
| 3 | ד | Dalet | door, pathway, movement, decision |
| 4 | ה | Hey | window, behold, reveal, breath |
| 5 | ו | Vav | hook, nail, and, connection |
| 6 | ז | Zayin | weapon, cut, nourish, time |
| 7 | ח | Chet | fence, protect, private, wall |
| 8 | ט | Tet | basket, surround, contain, good |
| 9 | י | Yod | hand, work, make, deed |
| 10 | כ | Kaf | palm, open, cover, allow, bless |
| 11 | ל | Lamed | staff, goad, teach, toward |
| 12 | מ | Mem | water, chaos, massive, from |
| 13 | נ | Nun | seed, fish, continue, offspring |
| 14 | ס | Samekh | prop, support, twist, cycle |
| 15 | ע | Ayin | eye, see, know, experience |
| 16 | פ | Pey | mouth, speak, edge, word |
| 17 | צ | Tsade | hunt, catch, righteous, desire |
| 18 | ק | Qoph | back of head, behind, horizon |
| 19 | ר | Resh | head, first, top, person |
| 20 | ש | Shin | teeth, fire, consume, repeat |
| 21 | ת | Tav | cross, mark, covenant, sign |
Why This Matters
Standard embeddings (OpenAI, Cohere, etc.) are black boxes—1536 dimensions with no interpretable meaning. AIAlphaBet produces embeddings where you can explain why two texts are similar by examining which Canaanite letters activate.
Use cases:
- Regulatory compliance: EU AI Act requires explainable AI for high-risk applications
- RAG debugging: Understand why documents were retrieved, not just that they were
- Semantic search: 70x compression vs. OpenAI embeddings (22D vs 1536D)
- Interpretable clustering: Group texts by their semantic letter patterns
Benchmarks
| Metric | Value |
|---|---|
| Letter Prediction Accuracy | 94.7% |
| Reconstruction MSE | 0.0025 |
| Reconstruction Cosine Similarity | 0.28 |
| Training Examples | 4,590 |
| Compression Ratio | 17.5x (384D → 22D) |
Interpretability Examples
"blood of the covenant"
ב Bet (0.97) - house, dwelling, covenant
ר Resh (0.85) - head, first
→ Matches Canaanite: ברית (brit, covenant)
"fire"
ש Shin (0.96) - fire, consume
א Aleph (0.75) - strength
→ Matches Canaanite: אש (esh, fire)
"water"
מ Mem (0.89) - water, chaos
→ Matches Canaanite: מים (mayim, water)
"father"
א Aleph (0.91) - strength, first
ב Bet (0.87) - house
→ Matches Canaanite: אב (av, father)
Installation
pip install torch sentence-transformers
Usage
import torch
from aialphabet import AIAlphaBet
# Load model
model = AIAlphaBet()
# Encode text to 22D Canaanite space
coords = model.encode("wisdom and understanding")
print(coords.shape) # torch.Size([1, 22])
# Get interpretable letter activations
activations = model.interpret(coords, top_k=5)
for act in activations:
print(f"{act.letter} {act.name}: {act.activation:.2f}")
# Full analysis with visualization
model.analyze("blood of the covenant")
# Semantic similarity in Canaanite space
sim = model.similarity("father", "house")
print(f"Similarity: {sim:.3f}")
# Reconstruct back to 384D
reconstructed = model.decode(coords)
print(reconstructed.shape) # torch.Size([1, 384])
Model Architecture
Text Input
↓
all-MiniLM-L6-v2 (384D)
↓
┌─────────────────────────┐
│ ENCODER │
│ Linear(384, 128) │
│ LayerNorm + GELU │
│ Dropout(0.1) │
│ Linear(128, 22) │
│ Sigmoid │
└─────────────────────────┘
↓
22D Canaanite Space [0, 1]
↓
┌─────────────────────────┐
│ DECODER │
│ Linear(22, 128) │
│ LayerNorm + GELU │
│ Dropout(0.1) │
│ Linear(128, 256) │
│ LayerNorm + GELU │
│ Dropout(0.1) │
│ Linear(256, 384) │
└─────────────────────────┘
↓
384D Reconstructed
Training Data
Trained on 4,590 examples extracted from Benner's Lexicon (793 pages):
- 380 parent roots
- 585 child roots
- 2,786 words
Each Canaanite root was paired with its English semantic field (action, object, abstract meanings), creating supervised training data for letter prediction.
Files
| File | Size | Description |
|---|---|---|
model.pt |
208 KB | Encoder weights (384D → 22D) |
decoder.pt |
534 KB | Decoder weights (22D → 384D) |
config.json |
1 KB | Architecture configuration |
aialphabet.py |
8 KB | Inference class |
rag_demo.py |
31 KB | RAG interpretability demo with causal tests |
demo_corpus.json |
4 KB | Sample corpus (20 documents) |
Limitations
English-centric: Trained on English glosses of Canaanite roots. Works best on concepts with Canaanite etymological connections.
Modern concepts: Abstract modern concepts ("computer", "blockchain") spread activation across many letters without clear mapping.
Base embedder locked: Uses all-MiniLM-L6-v2. Changing embedders requires retraining.
License
MIT License - use freely for any purpose.
Citation
@software{aialphabet2025,
author = {RockTalk Holdings},
title = {AIAlphaBet: Canaanite Semantic Compression},
year = {2025},
url = {https://huggingface.co/rocktalk/aialphabet}
}
RAG Interpretability Demo
The package includes a full RAG demo with causal interpretability tests:
# Basic retrieval with explanations
python rag_demo.py --query "blood of the covenant"
# Side-by-side: Black Box vs AIAlphaBet
python rag_demo.py --compare --query "living water"
# Causal interpretability tests
python rag_demo.py --ablation # What breaks when letters removed?
python rag_demo.py --intervention # Do rankings change with letter manipulation?
python rag_demo.py --causal # Full causal interpretability report
Causal Proof
When you query "I am the door", Dalet (ד - door) activates at 0.99. Remove Dalet from the query vector and similarity to door-related documents crashes from 0.67 to 0.11.
The letter isn't describing the match. It's causing it.
Domain Fit
| Query Type | Retrieval | Explanation |
|---|---|---|
| Biblical/theological | Excellent | Strong, causal |
| Modern with Semitic roots | Good | Present but abstract |
| Pure modern (blockchain) | Degrades | Flags low confidence |
Best used as an interpretability layer on top of standard embeddings, or as a standalone for biblical/theological RAG.
Links
- RockTalk Holdings
- GitHub (coming soon)
- Downloads last month
- 6