AIAlphaBet: Canaanite Semantic Compression

Interpretable 22-dimensional embeddings based on Canaanite pictographic letters.

Created by RockTalk Holdings

What This Does

AIAlphaBet compresses 384-dimensional sentence embeddings into 22 interpretable dimensions. Each dimension corresponds to one of the 22 Canaanite letters, which are ancient pictographs with concrete semantic meanings.

Dimension Letter Name Semantic Field
0 א Aleph ox, strength, leader, first
1 ב Bet house, inside, family, dwelling
2 ג Gimel camel, lift up, pride, benefit
3 ד Dalet door, pathway, movement, decision
4 ה Hey window, behold, reveal, breath
5 ו Vav hook, nail, and, connection
6 ז Zayin weapon, cut, nourish, time
7 ח Chet fence, protect, private, wall
8 ט Tet basket, surround, contain, good
9 י Yod hand, work, make, deed
10 כ Kaf palm, open, cover, allow, bless
11 ל Lamed staff, goad, teach, toward
12 מ Mem water, chaos, massive, from
13 נ Nun seed, fish, continue, offspring
14 ס Samekh prop, support, twist, cycle
15 ע Ayin eye, see, know, experience
16 פ Pey mouth, speak, edge, word
17 צ Tsade hunt, catch, righteous, desire
18 ק Qoph back of head, behind, horizon
19 ר Resh head, first, top, person
20 ש Shin teeth, fire, consume, repeat
21 ת Tav cross, mark, covenant, sign

Why This Matters

Standard embeddings (OpenAI, Cohere, etc.) are black boxes—1536 dimensions with no interpretable meaning. AIAlphaBet produces embeddings where you can explain why two texts are similar by examining which Canaanite letters activate.

Use cases:

  • Regulatory compliance: EU AI Act requires explainable AI for high-risk applications
  • RAG debugging: Understand why documents were retrieved, not just that they were
  • Semantic search: 70x compression vs. OpenAI embeddings (22D vs 1536D)
  • Interpretable clustering: Group texts by their semantic letter patterns

Benchmarks

Metric Value
Letter Prediction Accuracy 94.7%
Reconstruction MSE 0.0025
Reconstruction Cosine Similarity 0.28
Training Examples 4,590
Compression Ratio 17.5x (384D → 22D)

Interpretability Examples

"blood of the covenant"
  ב Bet (0.97)  - house, dwelling, covenant
  ר Resh (0.85) - head, first
  → Matches Canaanite: ברית (brit, covenant)

"fire"
  ש Shin (0.96) - fire, consume
  א Aleph (0.75) - strength
  → Matches Canaanite: אש (esh, fire)

"water"
  מ Mem (0.89) - water, chaos
  → Matches Canaanite: מים (mayim, water)

"father"
  א Aleph (0.91) - strength, first
  ב Bet (0.87) - house
  → Matches Canaanite: אב (av, father)

Installation

pip install torch sentence-transformers

Usage

import torch
from aialphabet import AIAlphaBet

# Load model
model = AIAlphaBet()

# Encode text to 22D Canaanite space
coords = model.encode("wisdom and understanding")
print(coords.shape)  # torch.Size([1, 22])

# Get interpretable letter activations
activations = model.interpret(coords, top_k=5)
for act in activations:
    print(f"{act.letter} {act.name}: {act.activation:.2f}")

# Full analysis with visualization
model.analyze("blood of the covenant")

# Semantic similarity in Canaanite space
sim = model.similarity("father", "house")
print(f"Similarity: {sim:.3f}")

# Reconstruct back to 384D
reconstructed = model.decode(coords)
print(reconstructed.shape)  # torch.Size([1, 384])

Model Architecture

Text Input
    ↓
all-MiniLM-L6-v2 (384D)
    ↓
┌─────────────────────────┐
│ ENCODER                 │
│ Linear(384, 128)        │
│ LayerNorm + GELU        │
│ Dropout(0.1)            │
│ Linear(128, 22)         │
│ Sigmoid                 │
└─────────────────────────┘
    ↓
22D Canaanite Space [0, 1]
    ↓
┌─────────────────────────┐
│ DECODER                 │
│ Linear(22, 128)         │
│ LayerNorm + GELU        │
│ Dropout(0.1)            │
│ Linear(128, 256)        │
│ LayerNorm + GELU        │
│ Dropout(0.1)            │
│ Linear(256, 384)        │
└─────────────────────────┘
    ↓
384D Reconstructed

Training Data

Trained on 4,590 examples extracted from Benner's Lexicon (793 pages):

  • 380 parent roots
  • 585 child roots
  • 2,786 words

Each Canaanite root was paired with its English semantic field (action, object, abstract meanings), creating supervised training data for letter prediction.

Files

File Size Description
model.pt 208 KB Encoder weights (384D → 22D)
decoder.pt 534 KB Decoder weights (22D → 384D)
config.json 1 KB Architecture configuration
aialphabet.py 8 KB Inference class
rag_demo.py 31 KB RAG interpretability demo with causal tests
demo_corpus.json 4 KB Sample corpus (20 documents)

Limitations

  1. English-centric: Trained on English glosses of Canaanite roots. Works best on concepts with Canaanite etymological connections.

  2. Modern concepts: Abstract modern concepts ("computer", "blockchain") spread activation across many letters without clear mapping.

  3. Base embedder locked: Uses all-MiniLM-L6-v2. Changing embedders requires retraining.

License

MIT License - use freely for any purpose.

Citation

@software{aialphabet2025,
  author = {RockTalk Holdings},
  title = {AIAlphaBet: Canaanite Semantic Compression},
  year = {2025},
  url = {https://huggingface.co/rocktalk/aialphabet}
}

RAG Interpretability Demo

The package includes a full RAG demo with causal interpretability tests:

# Basic retrieval with explanations
python rag_demo.py --query "blood of the covenant"

# Side-by-side: Black Box vs AIAlphaBet
python rag_demo.py --compare --query "living water"

# Causal interpretability tests
python rag_demo.py --ablation        # What breaks when letters removed?
python rag_demo.py --intervention    # Do rankings change with letter manipulation?
python rag_demo.py --causal          # Full causal interpretability report

Causal Proof

When you query "I am the door", Dalet (ד - door) activates at 0.99. Remove Dalet from the query vector and similarity to door-related documents crashes from 0.67 to 0.11.

The letter isn't describing the match. It's causing it.

Domain Fit

Query Type Retrieval Explanation
Biblical/theological Excellent Strong, causal
Modern with Semitic roots Good Present but abstract
Pure modern (blockchain) Degrades Flags low confidence

Best used as an interpretability layer on top of standard embeddings, or as a standalone for biblical/theological RAG.

Links

Downloads last month
6
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support