AIAlphaBet: Canaanite Semantic Compression

Interpretable 22-dimensional embeddings based on Canaanite pictographic letters.

What This Does

AIAlphaBet compresses 384-dimensional sentence embeddings into 22 interpretable dimensions. Each dimension corresponds to one of the 22 Canaanite letters, which are ancient pictographs with concrete semantic meanings.

Dimension	Letter	Name	Semantic Field
0	א	Aleph	ox, strength, leader, first
1	ב	Bet	house, inside, family, dwelling
2	ג	Gimel	camel, lift up, pride, benefit
3	ד	Dalet	door, pathway, movement, decision
4	ה	Hey	window, behold, reveal, breath
5	ו	Vav	hook, nail, and, connection
6	ז	Zayin	weapon, cut, nourish, time
7	ח	Chet	fence, protect, private, wall
8	ט	Tet	basket, surround, contain, good
9	י	Yod	hand, work, make, deed
10	כ	Kaf	palm, open, cover, allow, bless
11	ל	Lamed	staff, goad, teach, toward
12	מ	Mem	water, chaos, massive, from
13	נ	Nun	seed, fish, continue, offspring
14	ס	Samekh	prop, support, twist, cycle
15	ע	Ayin	eye, see, know, experience
16	פ	Pey	mouth, speak, edge, word
17	צ	Tsade	hunt, catch, righteous, desire
18	ק	Qoph	back of head, behind, horizon
19	ר	Resh	head, first, top, person
20	ש	Shin	teeth, fire, consume, repeat
21	ת	Tav	cross, mark, covenant, sign

Why This Matters

Standard embeddings (OpenAI, Cohere, etc.) are black boxes—1536 dimensions with no interpretable meaning. AIAlphaBet produces embeddings where you can explain why two texts are similar by examining which Canaanite letters activate.

Use cases:

Regulatory compliance: EU AI Act requires explainable AI for high-risk applications
RAG debugging: Understand why documents were retrieved, not just that they were
Semantic search: 70x compression vs. OpenAI embeddings (22D vs 1536D)
Interpretable clustering: Group texts by their semantic letter patterns

Benchmarks

Metric	Value
Letter Prediction Accuracy	94.7%
Reconstruction MSE	0.0025
Reconstruction Cosine Similarity	0.28
Training Examples	4,590
Compression Ratio	17.5x (384D → 22D)

Interpretability Examples

"blood of the covenant"
  ב Bet (0.97)  - house, dwelling, covenant
  ר Resh (0.85) - head, first
  → Matches Canaanite: ברית (brit, covenant)

"fire"
  ש Shin (0.96) - fire, consume
  א Aleph (0.75) - strength
  → Matches Canaanite: אש (esh, fire)

"water"
  מ Mem (0.89) - water, chaos
  → Matches Canaanite: מים (mayim, water)

"father"
  א Aleph (0.91) - strength, first
  ב Bet (0.87) - house
  → Matches Canaanite: אב (av, father)

Installation

pip install torch sentence-transformers

Usage

import torch
from aialphabet import AIAlphaBet

# Load model
model = AIAlphaBet()

# Encode text to 22D Canaanite space
coords = model.encode("wisdom and understanding")
print(coords.shape)  # torch.Size([1, 22])

# Get interpretable letter activations
activations = model.interpret(coords, top_k=5)
for act in activations:
    print(f"{act.letter} {act.name}: {act.activation:.2f}")

# Full analysis with visualization
model.analyze("blood of the covenant")

# Semantic similarity in Canaanite space
sim = model.similarity("father", "house")
print(f"Similarity: {sim:.3f}")

# Reconstruct back to 384D
reconstructed = model.decode(coords)
print(reconstructed.shape)  # torch.Size([1, 384])

Model Architecture

Text Input
    ↓
all-MiniLM-L6-v2 (384D)
    ↓
┌─────────────────────────┐
│ ENCODER                 │
│ Linear(384, 128)        │
│ LayerNorm + GELU        │
│ Dropout(0.1)            │
│ Linear(128, 22)         │
│ Sigmoid                 │
└─────────────────────────┘
    ↓
22D Canaanite Space [0, 1]
    ↓
┌─────────────────────────┐
│ DECODER                 │
│ Linear(22, 128)         │
│ LayerNorm + GELU        │
│ Dropout(0.1)            │
│ Linear(128, 256)        │
│ LayerNorm + GELU        │
│ Dropout(0.1)            │
│ Linear(256, 384)        │
└─────────────────────────┘
    ↓
384D Reconstructed

Training Data

Trained on 4,590 examples extracted from Benner's Lexicon (793 pages):

380 parent roots
585 child roots
2,786 words

Each Canaanite root was paired with its English semantic field (action, object, abstract meanings), creating supervised training data for letter prediction.

Files

File	Size	Description
`model.pt`	208 KB	Encoder weights (384D → 22D)
`decoder.pt`	534 KB	Decoder weights (22D → 384D)
`config.json`	1 KB	Architecture configuration
`aialphabet.py`	8 KB	Inference class
`rag_demo.py`	31 KB	RAG interpretability demo with causal tests
`demo_corpus.json`	4 KB	Sample corpus (20 documents)

Limitations

English-centric: Trained on English glosses of Canaanite roots. Works best on concepts with Canaanite etymological connections.
Modern concepts: Abstract modern concepts ("computer", "blockchain") spread activation across many letters without clear mapping.
Base embedder locked: Uses all-MiniLM-L6-v2. Changing embedders requires retraining.

License

MIT License - use freely for any purpose.

Citation

@software{aialphabet2025,
  author = {RockTalk Holdings},
  title = {AIAlphaBet: Canaanite Semantic Compression},
  year = {2025},
  url = {https://huggingface.co/rocktalk/aialphabet}
}

RAG Interpretability Demo

The package includes a full RAG demo with causal interpretability tests:

# Basic retrieval with explanations
python rag_demo.py --query "blood of the covenant"

# Side-by-side: Black Box vs AIAlphaBet
python rag_demo.py --compare --query "living water"

# Causal interpretability tests
python rag_demo.py --ablation        # What breaks when letters removed?
python rag_demo.py --intervention    # Do rankings change with letter manipulation?
python rag_demo.py --causal          # Full causal interpretability report

Causal Proof

When you query "I am the door", Dalet (ד - door) activates at 0.99. Remove Dalet from the query vector and similarity to door-related documents crashes from 0.67 to 0.11.

The letter isn't describing the match. It's causing it.

Domain Fit

Query Type	Retrieval	Explanation
Biblical/theological	Excellent	Strong, causal
Modern with Semitic roots	Good	Present but abstract
Pure modern (blockchain)	Degrades	Flags low confidence

Best used as an interpretability layer on top of standard embeddings, or as a standalone for biblical/theological RAG.

RockTalk
/

aialphabet