ByT5 Leetspeak Decoder V3 (Production)

The definitive byte-level translator for leetspeak, internet slang, and visual character obfuscation.

Built on google/byt5-base, V3 represents a major architectural shift from previous versions. It utilizes Curriculum Learning and Adversarial Filtering to solve the complex context ambiguity between leetspeak numbers (e.g., "2" meaning "to") and actual quantities (e.g., "2 cats").

Key Improvements in V3

Feature	V2 (Legacy)	V3 (Current)
Mixed-Number Context	Struggled (~74%)	100.0% Accuracy
Basic Leet Decoding	85%	100.0% Accuracy
Visual Obfuscation	Moderate	High (handles `
Output Style	Casual/Slang-heavy	Formal/Standard English
Final Eval Loss	0.84	0.3812

The "Number Problem" Solved

V3 is the first model in this series to perfectly distinguish between numbers used as letters and numbers used as quantities within the same sentence.

Input: 1t5 2 l8 4 2 people
V2 Output: It's to late for to people. (Fail)
V3 Output: It is too late for 2 people. (Pass)

Usage

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_id = "ilyyeees/byt5-leetspeak-decoder"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

def decode_leet(text):
    inputs = tokenizer(text, return_tensors="pt")
    outputs = model.generate(
        **inputs, 
        max_length=256,
        num_beams=4,
        early_stopping=True
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Test Cases
print(decode_leet("1t5 2 l8 4 th4t"))         
# Output: It is too late for that.

print(decode_leet("1 g0t 100 p01nt5 0n 1t"))   
# Output: I got 100 points on it. (Preserves the '100' but decodes the rest)

print(decode_leet("idk wh4t 2 d0 tbh"))      
# Output: I don't know what to do to be honest. (Expands abbreviations)

##Training Methodology V3 was trained on 2x NVIDIA RTX 5090s using a custom Reverse-Corruption Pipeline:

Clean Base: High-quality English from WikiText and ELI5 to ground the model in correct grammar.

LLM Adversarial Corruption: We used Qwen 2.5 72B to generate "Hard Negatives"—specific leetspeak patterns that previous model versions failed to decode.

Curriculum Learning: The model was trained in phases of increasing difficulty, starting with simple character swaps and ending with complex visual noise and mixed-number ambiguity.

#Limitations & Bias Formalization Bias: Because V3 was trained on high-quality datasets (Wiki/ELI5), it has a bias toward formal English. It may expand casual slang into formal prose (e.g., converting ngl to not gonna lie or idk to I don't know). It generally avoids outputting slang words like gonna or wanna unless strongly prompted.

Short Inputs: Extremely short, ambiguous inputs (1-2 characters) may be interpreted as standard English rather than leetspeak due to the conservative decoding threshold.

#Links GitHub Repository: ilyyeees/leet-speak-decoder

V2 Model (Legacy): byt5-leetspeak-decoder-v2

Downloads last month: 373

Safetensors

Model size

0.6B params

Tensor type

F32

Datasets used to train ilyyeees/byt5-leetspeak-decoder

Evaluation results

Mixed-Number Accuracy
self-reported

100.000
Basic Leet Accuracy
self-reported

100.000