ByT5 Leetspeak Decoder V3 (Production)
The definitive byte-level translator for leetspeak, internet slang, and visual character obfuscation.
Built on google/byt5-base, V3 represents a major architectural shift from previous versions. It utilizes Curriculum Learning and Adversarial Filtering to solve the complex context ambiguity between leetspeak numbers (e.g., "2" meaning "to") and actual quantities (e.g., "2 cats").
Key Improvements in V3
| Feature | V2 (Legacy) | V3 (Current) |
|---|---|---|
| Mixed-Number Context | Struggled (~74%) | 100.0% Accuracy |
| Basic Leet Decoding | 85% | 100.0% Accuracy |
| Visual Obfuscation | Moderate | High (handles ` |
| Output Style | Casual/Slang-heavy | Formal/Standard English |
| Final Eval Loss | 0.84 | 0.3812 |
The "Number Problem" Solved
V3 is the first model in this series to perfectly distinguish between numbers used as letters and numbers used as quantities within the same sentence.
- Input:
1t5 2 l8 4 2 people - V2 Output: It's to late for to people. (Fail)
- V3 Output: It is too late for 2 people. (Pass)
Usage
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model_id = "ilyyeees/byt5-leetspeak-decoder"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
def decode_leet(text):
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(
**inputs,
max_length=256,
num_beams=4,
early_stopping=True
)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
# Test Cases
print(decode_leet("1t5 2 l8 4 th4t"))
# Output: It is too late for that.
print(decode_leet("1 g0t 100 p01nt5 0n 1t"))
# Output: I got 100 points on it. (Preserves the '100' but decodes the rest)
print(decode_leet("idk wh4t 2 d0 tbh"))
# Output: I don't know what to do to be honest. (Expands abbreviations)
##Training Methodology V3 was trained on 2x NVIDIA RTX 5090s using a custom Reverse-Corruption Pipeline:
Clean Base: High-quality English from WikiText and ELI5 to ground the model in correct grammar.
LLM Adversarial Corruption: We used Qwen 2.5 72B to generate "Hard Negatives"—specific leetspeak patterns that previous model versions failed to decode.
Curriculum Learning: The model was trained in phases of increasing difficulty, starting with simple character swaps and ending with complex visual noise and mixed-number ambiguity.
#Limitations & Bias Formalization Bias: Because V3 was trained on high-quality datasets (Wiki/ELI5), it has a bias toward formal English. It may expand casual slang into formal prose (e.g., converting ngl to not gonna lie or idk to I don't know). It generally avoids outputting slang words like gonna or wanna unless strongly prompted.
Short Inputs: Extremely short, ambiguous inputs (1-2 characters) may be interpreted as standard English rather than leetspeak due to the conservative decoding threshold.
#Links GitHub Repository: ilyyeees/leet-speak-decoder
V2 Model (Legacy): byt5-leetspeak-decoder-v2
- Downloads last month
- 373
Datasets used to train ilyyeees/byt5-leetspeak-decoder
Evaluation results
- Mixed-Number Accuracyself-reported100.000
- Basic Leet Accuracyself-reported100.000