DKSplit
BiLSTM-CRF model for splitting concatenated strings into words. Trained on millions of domain names, brand names, personal names, and multilingual phrases.
Quick Start
pip install dksplit
import dksplit
dksplit.split("chatgptlogin")
# ['chatgpt', 'login']
dksplit.split("spotifywrapped")
# ['spotify', 'wrapped']
dksplit.split("mercibeaucoup")
# ['merci', 'beaucoup']
dksplit.split_batch(["openaikey", "microsoftoffice", "bitcoinprice"])
# [['openai', 'key'], ['microsoft', 'office'], ['bitcoin', 'price']]
# Top-k candidates, best first
dksplit.split3("noranite")
# [['nora', 'nite'], ['noranite'], ['nor', 'anite']]
dksplit.split5("pikahug")
# [['pikahug'], ['pika', 'hug'], ['pik', 'ahug'], ['pikah', 'ug'], ['pi', 'kahug']]
dksplit.split_topk("chatgptlogin", k=3)
# [['chatgpt', 'login'], ['chatgptlogin'], ['chatgpt', 'log', 'in']]
Model Details
| Property | Value |
|---|---|
| Architecture | BiLSTM-CRF |
| Parameters | 9.47M |
| Embedding | 384 |
| Hidden | 768 |
| Layers | 3 |
| Vocab | a-z, 0-9 (38 tokens) |
| Max length | 64 characters |
| Format | ONNX INT8 quantized |
| Size | 9 MB |
| Inference | CPU only, no GPU required |
Training
- Infrastructure: Leonardo Booster supercomputer at CINECA, Italy (NVIDIA A100)
- Compute: EuroHPC Joint Undertaking, project AIFAC_P02_281
- Data: Millions of labeled samples covering domain names, brand names, tech terms, personal names, and multilingual phrases
- Labels: Character-level B/I tags (B = word boundary, I = continuation)
- Optimizer: Adam, cosine LR schedule with warmup
- Epochs: 15
Benchmark
1,000 hand-audited domain prefixes drawn from the Newly Registered Domains Database (NRDS) (.com feed). No filtering or cherry-picking on segmentation difficulty. Ground truth was established through multi-model cross-validation (BiLSTM, Qwen 9B LoRA, Gemma 31B) and human audit. Each row provides a primary truth and an optional might_right field for genuinely ambiguous cases.
| Model | Strict EM | Lenient EM |
|---|---|---|
| DKSplit | 86.5% | 91.5% |
| WordSegment | 65.2% | 69.5% |
| WordNinja | 51.0% | 54.0% |
Strict EM counts only exact matches against truth. Lenient EM also accepts the might_right alternative when present.
Top-k coverage (an acceptable segmentation is present within the candidates):
| Benchmark | top-1 | top-3 | top-5 |
|---|---|---|---|
| 1,000 samples | 91.5% | 98.5% | 99.3% |
| 5,000 samples | 90.4% | 97.8% | 99.0% |
Both benchmark sets ship in the GitHub repo's
/benchmark
directory and on Hugging Face as
ABTdomain/dksplit-benchmark.
To explore domain data yourself, register at
domainkits.com — fresh .com NRD downloads are free.
Comparison
| Input | DKSplit | WordSegment | WordNinja |
|---|---|---|---|
chatgptprompts |
chatgpt prompts | chat gpt prompts | chat gp t prompts |
spotifywrapped |
spotify wrapped | spot if y wrapped | spot if y wrapped |
ethereumwallet |
ethereum wallet | e there um wallet | e there um wallet |
kubernetescluster |
kubernetes cluster | ku bernet es cluster | ku berne tes cluster |
whatsappstatus |
whatsapp status | what sapp status | what s app status |
drwatsonai |
dr watson ai | dr watson a i | dr watson a i |
escribirenvozalta |
escribir en voz alta | escribir env oz alta | es crib ire nv oz alta |
tuvasou |
tu vas ou | tuva sou | tuva so u |
candidiasenuncamais |
candidiase nunca mais | candid iase nunca mais | can didi as e nun cama is |
Using the ONNX Model Directly
The model outputs emission scores. CRF decoding is done separately using the parameters in dksplit.npz.
import numpy as np
import onnxruntime as ort
# Load model
sess = ort.InferenceSession("dksplit-int8.onnx")
crf = np.load("dksplit.npz")
# Encode input
CHAR_MAP = {c: i+2 for i, c in enumerate("abcdefghijklmnopqrstuvwxyz0123456789")}
text = "chatgptlogin"
ids = np.array([[CHAR_MAP.get(c, 1) for c in text]], dtype=np.int64)
# Get emissions
emissions = sess.run(["emissions"], {"chars": ids})[0]
# CRF Viterbi decode
trans = crf["transitions"]
start_t = crf["start_transitions"]
end_t = crf["end_transitions"]
score = start_t + emissions[0, 0]
history = []
for t in range(1, emissions.shape[1]):
ns = score[:, None] + trans + emissions[0, t, None, :]
history.append(np.argmax(ns, axis=0))
score = np.max(ns, axis=0)
best = [np.argmax(score + end_t)]
for h in reversed(history):
best.append(h[best[-1]])
best.reverse()
# Decode to words
words, cur = [], []
for ch, lb in zip(text, best):
if lb == 1 and cur:
words.append("".join(cur))
cur = [ch]
else:
cur.append(ch)
if cur:
words.append("".join(cur))
print(words) # ['chatgpt', 'login']
Files
dksplit-int8.onnx- BiLSTM emissions model (INT8 quantized, 9 MB)dksplit.npz- CRF parameters (transitions, start_transitions, end_transitions)
Intended Use
- Domain name analysis and segmentation
- Hashtag splitting
- URL component extraction
- Compound string decomposition
- Any concatenated text without spaces
Limitations
a-zand0-9only (auto-lowercased), Latin script. For best results pass letter-only runs: split off digits and separators with simple rules first.- Max 64 characters
- Accuracy is highest on English and major European languages
- Some inputs are genuinely ambiguous; use the top-k API when your pipeline can handle multiple candidates
Links
- Read more about DKSplit: DKSplit on EuroHPC
- PyPI: pypi.org/project/dksplit
- GitHub: github.com/ABTdomain/dksplit
- Go version: github.com/ABTdomain/dksplit-go
- LLM variant: ABTdomain/dksplit-qwen-lora
- Website: ABTdomain.com, DomainKits.com
Acknowledgements
The model was trained on the Leonardo Booster supercomputer at CINECA, Italy, with computing resources provided by the EuroHPC Joint Undertaking through the Playground Access program (EHPC-AIF-2026PG01-281). We thank EuroHPC JU for enabling SMEs to explore new possibilities with world-class HPC infrastructure.
License
CC BY 4.0. Attribution required: credit "DKSplit by ABTdomain" in your README, documentation, about page, or API response metadata.