DKSplit

BiLSTM-CRF model for splitting concatenated strings into words. Trained on millions of domain names, brand names, personal names, and multilingual phrases.

Quick Start

pip install dksplit

import dksplit

dksplit.split("chatgptlogin")
# ['chatgpt', 'login']

dksplit.split("spotifywrapped")
# ['spotify', 'wrapped']

dksplit.split("mercibeaucoup")
# ['merci', 'beaucoup']

dksplit.split_batch(["openaikey", "microsoftoffice", "bitcoinprice"])
# [['openai', 'key'], ['microsoft', 'office'], ['bitcoin', 'price']]

# Top-k candidates, best first
dksplit.split3("noranite")
# [['nora', 'nite'], ['noranite'], ['nor', 'anite']]

dksplit.split5("pikahug")
# [['pikahug'], ['pika', 'hug'], ['pik', 'ahug'], ['pikah', 'ug'], ['pi', 'kahug']]

dksplit.split_topk("chatgptlogin", k=3)
# [['chatgpt', 'login'], ['chatgptlogin'], ['chatgpt', 'log', 'in']]

Model Details

Property	Value
Architecture	BiLSTM-CRF
Parameters	9.47M
Embedding	384
Hidden	768
Layers	3
Vocab	a-z, 0-9 (38 tokens)
Max length	64 characters
Format	ONNX INT8 quantized
Size	9 MB
Inference	CPU only, no GPU required

Training

Infrastructure: Leonardo Booster supercomputer at CINECA, Italy (NVIDIA A100)
Compute: EuroHPC Joint Undertaking, project AIFAC_P02_281
Data: Millions of labeled samples covering domain names, brand names, tech terms, personal names, and multilingual phrases
Labels: Character-level B/I tags (B = word boundary, I = continuation)
Optimizer: Adam, cosine LR schedule with warmup
Epochs: 15

Benchmark

1,000 hand-audited domain prefixes drawn from the Newly Registered Domains Database (NRDS) (.com feed). No filtering or cherry-picking on segmentation difficulty. Ground truth was established through multi-model cross-validation (BiLSTM, Qwen 9B LoRA, Gemma 31B) and human audit. Each row provides a primary truth and an optional might_right field for genuinely ambiguous cases.

Model	Strict EM	Lenient EM
DKSplit	86.5%	91.5%
WordSegment	65.2%	69.5%
WordNinja	51.0%	54.0%

Strict EM counts only exact matches against truth. Lenient EM also accepts the might_right alternative when present.

Top-k coverage (an acceptable segmentation is present within the candidates):

Benchmark	top-1	top-3	top-5
1,000 samples	91.5%	98.5%	99.3%
5,000 samples	90.4%	97.8%	99.0%

Both benchmark sets ship in the GitHub repo's /benchmark directory and on Hugging Face as ABTdomain/dksplit-benchmark. To explore domain data yourself, register at domainkits.com — fresh .com NRD downloads are free.

Comparison

Input	DKSplit	WordSegment	WordNinja
`chatgptprompts`	chatgpt prompts	chat gpt prompts	chat gp t prompts
`spotifywrapped`	spotify wrapped	spot if y wrapped	spot if y wrapped
`ethereumwallet`	ethereum wallet	e there um wallet	e there um wallet
`kubernetescluster`	kubernetes cluster	ku bernet es cluster	ku berne tes cluster
`whatsappstatus`	whatsapp status	what sapp status	what s app status
`drwatsonai`	dr watson ai	dr watson a i	dr watson a i
`escribirenvozalta`	escribir en voz alta	escribir env oz alta	es crib ire nv oz alta
`tuvasou`	tu vas ou	tuva sou	tuva so u
`candidiasenuncamais`	candidiase nunca mais	candid iase nunca mais	can didi as e nun cama is

Using the ONNX Model Directly

The model outputs emission scores. CRF decoding is done separately using the parameters in dksplit.npz.

import numpy as np
import onnxruntime as ort

# Load model
sess = ort.InferenceSession("dksplit-int8.onnx")
crf = np.load("dksplit.npz")

# Encode input
CHAR_MAP = {c: i+2 for i, c in enumerate("abcdefghijklmnopqrstuvwxyz0123456789")}
text = "chatgptlogin"
ids = np.array([[CHAR_MAP.get(c, 1) for c in text]], dtype=np.int64)

# Get emissions
emissions = sess.run(["emissions"], {"chars": ids})[0]

# CRF Viterbi decode
trans = crf["transitions"]
start_t = crf["start_transitions"]
end_t = crf["end_transitions"]

score = start_t + emissions[0, 0]
history = []
for t in range(1, emissions.shape[1]):
    ns = score[:, None] + trans + emissions[0, t, None, :]
    history.append(np.argmax(ns, axis=0))
    score = np.max(ns, axis=0)
best = [np.argmax(score + end_t)]
for h in reversed(history):
    best.append(h[best[-1]])
best.reverse()

# Decode to words
words, cur = [], []
for ch, lb in zip(text, best):
    if lb == 1 and cur:
        words.append("".join(cur))
        cur = [ch]
    else:
        cur.append(ch)
if cur:
    words.append("".join(cur))
print(words)  # ['chatgpt', 'login']

Files

dksplit-int8.onnx - BiLSTM emissions model (INT8 quantized, 9 MB)
dksplit.npz - CRF parameters (transitions, start_transitions, end_transitions)

Intended Use

Domain name analysis and segmentation
Hashtag splitting
URL component extraction
Compound string decomposition
Any concatenated text without spaces

Limitations

a-z and 0-9 only (auto-lowercased), Latin script. For best results pass letter-only runs: split off digits and separators with simple rules first.
Max 64 characters
Accuracy is highest on English and major European languages
Some inputs are genuinely ambiguous; use the top-k API when your pipeline can handle multiple candidates

Acknowledgements

The model was trained on the Leonardo Booster supercomputer at CINECA, Italy, with computing resources provided by the EuroHPC Joint Undertaking through the Playground Access program (EHPC-AIF-2026PG01-281). We thank EuroHPC JU for enabling SMEs to explore new possibilities with world-class HPC infrastructure.

License

CC BY 4.0. Attribution required: credit "DKSplit by ABTdomain" in your README, documentation, about page, or API response metadata.

Downloads last month: -; Downloads are not tracked for this model. How to track

ABTdomain
/

dksplit