DKSplit

BiLSTM-CRF model for splitting concatenated strings into words. Trained on millions of domain names, brand names, personal names, and multilingual phrases.

Quick Start

pip install dksplit
import dksplit

dksplit.split("chatgptlogin")
# ['chatgpt', 'login']

dksplit.split("spotifywrapped")
# ['spotify', 'wrapped']

dksplit.split("mercibeaucoup")
# ['merci', 'beaucoup']

dksplit.split_batch(["openaikey", "microsoftoffice", "bitcoinprice"])
# [['openai', 'key'], ['microsoft', 'office'], ['bitcoin', 'price']]

# Top-k candidates, best first
dksplit.split3("noranite")
# [['nora', 'nite'], ['noranite'], ['nor', 'anite']]

dksplit.split5("pikahug")
# [['pikahug'], ['pika', 'hug'], ['pik', 'ahug'], ['pikah', 'ug'], ['pi', 'kahug']]

dksplit.split_topk("chatgptlogin", k=3)
# [['chatgpt', 'login'], ['chatgptlogin'], ['chatgpt', 'log', 'in']]

Model Details

Property Value
Architecture BiLSTM-CRF
Parameters 9.47M
Embedding 384
Hidden 768
Layers 3
Vocab a-z, 0-9 (38 tokens)
Max length 64 characters
Format ONNX INT8 quantized
Size 9 MB
Inference CPU only, no GPU required

Training

  • Infrastructure: Leonardo Booster supercomputer at CINECA, Italy (NVIDIA A100)
  • Compute: EuroHPC Joint Undertaking, project AIFAC_P02_281
  • Data: Millions of labeled samples covering domain names, brand names, tech terms, personal names, and multilingual phrases
  • Labels: Character-level B/I tags (B = word boundary, I = continuation)
  • Optimizer: Adam, cosine LR schedule with warmup
  • Epochs: 15

Benchmark

1,000 hand-audited domain prefixes drawn from the Newly Registered Domains Database (NRDS) (.com feed). No filtering or cherry-picking on segmentation difficulty. Ground truth was established through multi-model cross-validation (BiLSTM, Qwen 9B LoRA, Gemma 31B) and human audit. Each row provides a primary truth and an optional might_right field for genuinely ambiguous cases.

Model Strict EM Lenient EM
DKSplit 86.5% 91.5%
WordSegment 65.2% 69.5%
WordNinja 51.0% 54.0%

Strict EM counts only exact matches against truth. Lenient EM also accepts the might_right alternative when present.

Top-k coverage (an acceptable segmentation is present within the candidates):

Benchmark top-1 top-3 top-5
1,000 samples 91.5% 98.5% 99.3%
5,000 samples 90.4% 97.8% 99.0%

Both benchmark sets ship in the GitHub repo's /benchmark directory and on Hugging Face as ABTdomain/dksplit-benchmark. To explore domain data yourself, register at domainkits.com — fresh .com NRD downloads are free.

Comparison

Input DKSplit WordSegment WordNinja
chatgptprompts chatgpt prompts chat gpt prompts chat gp t prompts
spotifywrapped spotify wrapped spot if y wrapped spot if y wrapped
ethereumwallet ethereum wallet e there um wallet e there um wallet
kubernetescluster kubernetes cluster ku bernet es cluster ku berne tes cluster
whatsappstatus whatsapp status what sapp status what s app status
drwatsonai dr watson ai dr watson a i dr watson a i
escribirenvozalta escribir en voz alta escribir env oz alta es crib ire nv oz alta
tuvasou tu vas ou tuva sou tuva so u
candidiasenuncamais candidiase nunca mais candid iase nunca mais can didi as e nun cama is

Using the ONNX Model Directly

The model outputs emission scores. CRF decoding is done separately using the parameters in dksplit.npz.

import numpy as np
import onnxruntime as ort

# Load model
sess = ort.InferenceSession("dksplit-int8.onnx")
crf = np.load("dksplit.npz")

# Encode input
CHAR_MAP = {c: i+2 for i, c in enumerate("abcdefghijklmnopqrstuvwxyz0123456789")}
text = "chatgptlogin"
ids = np.array([[CHAR_MAP.get(c, 1) for c in text]], dtype=np.int64)

# Get emissions
emissions = sess.run(["emissions"], {"chars": ids})[0]

# CRF Viterbi decode
trans = crf["transitions"]
start_t = crf["start_transitions"]
end_t = crf["end_transitions"]

score = start_t + emissions[0, 0]
history = []
for t in range(1, emissions.shape[1]):
    ns = score[:, None] + trans + emissions[0, t, None, :]
    history.append(np.argmax(ns, axis=0))
    score = np.max(ns, axis=0)
best = [np.argmax(score + end_t)]
for h in reversed(history):
    best.append(h[best[-1]])
best.reverse()

# Decode to words
words, cur = [], []
for ch, lb in zip(text, best):
    if lb == 1 and cur:
        words.append("".join(cur))
        cur = [ch]
    else:
        cur.append(ch)
if cur:
    words.append("".join(cur))
print(words)  # ['chatgpt', 'login']

Files

  • dksplit-int8.onnx - BiLSTM emissions model (INT8 quantized, 9 MB)
  • dksplit.npz - CRF parameters (transitions, start_transitions, end_transitions)

Intended Use

  • Domain name analysis and segmentation
  • Hashtag splitting
  • URL component extraction
  • Compound string decomposition
  • Any concatenated text without spaces

Limitations

  • a-z and 0-9 only (auto-lowercased), Latin script. For best results pass letter-only runs: split off digits and separators with simple rules first.
  • Max 64 characters
  • Accuracy is highest on English and major European languages
  • Some inputs are genuinely ambiguous; use the top-k API when your pipeline can handle multiple candidates

Links

Acknowledgements

The model was trained on the Leonardo Booster supercomputer at CINECA, Italy, with computing resources provided by the EuroHPC Joint Undertaking through the Playground Access program (EHPC-AIF-2026PG01-281). We thank EuroHPC JU for enabling SMEs to explore new possibilities with world-class HPC infrastructure.

License

CC BY 4.0. Attribution required: credit "DKSplit by ABTdomain" in your README, documentation, about page, or API response metadata.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train ABTdomain/dksplit