Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,123 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
---
|
| 4 |
+
Dynamic Reinforcement Tokenizer (DRT-01)
|
| 5 |
+
|
| 6 |
+
Author: SofiTesfay2010
|
| 7 |
+
Model Repo: HuggingFace DRT-01
|
| 8 |
+
|
| 9 |
+
Overview
|
| 10 |
+
|
| 11 |
+
DRT-01 is a dynamic, context-aware tokenizer trained with reinforcement learning (RL) on large text datasets. Unlike static tokenizers, it segments text based on context embeddings and learns to produce tokenizations that maximize downstream semantic consistency.
|
| 12 |
+
|
| 13 |
+
It can be applied to any Transformer-based language model for improved compression, semantic segmentation, or as a preprocessing step in RL or SOTA LLM pipelines.
|
| 14 |
+
|
| 15 |
+
Features
|
| 16 |
+
|
| 17 |
+
Context-aware segmentation using a Transformer-based policy network.
|
| 18 |
+
|
| 19 |
+
Continuous memory of previous segments for dynamic tokenization.
|
| 20 |
+
|
| 21 |
+
Reinforcement-learning-based reward optimizing embedding variance and smoothness.
|
| 22 |
+
|
| 23 |
+
Auto-save and push checkpoints to HuggingFace Hub.
|
| 24 |
+
|
| 25 |
+
Compatible with any LLM, including large instruction-tuned models like Qwen/Qwen3.
|
| 26 |
+
|
| 27 |
+
Installation
|
| 28 |
+
pip install datasets torch transformers huggingface_hub
|
| 29 |
+
|
| 30 |
+
How It Works
|
| 31 |
+
|
| 32 |
+
Streaming Dataset: Loads large corpora (e.g., WikiText-103) for online training.
|
| 33 |
+
|
| 34 |
+
Policy Network: A Transformer-based RL policy predicts segment lengths for byte sequences.
|
| 35 |
+
|
| 36 |
+
Embedder: Converts segments into embedding vectors combining local bytes and context.
|
| 37 |
+
|
| 38 |
+
Context Memory: Maintains a rolling memory of embeddings to inform segmentation of new text.
|
| 39 |
+
|
| 40 |
+
Reward Function: Encourages embedding diversity and segment coherence while penalizing too many segments.
|
| 41 |
+
|
| 42 |
+
Training Loop: Uses RL to optimize segment policy, auto-saves checkpoints, and can push to HuggingFace Hub.
|
| 43 |
+
|
| 44 |
+
Quick Start
|
| 45 |
+
import torch
|
| 46 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 47 |
+
|
| 48 |
+
from drt import DCTokenizer # assume drt.py contains your DCTokenizer code
|
| 49 |
+
|
| 50 |
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 51 |
+
|
| 52 |
+
# Load dynamic tokenizer
|
| 53 |
+
dct = DCTokenizer()
|
| 54 |
+
|
| 55 |
+
text = "The quick brown fox jumps over the lazy dog."
|
| 56 |
+
segs, logps = dct.sample_segmentation(text)
|
| 57 |
+
|
| 58 |
+
# Display segmented byte chunks
|
| 59 |
+
for seg_bytes, seg_emb in segs:
|
| 60 |
+
print(seg_bytes.tolist())
|
| 61 |
+
|
| 62 |
+
Integrating with LLMs
|
| 63 |
+
|
| 64 |
+
You can feed the segments into any HuggingFace model:
|
| 65 |
+
|
| 66 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 67 |
+
|
| 68 |
+
# Example: Qwen3-Next-80B-A3B-Instruct
|
| 69 |
+
model_name = "Qwen/Qwen3-Next-80B-A3B-Instruct"
|
| 70 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 71 |
+
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
|
| 72 |
+
|
| 73 |
+
# Convert byte segments back to strings
|
| 74 |
+
text_segments = [bytes(seg_bytes.tolist()).decode("utf-8", errors="ignore") for seg_bytes, _ in segs]
|
| 75 |
+
segmented_text = " ".join(text_segments)
|
| 76 |
+
|
| 77 |
+
# Tokenize for LLM
|
| 78 |
+
inputs = tokenizer(segmented_text, return_tensors="pt").to(device)
|
| 79 |
+
outputs = model.generate(**inputs, max_new_tokens=50)
|
| 80 |
+
|
| 81 |
+
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
| 82 |
+
|
| 83 |
+
|
| 84 |
+
⚡ Tip: You can also embed the segments and use them for retrieval-augmented generation (RAG), semantic search, or RL-based language modeling.
|
| 85 |
+
|
| 86 |
+
Training
|
| 87 |
+
|
| 88 |
+
The included RL loop trains the policy to segment text optimally:
|
| 89 |
+
|
| 90 |
+
EPOCHS = 50
|
| 91 |
+
BATCH_SIZE = 8
|
| 92 |
+
|
| 93 |
+
for ep in range(EPOCHS):
|
| 94 |
+
batch = random.sample(docs, BATCH_SIZE)
|
| 95 |
+
for doc in batch:
|
| 96 |
+
segs, logps = dct.sample_segmentation(doc[:512])
|
| 97 |
+
R = reward(segs)
|
| 98 |
+
logp_sum = torch.stack(logps).sum()
|
| 99 |
+
loss = -R * logp_sum
|
| 100 |
+
loss.backward()
|
| 101 |
+
optimizer.step()
|
| 102 |
+
optimizer.zero_grad()
|
| 103 |
+
|
| 104 |
+
|
| 105 |
+
Checkpoints are auto-saved locally (dct_hf_tokenizer) and can be auto-pushed to your HuggingFace repository.
|
| 106 |
+
|
| 107 |
+
Saving & Loading Tokenizer
|
| 108 |
+
torch.save(dct.policy.state_dict(), "policy.pt")
|
| 109 |
+
torch.save(dct.embedder.state_dict(), "embedder.pt")
|
| 110 |
+
|
| 111 |
+
# Reload
|
| 112 |
+
dct.policy.load_state_dict(torch.load("policy.pt", map_location=device))
|
| 113 |
+
dct.embedder.load_state_dict(torch.load("embedder.pt", map_location=device))
|
| 114 |
+
|
| 115 |
+
Notes
|
| 116 |
+
|
| 117 |
+
Works best with UTF-8 encoded text.
|
| 118 |
+
|
| 119 |
+
Can be used as a dynamic preprocessing layer before any LLM, particularly useful for very large instruction-tuned models.
|
| 120 |
+
|
| 121 |
+
Designed for streaming datasets: no need to load the full corpus in memory.
|
| 122 |
+
|
| 123 |
+
This setup allows you to experiment with context-aware tokenization and integrate seamlessly with any HuggingFace model, from GPT-2 to Qwen3-Next-80B-A3B-Instruct.
|