SofiTesfay2010 commited on
Commit
c80ff36
·
verified ·
1 Parent(s): 21c6587

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +123 -3
README.md CHANGED
@@ -1,3 +1,123 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+ Dynamic Reinforcement Tokenizer (DRT-01)
5
+
6
+ Author: SofiTesfay2010
7
+ Model Repo: HuggingFace DRT-01
8
+
9
+ Overview
10
+
11
+ DRT-01 is a dynamic, context-aware tokenizer trained with reinforcement learning (RL) on large text datasets. Unlike static tokenizers, it segments text based on context embeddings and learns to produce tokenizations that maximize downstream semantic consistency.
12
+
13
+ It can be applied to any Transformer-based language model for improved compression, semantic segmentation, or as a preprocessing step in RL or SOTA LLM pipelines.
14
+
15
+ Features
16
+
17
+ Context-aware segmentation using a Transformer-based policy network.
18
+
19
+ Continuous memory of previous segments for dynamic tokenization.
20
+
21
+ Reinforcement-learning-based reward optimizing embedding variance and smoothness.
22
+
23
+ Auto-save and push checkpoints to HuggingFace Hub.
24
+
25
+ Compatible with any LLM, including large instruction-tuned models like Qwen/Qwen3.
26
+
27
+ Installation
28
+ pip install datasets torch transformers huggingface_hub
29
+
30
+ How It Works
31
+
32
+ Streaming Dataset: Loads large corpora (e.g., WikiText-103) for online training.
33
+
34
+ Policy Network: A Transformer-based RL policy predicts segment lengths for byte sequences.
35
+
36
+ Embedder: Converts segments into embedding vectors combining local bytes and context.
37
+
38
+ Context Memory: Maintains a rolling memory of embeddings to inform segmentation of new text.
39
+
40
+ Reward Function: Encourages embedding diversity and segment coherence while penalizing too many segments.
41
+
42
+ Training Loop: Uses RL to optimize segment policy, auto-saves checkpoints, and can push to HuggingFace Hub.
43
+
44
+ Quick Start
45
+ import torch
46
+ from transformers import AutoModelForCausalLM, AutoTokenizer
47
+
48
+ from drt import DCTokenizer # assume drt.py contains your DCTokenizer code
49
+
50
+ device = "cuda" if torch.cuda.is_available() else "cpu"
51
+
52
+ # Load dynamic tokenizer
53
+ dct = DCTokenizer()
54
+
55
+ text = "The quick brown fox jumps over the lazy dog."
56
+ segs, logps = dct.sample_segmentation(text)
57
+
58
+ # Display segmented byte chunks
59
+ for seg_bytes, seg_emb in segs:
60
+ print(seg_bytes.tolist())
61
+
62
+ Integrating with LLMs
63
+
64
+ You can feed the segments into any HuggingFace model:
65
+
66
+ from transformers import AutoTokenizer, AutoModelForCausalLM
67
+
68
+ # Example: Qwen3-Next-80B-A3B-Instruct
69
+ model_name = "Qwen/Qwen3-Next-80B-A3B-Instruct"
70
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
71
+ model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
72
+
73
+ # Convert byte segments back to strings
74
+ text_segments = [bytes(seg_bytes.tolist()).decode("utf-8", errors="ignore") for seg_bytes, _ in segs]
75
+ segmented_text = " ".join(text_segments)
76
+
77
+ # Tokenize for LLM
78
+ inputs = tokenizer(segmented_text, return_tensors="pt").to(device)
79
+ outputs = model.generate(**inputs, max_new_tokens=50)
80
+
81
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
82
+
83
+
84
+ ⚡ Tip: You can also embed the segments and use them for retrieval-augmented generation (RAG), semantic search, or RL-based language modeling.
85
+
86
+ Training
87
+
88
+ The included RL loop trains the policy to segment text optimally:
89
+
90
+ EPOCHS = 50
91
+ BATCH_SIZE = 8
92
+
93
+ for ep in range(EPOCHS):
94
+ batch = random.sample(docs, BATCH_SIZE)
95
+ for doc in batch:
96
+ segs, logps = dct.sample_segmentation(doc[:512])
97
+ R = reward(segs)
98
+ logp_sum = torch.stack(logps).sum()
99
+ loss = -R * logp_sum
100
+ loss.backward()
101
+ optimizer.step()
102
+ optimizer.zero_grad()
103
+
104
+
105
+ Checkpoints are auto-saved locally (dct_hf_tokenizer) and can be auto-pushed to your HuggingFace repository.
106
+
107
+ Saving & Loading Tokenizer
108
+ torch.save(dct.policy.state_dict(), "policy.pt")
109
+ torch.save(dct.embedder.state_dict(), "embedder.pt")
110
+
111
+ # Reload
112
+ dct.policy.load_state_dict(torch.load("policy.pt", map_location=device))
113
+ dct.embedder.load_state_dict(torch.load("embedder.pt", map_location=device))
114
+
115
+ Notes
116
+
117
+ Works best with UTF-8 encoded text.
118
+
119
+ Can be used as a dynamic preprocessing layer before any LLM, particularly useful for very large instruction-tuned models.
120
+
121
+ Designed for streaming datasets: no need to load the full corpus in memory.
122
+
123
+ This setup allows you to experiment with context-aware tokenization and integrate seamlessly with any HuggingFace model, from GPT-2 to Qwen3-Next-80B-A3B-Instruct.