pratyushrt commited on
Commit
4a1c1f5
·
verified ·
1 Parent(s): 0cb055c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +170 -31
README.md CHANGED
@@ -28,56 +28,195 @@ It is the most accurate variant available and powers advanced anonymization in [
28
  * Largest model in the series, not suitable for mobile inference as of August 2025.
29
  * Requires MacBook-class hardware or above for real-time use.
30
 
31
- ## Usage example
32
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
  ```json
34
- <tool_call>
35
  {"name": "replace_entities", "arguments": {"replacements": [
36
- {"original": "Marc", "replacement": "Robert"},
37
- {"original": "cloud infrastructure", "replacement": "enterprise software"}
 
38
  ]}}
39
- </tool_call>
40
  ```
41
 
42
- ## Usage prompt template
 
 
 
 
 
 
 
 
 
 
 
43
 
44
- The models expect input in this specific format:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45
 
46
  ```
47
  [BEGIN OF TASK INSTRUCTION]
48
- You are an anonymizer. Your task is to identify and replace personally identifiable information (PII) in the given text.
49
- Replace PII entities with semantically equivalent alternatives that preserve the context needed for a good response.
50
- If no PII is found or replacement is not needed, return an empty replacements list.
51
-
52
- REPLACEMENT RULES:
53
- - Personal names: Replace private or small-group individuals. Pick same culture + gender + era; keep surnames aligned across family members. DO NOT replace globally recognised public figures (heads of state, Nobel laureates, A-list entertainers, Fortune-500 CEOs, etc.).
54
- - Companies / organisations: Replace private, niche, employer & partner orgs. Invent a fictitious org in the same industry & size tier; keep legal suffix. Keep major public companies (anonymity set ≥ 1,000,000).
55
- - Projects / codenames / internal tools: Always replace with a neutral two-word alias of similar length.
56
- - Locations: Replace street addresses, buildings, villages & towns < 100k pop with a same-level synthetic location inside the same state/country. Keep big cities (≥ 1M), states, provinces, countries, iconic landmarks.
57
- - Dates & times: Replace birthdays, meeting invites, exact timestamps. Shift day/month by small amounts while KEEPING THE SAME YEAR to maintain temporal context. DO NOT shift public holidays or famous historic dates ("July 4 1776", "Christmas Day", "9/11/2001", etc.). Keep years, fiscal quarters, decade references unchanged.
58
- - Identifiers: (emails, phone #s, IDs, URLs, account #s) Always replace with format-valid dummies; keep domain class (.com big-tech, .edu, .gov).
59
- - Monetary values: Replace personal income, invoices, bids by × [0.8 – 1.25] to keep order-of-magnitude. Keep public list prices & market caps.
60
- - Quotes / text snippets: If the quote contains PII, swap only the embedded tokens; keep the rest verbatim.
61
- /no_think
62
  [END OF TASK INSTRUCTION]
63
 
64
  [BEGIN OF AVAILABLE TOOLS]
65
- [{"type": "function", "function": {"name": "replace_entities", "description": "Replace PII entities with anonymized versions", "parameters": {"type": "object", "properties": {"replacements": {"type": "array", "items": {"type": "object", "properties": {"original": {"type": "string"}, "replacement": {"type": "string"}}, "required": ["original", "replacement"]}}}, "required": ["replacements"]}}}]
66
  [END OF AVAILABLE TOOLS]
67
 
68
  [BEGIN OF FORMAT INSTRUCTION]
69
- Use the replace_entities tool to specify replacements. Your response must use the tool call wrapper format:
70
-
71
- <|tool_call|>{"name": "replace_entities", "arguments": {"replacements": [{"original": "PII_TEXT", "replacement": "ANONYMIZED_TEXT"}, ...]}}</|tool_call|>
72
-
73
- If no replacements are needed, use:
74
- <|tool_call|>{"name": "replace_entities", "arguments": {"replacements": []}}</|tool_call|>
75
-
76
- Remember to wrap your entire tool call in <|tool_call|> and </|tool_call|> tags.
77
  [END OF FORMAT INSTRUCTION]
78
 
79
  [BEGIN OF QUERY]
80
  Your text to anonymize goes here
81
  /no_think
82
  [END OF QUERY]
83
- ```
 
 
 
 
28
  * Largest model in the series, not suitable for mobile inference as of August 2025.
29
  * Requires MacBook-class hardware or above for real-time use.
30
 
31
+ ## Usage Example
32
 
33
+ ⚠️ **Important**: This model requires specific formatting using the tokenizer's chat template. Do not use raw prompts directly.
34
+
35
+ ### Quick Start
36
+
37
+ ```python
38
+ from transformers import AutoModelForCausalLM, AutoTokenizer
39
+ import torch
40
+ import json
41
+
42
+ # Load model and tokenizer
43
+ model_name = "eternisai/Anonymizer-4B"
44
+ tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
45
+ model = AutoModelForCausalLM.from_pretrained(
46
+ model_name,
47
+ torch_dtype=torch.float16,
48
+ device_map="auto",
49
+ trust_remote_code=True
50
+ )
51
+
52
+ # Define the task instruction
53
+ TASK_INSTRUCTION = """You are an anonymizer. Your task is to identify and replace personally identifiable information (PII) in the given text.
54
+ Replace PII entities with semantically equivalent alternatives that preserve the context needed for a good response.
55
+ If no PII is found or replacement is not needed, return an empty replacements list.
56
+
57
+ REPLACEMENT RULES:
58
+ • Personal names: Replace private or small-group individuals. Pick same culture + gender + era; keep surnames aligned across family members. DO NOT replace globally recognised public figures (heads of state, Nobel laureates, A-list entertainers, Fortune-500 CEOs, etc.).
59
+ • Companies / organisations: Replace private, niche, employer & partner orgs. Invent a fictitious org in the same industry & size tier; keep legal suffix. Keep major public companies (anonymity set ≥ 1,000,000).
60
+ • Projects / codenames / internal tools: Always replace with a neutral two-word alias of similar length.
61
+ • Locations: Replace street addresses, buildings, villages & towns < 100k pop with a same-level synthetic location inside the same state/country. Keep big cities (≥ 1M), states, provinces, countries, iconic landmarks.
62
+ • Dates & times: Replace birthdays, meeting invites, exact timestamps. Shift day/month by small amounts while KEEPING THE SAME YEAR to maintain temporal context. DO NOT shift public holidays or famous historic dates ("July 4 1776", "Christmas Day", "9/11/2001", etc.). Keep years, fiscal quarters, decade references unchanged.
63
+ • Identifiers: (emails, phone #s, IDs, URLs, account #s) Always replace with format-valid dummies; keep domain class (.com big-tech, .edu, .gov).
64
+ • Monetary values: Replace personal income, invoices, bids by × [0.8 – 1.25] to keep order-of-magnitude. Keep public list prices & market caps.
65
+ • Quotes / text snippets: If the quote contains PII, swap only the embedded tokens; keep the rest verbatim."""
66
+
67
+ # Define tool schema (required!)
68
+ tools = [{
69
+ "type": "function",
70
+ "function": {
71
+ "name": "replace_entities",
72
+ "description": "Replace PII entities with anonymized versions",
73
+ "parameters": {
74
+ "type": "object",
75
+ "properties": {
76
+ "replacements": {
77
+ "type": "array",
78
+ "items": {
79
+ "type": "object",
80
+ "properties": {
81
+ "original": {"type": "string"},
82
+ "replacement": {"type": "string"}
83
+ },
84
+ "required": ["original", "replacement"]
85
+ }
86
+ }
87
+ },
88
+ "required": ["replacements"]
89
+ }
90
+ }
91
+ }]
92
+
93
+ # Your query to anonymize
94
+ query = "Hi, my son Elijah works at TechStartup Inc and makes $85,000 per year."
95
+
96
+ # Format messages properly (critical step!)
97
+ messages = [
98
+ {"role": "system", "content": TASK_INSTRUCTION},
99
+ {"role": "user", "content": query + "\n/no_think"}
100
+ ]
101
+
102
+ # Apply chat template with tools
103
+ formatted_prompt = tokenizer.apply_chat_template(
104
+ messages,
105
+ tools=tools,
106
+ tokenize=False,
107
+ add_generation_prompt=True
108
+ )
109
+
110
+ # Tokenize and generate
111
+ inputs = tokenizer(formatted_prompt, return_tensors="pt", truncation=True).to(model.device)
112
+ outputs = model.generate(**inputs, max_new_tokens=250, temperature=0.3, do_sample=True, top_p=0.9)
113
+
114
+ # Decode and extract response
115
+ response = tokenizer.decode(outputs[0], skip_special_tokens=False)
116
+ assistant_response = response.split("assistant")[-1].split("<|im_end|>")[0].strip()
117
+
118
+ print("Response:", assistant_response)
119
+ # Expected output format:
120
+ # <|tool_call|>{"name": "replace_entities", "arguments": {"replacements": [{"original": "Elijah", "replacement": "Nathan"}, {"original": "TechStartup Inc", "replacement": "DataSoft LLC"}, {"original": "$85,000", "replacement": "$72,000"}]}}</|tool_call|>
121
+ ```
122
+
123
+ ### Parsing the Response
124
+
125
+ ```python
126
+ def parse_replacements(response):
127
+ """Extract replacements from model response"""
128
+ try:
129
+ if '<|tool_call|>' in response:
130
+ start = response.find('<|tool_call|>') + len('<|tool_call|>')
131
+ end = response.find('</|tool_call|>')
132
+ elif '<tool_call>' in response:
133
+ start = response.find('<tool_call>') + len('<tool_call>')
134
+ end = response.find('</tool_call>')
135
+ else:
136
+ return None
137
+
138
+ if end != -1:
139
+ json_str = response[start:end].strip()
140
+ tool_data = json.loads(json_str)
141
+ return tool_data.get('arguments', {}).get('replacements', [])
142
+ except:
143
+ return None
144
+
145
+ # Parse the response
146
+ replacements = parse_replacements(assistant_response)
147
+ if replacements:
148
+ for r in replacements:
149
+ print(f"Replace '{r['original']}' with '{r['replacement']}'")
150
+ ```
151
+
152
+ ### Output Format
153
+
154
+ The model outputs tool calls in this format:
155
+
156
+ **With PII detected:**
157
  ```json
158
+ <|tool_call|>
159
  {"name": "replace_entities", "arguments": {"replacements": [
160
+ {"original": "John", "replacement": "Marcus"},
161
+ {"original": "Microsoft", "replacement": "TechCorp"},
162
+ {"original": "$5000", "replacement": "$4200"}
163
  ]}}
164
+ </|tool_call|>
165
  ```
166
 
167
+ **No PII detected:**
168
+ ```json
169
+ <|tool_call|>
170
+ {"name": "replace_entities", "arguments": {"replacements": []}}
171
+ </|tool_call|>
172
+ ```
173
+
174
+ ## Important Notes
175
+
176
+ 1. **Chat Template Required**: The model will NOT work with raw prompts. You must use `tokenizer.apply_chat_template()` with the tools parameter.
177
+
178
+ 2. **Tool Schema Required**: The tools schema must be provided to the chat template for proper formatting.
179
 
180
+ 3. **Special Marker**: User queries need the `/no_think` marker appended.
181
+
182
+ 4. **Response Format**: The model outputs structured tool calls wrapped in `<|tool_call|>` tags (or `<tool_call>` in some versions).
183
+
184
+ ## Common Issues
185
+
186
+ **Issue**: Model outputs gibberish or doesn't follow the format
187
+ **Solution**: Ensure you're using `apply_chat_template` with the tools parameter
188
+
189
+ **Issue**: Model doesn't detect obvious PII
190
+ **Solution**: Make sure to append `/no_think` to the user query
191
+
192
+ **Issue**: Getting errors about missing tools
193
+ **Solution**: The tools schema is required - see the example above
194
+
195
+ ## Technical Details
196
+
197
+ The model was trained using the Qwen3 chat template format with tool calling capabilities. The internal prompt structure (shown below for reference) is automatically generated by the tokenizer - **do not construct this manually**:
198
+
199
+ <details>
200
+ <summary>Internal prompt structure (auto-generated, for reference only)</summary>
201
 
202
  ```
203
  [BEGIN OF TASK INSTRUCTION]
204
+ You are an anonymizer. Your task is to identify and replace personally identifiable information (PII)...
 
 
 
 
 
 
 
 
 
 
 
 
 
205
  [END OF TASK INSTRUCTION]
206
 
207
  [BEGIN OF AVAILABLE TOOLS]
208
+ [{"type": "function", "function": {"name": "replace_entities", ...}}]
209
  [END OF AVAILABLE TOOLS]
210
 
211
  [BEGIN OF FORMAT INSTRUCTION]
212
+ Use the replace_entities tool to specify replacements...
 
 
 
 
 
 
 
213
  [END OF FORMAT INSTRUCTION]
214
 
215
  [BEGIN OF QUERY]
216
  Your text to anonymize goes here
217
  /no_think
218
  [END OF QUERY]
219
+ ```
220
+
221
+ This structure is created automatically when you use `tokenizer.apply_chat_template()` - never construct it manually.
222
+ </details>