BaoLocTown commited on
Commit
85cf26b
Β·
verified Β·
1 Parent(s): d14baa0

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +480 -0
README.md ADDED
@@ -0,0 +1,480 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - vi
5
+ - en
6
+ library_name: gliner
7
+ pipeline_tag: token-classification
8
+ tags:
9
+ - NER
10
+ - GLiNER
11
+ - information extraction
12
+ - PII
13
+ - PHI
14
+ - PCI
15
+ - entity recognition
16
+ - multilingual
17
+ ---
18
+ # Entity Types Classification
19
+
20
+ ## Personal Information
21
+ - Date of birth
22
+ - Age
23
+ - Gender
24
+ - Last name
25
+ - Occupation
26
+ - Education level
27
+ - Phone number
28
+ - Email
29
+ - Street address
30
+ - City
31
+ - Country
32
+ - Postcode
33
+ - User name
34
+ - Password
35
+ - Tax ID
36
+ - License plate
37
+ - CVV
38
+ - Bank routing number
39
+ - Account number
40
+ - SWIFT BIC
41
+ - Biometric identifier
42
+ - Device identifier
43
+ - Location
44
+
45
+ ## Financial Information
46
+ - Account number
47
+ - Bank routing number
48
+ - SWIFT BIC
49
+ - CVV
50
+ - Tax ID
51
+ - API key
52
+
53
+ ## Health and Medical Information
54
+ - Blood type
55
+ - Biometric identifier
56
+ - Organ
57
+ - Diseases symptom
58
+ - Diagnostics
59
+ - Preventive medicine
60
+ - Treatment
61
+ - Surgery
62
+ - Drug chemical
63
+ - Medical device technique
64
+ - Personal care
65
+
66
+ ## Online and Web-related Information
67
+ - URL
68
+ - IP address
69
+ - Email
70
+ - User name
71
+ - API key
72
+
73
+ ## Professional Information
74
+ - Occupation
75
+ - Skill
76
+ - Organization
77
+ - Company name
78
+
79
+ ## Location Information
80
+ - City
81
+ - Country
82
+ - Postcode
83
+ - Street address
84
+ - Location
85
+
86
+ ## Time-Related Information
87
+ - Date
88
+ - Date time
89
+
90
+ ## Miscellaneous
91
+ - Event
92
+ - Miscellaneous
93
+
94
+ ## Product and Goods Information
95
+ - Product
96
+ - Quantity
97
+ - Food drink
98
+ - Transportation
99
+
100
+ ## Identifiers
101
+ - Device identifier
102
+ - Biometric identifier
103
+ - User name
104
+ - Email
105
+ - Phone number
106
+ - URL
107
+ - License plate
108
+
109
+
110
+ # GLiNER-PII: Zero-shot PII model
111
+
112
+ A production-grade open-source model for privacy-focused PII, PHI, and PCI detection with zero-shot entity recognition capabilities.
113
+ This model was developed in collaboration between [Wordcab](https://wordcab.com/) and [Knowledgator](https://www.knowledgator.com/). For enterprise-ready, specialized PII/PHI/PCI models, contact us at [email protected].
114
+
115
+ ## 🧠 What is GLiNER?
116
+
117
+ GLiNER (Generalist and Lightweight Named Entity Recognition) is a bidirectional transformer model that can identify **any entity type** without predefined categories. Unlike traditional NER models that are limited to specific entity classes, GLiNER allows you to specify exactly what entities you want to extract at runtime.
118
+
119
+ ### Key Advantages
120
+
121
+ - **Zero-shot recognition**: Extract any entity type without retraining
122
+ - **Privacy-first**: Process sensitive data locally without API calls
123
+ - **Lightweight**: Much faster than large language models for NER tasks
124
+ - **Production-ready**: Quantization-aware training with FP16 and UINT8 ONNX models
125
+ - **Comprehensive**: 60+ predefined PII categories with custom entity support
126
+
127
+ ### How GLiNER Works
128
+
129
+ Instead of predicting from a fixed set of entity classes, GLiNER takes both text and a list of desired entity types as input, then identifies spans that match those categories:
130
+
131
+ ```python
132
+ text = "John Smith called from 415-555-1234 to discuss his account."
133
+ entities = ["name", "phone number", "account number"]
134
+ # GLiNER finds: "John Smith" β†’ name, "415-555-1234" β†’ phone number
135
+ ```
136
+
137
+ ## 🐍 Python Implementation
138
+
139
+ The primary GLiNER implementation provides comprehensive PII detection with 60+ entity categories, fine-tuned specifically for privacy and compliance use cases.
140
+
141
+ ### Installation
142
+
143
+ ```bash
144
+ pip install gliner
145
+ ```
146
+
147
+ ### Quick Start
148
+
149
+ ```python
150
+ from gliner import GLiNER
151
+
152
+ # Load the model (downloads automatically on first use)
153
+ model = GLiNER.from_pretrained("knowledgator/gliner-pii-base-v1.0")
154
+
155
+ text = "John Smith called from 415-555-1234 to discuss his account number 12345678."
156
+ labels = ["name", "phone number", "account number"]
157
+
158
+ entities = model.predict_entities(text, labels, threshold=0.3)
159
+
160
+ for entity in entities:
161
+ print(f"{entity['text']} => {entity['label']} (confidence: {entity['score']:.2f})")
162
+ ```
163
+
164
+ Output:
165
+ ```
166
+ John Smith => name (confidence: 0.95)
167
+ 415-555-1234 => phone number (confidence: 0.92)
168
+ 12345678 => account number (confidence: 0.88)
169
+ ```
170
+
171
+ ### Advanced Usage Examples
172
+
173
+ #### Multi-Category Detection
174
+ ```python
175
+ text = """
176
+ Patient Mary Johnson, DOB 01/15/1980, was discharged on March 10, 2024
177
+ from St. Mary's Hospital. Contact: [email protected], (555) 123-4567.
178
+ Insurance policy: POL-789456123.
179
+ """
180
+
181
+ labels = [
182
+ "name", "dob", "discharge date", "organization medical facility",
183
+ "email address", "phone number", "policy number"
184
+ ]
185
+
186
+ entities = model.predict_entities(text, labels, threshold=0.3)
187
+
188
+ for entity in entities:
189
+ print(f"Found '{entity['text']}' as {entity['label']}")
190
+ ```
191
+
192
+ #### Batch Processing for High Throughput
193
+ ```python
194
+ documents = [
195
+ "Customer John called about his credit card ending in 4532.",
196
+ "Sarah's SSN 123-45-6789 needs verification.",
197
+ "Email [email protected] for account 987654321 issues."
198
+ ]
199
+
200
+ labels = ["name", "credit card", "ssn", "email address", "account number"]
201
+
202
+ # Process multiple documents efficiently
203
+ results = model.run(documents, labels, threshold=0.3, batch_size=8)
204
+
205
+ for doc_idx, entities in enumerate(results):
206
+ print(f"\nDocument {doc_idx + 1}:")
207
+ for entity in entities:
208
+ print(f" {entity['text']} => {entity['label']}")
209
+ ```
210
+
211
+ #### Custom Entity Detection
212
+ ```python
213
+ # GLiNER isn't limited to PII - you can detect any entities
214
+ text = "The MacBook Pro with M2 chip costs $1,999 at the Apple Store in Manhattan."
215
+ custom_labels = ["product", "processor", "price", "store", "location"]
216
+
217
+ entities = model.predict_entities(text, custom_labels, threshold=0.3)
218
+ ```
219
+
220
+ #### Threshold Optimization
221
+ ```python
222
+ # Lower threshold: Higher recall, more false positives
223
+ high_recall = model.predict_entities(text, labels, threshold=0.2)
224
+
225
+ # Higher threshold: Higher precision, fewer false positives
226
+ high_precision = model.predict_entities(text, labels, threshold=0.6)
227
+
228
+ # Recommended starting point for production
229
+ balanced = model.predict_entities(text, labels, threshold=0.3)
230
+ ```
231
+
232
+ ## πŸ’‘ Use Cases
233
+
234
+ GLiNER excels in privacy-focused applications where traditional cloud-based NER services pose compliance risks.
235
+
236
+ ### 🎯 **Primary Applications**
237
+
238
+ #### Privacy-First Voice & Transcription
239
+ ```python
240
+ # Automatically redact PII from voice transcriptions
241
+ transcription = "Hi, my name is Sarah Johnson and my phone number is 415-555-0123"
242
+ pii_labels = ["name", "phone number", "email address", "ssn"]
243
+
244
+ entities = model.predict_entities(transcription, pii_labels)
245
+ # Redact or anonymize detected PII before storage
246
+ ```
247
+
248
+ #### Compliance-Ready Document Processing
249
+ ```python
250
+ # Healthcare: HIPAA-compliant note processing
251
+ medical_note = "Patient John Doe, MRN 123456, diagnosed with diabetes..."
252
+ phi_labels = ["name", "medical record number", "condition", "dob"]
253
+
254
+ # Finance: PCI-DSS compliant transaction logs
255
+ transaction_log = "Card ****4532 charged $299.99 to John Smith"
256
+ pci_labels = ["credit card", "money", "name"]
257
+
258
+ # Legal: Attorney-client privilege protection
259
+ legal_doc = "Client Jane Doe vs. Corporation ABC, case #2024-CV-001"
260
+ legal_labels = ["name", "organization", "case number"]
261
+ ```
262
+
263
+ #### Real-Time Data Anonymization
264
+ ```python
265
+ def anonymize_text(text, entity_types):
266
+ """Anonymize PII in real-time"""
267
+ entities = model.predict_entities(text, entity_types)
268
+
269
+ # Sort by position to replace from end to start
270
+ entities.sort(key=lambda x: x['start'], reverse=True)
271
+
272
+ anonymized = text
273
+ for entity in entities:
274
+ placeholder = f"<{entity['label'].upper()}>"
275
+ anonymized = anonymized[:entity['start']] + placeholder + anonymized[entity['end']:]
276
+
277
+ return anonymized
278
+
279
+ original = "John Smith's SSN is 123-45-6789"
280
+ anonymized = anonymize_text(original, ["name", "ssn"])
281
+ print(anonymized) # "<NAME>'s SSN is <SSN>"
282
+ ```
283
+
284
+ ### 🌟 **Extended Applications**
285
+
286
+ #### Enhanced Search & Content Understanding
287
+ ```python
288
+ # Extract key entities from user queries for better search
289
+ query = "Find restaurants near Stanford University in Palo Alto"
290
+ search_entities = ["organization", "location city", "business type"]
291
+
292
+ # Intelligent document tagging
293
+ document = "This quarterly report discusses Microsoft's Azure growth..."
294
+ doc_entities = ["organization", "product", "time period"]
295
+ ```
296
+
297
+ #### GDPR-Compliant Chatbot Logs
298
+ ```python
299
+ def sanitize_chat_log(message):
300
+ """Remove PII from chat logs per GDPR requirements"""
301
+ sensitive_types = [
302
+ "name", "email address", "phone number", "location address",
303
+ "credit card", "ssn", "passport number"
304
+ ]
305
+
306
+ entities = model.predict_entities(message, sensitive_types)
307
+ if entities:
308
+ # Log anonymized version, alert compliance team
309
+ return anonymize_text(message, sensitive_types)
310
+ return message
311
+ ```
312
+
313
+ #### Secure Mobile & Edge Processing
314
+ ```python
315
+ # Process sensitive data entirely on-device
316
+ def process_locally(user_input):
317
+ """Process PII detection without cloud APIs"""
318
+ pii_types = ["name", "phone number", "email address", "ssn", "credit card"]
319
+
320
+ # All processing happens locally - no data leaves device
321
+ detected_pii = model.predict_entities(user_input, pii_types)
322
+
323
+ if detected_pii:
324
+ return "⚠️ Sensitive information detected - proceed with caution"
325
+ return "βœ… No PII detected - safe to share"
326
+ ```
327
+
328
+ ## πŸ“Š Performance Benchmarks
329
+
330
+ ### Accuracy Evaluation
331
+
332
+ The following benchmarks were run on the **synthetic-multi-pii-ner-v1** dataset.
333
+ We compare multiple GLiNER-based PII models, including our new **Knowledgator GLiNER PII Edge v1.0**.
334
+
335
+ | Model Path | Precision | Recall | F1 Score |
336
+ | ---------------------------------------------------------------------- | --------- | ------ | ---------- |
337
+ | **knowledgator/gliner-pii-edge-v1.0** | 78.96% | 72.34% | **75.50%** |
338
+ | **knowledgator/gliner-pii-small-v1.0** | 78.99% | 74.80% | **76.84%** |
339
+ | **knowledgator/gliner-pii-base-v1.0** | 79.28% | 82.78% | **80.99%** |
340
+ | **knowledgator/gliner-pii-large-v1.0** | 87.42% | 79.4% | **83.25%** |
341
+ | **urchade/gliner\_multi\_pii-v1** | 79.19% | 74.67% | **76.86%** |
342
+ | **E3-JSI/gliner-multi-pii-domains-v1** | 78.35% | 74.46% | **76.36%** |
343
+ | **gravitee-io/gliner-pii-detection** | 81.27% | 56.76% | **66.84%** |
344
+
345
+ ### Key Takeaways
346
+
347
+ * **Base Post Model** (`knowledgator/gliner-pii-base-v1.0`) achieves the **highest F1 score (80.99%)**, indicating the strongest overall performance.
348
+ * **Knowledgator Edge Model** (`knowledgator/gliner-pii-edge-v1.0`) is optimized for **edge environments**, trading a slight decrease in recall for lower latency and footprint.
349
+ * **Gravitee-io Model** shows strong precision but lower recall, indicating it is tuned for high confidence but misses more entities.
350
+
351
+
352
+ ### Comparison with Alternatives
353
+
354
+ | Solution | Speed | Privacy | Accuracy | Flexibility | Cost |
355
+ | --------------------- | ----- | ------- | -------- | ----------- | -------- |
356
+ | **GLiNER** | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | Free |
357
+ | Cloud NER APIs | ⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | \$\$\$ |
358
+ | Large Language Models | ⭐⭐ | ⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | \$\$\$\$ |
359
+ | Traditional NER | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐ | Free |
360
+
361
+
362
+ ## πŸš€ Alternative Implementations
363
+
364
+ While Python provides the most comprehensive PII detection capabilities, GLiNER is available across multiple languages for different deployment scenarios.
365
+
366
+ ### πŸ¦€ Rust Implementation (gline-rs)
367
+
368
+ **Best for**: High-performance backend services, microservices
369
+
370
+ ```toml
371
+ [dependencies]
372
+ "gline-rs" = "1"
373
+ ```
374
+
375
+ ```rust
376
+ use gline_rs::{GLiNER, TextInput, Parameters, RuntimeParameters};
377
+
378
+ let model = GLiNER::<TokenMode>::new(
379
+ Parameters::default(),
380
+ RuntimeParameters::default(),
381
+ "tokenizer.json",
382
+ "model.onnx",
383
+ )?;
384
+
385
+ let input = TextInput::from_str(
386
+ &["My name is James Bond."],
387
+ &["person"],
388
+ )?;
389
+
390
+ let output = model.inference(input)?;
391
+ ```
392
+
393
+ **Performance**: 4x faster than Python on CPU, 37x faster with GPU acceleration.
394
+
395
+ ### ⚑ C++ Implementation (GLiNER.cpp)
396
+
397
+ **Best for**: Embedded systems, mobile apps, edge devices
398
+
399
+ ```cpp
400
+ #include "GLiNER/model.hpp"
401
+
402
+ gliner::Config config{12, 512};
403
+ gliner::Model model("./model.onnx", "./tokenizer.json", config);
404
+
405
+ std::vector<std::string> texts = {"John works at Microsoft"};
406
+ std::vector<std::string> entities = {"person", "organization"};
407
+
408
+ auto output = model.inference(texts, entities);
409
+ ```
410
+
411
+ ### 🌐 JavaScript Implementation (GLiNER.js)
412
+
413
+ **Best for**: Web applications, browser-based processing
414
+
415
+ ```bash
416
+ npm install gliner
417
+ ```
418
+
419
+ ```javascript
420
+ import { Gliner } from 'gliner';
421
+
422
+ const gliner = new Gliner({
423
+ tokenizerPath: "onnx-community/gliner_small-v2",
424
+ onnxSettings: {
425
+ modelPath: "public/model.onnx",
426
+ executionProvider: "webgpu",
427
+ }
428
+ });
429
+
430
+ await gliner.initialize();
431
+
432
+ const results = await gliner.inference({
433
+ texts: ["John Smith works at Microsoft"],
434
+ entities: ["person", "organization"],
435
+ threshold: 0.1,
436
+ });
437
+ ```
438
+
439
+ ## πŸ—οΈ Model Architecture & Training
440
+
441
+ ### Quantization-Aware Pretraining
442
+
443
+ GLiNER models use quantization-aware pretraining, which optimizes performance while maintaining accuracy. This allows efficient inference even with quantized models.
444
+
445
+ ### Available ONNX Formats
446
+
447
+ | Format | Size | Use Case |
448
+ |--------|------|----------|
449
+ | **FP16** | 330MB | Balanced performance/accuracy |
450
+ | **UINT8** | 197MB | Maximum efficiency |
451
+
452
+ ### Model Conversion
453
+
454
+ ```bash
455
+ python convert_to_onnx.py \
456
+ --model_path knowledgator/gliner-pii-base-v1.0 \
457
+ --save_path ./model \
458
+ --quantize True # For UINT8 quantization
459
+ ```
460
+
461
+
462
+ ## πŸ“„ References
463
+
464
+ - [GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer](https://arxiv.org/abs/2311.08526)
465
+ - [GLiNER multi-task: Generalist Lightweight Model for Various Information Extraction Tasks](https://arxiv.org/abs/2406.12925)
466
+ - [Named Entity Recognition as Structured Span Prediction](https://arxiv.org/abs/2212.13415)
467
+
468
+ ## πŸ™ Acknowledgments
469
+
470
+ Special thanks to the all GLiNER contributors, the Wordcab team and additional thanks to maintainers of the Rust, C++, and JavaScript implementations.
471
+
472
+ ## πŸ“ž Support
473
+
474
+ - **Hugging Face**: [Ihor/gliner-pii-small](https://huggingface.co/Ihor/gliner-pii-small)
475
+ - **GitHub Issues**: [Report bugs and request features](https://github.com/info-wordcab/wordcab-pii)
476
+ - **Discord**: [Join community discussions](https://discord.gg/wRF7tuY9)
477
+
478
+ ---
479
+
480
+ *GLiNER: Open-source privacy-first entity recognition for production applications.*