Entity Types Classification
Personal Information
- Date of birth
- Age
- Gender
- Last name
- Occupation
- Education level
- Phone number
- Street address
- City
- Country
- Postcode
- User name
- Password
- Tax ID
- License plate
- CVV
- Bank routing number
- Account number
- SWIFT BIC
- Biometric identifier
- Device identifier
- Location
Financial Information
- Account number
- Bank routing number
- SWIFT BIC
- CVV
- Tax ID
- API key
Health and Medical Information
- Blood type
- Biometric identifier
- Organ
- Diseases symptom
- Diagnostics
- Preventive medicine
- Treatment
- Surgery
- Drug chemical
- Medical device technique
- Personal care
Online and Web-related Information
- URL
- IP address
- User name
- API key
Professional Information
- Occupation
- Skill
- Organization
- Company name
Location Information
- City
- Country
- Postcode
- Street address
- Location
Time-Related Information
- Date
- Date time
Miscellaneous
- Event
- Miscellaneous
Product and Goods Information
- Product
- Quantity
- Food drink
- Transportation
Identifiers
- Device identifier
- Biometric identifier
- User name
- Phone number
- URL
- License plate
GLiNER-PII: Zero-shot PII model
A production-grade open-source model for privacy-focused PII, PHI, and PCI detection with zero-shot entity recognition capabilities. This model was developed in collaboration between Wordcab and Knowledgator. For enterprise-ready, specialized PII/PHI/PCI models, contact us at [email protected].
π§ What is GLiNER?
GLiNER (Generalist and Lightweight Named Entity Recognition) is a bidirectional transformer model that can identify any entity type without predefined categories. Unlike traditional NER models that are limited to specific entity classes, GLiNER allows you to specify exactly what entities you want to extract at runtime.
Key Advantages
- Zero-shot recognition: Extract any entity type without retraining
- Privacy-first: Process sensitive data locally without API calls
- Lightweight: Much faster than large language models for NER tasks
- Production-ready: Quantization-aware training with FP16 and UINT8 ONNX models
- Comprehensive: 60+ predefined PII categories with custom entity support
How GLiNER Works
Instead of predicting from a fixed set of entity classes, GLiNER takes both text and a list of desired entity types as input, then identifies spans that match those categories:
text = "John Smith called from 415-555-1234 to discuss his account."
entities = ["name", "phone number", "account number"]
# GLiNER finds: "John Smith" β name, "415-555-1234" β phone number
π Python Implementation
The primary GLiNER implementation provides comprehensive PII detection with 60+ entity categories, fine-tuned specifically for privacy and compliance use cases.
Installation
pip install gliner
Quick Start
from gliner import GLiNER
# Load the model (downloads automatically on first use)
model = GLiNER.from_pretrained("knowledgator/gliner-pii-base-v1.0")
text = "John Smith called from 415-555-1234 to discuss his account number 12345678."
labels = ["name", "phone number", "account number"]
entities = model.predict_entities(text, labels, threshold=0.3)
for entity in entities:
print(f"{entity['text']} => {entity['label']} (confidence: {entity['score']:.2f})")
Output:
John Smith => name (confidence: 0.95)
415-555-1234 => phone number (confidence: 0.92)
12345678 => account number (confidence: 0.88)
Advanced Usage Examples
Multi-Category Detection
text = """
Patient Mary Johnson, DOB 01/15/1980, was discharged on March 10, 2024
from St. Mary's Hospital. Contact: [email protected], (555) 123-4567.
Insurance policy: POL-789456123.
"""
labels = [
"name", "dob", "discharge date", "organization medical facility",
"email address", "phone number", "policy number"
]
entities = model.predict_entities(text, labels, threshold=0.3)
for entity in entities:
print(f"Found '{entity['text']}' as {entity['label']}")
Batch Processing for High Throughput
documents = [
"Customer John called about his credit card ending in 4532.",
"Sarah's SSN 123-45-6789 needs verification.",
"Email [email protected] for account 987654321 issues."
]
labels = ["name", "credit card", "ssn", "email address", "account number"]
# Process multiple documents efficiently
results = model.run(documents, labels, threshold=0.3, batch_size=8)
for doc_idx, entities in enumerate(results):
print(f"\nDocument {doc_idx + 1}:")
for entity in entities:
print(f" {entity['text']} => {entity['label']}")
Custom Entity Detection
# GLiNER isn't limited to PII - you can detect any entities
text = "The MacBook Pro with M2 chip costs $1,999 at the Apple Store in Manhattan."
custom_labels = ["product", "processor", "price", "store", "location"]
entities = model.predict_entities(text, custom_labels, threshold=0.3)
Threshold Optimization
# Lower threshold: Higher recall, more false positives
high_recall = model.predict_entities(text, labels, threshold=0.2)
# Higher threshold: Higher precision, fewer false positives
high_precision = model.predict_entities(text, labels, threshold=0.6)
# Recommended starting point for production
balanced = model.predict_entities(text, labels, threshold=0.3)
π‘ Use Cases
GLiNER excels in privacy-focused applications where traditional cloud-based NER services pose compliance risks.
π― Primary Applications
Privacy-First Voice & Transcription
# Automatically redact PII from voice transcriptions
transcription = "Hi, my name is Sarah Johnson and my phone number is 415-555-0123"
pii_labels = ["name", "phone number", "email address", "ssn"]
entities = model.predict_entities(transcription, pii_labels)
# Redact or anonymize detected PII before storage
Compliance-Ready Document Processing
# Healthcare: HIPAA-compliant note processing
medical_note = "Patient John Doe, MRN 123456, diagnosed with diabetes..."
phi_labels = ["name", "medical record number", "condition", "dob"]
# Finance: PCI-DSS compliant transaction logs
transaction_log = "Card ****4532 charged $299.99 to John Smith"
pci_labels = ["credit card", "money", "name"]
# Legal: Attorney-client privilege protection
legal_doc = "Client Jane Doe vs. Corporation ABC, case #2024-CV-001"
legal_labels = ["name", "organization", "case number"]
Real-Time Data Anonymization
def anonymize_text(text, entity_types):
"""Anonymize PII in real-time"""
entities = model.predict_entities(text, entity_types)
# Sort by position to replace from end to start
entities.sort(key=lambda x: x['start'], reverse=True)
anonymized = text
for entity in entities:
placeholder = f"<{entity['label'].upper()}>"
anonymized = anonymized[:entity['start']] + placeholder + anonymized[entity['end']:]
return anonymized
original = "John Smith's SSN is 123-45-6789"
anonymized = anonymize_text(original, ["name", "ssn"])
print(anonymized) # "<NAME>'s SSN is <SSN>"
π Extended Applications
Enhanced Search & Content Understanding
# Extract key entities from user queries for better search
query = "Find restaurants near Stanford University in Palo Alto"
search_entities = ["organization", "location city", "business type"]
# Intelligent document tagging
document = "This quarterly report discusses Microsoft's Azure growth..."
doc_entities = ["organization", "product", "time period"]
GDPR-Compliant Chatbot Logs
def sanitize_chat_log(message):
"""Remove PII from chat logs per GDPR requirements"""
sensitive_types = [
"name", "email address", "phone number", "location address",
"credit card", "ssn", "passport number"
]
entities = model.predict_entities(message, sensitive_types)
if entities:
# Log anonymized version, alert compliance team
return anonymize_text(message, sensitive_types)
return message
Secure Mobile & Edge Processing
# Process sensitive data entirely on-device
def process_locally(user_input):
"""Process PII detection without cloud APIs"""
pii_types = ["name", "phone number", "email address", "ssn", "credit card"]
# All processing happens locally - no data leaves device
detected_pii = model.predict_entities(user_input, pii_types)
if detected_pii:
return "β οΈ Sensitive information detected - proceed with caution"
return "β
No PII detected - safe to share"
π Performance Benchmarks
Accuracy Evaluation
The following benchmarks were run on the synthetic-multi-pii-ner-v1 dataset. We compare multiple GLiNER-based PII models, including our new Knowledgator GLiNER PII Edge v1.0.
| Model Path | Precision | Recall | F1 Score |
|---|---|---|---|
| knowledgator/gliner-pii-edge-v1.0 | 78.96% | 72.34% | 75.50% |
| knowledgator/gliner-pii-small-v1.0 | 78.99% | 74.80% | 76.84% |
| knowledgator/gliner-pii-base-v1.0 | 79.28% | 82.78% | 80.99% |
| knowledgator/gliner-pii-large-v1.0 | 87.42% | 79.4% | 83.25% |
| urchade/gliner_multi_pii-v1 | 79.19% | 74.67% | 76.86% |
| E3-JSI/gliner-multi-pii-domains-v1 | 78.35% | 74.46% | 76.36% |
| gravitee-io/gliner-pii-detection | 81.27% | 56.76% | 66.84% |
Key Takeaways
- Base Post Model (
knowledgator/gliner-pii-base-v1.0) achieves the highest F1 score (80.99%), indicating the strongest overall performance. - Knowledgator Edge Model (
knowledgator/gliner-pii-edge-v1.0) is optimized for edge environments, trading a slight decrease in recall for lower latency and footprint. - Gravitee-io Model shows strong precision but lower recall, indicating it is tuned for high confidence but misses more entities.
Comparison with Alternatives
| Solution | Speed | Privacy | Accuracy | Flexibility | Cost |
|---|---|---|---|---|---|
| GLiNER | ββββ | βββββ | βββββ | ββββ | Free |
| Cloud NER APIs | βββ | βββ | βββββ | βββ | $$$ |
| Large Language Models | ββ | ββ | ββββ | ββββ | $$$$ |
| Traditional NER | βββββ | βββββ | ββββ | β | Free |
π Alternative Implementations
While Python provides the most comprehensive PII detection capabilities, GLiNER is available across multiple languages for different deployment scenarios.
π¦ Rust Implementation (gline-rs)
Best for: High-performance backend services, microservices
[dependencies]
"gline-rs" = "1"
use gline_rs::{GLiNER, TextInput, Parameters, RuntimeParameters};
let model = GLiNER::<TokenMode>::new(
Parameters::default(),
RuntimeParameters::default(),
"tokenizer.json",
"model.onnx",
)?;
let input = TextInput::from_str(
&["My name is James Bond."],
&["person"],
)?;
let output = model.inference(input)?;
Performance: 4x faster than Python on CPU, 37x faster with GPU acceleration.
β‘ C++ Implementation (GLiNER.cpp)
Best for: Embedded systems, mobile apps, edge devices
#include "GLiNER/model.hpp"
gliner::Config config{12, 512};
gliner::Model model("./model.onnx", "./tokenizer.json", config);
std::vector<std::string> texts = {"John works at Microsoft"};
std::vector<std::string> entities = {"person", "organization"};
auto output = model.inference(texts, entities);
π JavaScript Implementation (GLiNER.js)
Best for: Web applications, browser-based processing
npm install gliner
import { Gliner } from 'gliner';
const gliner = new Gliner({
tokenizerPath: "onnx-community/gliner_small-v2",
onnxSettings: {
modelPath: "public/model.onnx",
executionProvider: "webgpu",
}
});
await gliner.initialize();
const results = await gliner.inference({
texts: ["John Smith works at Microsoft"],
entities: ["person", "organization"],
threshold: 0.1,
});
ποΈ Model Architecture & Training
Quantization-Aware Pretraining
GLiNER models use quantization-aware pretraining, which optimizes performance while maintaining accuracy. This allows efficient inference even with quantized models.
Available ONNX Formats
| Format | Size | Use Case |
|---|---|---|
| FP16 | 330MB | Balanced performance/accuracy |
| UINT8 | 197MB | Maximum efficiency |
Model Conversion
python convert_to_onnx.py \
--model_path knowledgator/gliner-pii-base-v1.0 \
--save_path ./model \
--quantize True # For UINT8 quantization
π References
- GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer
- GLiNER multi-task: Generalist Lightweight Model for Various Information Extraction Tasks
- Named Entity Recognition as Structured Span Prediction
π Acknowledgments
Special thanks to the all GLiNER contributors, the Wordcab team and additional thanks to maintainers of the Rust, C++, and JavaScript implementations.
π Support
- Hugging Face: Ihor/gliner-pii-small
- GitHub Issues: Report bugs and request features
- Discord: Join community discussions
GLiNER: Open-source privacy-first entity recognition for production applications.
- Downloads last month
- 12