File size: 5,687 Bytes
86336ba a4dd5ac 0c68aa0 a4dd5ac 86336ba a4dd5ac 86336ba a4dd5ac 86336ba a4dd5ac 86336ba a4dd5ac 86336ba a4dd5ac 86336ba a4dd5ac 86336ba a4dd5ac 5f131e2 a4dd5ac 5f131e2 a4dd5ac 213ff12 a4dd5ac 213ff12 a4dd5ac 5f131e2 a4dd5ac 213ff12 a4dd5ac 5f131e2 a4dd5ac 213ff12 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 |
---
language:
- en
license: mit
tags:
- text-classification
- url-classification
- bert
- domain-classification
pipeline_tag: text-classification
widget:
- text: https://acmewidgets.com
- text: https://store.myshopify.com
- text: https://example.wixsite.com/store
- text: https://business.com
datasets:
- custom
metrics:
- accuracy
- f1
- precision
- recall
---
# URL Classifier
A fine-tuned BERT model for binary classification of URLs as either **platform listings** (e.g., `*.myshopify.com`, `*.wixsite.com`) or **official websites** (e.g., `acmewidgets.com`).
## Model Description
This model is a fine-tuned version of [amahdaouy/DomURLs_BERT](https://huggingface.co/amahdaouy/DomURLs_BERT) trained to distinguish between:
- **LABEL_0 (official_website)**: Direct company/brand websites
- **LABEL_1 (platform)**: Third-party platform listings (Shopify, Wix, etc.)
## Training Details
### Base Model
- **Architecture**: BERT for Sequence Classification
- **Base Model**: `amahdaouy/DomURLs_BERT`
- **Tokenizer**: `CrabInHoney/urlbert-tiny-base-v4`
### Training Configuration
| Parameter | Value |
|-----------|-------|
| **Epochs** | 20 |
| **Learning Rate** | 2e-5 |
| **Batch Size** | 32 |
| **Max Sequence Length** | 64 tokens |
| **Optimizer** | AdamW |
| **Weight Decay** | 0.01 |
| **LR Scheduler** | ReduceLROnPlateau |
| **Early Stopping** | Patience: 3, Threshold: 0.001 |
### Training Data
- Custom curated dataset of platform and official website URLs
- Balanced training set with equal representation of both classes
- Domain-specific preprocessing and data augmentation
## Performance
### Test Set Metrics
| Metric | Threshold | Achieved |
|--------|-----------|----------|
| **Accuracy** | β₯ 0.80 | **β₯ 0.99** β
|
| **F1 Score** | β₯ 0.80 | **β₯ 0.99** β
|
| **Precision** | β₯ 0.80 | **β₯ 0.99** β
|
| **Recall** | β₯ 0.80 | **β₯ 0.99** β
|
| **False Positive Rate** | β€ 0.15 | **< 0.01** β
|
| **False Negative Rate** | β€ 0.15 | **< 0.01** β
|
### Example Predictions
- `https://acmewidgets.com` β **official_website** (99.98% confidence)
- `https://store.myshopify.com` β **platform** (75.96% confidence)
- `https://example.wixsite.com/store` β **platform** (high confidence)
## Usage
### Direct Inference with Transformers
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load model and tokenizer
model_name = "DiligentAI/urlbert-url-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Classify URL
url = "https://acmewidgets.com"
inputs = tokenizer(url, return_tensors="pt", truncation=True, max_length=64)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predicted_class = torch.argmax(predictions, dim=1).item()
confidence = predictions[0][predicted_class].item()
label_map = {0: "official_website", 1: "platform"}
print(f"Prediction: {label_map[predicted_class]} ({confidence:.2%})")
Using Hugging Face Pipeline
from transformers import pipeline
classifier = pipeline("text-classification", model="DiligentAI/urlbert-url-classifier")
result = classifier("https://store.myshopify.com")
Pydantic Integration (Production-Ready)
from transformers import pipeline
from pydantic import BaseModel, Field
from typing import Literal
class URLClassificationResult(BaseModel):
url: str
label: Literal["official_website", "platform"]
confidence: float = Field(..., ge=0.0, le=1.0)
def classify_url(url: str) -> URLClassificationResult:
classifier = pipeline("text-classification", model="DiligentAI/urlbert-url-classifier")
result = classifier(url[:64])[0] # Truncate to max_length
label_map = {"LABEL_0": "official_website", "LABEL_1": "platform"}
return URLClassificationResult(
url=url,
label=label_map[result["label"]],
confidence=result["score"]
)
Limitations and Bias
Max URL Length: Model trained on 64-token sequences. Longer URLs are truncated.
Domain Focus: Optimized for e-commerce and business websites
Platform Coverage: Best performance on common platforms (Shopify, Wix, etc.)
Language: Primarily trained on English-language domains
Edge Cases: May have lower confidence on:
Uncommon TLDs
Very short URLs
Internationalized domain names
Intended Use
Primary Use Cases:
URL filtering and categorization pipelines
Lead qualification systems
Web scraping and data collection workflows
Business intelligence and market research
Out of Scope:
Content classification (only URL structure is analyzed)
Malicious URL detection (use dedicated security models)
Language detection
Spam filtering
Model Card Authors
DiligentAI Team
Citation
@misc{urlbert-classifier-2025,
author = {DiligentAI},
title = {URL Classifier - Platform vs Official Website Detection},
year = {2025},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/DiligentAI/urlbert-url-classifier}}
}
License
MIT License
Framework Versions
Transformers: 4.57.0+
PyTorch: 2.0.0+
Python: 3.10+
Training Infrastructure
Framework: PyTorch + Hugging Face Transformers
Pipeline Orchestration: DVC (Data Version Control)
CI/CD: GitHub Actions
Model Format: Safetensors
Dependencies: See repository
Model Versioning
This model is automatically versioned and deployed via GitHub Actions. Each release includes:
Model checkpoint (.safetensors)
Tokenizer configuration
Label mapping (label_map.json)
Performance metrics (metrics.json)
Contact
For issues, questions, or feedback:
GitHub: DiligentAI/url-classifier
Organization: DiligentAI |