URL Classifier
A fine-tuned BERT model for binary classification of URLs as either platform listings (e.g., *.myshopify.com, *.wixsite.com) or official websites (e.g., acmewidgets.com).
Model Description
This model is a fine-tuned version of amahdaouy/DomURLs_BERT trained to distinguish between:
- LABEL_0 (official_website): Direct company/brand websites
- LABEL_1 (platform): Third-party platform listings (Shopify, Wix, etc.)
Training Details
Base Model
- Architecture: BERT for Sequence Classification
- Base Model:
amahdaouy/DomURLs_BERT - Tokenizer:
CrabInHoney/urlbert-tiny-base-v4
Training Configuration
| Parameter | Value |
|---|---|
| Epochs | 20 |
| Learning Rate | 2e-5 |
| Batch Size | 32 |
| Max Sequence Length | 64 tokens |
| Optimizer | AdamW |
| Weight Decay | 0.01 |
| LR Scheduler | ReduceLROnPlateau |
| Early Stopping | Patience: 3, Threshold: 0.001 |
Training Data
- Custom curated dataset of platform and official website URLs
- Balanced training set with equal representation of both classes
- Domain-specific preprocessing and data augmentation
Performance
Test Set Metrics
| Metric | Threshold | Achieved |
|---|---|---|
| Accuracy | β₯ 0.80 | β₯ 0.99 β |
| F1 Score | β₯ 0.80 | β₯ 0.99 β |
| Precision | β₯ 0.80 | β₯ 0.99 β |
| Recall | β₯ 0.80 | β₯ 0.99 β |
| False Positive Rate | β€ 0.15 | < 0.01 β |
| False Negative Rate | β€ 0.15 | < 0.01 β |
Example Predictions
https://acmewidgets.comβ official_website (99.98% confidence)https://store.myshopify.comβ platform (75.96% confidence)https://example.wixsite.com/storeβ platform (high confidence)
Usage
Direct Inference with Transformers
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load model and tokenizer
model_name = "DiligentAI/urlbert-url-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Classify URL
url = "https://acmewidgets.com"
inputs = tokenizer(url, return_tensors="pt", truncation=True, max_length=64)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predicted_class = torch.argmax(predictions, dim=1).item()
confidence = predictions[0][predicted_class].item()
label_map = {0: "official_website", 1: "platform"}
print(f"Prediction: {label_map[predicted_class]} ({confidence:.2%})")
Using Hugging Face Pipeline
from transformers import pipeline
classifier = pipeline("text-classification", model="DiligentAI/urlbert-url-classifier")
result = classifier("https://store.myshopify.com")
Pydantic Integration (Production-Ready)
from transformers import pipeline
from pydantic import BaseModel, Field
from typing import Literal
class URLClassificationResult(BaseModel):
url: str
label: Literal["official_website", "platform"]
confidence: float = Field(..., ge=0.0, le=1.0)
def classify_url(url: str) -> URLClassificationResult:
classifier = pipeline("text-classification", model="DiligentAI/urlbert-url-classifier")
result = classifier(url[:64])[0] # Truncate to max_length
label_map = {"LABEL_0": "official_website", "LABEL_1": "platform"}
return URLClassificationResult(
url=url,
label=label_map[result["label"]],
confidence=result["score"]
)
Limitations and Bias
Max URL Length: Model trained on 64-token sequences. Longer URLs are truncated.
Domain Focus: Optimized for e-commerce and business websites
Platform Coverage: Best performance on common platforms (Shopify, Wix, etc.)
Language: Primarily trained on English-language domains
Edge Cases: May have lower confidence on:
Uncommon TLDs
Very short URLs
Internationalized domain names
Intended Use
Primary Use Cases:
URL filtering and categorization pipelines
Lead qualification systems
Web scraping and data collection workflows
Business intelligence and market research
Out of Scope:
Content classification (only URL structure is analyzed)
Malicious URL detection (use dedicated security models)
Language detection
Spam filtering
Model Card Authors
DiligentAI Team
Citation
@misc{urlbert-classifier-2025,
author = {DiligentAI},
title = {URL Classifier - Platform vs Official Website Detection},
year = {2025},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/DiligentAI/urlbert-url-classifier}}
}
License
MIT License
Framework Versions
Transformers: 4.57.0+
PyTorch: 2.0.0+
Python: 3.10+
Training Infrastructure
Framework: PyTorch + Hugging Face Transformers
Pipeline Orchestration: DVC (Data Version Control)
CI/CD: GitHub Actions
Model Format: Safetensors
Dependencies: See repository
Model Versioning
This model is automatically versioned and deployed via GitHub Actions. Each release includes:
Model checkpoint (.safetensors)
Tokenizer configuration
Label mapping (label_map.json)
Performance metrics (metrics.json)
Contact
For issues, questions, or feedback:
GitHub: DiligentAI/url-classifier
Organization: DiligentAI
- Downloads last month
- 71