URL Classifier

A fine-tuned BERT model for binary classification of URLs as either platform listings (e.g., *.myshopify.com, *.wixsite.com) or official websites (e.g., acmewidgets.com).

Model Description

This model is a fine-tuned version of amahdaouy/DomURLs_BERT trained to distinguish between:

LABEL_0 (official_website): Direct company/brand websites
LABEL_1 (platform): Third-party platform listings (Shopify, Wix, etc.)

Training Details

Base Model

Architecture: BERT for Sequence Classification
Base Model: amahdaouy/DomURLs_BERT
Tokenizer: CrabInHoney/urlbert-tiny-base-v4

Training Configuration

Parameter	Value
Epochs	20
Learning Rate	2e-5
Batch Size	32
Max Sequence Length	64 tokens
Optimizer	AdamW
Weight Decay	0.01
LR Scheduler	ReduceLROnPlateau
Early Stopping	Patience: 3, Threshold: 0.001

Training Data

Custom curated dataset of platform and official website URLs
Balanced training set with equal representation of both classes
Domain-specific preprocessing and data augmentation

Performance

Test Set Metrics

Metric	Threshold	Achieved
Accuracy	≥ 0.80	≥ 0.99 ✅
F1 Score	≥ 0.80	≥ 0.99 ✅
Precision	≥ 0.80	≥ 0.99 ✅
Recall	≥ 0.80	≥ 0.99 ✅
False Positive Rate	≤ 0.15	< 0.01 ✅
False Negative Rate	≤ 0.15	< 0.01 ✅

Example Predictions

https://acmewidgets.com → official_website (99.98% confidence)
https://store.myshopify.com → platform (75.96% confidence)
https://example.wixsite.com/store → platform (high confidence)

Usage

Direct Inference with Transformers

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
model_name = "DiligentAI/urlbert-url-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Classify URL
url = "https://acmewidgets.com"
inputs = tokenizer(url, return_tensors="pt", truncation=True, max_length=64)

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_class = torch.argmax(predictions, dim=1).item()
    confidence = predictions[0][predicted_class].item()

label_map = {0: "official_website", 1: "platform"}
print(f"Prediction: {label_map[predicted_class]} ({confidence:.2%})")
Using Hugging Face Pipeline
from transformers import pipeline

classifier = pipeline("text-classification", model="DiligentAI/urlbert-url-classifier")
result = classifier("https://store.myshopify.com")


Pydantic Integration (Production-Ready)

from transformers import pipeline
from pydantic import BaseModel, Field
from typing import Literal

class URLClassificationResult(BaseModel):
    url: str
    label: Literal["official_website", "platform"]
    confidence: float = Field(..., ge=0.0, le=1.0)
    
def classify_url(url: str) -> URLClassificationResult:
    classifier = pipeline("text-classification", model="DiligentAI/urlbert-url-classifier")
    result = classifier(url[:64])[0]  # Truncate to max_length
    
    label_map = {"LABEL_0": "official_website", "LABEL_1": "platform"}
    
    return URLClassificationResult(
        url=url,
        label=label_map[result["label"]],
        confidence=result["score"]
    )

Limitations and Bias
Max URL Length: Model trained on 64-token sequences. Longer URLs are truncated.
Domain Focus: Optimized for e-commerce and business websites
Platform Coverage: Best performance on common platforms (Shopify, Wix, etc.)
Language: Primarily trained on English-language domains
Edge Cases: May have lower confidence on:
Uncommon TLDs
Very short URLs
Internationalized domain names
Intended Use

Primary Use Cases:
URL filtering and categorization pipelines
Lead qualification systems
Web scraping and data collection workflows
Business intelligence and market research
Out of Scope:
Content classification (only URL structure is analyzed)
Malicious URL detection (use dedicated security models)
Language detection
Spam filtering
Model Card Authors
DiligentAI Team

Citation
@misc{urlbert-classifier-2025,
  author = {DiligentAI},
  title = {URL Classifier - Platform vs Official Website Detection},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/DiligentAI/urlbert-url-classifier}}
}

License
MIT License
Framework Versions
Transformers: 4.57.0+
PyTorch: 2.0.0+
Python: 3.10+
Training Infrastructure
Framework: PyTorch + Hugging Face Transformers
Pipeline Orchestration: DVC (Data Version Control)
CI/CD: GitHub Actions
Model Format: Safetensors
Dependencies: See repository
Model Versioning
This model is automatically versioned and deployed via GitHub Actions. Each release includes:
Model checkpoint (.safetensors)
Tokenizer configuration
Label mapping (label_map.json)
Performance metrics (metrics.json)

Contact
For issues, questions, or feedback:
GitHub: DiligentAI/url-classifier
Organization: DiligentAI

Downloads last month: 71

Safetensors

Model size

0.1B params

Tensor type

F32