URL Classifier

A fine-tuned BERT model for binary classification of URLs as either platform listings (e.g., *.myshopify.com, *.wixsite.com) or official websites (e.g., acmewidgets.com).

Model Description

This model is a fine-tuned version of amahdaouy/DomURLs_BERT trained to distinguish between:

  • LABEL_0 (official_website): Direct company/brand websites
  • LABEL_1 (platform): Third-party platform listings (Shopify, Wix, etc.)

Training Details

Base Model

  • Architecture: BERT for Sequence Classification
  • Base Model: amahdaouy/DomURLs_BERT
  • Tokenizer: CrabInHoney/urlbert-tiny-base-v4

Training Configuration

Parameter Value
Epochs 20
Learning Rate 2e-5
Batch Size 32
Max Sequence Length 64 tokens
Optimizer AdamW
Weight Decay 0.01
LR Scheduler ReduceLROnPlateau
Early Stopping Patience: 3, Threshold: 0.001

Training Data

  • Custom curated dataset of platform and official website URLs
  • Balanced training set with equal representation of both classes
  • Domain-specific preprocessing and data augmentation

Performance

Test Set Metrics

Metric Threshold Achieved
Accuracy β‰₯ 0.80 β‰₯ 0.99 βœ…
F1 Score β‰₯ 0.80 β‰₯ 0.99 βœ…
Precision β‰₯ 0.80 β‰₯ 0.99 βœ…
Recall β‰₯ 0.80 β‰₯ 0.99 βœ…
False Positive Rate ≀ 0.15 < 0.01 βœ…
False Negative Rate ≀ 0.15 < 0.01 βœ…

Example Predictions

  • https://acmewidgets.com β†’ official_website (99.98% confidence)
  • https://store.myshopify.com β†’ platform (75.96% confidence)
  • https://example.wixsite.com/store β†’ platform (high confidence)

Usage

Direct Inference with Transformers

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
model_name = "DiligentAI/urlbert-url-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Classify URL
url = "https://acmewidgets.com"
inputs = tokenizer(url, return_tensors="pt", truncation=True, max_length=64)

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_class = torch.argmax(predictions, dim=1).item()
    confidence = predictions[0][predicted_class].item()

label_map = {0: "official_website", 1: "platform"}
print(f"Prediction: {label_map[predicted_class]} ({confidence:.2%})")
Using Hugging Face Pipeline
from transformers import pipeline

classifier = pipeline("text-classification", model="DiligentAI/urlbert-url-classifier")
result = classifier("https://store.myshopify.com")


Pydantic Integration (Production-Ready)

from transformers import pipeline
from pydantic import BaseModel, Field
from typing import Literal

class URLClassificationResult(BaseModel):
    url: str
    label: Literal["official_website", "platform"]
    confidence: float = Field(..., ge=0.0, le=1.0)
    
def classify_url(url: str) -> URLClassificationResult:
    classifier = pipeline("text-classification", model="DiligentAI/urlbert-url-classifier")
    result = classifier(url[:64])[0]  # Truncate to max_length
    
    label_map = {"LABEL_0": "official_website", "LABEL_1": "platform"}
    
    return URLClassificationResult(
        url=url,
        label=label_map[result["label"]],
        confidence=result["score"]
    )

Limitations and Bias
Max URL Length: Model trained on 64-token sequences. Longer URLs are truncated.
Domain Focus: Optimized for e-commerce and business websites
Platform Coverage: Best performance on common platforms (Shopify, Wix, etc.)
Language: Primarily trained on English-language domains
Edge Cases: May have lower confidence on:
Uncommon TLDs
Very short URLs
Internationalized domain names
Intended Use

Primary Use Cases:
URL filtering and categorization pipelines
Lead qualification systems
Web scraping and data collection workflows
Business intelligence and market research
Out of Scope:
Content classification (only URL structure is analyzed)
Malicious URL detection (use dedicated security models)
Language detection
Spam filtering
Model Card Authors
DiligentAI Team

Citation
@misc{urlbert-classifier-2025,
  author = {DiligentAI},
  title = {URL Classifier - Platform vs Official Website Detection},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/DiligentAI/urlbert-url-classifier}}
}

License
MIT License
Framework Versions
Transformers: 4.57.0+
PyTorch: 2.0.0+
Python: 3.10+
Training Infrastructure
Framework: PyTorch + Hugging Face Transformers
Pipeline Orchestration: DVC (Data Version Control)
CI/CD: GitHub Actions
Model Format: Safetensors
Dependencies: See repository
Model Versioning
This model is automatically versioned and deployed via GitHub Actions. Each release includes:
Model checkpoint (.safetensors)
Tokenizer configuration
Label mapping (label_map.json)
Performance metrics (metrics.json)

Contact
For issues, questions, or feedback:
GitHub: DiligentAI/url-classifier
Organization: DiligentAI
Downloads last month
71
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support