---
language:
- en
license: mit
tags:
- text-classification
- url-classification
- bert
- domain-classification
pipeline_tag: text-classification
widget:
- text: https://acmewidgets.com
- text: https://store.myshopify.com
- text: https://example.wixsite.com/store
- text: https://business.com
datasets:
- custom
metrics:
- accuracy
- f1
- precision
- recall
---

# URL Classifier

A fine-tuned BERT model for binary classification of URLs as either **platform listings** (e.g., `*.myshopify.com`, `*.wixsite.com`) or **official websites** (e.g., `acmewidgets.com`).

## Model Description

This model is a fine-tuned version of [amahdaouy/DomURLs_BERT](https://huggingface.co/amahdaouy/DomURLs_BERT) trained to distinguish between:

- **LABEL_0 (official_website)**: Direct company/brand websites
- **LABEL_1 (platform)**: Third-party platform listings (Shopify, Wix, etc.)

## Training Details

### Base Model
- **Architecture**: BERT for Sequence Classification
- **Base Model**: `amahdaouy/DomURLs_BERT`
- **Tokenizer**: `CrabInHoney/urlbert-tiny-base-v4`

### Training Configuration
| Parameter | Value |
|-----------|-------|
| **Epochs** | 20 |
| **Learning Rate** | 2e-5 |
| **Batch Size** | 32 |
| **Max Sequence Length** | 64 tokens |
| **Optimizer** | AdamW |
| **Weight Decay** | 0.01 |
| **LR Scheduler** | ReduceLROnPlateau |
| **Early Stopping** | Patience: 3, Threshold: 0.001 |

### Training Data
- Custom curated dataset of platform and official website URLs
- Balanced training set with equal representation of both classes
- Domain-specific preprocessing and data augmentation

## Performance

### Test Set Metrics
| Metric | Threshold | Achieved |
|--------|-----------|----------|
| **Accuracy** | ≥ 0.80 | **≥ 0.99** ✅ |
| **F1 Score** | ≥ 0.80 | **≥ 0.99** ✅ |
| **Precision** | ≥ 0.80 | **≥ 0.99** ✅ |
| **Recall** | ≥ 0.80 | **≥ 0.99** ✅ |
| **False Positive Rate** | ≤ 0.15 | **< 0.01** ✅ |
| **False Negative Rate** | ≤ 0.15 | **< 0.01** ✅ |

### Example Predictions
- `https://acmewidgets.com` → **official_website** (99.98% confidence)
- `https://store.myshopify.com` → **platform** (75.96% confidence)
- `https://example.wixsite.com/store` → **platform** (high confidence)

## Usage

### Direct Inference with Transformers

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
model_name = "DiligentAI/urlbert-url-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Classify URL
url = "https://acmewidgets.com"
inputs = tokenizer(url, return_tensors="pt", truncation=True, max_length=64)

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_class = torch.argmax(predictions, dim=1).item()
    confidence = predictions[0][predicted_class].item()

label_map = {0: "official_website", 1: "platform"}
print(f"Prediction: {label_map[predicted_class]} ({confidence:.2%})")
Using Hugging Face Pipeline
from transformers import pipeline

classifier = pipeline("text-classification", model="DiligentAI/urlbert-url-classifier")
result = classifier("https://store.myshopify.com")


Pydantic Integration (Production-Ready)

from transformers import pipeline
from pydantic import BaseModel, Field
from typing import Literal

class URLClassificationResult(BaseModel):
    url: str
    label: Literal["official_website", "platform"]
    confidence: float = Field(..., ge=0.0, le=1.0)
    
def classify_url(url: str) -> URLClassificationResult:
    classifier = pipeline("text-classification", model="DiligentAI/urlbert-url-classifier")
    result = classifier(url[:64])[0]  # Truncate to max_length
    
    label_map = {"LABEL_0": "official_website", "LABEL_1": "platform"}
    
    return URLClassificationResult(
        url=url,
        label=label_map[result["label"]],
        confidence=result["score"]
    )

Limitations and Bias
Max URL Length: Model trained on 64-token sequences. Longer URLs are truncated.
Domain Focus: Optimized for e-commerce and business websites
Platform Coverage: Best performance on common platforms (Shopify, Wix, etc.)
Language: Primarily trained on English-language domains
Edge Cases: May have lower confidence on:
Uncommon TLDs
Very short URLs
Internationalized domain names
Intended Use

Primary Use Cases:
URL filtering and categorization pipelines
Lead qualification systems
Web scraping and data collection workflows
Business intelligence and market research
Out of Scope:
Content classification (only URL structure is analyzed)
Malicious URL detection (use dedicated security models)
Language detection
Spam filtering
Model Card Authors
DiligentAI Team

Citation
@misc{urlbert-classifier-2025,
  author = {DiligentAI},
  title = {URL Classifier - Platform vs Official Website Detection},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/DiligentAI/urlbert-url-classifier}}
}

License
MIT License
Framework Versions
Transformers: 4.57.0+
PyTorch: 2.0.0+
Python: 3.10+
Training Infrastructure
Framework: PyTorch + Hugging Face Transformers
Pipeline Orchestration: DVC (Data Version Control)
CI/CD: GitHub Actions
Model Format: Safetensors
Dependencies: See repository
Model Versioning
This model is automatically versioned and deployed via GitHub Actions. Each release includes:
Model checkpoint (.safetensors)
Tokenizer configuration
Label mapping (label_map.json)
Performance metrics (metrics.json)

Contact
For issues, questions, or feedback:
GitHub: DiligentAI/url-classifier
Organization: DiligentAI