File size: 5,687 Bytes

86336ba
a4dd5ac
 
 
 
 
 
 
 
 
 
0c68aa0
 
 
 
a4dd5ac
 
 
 
 
 
 
86336ba
 
a4dd5ac
86336ba
a4dd5ac
86336ba
a4dd5ac
86336ba
a4dd5ac
86336ba
a4dd5ac
 
86336ba
 
 
a4dd5ac
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
86336ba
a4dd5ac
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5f131e2
 
a4dd5ac
5f131e2
a4dd5ac
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
213ff12
a4dd5ac
 
 
 
 
 
 
 
 
 
213ff12
a4dd5ac
 
 
 
 
 
 
 
 
 
 
 
5f131e2
a4dd5ac
 
 
 
 
 
 
 
213ff12
a4dd5ac
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5f131e2
a4dd5ac
 
 
213ff12

---
language:
- en
license: mit
tags:
- text-classification
- url-classification
- bert
- domain-classification
pipeline_tag: text-classification
widget:
- text: https://acmewidgets.com
- text: https://store.myshopify.com
- text: https://example.wixsite.com/store
- text: https://business.com
datasets:
- custom
metrics:
- accuracy
- f1
- precision
- recall
---

# URL Classifier

A fine-tuned BERT model for binary classification of URLs as either **platform listings** (e.g., `*.myshopify.com`, `*.wixsite.com`) or **official websites** (e.g., `acmewidgets.com`).

## Model Description

This model is a fine-tuned version of [amahdaouy/DomURLs_BERT](https://huggingface.co/amahdaouy/DomURLs_BERT) trained to distinguish between:

- **LABEL_0 (official_website)**: Direct company/brand websites
- **LABEL_1 (platform)**: Third-party platform listings (Shopify, Wix, etc.)

## Training Details

### Base Model
- **Architecture**: BERT for Sequence Classification
- **Base Model**: `amahdaouy/DomURLs_BERT`
- **Tokenizer**: `CrabInHoney/urlbert-tiny-base-v4`

### Training Configuration
| Parameter | Value |
|-----------|-------|
| **Epochs** | 20 |
| **Learning Rate** | 2e-5 |
| **Batch Size** | 32 |
| **Max Sequence Length** | 64 tokens |
| **Optimizer** | AdamW |
| **Weight Decay** | 0.01 |
| **LR Scheduler** | ReduceLROnPlateau |
| **Early Stopping** | Patience: 3, Threshold: 0.001 |

### Training Data
- Custom curated dataset of platform and official website URLs
- Balanced training set with equal representation of both classes
- Domain-specific preprocessing and data augmentation

## Performance

### Test Set Metrics
| Metric | Threshold | Achieved |
|--------|-----------|----------|
| **Accuracy** | ≥ 0.80 | **≥ 0.99** ✅ |
| **F1 Score** | ≥ 0.80 | **≥ 0.99** ✅ |
| **Precision** | ≥ 0.80 | **≥ 0.99** ✅ |
| **Recall** | ≥ 0.80 | **≥ 0.99** ✅ |
| **False Positive Rate** | ≤ 0.15 | **< 0.01** ✅ |
| **False Negative Rate** | ≤ 0.15 | **< 0.01** ✅ |

### Example Predictions
- `https://acmewidgets.com` → **official_website** (99.98% confidence)
- `https://store.myshopify.com` → **platform** (75.96% confidence)
- `https://example.wixsite.com/store` → **platform** (high confidence)

## Usage

### Direct Inference with Transformers

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
model_name = "DiligentAI/urlbert-url-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Classify URL
url = "https://acmewidgets.com"
inputs = tokenizer(url, return_tensors="pt", truncation=True, max_length=64)

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_class = torch.argmax(predictions, dim=1).item()
    confidence = predictions[0][predicted_class].item()

label_map = {0: "official_website", 1: "platform"}
print(f"Prediction: {label_map[predicted_class]} ({confidence:.2%})")
Using Hugging Face Pipeline
from transformers import pipeline

classifier = pipeline("text-classification", model="DiligentAI/urlbert-url-classifier")
result = classifier("https://store.myshopify.com")


Pydantic Integration (Production-Ready)

from transformers import pipeline
from pydantic import BaseModel, Field
from typing import Literal

class URLClassificationResult(BaseModel):
    url: str
    label: Literal["official_website", "platform"]
    confidence: float = Field(..., ge=0.0, le=1.0)
    
def classify_url(url: str) -> URLClassificationResult:
    classifier = pipeline("text-classification", model="DiligentAI/urlbert-url-classifier")
    result = classifier(url[:64])[0]  # Truncate to max_length
    
    label_map = {"LABEL_0": "official_website", "LABEL_1": "platform"}
    
    return URLClassificationResult(
        url=url,
        label=label_map[result["label"]],
        confidence=result["score"]
    )

Limitations and Bias
Max URL Length: Model trained on 64-token sequences. Longer URLs are truncated.
Domain Focus: Optimized for e-commerce and business websites
Platform Coverage: Best performance on common platforms (Shopify, Wix, etc.)
Language: Primarily trained on English-language domains
Edge Cases: May have lower confidence on:
Uncommon TLDs
Very short URLs
Internationalized domain names
Intended Use

Primary Use Cases:
URL filtering and categorization pipelines
Lead qualification systems
Web scraping and data collection workflows
Business intelligence and market research
Out of Scope:
Content classification (only URL structure is analyzed)
Malicious URL detection (use dedicated security models)
Language detection
Spam filtering
Model Card Authors
DiligentAI Team

Citation
@misc{urlbert-classifier-2025,
  author = {DiligentAI},
  title = {URL Classifier - Platform vs Official Website Detection},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/DiligentAI/urlbert-url-classifier}}
}

License
MIT License
Framework Versions
Transformers: 4.57.0+
PyTorch: 2.0.0+
Python: 3.10+
Training Infrastructure
Framework: PyTorch + Hugging Face Transformers
Pipeline Orchestration: DVC (Data Version Control)
CI/CD: GitHub Actions
Model Format: Safetensors
Dependencies: See repository
Model Versioning
This model is automatically versioned and deployed via GitHub Actions. Each release includes:
Model checkpoint (.safetensors)
Tokenizer configuration
Label mapping (label_map.json)
Performance metrics (metrics.json)

Contact
For issues, questions, or feedback:
GitHub: DiligentAI/url-classifier
Organization: DiligentAI