File size: 5,687 Bytes
86336ba
a4dd5ac
 
 
 
 
 
 
 
 
 
0c68aa0
 
 
 
a4dd5ac
 
 
 
 
 
 
86336ba
 
a4dd5ac
86336ba
a4dd5ac
86336ba
a4dd5ac
86336ba
a4dd5ac
86336ba
a4dd5ac
 
86336ba
 
 
a4dd5ac
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
86336ba
a4dd5ac
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5f131e2
 
a4dd5ac
5f131e2
a4dd5ac
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
213ff12
a4dd5ac
 
 
 
 
 
 
 
 
 
213ff12
a4dd5ac
 
 
 
 
 
 
 
 
 
 
 
5f131e2
a4dd5ac
 
 
 
 
 
 
 
213ff12
a4dd5ac
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5f131e2
a4dd5ac
 
 
213ff12
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
---
language:
- en
license: mit
tags:
- text-classification
- url-classification
- bert
- domain-classification
pipeline_tag: text-classification
widget:
- text: https://acmewidgets.com
- text: https://store.myshopify.com
- text: https://example.wixsite.com/store
- text: https://business.com
datasets:
- custom
metrics:
- accuracy
- f1
- precision
- recall
---

# URL Classifier

A fine-tuned BERT model for binary classification of URLs as either **platform listings** (e.g., `*.myshopify.com`, `*.wixsite.com`) or **official websites** (e.g., `acmewidgets.com`).

## Model Description

This model is a fine-tuned version of [amahdaouy/DomURLs_BERT](https://huggingface.co/amahdaouy/DomURLs_BERT) trained to distinguish between:

- **LABEL_0 (official_website)**: Direct company/brand websites
- **LABEL_1 (platform)**: Third-party platform listings (Shopify, Wix, etc.)

## Training Details

### Base Model
- **Architecture**: BERT for Sequence Classification
- **Base Model**: `amahdaouy/DomURLs_BERT`
- **Tokenizer**: `CrabInHoney/urlbert-tiny-base-v4`

### Training Configuration
| Parameter | Value |
|-----------|-------|
| **Epochs** | 20 |
| **Learning Rate** | 2e-5 |
| **Batch Size** | 32 |
| **Max Sequence Length** | 64 tokens |
| **Optimizer** | AdamW |
| **Weight Decay** | 0.01 |
| **LR Scheduler** | ReduceLROnPlateau |
| **Early Stopping** | Patience: 3, Threshold: 0.001 |

### Training Data
- Custom curated dataset of platform and official website URLs
- Balanced training set with equal representation of both classes
- Domain-specific preprocessing and data augmentation

## Performance

### Test Set Metrics
| Metric | Threshold | Achieved |
|--------|-----------|----------|
| **Accuracy** | β‰₯ 0.80 | **β‰₯ 0.99** βœ… |
| **F1 Score** | β‰₯ 0.80 | **β‰₯ 0.99** βœ… |
| **Precision** | β‰₯ 0.80 | **β‰₯ 0.99** βœ… |
| **Recall** | β‰₯ 0.80 | **β‰₯ 0.99** βœ… |
| **False Positive Rate** | ≀ 0.15 | **< 0.01** βœ… |
| **False Negative Rate** | ≀ 0.15 | **< 0.01** βœ… |

### Example Predictions
- `https://acmewidgets.com` β†’ **official_website** (99.98% confidence)
- `https://store.myshopify.com` β†’ **platform** (75.96% confidence)
- `https://example.wixsite.com/store` β†’ **platform** (high confidence)

## Usage

### Direct Inference with Transformers

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
model_name = "DiligentAI/urlbert-url-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Classify URL
url = "https://acmewidgets.com"
inputs = tokenizer(url, return_tensors="pt", truncation=True, max_length=64)

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_class = torch.argmax(predictions, dim=1).item()
    confidence = predictions[0][predicted_class].item()

label_map = {0: "official_website", 1: "platform"}
print(f"Prediction: {label_map[predicted_class]} ({confidence:.2%})")
Using Hugging Face Pipeline
from transformers import pipeline

classifier = pipeline("text-classification", model="DiligentAI/urlbert-url-classifier")
result = classifier("https://store.myshopify.com")


Pydantic Integration (Production-Ready)

from transformers import pipeline
from pydantic import BaseModel, Field
from typing import Literal

class URLClassificationResult(BaseModel):
    url: str
    label: Literal["official_website", "platform"]
    confidence: float = Field(..., ge=0.0, le=1.0)
    
def classify_url(url: str) -> URLClassificationResult:
    classifier = pipeline("text-classification", model="DiligentAI/urlbert-url-classifier")
    result = classifier(url[:64])[0]  # Truncate to max_length
    
    label_map = {"LABEL_0": "official_website", "LABEL_1": "platform"}
    
    return URLClassificationResult(
        url=url,
        label=label_map[result["label"]],
        confidence=result["score"]
    )

Limitations and Bias
Max URL Length: Model trained on 64-token sequences. Longer URLs are truncated.
Domain Focus: Optimized for e-commerce and business websites
Platform Coverage: Best performance on common platforms (Shopify, Wix, etc.)
Language: Primarily trained on English-language domains
Edge Cases: May have lower confidence on:
Uncommon TLDs
Very short URLs
Internationalized domain names
Intended Use

Primary Use Cases:
URL filtering and categorization pipelines
Lead qualification systems
Web scraping and data collection workflows
Business intelligence and market research
Out of Scope:
Content classification (only URL structure is analyzed)
Malicious URL detection (use dedicated security models)
Language detection
Spam filtering
Model Card Authors
DiligentAI Team

Citation
@misc{urlbert-classifier-2025,
  author = {DiligentAI},
  title = {URL Classifier - Platform vs Official Website Detection},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/DiligentAI/urlbert-url-classifier}}
}

License
MIT License
Framework Versions
Transformers: 4.57.0+
PyTorch: 2.0.0+
Python: 3.10+
Training Infrastructure
Framework: PyTorch + Hugging Face Transformers
Pipeline Orchestration: DVC (Data Version Control)
CI/CD: GitHub Actions
Model Format: Safetensors
Dependencies: See repository
Model Versioning
This model is automatically versioned and deployed via GitHub Actions. Each release includes:
Model checkpoint (.safetensors)
Tokenizer configuration
Label mapping (label_map.json)
Performance metrics (metrics.json)

Contact
For issues, questions, or feedback:
GitHub: DiligentAI/url-classifier
Organization: DiligentAI