|
|
--- |
|
|
language: |
|
|
- en |
|
|
license: mit |
|
|
tags: |
|
|
- text-classification |
|
|
- url-classification |
|
|
- bert |
|
|
- domain-classification |
|
|
pipeline_tag: text-classification |
|
|
widget: |
|
|
- text: https://acmewidgets.com |
|
|
- text: https://store.myshopify.com |
|
|
- text: https://example.wixsite.com/store |
|
|
- text: https://business.com |
|
|
datasets: |
|
|
- custom |
|
|
metrics: |
|
|
- accuracy |
|
|
- f1 |
|
|
- precision |
|
|
- recall |
|
|
--- |
|
|
|
|
|
# URL Classifier |
|
|
|
|
|
A fine-tuned BERT model for binary classification of URLs as either **platform listings** (e.g., `*.myshopify.com`, `*.wixsite.com`) or **official websites** (e.g., `acmewidgets.com`). |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This model is a fine-tuned version of [amahdaouy/DomURLs_BERT](https://huggingface.co/amahdaouy/DomURLs_BERT) trained to distinguish between: |
|
|
|
|
|
- **LABEL_0 (official_website)**: Direct company/brand websites |
|
|
- **LABEL_1 (platform)**: Third-party platform listings (Shopify, Wix, etc.) |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Base Model |
|
|
- **Architecture**: BERT for Sequence Classification |
|
|
- **Base Model**: `amahdaouy/DomURLs_BERT` |
|
|
- **Tokenizer**: `CrabInHoney/urlbert-tiny-base-v4` |
|
|
|
|
|
### Training Configuration |
|
|
| Parameter | Value | |
|
|
|-----------|-------| |
|
|
| **Epochs** | 20 | |
|
|
| **Learning Rate** | 2e-5 | |
|
|
| **Batch Size** | 32 | |
|
|
| **Max Sequence Length** | 64 tokens | |
|
|
| **Optimizer** | AdamW | |
|
|
| **Weight Decay** | 0.01 | |
|
|
| **LR Scheduler** | ReduceLROnPlateau | |
|
|
| **Early Stopping** | Patience: 3, Threshold: 0.001 | |
|
|
|
|
|
### Training Data |
|
|
- Custom curated dataset of platform and official website URLs |
|
|
- Balanced training set with equal representation of both classes |
|
|
- Domain-specific preprocessing and data augmentation |
|
|
|
|
|
## Performance |
|
|
|
|
|
### Test Set Metrics |
|
|
| Metric | Threshold | Achieved | |
|
|
|--------|-----------|----------| |
|
|
| **Accuracy** | β₯ 0.80 | **β₯ 0.99** β
| |
|
|
| **F1 Score** | β₯ 0.80 | **β₯ 0.99** β
| |
|
|
| **Precision** | β₯ 0.80 | **β₯ 0.99** β
| |
|
|
| **Recall** | β₯ 0.80 | **β₯ 0.99** β
| |
|
|
| **False Positive Rate** | β€ 0.15 | **< 0.01** β
| |
|
|
| **False Negative Rate** | β€ 0.15 | **< 0.01** β
| |
|
|
|
|
|
### Example Predictions |
|
|
- `https://acmewidgets.com` β **official_website** (99.98% confidence) |
|
|
- `https://store.myshopify.com` β **platform** (75.96% confidence) |
|
|
- `https://example.wixsite.com/store` β **platform** (high confidence) |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Direct Inference with Transformers |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
import torch |
|
|
|
|
|
# Load model and tokenizer |
|
|
model_name = "DiligentAI/urlbert-url-classifier" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForSequenceClassification.from_pretrained(model_name) |
|
|
|
|
|
# Classify URL |
|
|
url = "https://acmewidgets.com" |
|
|
inputs = tokenizer(url, return_tensors="pt", truncation=True, max_length=64) |
|
|
|
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) |
|
|
predicted_class = torch.argmax(predictions, dim=1).item() |
|
|
confidence = predictions[0][predicted_class].item() |
|
|
|
|
|
label_map = {0: "official_website", 1: "platform"} |
|
|
print(f"Prediction: {label_map[predicted_class]} ({confidence:.2%})") |
|
|
Using Hugging Face Pipeline |
|
|
from transformers import pipeline |
|
|
|
|
|
classifier = pipeline("text-classification", model="DiligentAI/urlbert-url-classifier") |
|
|
result = classifier("https://store.myshopify.com") |
|
|
|
|
|
|
|
|
Pydantic Integration (Production-Ready) |
|
|
|
|
|
from transformers import pipeline |
|
|
from pydantic import BaseModel, Field |
|
|
from typing import Literal |
|
|
|
|
|
class URLClassificationResult(BaseModel): |
|
|
url: str |
|
|
label: Literal["official_website", "platform"] |
|
|
confidence: float = Field(..., ge=0.0, le=1.0) |
|
|
|
|
|
def classify_url(url: str) -> URLClassificationResult: |
|
|
classifier = pipeline("text-classification", model="DiligentAI/urlbert-url-classifier") |
|
|
result = classifier(url[:64])[0] # Truncate to max_length |
|
|
|
|
|
label_map = {"LABEL_0": "official_website", "LABEL_1": "platform"} |
|
|
|
|
|
return URLClassificationResult( |
|
|
url=url, |
|
|
label=label_map[result["label"]], |
|
|
confidence=result["score"] |
|
|
) |
|
|
|
|
|
Limitations and Bias |
|
|
Max URL Length: Model trained on 64-token sequences. Longer URLs are truncated. |
|
|
Domain Focus: Optimized for e-commerce and business websites |
|
|
Platform Coverage: Best performance on common platforms (Shopify, Wix, etc.) |
|
|
Language: Primarily trained on English-language domains |
|
|
Edge Cases: May have lower confidence on: |
|
|
Uncommon TLDs |
|
|
Very short URLs |
|
|
Internationalized domain names |
|
|
Intended Use |
|
|
|
|
|
Primary Use Cases: |
|
|
URL filtering and categorization pipelines |
|
|
Lead qualification systems |
|
|
Web scraping and data collection workflows |
|
|
Business intelligence and market research |
|
|
Out of Scope: |
|
|
Content classification (only URL structure is analyzed) |
|
|
Malicious URL detection (use dedicated security models) |
|
|
Language detection |
|
|
Spam filtering |
|
|
Model Card Authors |
|
|
DiligentAI Team |
|
|
|
|
|
Citation |
|
|
@misc{urlbert-classifier-2025, |
|
|
author = {DiligentAI}, |
|
|
title = {URL Classifier - Platform vs Official Website Detection}, |
|
|
year = {2025}, |
|
|
publisher = {HuggingFace}, |
|
|
howpublished = {\url{https://huggingface.co/DiligentAI/urlbert-url-classifier}} |
|
|
} |
|
|
|
|
|
License |
|
|
MIT License |
|
|
Framework Versions |
|
|
Transformers: 4.57.0+ |
|
|
PyTorch: 2.0.0+ |
|
|
Python: 3.10+ |
|
|
Training Infrastructure |
|
|
Framework: PyTorch + Hugging Face Transformers |
|
|
Pipeline Orchestration: DVC (Data Version Control) |
|
|
CI/CD: GitHub Actions |
|
|
Model Format: Safetensors |
|
|
Dependencies: See repository |
|
|
Model Versioning |
|
|
This model is automatically versioned and deployed via GitHub Actions. Each release includes: |
|
|
Model checkpoint (.safetensors) |
|
|
Tokenizer configuration |
|
|
Label mapping (label_map.json) |
|
|
Performance metrics (metrics.json) |
|
|
|
|
|
Contact |
|
|
For issues, questions, or feedback: |
|
|
GitHub: DiligentAI/url-classifier |
|
|
Organization: DiligentAI |