AhmedGaver's picture
Upload v2 of URL classifier model
0c68aa0 verified
|
raw
history blame
5.69 kB
---
language:
- en
license: mit
tags:
- text-classification
- url-classification
- bert
- domain-classification
pipeline_tag: text-classification
widget:
- text: https://acmewidgets.com
- text: https://store.myshopify.com
- text: https://example.wixsite.com/store
- text: https://business.com
datasets:
- custom
metrics:
- accuracy
- f1
- precision
- recall
---
# URL Classifier
A fine-tuned BERT model for binary classification of URLs as either **platform listings** (e.g., `*.myshopify.com`, `*.wixsite.com`) or **official websites** (e.g., `acmewidgets.com`).
## Model Description
This model is a fine-tuned version of [amahdaouy/DomURLs_BERT](https://huggingface.co/amahdaouy/DomURLs_BERT) trained to distinguish between:
- **LABEL_0 (official_website)**: Direct company/brand websites
- **LABEL_1 (platform)**: Third-party platform listings (Shopify, Wix, etc.)
## Training Details
### Base Model
- **Architecture**: BERT for Sequence Classification
- **Base Model**: `amahdaouy/DomURLs_BERT`
- **Tokenizer**: `CrabInHoney/urlbert-tiny-base-v4`
### Training Configuration
| Parameter | Value |
|-----------|-------|
| **Epochs** | 20 |
| **Learning Rate** | 2e-5 |
| **Batch Size** | 32 |
| **Max Sequence Length** | 64 tokens |
| **Optimizer** | AdamW |
| **Weight Decay** | 0.01 |
| **LR Scheduler** | ReduceLROnPlateau |
| **Early Stopping** | Patience: 3, Threshold: 0.001 |
### Training Data
- Custom curated dataset of platform and official website URLs
- Balanced training set with equal representation of both classes
- Domain-specific preprocessing and data augmentation
## Performance
### Test Set Metrics
| Metric | Threshold | Achieved |
|--------|-----------|----------|
| **Accuracy** | β‰₯ 0.80 | **β‰₯ 0.99** βœ… |
| **F1 Score** | β‰₯ 0.80 | **β‰₯ 0.99** βœ… |
| **Precision** | β‰₯ 0.80 | **β‰₯ 0.99** βœ… |
| **Recall** | β‰₯ 0.80 | **β‰₯ 0.99** βœ… |
| **False Positive Rate** | ≀ 0.15 | **< 0.01** βœ… |
| **False Negative Rate** | ≀ 0.15 | **< 0.01** βœ… |
### Example Predictions
- `https://acmewidgets.com` β†’ **official_website** (99.98% confidence)
- `https://store.myshopify.com` β†’ **platform** (75.96% confidence)
- `https://example.wixsite.com/store` β†’ **platform** (high confidence)
## Usage
### Direct Inference with Transformers
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load model and tokenizer
model_name = "DiligentAI/urlbert-url-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Classify URL
url = "https://acmewidgets.com"
inputs = tokenizer(url, return_tensors="pt", truncation=True, max_length=64)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predicted_class = torch.argmax(predictions, dim=1).item()
confidence = predictions[0][predicted_class].item()
label_map = {0: "official_website", 1: "platform"}
print(f"Prediction: {label_map[predicted_class]} ({confidence:.2%})")
Using Hugging Face Pipeline
from transformers import pipeline
classifier = pipeline("text-classification", model="DiligentAI/urlbert-url-classifier")
result = classifier("https://store.myshopify.com")
Pydantic Integration (Production-Ready)
from transformers import pipeline
from pydantic import BaseModel, Field
from typing import Literal
class URLClassificationResult(BaseModel):
url: str
label: Literal["official_website", "platform"]
confidence: float = Field(..., ge=0.0, le=1.0)
def classify_url(url: str) -> URLClassificationResult:
classifier = pipeline("text-classification", model="DiligentAI/urlbert-url-classifier")
result = classifier(url[:64])[0] # Truncate to max_length
label_map = {"LABEL_0": "official_website", "LABEL_1": "platform"}
return URLClassificationResult(
url=url,
label=label_map[result["label"]],
confidence=result["score"]
)
Limitations and Bias
Max URL Length: Model trained on 64-token sequences. Longer URLs are truncated.
Domain Focus: Optimized for e-commerce and business websites
Platform Coverage: Best performance on common platforms (Shopify, Wix, etc.)
Language: Primarily trained on English-language domains
Edge Cases: May have lower confidence on:
Uncommon TLDs
Very short URLs
Internationalized domain names
Intended Use
Primary Use Cases:
URL filtering and categorization pipelines
Lead qualification systems
Web scraping and data collection workflows
Business intelligence and market research
Out of Scope:
Content classification (only URL structure is analyzed)
Malicious URL detection (use dedicated security models)
Language detection
Spam filtering
Model Card Authors
DiligentAI Team
Citation
@misc{urlbert-classifier-2025,
author = {DiligentAI},
title = {URL Classifier - Platform vs Official Website Detection},
year = {2025},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/DiligentAI/urlbert-url-classifier}}
}
License
MIT License
Framework Versions
Transformers: 4.57.0+
PyTorch: 2.0.0+
Python: 3.10+
Training Infrastructure
Framework: PyTorch + Hugging Face Transformers
Pipeline Orchestration: DVC (Data Version Control)
CI/CD: GitHub Actions
Model Format: Safetensors
Dependencies: See repository
Model Versioning
This model is automatically versioned and deployed via GitHub Actions. Each release includes:
Model checkpoint (.safetensors)
Tokenizer configuration
Label mapping (label_map.json)
Performance metrics (metrics.json)
Contact
For issues, questions, or feedback:
GitHub: DiligentAI/url-classifier
Organization: DiligentAI