--- language: - en license: mit tags: - text-classification - url-classification - bert - domain-classification pipeline_tag: text-classification widget: - text: https://acmewidgets.com - text: https://store.myshopify.com - text: https://example.wixsite.com/store - text: https://business.com datasets: - custom metrics: - accuracy - f1 - precision - recall --- # URL Classifier A fine-tuned BERT model for binary classification of URLs as either **platform listings** (e.g., `*.myshopify.com`, `*.wixsite.com`) or **official websites** (e.g., `acmewidgets.com`). ## Model Description This model is a fine-tuned version of [amahdaouy/DomURLs_BERT](https://huggingface.co/amahdaouy/DomURLs_BERT) trained to distinguish between: - **LABEL_0 (official_website)**: Direct company/brand websites - **LABEL_1 (platform)**: Third-party platform listings (Shopify, Wix, etc.) ## Training Details ### Base Model - **Architecture**: BERT for Sequence Classification - **Base Model**: `amahdaouy/DomURLs_BERT` - **Tokenizer**: `CrabInHoney/urlbert-tiny-base-v4` ### Training Configuration | Parameter | Value | |-----------|-------| | **Epochs** | 20 | | **Learning Rate** | 2e-5 | | **Batch Size** | 32 | | **Max Sequence Length** | 64 tokens | | **Optimizer** | AdamW | | **Weight Decay** | 0.01 | | **LR Scheduler** | ReduceLROnPlateau | | **Early Stopping** | Patience: 3, Threshold: 0.001 | ### Training Data - Custom curated dataset of platform and official website URLs - Balanced training set with equal representation of both classes - Domain-specific preprocessing and data augmentation ## Performance ### Test Set Metrics | Metric | Threshold | Achieved | |--------|-----------|----------| | **Accuracy** | ≥ 0.80 | **≥ 0.99** ✅ | | **F1 Score** | ≥ 0.80 | **≥ 0.99** ✅ | | **Precision** | ≥ 0.80 | **≥ 0.99** ✅ | | **Recall** | ≥ 0.80 | **≥ 0.99** ✅ | | **False Positive Rate** | ≤ 0.15 | **< 0.01** ✅ | | **False Negative Rate** | ≤ 0.15 | **< 0.01** ✅ | ### Example Predictions - `https://acmewidgets.com` → **official_website** (99.98% confidence) - `https://store.myshopify.com` → **platform** (75.96% confidence) - `https://example.wixsite.com/store` → **platform** (high confidence) ## Usage ### Direct Inference with Transformers ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch # Load model and tokenizer model_name = "DiligentAI/urlbert-url-classifier" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name) # Classify URL url = "https://acmewidgets.com" inputs = tokenizer(url, return_tensors="pt", truncation=True, max_length=64) with torch.no_grad(): outputs = model(**inputs) predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) predicted_class = torch.argmax(predictions, dim=1).item() confidence = predictions[0][predicted_class].item() label_map = {0: "official_website", 1: "platform"} print(f"Prediction: {label_map[predicted_class]} ({confidence:.2%})") Using Hugging Face Pipeline from transformers import pipeline classifier = pipeline("text-classification", model="DiligentAI/urlbert-url-classifier") result = classifier("https://store.myshopify.com") Pydantic Integration (Production-Ready) from transformers import pipeline from pydantic import BaseModel, Field from typing import Literal class URLClassificationResult(BaseModel): url: str label: Literal["official_website", "platform"] confidence: float = Field(..., ge=0.0, le=1.0) def classify_url(url: str) -> URLClassificationResult: classifier = pipeline("text-classification", model="DiligentAI/urlbert-url-classifier") result = classifier(url[:64])[0] # Truncate to max_length label_map = {"LABEL_0": "official_website", "LABEL_1": "platform"} return URLClassificationResult( url=url, label=label_map[result["label"]], confidence=result["score"] ) Limitations and Bias Max URL Length: Model trained on 64-token sequences. Longer URLs are truncated. Domain Focus: Optimized for e-commerce and business websites Platform Coverage: Best performance on common platforms (Shopify, Wix, etc.) Language: Primarily trained on English-language domains Edge Cases: May have lower confidence on: Uncommon TLDs Very short URLs Internationalized domain names Intended Use Primary Use Cases: URL filtering and categorization pipelines Lead qualification systems Web scraping and data collection workflows Business intelligence and market research Out of Scope: Content classification (only URL structure is analyzed) Malicious URL detection (use dedicated security models) Language detection Spam filtering Model Card Authors DiligentAI Team Citation @misc{urlbert-classifier-2025, author = {DiligentAI}, title = {URL Classifier - Platform vs Official Website Detection}, year = {2025}, publisher = {HuggingFace}, howpublished = {\url{https://huggingface.co/DiligentAI/urlbert-url-classifier}} } License MIT License Framework Versions Transformers: 4.57.0+ PyTorch: 2.0.0+ Python: 3.10+ Training Infrastructure Framework: PyTorch + Hugging Face Transformers Pipeline Orchestration: DVC (Data Version Control) CI/CD: GitHub Actions Model Format: Safetensors Dependencies: See repository Model Versioning This model is automatically versioned and deployed via GitHub Actions. Each release includes: Model checkpoint (.safetensors) Tokenizer configuration Label mapping (label_map.json) Performance metrics (metrics.json) Contact For issues, questions, or feedback: GitHub: DiligentAI/url-classifier Organization: DiligentAI