Upload v2 of URL classifier model

0c68aa0 verified 2 months ago

5.69 kB

	---
	language:
	- en
	license: mit
	tags:
	- text-classification
	- url-classification
	- bert
	- domain-classification
	pipeline_tag: text-classification
	widget:
	- text: https://acmewidgets.com
	- text: https://store.myshopify.com
	- text: https://example.wixsite.com/store
	- text: https://business.com
	datasets:
	- custom
	metrics:
	- accuracy
	- f1
	- precision
	- recall
	---

	# URL Classifier

	A fine-tuned BERT model for binary classification of URLs as either platform listings (e.g., `.myshopify.com`, `.wixsite.com`) or official websites (e.g., `acmewidgets.com`).

	## Model Description

	This model is a fine-tuned version of [amahdaouy/DomURLs_BERT](https://huggingface.co/amahdaouy/DomURLs_BERT) trained to distinguish between:

	- LABEL_0 (official_website): Direct company/brand websites
	- LABEL_1 (platform): Third-party platform listings (Shopify, Wix, etc.)

	## Training Details

	### Base Model
	- Architecture: BERT for Sequence Classification
	- Base Model: `amahdaouy/DomURLs_BERT`
	- Tokenizer: `CrabInHoney/urlbert-tiny-base-v4`

	### Training Configuration
	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Epochs \| 20 \|
	\| Learning Rate \| 2e-5 \|
	\| Batch Size \| 32 \|
	\| Max Sequence Length \| 64 tokens \|
	\| Optimizer \| AdamW \|
	\| Weight Decay \| 0.01 \|
	\| LR Scheduler \| ReduceLROnPlateau \|
	\| Early Stopping \| Patience: 3, Threshold: 0.001 \|

	### Training Data
	- Custom curated dataset of platform and official website URLs
	- Balanced training set with equal representation of both classes
	- Domain-specific preprocessing and data augmentation

	## Performance

	### Test Set Metrics
	\| Metric \| Threshold \| Achieved \|
	\|--------\|-----------\|----------\|
	\| Accuracy \| ≥ 0.80 \| ≥ 0.99 ✅ \|
	\| F1 Score \| ≥ 0.80 \| ≥ 0.99 ✅ \|
	\| Precision \| ≥ 0.80 \| ≥ 0.99 ✅ \|
	\| Recall \| ≥ 0.80 \| ≥ 0.99 ✅ \|
	\| False Positive Rate \| ≤ 0.15 \| < 0.01 ✅ \|
	\| False Negative Rate \| ≤ 0.15 \| < 0.01 ✅ \|

	### Example Predictions
	- `https://acmewidgets.com` → official_website (99.98% confidence)
	- `https://store.myshopify.com` → platform (75.96% confidence)
	- `https://example.wixsite.com/store` → platform (high confidence)

	## Usage

	### Direct Inference with Transformers

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	# Load model and tokenizer
	model_name = "DiligentAI/urlbert-url-classifier"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSequenceClassification.from_pretrained(model_name)

	# Classify URL
	url = "https://acmewidgets.com"
	inputs = tokenizer(url, return_tensors="pt", truncation=True, max_length=64)

	with torch.no_grad():
	outputs = model(**inputs)
	predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
	predicted_class = torch.argmax(predictions, dim=1).item()
	confidence = predictions[0][predicted_class].item()

	label_map = {0: "official_website", 1: "platform"}
	print(f"Prediction: {label_map[predicted_class]} ({confidence:.2%})")
	Using Hugging Face Pipeline
	from transformers import pipeline

	classifier = pipeline("text-classification", model="DiligentAI/urlbert-url-classifier")
	result = classifier("https://store.myshopify.com")


	Pydantic Integration (Production-Ready)

	from transformers import pipeline
	from pydantic import BaseModel, Field
	from typing import Literal

	class URLClassificationResult(BaseModel):
	url: str
	label: Literal["official_website", "platform"]
	confidence: float = Field(..., ge=0.0, le=1.0)

	def classify_url(url: str) -> URLClassificationResult:
	classifier = pipeline("text-classification", model="DiligentAI/urlbert-url-classifier")
	result = classifier(url[:64])[0] # Truncate to max_length

	label_map = {"LABEL_0": "official_website", "LABEL_1": "platform"}

	return URLClassificationResult(
	url=url,
	label=label_map[result["label"]],
	confidence=result["score"]
	)

	Limitations and Bias
	Max URL Length: Model trained on 64-token sequences. Longer URLs are truncated.
	Domain Focus: Optimized for e-commerce and business websites
	Platform Coverage: Best performance on common platforms (Shopify, Wix, etc.)
	Language: Primarily trained on English-language domains
	Edge Cases: May have lower confidence on:
	Uncommon TLDs
	Very short URLs
	Internationalized domain names
	Intended Use

	Primary Use Cases:
	URL filtering and categorization pipelines
	Lead qualification systems
	Web scraping and data collection workflows
	Business intelligence and market research
	Out of Scope:
	Content classification (only URL structure is analyzed)
	Malicious URL detection (use dedicated security models)
	Language detection
	Spam filtering
	Model Card Authors
	DiligentAI Team

	Citation
	@misc{urlbert-classifier-2025,
	author = {DiligentAI},
	title = {URL Classifier - Platform vs Official Website Detection},
	year = {2025},
	publisher = {HuggingFace},
	howpublished = {\url{https://huggingface.co/DiligentAI/urlbert-url-classifier}}
	}

	License
	MIT License
	Framework Versions
	Transformers: 4.57.0+
	PyTorch: 2.0.0+
	Python: 3.10+
	Training Infrastructure
	Framework: PyTorch + Hugging Face Transformers
	Pipeline Orchestration: DVC (Data Version Control)
	CI/CD: GitHub Actions
	Model Format: Safetensors
	Dependencies: See repository
	Model Versioning
	This model is automatically versioned and deployed via GitHub Actions. Each release includes:
	Model checkpoint (.safetensors)
	Tokenizer configuration
	Label mapping (label_map.json)
	Performance metrics (metrics.json)

	Contact
	For issues, questions, or feedback:
	GitHub: DiligentAI/url-classifier
	Organization: DiligentAI