michael-sigamani commited on
Commit
a4dd5ac
Β·
verified Β·
1 Parent(s): fa11683

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +200 -188
README.md CHANGED
@@ -1,199 +1,211 @@
1
  ---
2
- library_name: transformers
3
- tags: []
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  ---
5
 
6
- # Model Card for Model ID
7
 
8
- <!-- Provide a quick summary of what the model is/does. -->
9
 
 
10
 
 
11
 
12
- ## Model Details
13
-
14
- ### Model Description
15
-
16
- <!-- Provide a longer summary of what this model is. -->
17
-
18
- This is the model card of a πŸ€— transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
-
20
- - **Developed by:** [More Information Needed]
21
- - **Funded by [optional]:** [More Information Needed]
22
- - **Shared by [optional]:** [More Information Needed]
23
- - **Model type:** [More Information Needed]
24
- - **Language(s) (NLP):** [More Information Needed]
25
- - **License:** [More Information Needed]
26
- - **Finetuned from model [optional]:** [More Information Needed]
27
-
28
- ### Model Sources [optional]
29
-
30
- <!-- Provide the basic links for the model. -->
31
-
32
- - **Repository:** [More Information Needed]
33
- - **Paper [optional]:** [More Information Needed]
34
- - **Demo [optional]:** [More Information Needed]
35
-
36
- ## Uses
37
-
38
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
-
40
- ### Direct Use
41
-
42
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
-
44
- [More Information Needed]
45
-
46
- ### Downstream Use [optional]
47
-
48
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
-
50
- [More Information Needed]
51
-
52
- ### Out-of-Scope Use
53
-
54
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
-
56
- [More Information Needed]
57
-
58
- ## Bias, Risks, and Limitations
59
-
60
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
-
62
- [More Information Needed]
63
-
64
- ### Recommendations
65
-
66
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
-
68
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
-
70
- ## How to Get Started with the Model
71
-
72
- Use the code below to get started with the model.
73
-
74
- [More Information Needed]
75
 
76
  ## Training Details
77
 
78
- ### Training Data
79
-
80
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
-
82
- [More Information Needed]
83
-
84
- ### Training Procedure
85
-
86
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
-
88
- #### Preprocessing [optional]
89
-
90
- [More Information Needed]
91
-
92
-
93
- #### Training Hyperparameters
94
-
95
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
-
97
- #### Speeds, Sizes, Times [optional]
98
-
99
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
-
101
- [More Information Needed]
102
-
103
- ## Evaluation
104
-
105
- <!-- This section describes the evaluation protocols and provides the results. -->
106
-
107
- ### Testing Data, Factors & Metrics
108
-
109
- #### Testing Data
110
 
111
- <!-- This should link to a Dataset Card if possible. -->
112
-
113
- [More Information Needed]
114
-
115
- #### Factors
116
-
117
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
-
119
- [More Information Needed]
120
-
121
- #### Metrics
122
-
123
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
-
125
- [More Information Needed]
126
-
127
- ### Results
128
-
129
- [More Information Needed]
130
-
131
- #### Summary
132
-
133
-
134
-
135
- ## Model Examination [optional]
136
-
137
- <!-- Relevant interpretability work for the model goes here -->
138
-
139
- [More Information Needed]
140
-
141
- ## Environmental Impact
142
-
143
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
-
145
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
-
147
- - **Hardware Type:** [More Information Needed]
148
- - **Hours used:** [More Information Needed]
149
- - **Cloud Provider:** [More Information Needed]
150
- - **Compute Region:** [More Information Needed]
151
- - **Carbon Emitted:** [More Information Needed]
152
-
153
- ## Technical Specifications [optional]
154
-
155
- ### Model Architecture and Objective
156
-
157
- [More Information Needed]
158
-
159
- ### Compute Infrastructure
160
-
161
- [More Information Needed]
162
-
163
- #### Hardware
164
-
165
- [More Information Needed]
166
-
167
- #### Software
168
-
169
- [More Information Needed]
170
-
171
- ## Citation [optional]
172
-
173
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
-
175
- **BibTeX:**
176
-
177
- [More Information Needed]
178
-
179
- **APA:**
180
-
181
- [More Information Needed]
182
-
183
- ## Glossary [optional]
184
-
185
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
-
187
- [More Information Needed]
188
-
189
- ## More Information [optional]
190
-
191
- [More Information Needed]
192
-
193
- ## Model Card Authors [optional]
194
-
195
- [More Information Needed]
196
-
197
- ## Model Card Contact
198
-
199
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - en
4
+ license: mit
5
+ tags:
6
+ - text-classification
7
+ - url-classification
8
+ - bert
9
+ - domain-classification
10
+ pipeline_tag: text-classification
11
+ widget:
12
+ - text: "https://acmewidgets.com"
13
+ - text: "https://store.myshopify.com"
14
+ - text: "https://example.wixsite.com/store"
15
+ - text: "https://business.com"
16
+ datasets:
17
+ - custom
18
+ metrics:
19
+ - accuracy
20
+ - f1
21
+ - precision
22
+ - recall
23
+ model-index:
24
+ - name: urlbert-url-classifier
25
+ results:
26
+ - task:
27
+ type: text-classification
28
+ name: URL Classification
29
+ metrics:
30
+ - type: accuracy
31
+ value: 0.99
32
+ name: Test Accuracy
33
+ - type: f1
34
+ value: 0.99
35
+ name: Test F1 Score
36
+ - type: precision
37
+ value: 0.99
38
+ name: Test Precision
39
+ - type: recall
40
+ value: 0.99
41
+ name: Test Recall
42
  ---
43
 
44
+ # URL Classifier
45
 
46
+ A fine-tuned BERT model for binary classification of URLs as either **platform listings** (e.g., `*.myshopify.com`, `*.wixsite.com`) or **official websites** (e.g., `acmewidgets.com`).
47
 
48
+ ## Model Description
49
 
50
+ This model is a fine-tuned version of [amahdaouy/DomURLs_BERT](https://huggingface.co/amahdaouy/DomURLs_BERT) trained to distinguish between:
51
 
52
+ - **LABEL_0 (official_website)**: Direct company/brand websites
53
+ - **LABEL_1 (platform)**: Third-party platform listings (Shopify, Wix, etc.)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54
 
55
  ## Training Details
56
 
57
+ ### Base Model
58
+ - **Architecture**: BERT for Sequence Classification
59
+ - **Base Model**: `amahdaouy/DomURLs_BERT`
60
+ - **Tokenizer**: `CrabInHoney/urlbert-tiny-base-v4`
61
+
62
+ ### Training Configuration
63
+ | Parameter | Value |
64
+ |-----------|-------|
65
+ | **Epochs** | 20 |
66
+ | **Learning Rate** | 2e-5 |
67
+ | **Batch Size** | 32 |
68
+ | **Max Sequence Length** | 64 tokens |
69
+ | **Optimizer** | AdamW |
70
+ | **Weight Decay** | 0.01 |
71
+ | **LR Scheduler** | ReduceLROnPlateau |
72
+ | **Early Stopping** | Patience: 3, Threshold: 0.001 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
73
 
74
+ ### Training Data
75
+ - Custom curated dataset of platform and official website URLs
76
+ - Balanced training set with equal representation of both classes
77
+ - Domain-specific preprocessing and data augmentation
78
+
79
+ ## Performance
80
+
81
+ ### Test Set Metrics
82
+ | Metric | Threshold | Achieved |
83
+ |--------|-----------|----------|
84
+ | **Accuracy** | β‰₯ 0.80 | **β‰₯ 0.99** βœ… |
85
+ | **F1 Score** | β‰₯ 0.80 | **β‰₯ 0.99** βœ… |
86
+ | **Precision** | β‰₯ 0.80 | **β‰₯ 0.99** βœ… |
87
+ | **Recall** | β‰₯ 0.80 | **β‰₯ 0.99** βœ… |
88
+ | **False Positive Rate** | ≀ 0.15 | **< 0.01** βœ… |
89
+ | **False Negative Rate** | ≀ 0.15 | **< 0.01** βœ… |
90
+
91
+ ### Example Predictions
92
+ - `https://acmewidgets.com` β†’ **official_website** (99.98% confidence)
93
+ - `https://store.myshopify.com` β†’ **platform** (75.96% confidence)
94
+ - `https://example.wixsite.com/store` β†’ **platform** (high confidence)
95
+
96
+ ## Usage
97
+
98
+ ### Direct Inference with Transformers
99
+
100
+ ```python
101
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
102
+ import torch
103
+
104
+ # Load model and tokenizer
105
+ model_name = "DiligentAI/urlbert-url-classifier"
106
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
107
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
108
+
109
+ # Classify URL
110
+ url = "https://acmewidgets.com"
111
+ inputs = tokenizer(url, return_tensors="pt", truncation=True, max_length=64)
112
+
113
+ with torch.no_grad():
114
+ outputs = model(**inputs)
115
+ predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
116
+ predicted_class = torch.argmax(predictions, dim=1).item()
117
+ confidence = predictions[0][predicted_class].item()
118
+
119
+ label_map = {0: "official_website", 1: "platform"}
120
+ print(f"Prediction: {label_map[predicted_class]} ({confidence:.2%})")
121
+ Using Hugging Face Pipeline
122
+ from transformers import pipeline
123
+
124
+ classifier = pipeline("text-classification", model="DiligentAI/urlbert-url-classifier")
125
+ result = classifier("https://store.myshopify.com")
126
+ print(result)
127
+ # [{'label': 'LABEL_1', 'score': 0.7596}]
128
+ Pydantic Integration (Production-Ready)
129
+ from transformers import pipeline
130
+ from pydantic import BaseModel, Field
131
+ from typing import Literal
132
+
133
+ class URLClassificationResult(BaseModel):
134
+ url: str
135
+ label: Literal["official_website", "platform"]
136
+ confidence: float = Field(..., ge=0.0, le=1.0)
137
+
138
+ def classify_url(url: str) -> URLClassificationResult:
139
+ classifier = pipeline("text-classification", model="DiligentAI/urlbert-url-classifier")
140
+ result = classifier(url[:64])[0] # Truncate to max_length
141
+
142
+ label_map = {"LABEL_0": "official_website", "LABEL_1": "platform"}
143
+
144
+ return URLClassificationResult(
145
+ url=url,
146
+ label=label_map[result["label"]],
147
+ confidence=result["score"]
148
+ )
149
+ Limitations and Bias
150
+ Max URL Length: Model trained on 64-token sequences. Longer URLs are truncated.
151
+ Domain Focus: Optimized for e-commerce and business websites
152
+ Platform Coverage: Best performance on common platforms (Shopify, Wix, etc.)
153
+ Language: Primarily trained on English-language domains
154
+ Edge Cases: May have lower confidence on:
155
+ Uncommon TLDs
156
+ Very short URLs
157
+ Internationalized domain names
158
+ Intended Use
159
+ Primary Use Cases:
160
+ URL filtering and categorization pipelines
161
+ Lead qualification systems
162
+ Web scraping and data collection workflows
163
+ Business intelligence and market research
164
+ Out of Scope:
165
+ Content classification (only URL structure is analyzed)
166
+ Malicious URL detection (use dedicated security models)
167
+ Language detection
168
+ Spam filtering
169
+ Model Card Authors
170
+ DiligentAI Team
171
+ Citation
172
+ @misc{urlbert-classifier-2025,
173
+ author = {DiligentAI},
174
+ title = {URL Classifier - Platform vs Official Website Detection},
175
+ year = {2025},
176
+ publisher = {HuggingFace},
177
+ howpublished = {\url{https://huggingface.co/DiligentAI/urlbert-url-classifier}}
178
+ }
179
+ License
180
+ MIT License
181
+ Framework Versions
182
+ Transformers: 4.57.0+
183
+ PyTorch: 2.0.0+
184
+ Python: 3.10+
185
+ Training Infrastructure
186
+ Framework: PyTorch + Hugging Face Transformers
187
+ Pipeline Orchestration: DVC (Data Version Control)
188
+ CI/CD: GitHub Actions
189
+ Model Format: Safetensors
190
+ Dependencies: See repository
191
+ Model Versioning
192
+ This model is automatically versioned and deployed via GitHub Actions. Each release includes:
193
+ Model checkpoint (.safetensors)
194
+ Tokenizer configuration
195
+ Label mapping (label_map.json)
196
+ Performance metrics (metrics.json)
197
+ Contact
198
+ For issues, questions, or feedback:
199
+ GitHub: DiligentAI/url-classifier
200
+ Organization: DiligentAI
201
+
202
+ This model card includes:
203
+
204
+ 1. **Metadata** (YAML front matter) for HuggingFace widgets and indexing
205
+ 2. **Model description** and purpose
206
+ 3. **Complete training details** from your config
207
+ 4. **Performance metrics** with quality gates
208
+ 5. **Usage examples** (transformers, pipeline, Pydantic)
209
+ 6. **Limitations** and intended use cases
210
+ 7. **Citation** and licensing information
211
+ 8. **Technical specifications** and dependencies