DistilBERT Message Parser π€π¬
A fine-tuned DistilBERT model for parsing natural language queries to extract receiver (person) and content (message) information from user requests.
Model Description
This model performs token-level classification to identify:
person: The recipient/receiver of the messagecontent: The message content to be sentO: Other tokens (Outside)
Use Cases
Perfect for virtual assistants, chatbots, and messaging applications that need to understand commands like:
- "Send a message to Mom telling her I'll be home late"
- "Ask the python teacher when is the next class"
- "Text John about tomorrow's meeting"
Quick Start
Installation
pip install transformers torch
Basic Usage
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
# Load model and tokenizer
model_name = "AbdellatifZ/distilbert-message-parser" # Replace with your model name
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
# Helper function for word-level predictions
def predict_at_word_level(words, model, tokenizer):
"""Predict labels at word level (not subword tokens)"""
inputs = tokenizer(words, return_tensors="pt", is_split_into_words=True)
with torch.no_grad():
logits = model(**inputs).logits
predictions = torch.argmax(logits, dim=2)
word_labels = []
word_ids = inputs.word_ids()
previous_word_idx = None
for idx, word_idx in enumerate(word_ids):
if word_idx is None: # Special tokens
continue
if word_idx != previous_word_idx: # First subtoken of each word
word_labels.append(predictions[0][idx].item())
previous_word_idx = word_idx
return word_labels
# Main parsing function
def parse_message(query, model, tokenizer):
"""
Parse a query to extract receiver and content.
Args:
query (str): User query in natural language
model: Token classification model
tokenizer: Tokenizer
Returns:
dict: {"receiver": str, "content": str}
"""
words = query.split()
label_ids = predict_at_word_level(words, model, tokenizer)
id2label = model.config.id2label
labels = [id2label[label_id] for label_id in label_ids]
person_tokens = [word for word, label in zip(words, labels) if label == 'person']
content_tokens = [word for word, label in zip(words, labels) if label == 'content']
return {
'receiver': ' '.join(person_tokens) if person_tokens else None,
'content': ' '.join(content_tokens) if content_tokens else None
}
# Example usage
query = "Ask the python teacher when is the next class"
result = parse_message(query, model, tokenizer)
print(result)
# Output: {'receiver': 'the python teacher', 'content': 'when is the next class'}
More Examples
# Example 1: Simple message
query = "Send a message to Mom telling her I'll be home late"
result = parse_message(query, model, tokenizer)
print(result)
# {'receiver': 'Mom', 'content': "telling her I'll be home late"}
# Example 2: Professional context
query = "Write to the professor asking about the exam format"
result = parse_message(query, model, tokenizer)
print(result)
# {'receiver': 'the professor', 'content': 'asking about the exam format'}
# Example 3: Casual context
query = "Text John asking if he's available for a meeting tomorrow"
result = parse_message(query, model, tokenizer)
print(result)
# {'receiver': 'John', 'content': "asking if he's available for a meeting tomorrow"}
Advanced Usage: Batch Processing
def parse_messages_batch(queries, model, tokenizer):
"""Parse multiple queries efficiently"""
results = []
for query in queries:
result = parse_message(query, model, tokenizer)
results.append(result)
return results
# Batch example
queries = [
"Ask the python teacher when is the next class",
"Message the customer support about my order status",
"Text my friend to see if they're coming tonight"
]
results = parse_messages_batch(queries, model, tokenizer)
for query, result in zip(queries, results):
print(f"Query: {query}")
print(f"Result: {result}\n")
Detailed Token-Level Analysis
def visualize_parsing(query, model, tokenizer):
"""Show word-by-word label predictions"""
words = query.split()
label_ids = predict_at_word_level(words, model, tokenizer)
id2label = model.config.id2label
labels = [id2label[label_id] for label_id in label_ids]
print(f"\nQuery: {query}\n")
print(f"{'Word':<25} {'Label':<10}")
print("-" * 35)
for word, label in zip(words, labels):
print(f"{word:<25} {label:<10}")
result = parse_message(query, model, tokenizer)
print(f"\n{'='*35}")
print(f"Receiver: {result['receiver']}")
print(f"Content: {result['content']}")
print(f"{'='*35}")
# Example
visualize_parsing("Ask the python teacher when is the next class", model, tokenizer)
Output:
Query: Ask the python teacher when is the next class
Word Label
-----------------------------------
Ask O
the person
python person
teacher person
when content
is content
the content
next content
class content
===================================
Receiver: the python teacher
Content: when is the next class
===================================
API Integration Example
from flask import Flask, request, jsonify
app = Flask(__name__)
# Load model once at startup
model = AutoModelForTokenClassification.from_pretrained("AbdellatifZ/distilbert-message-parser")
tokenizer = AutoTokenizer.from_pretrained("AbdellatifZ/distilbert-message-parser")
@app.route('/parse', methods=['POST'])
def parse():
data = request.json
query = data.get('query', '')
if not query:
return jsonify({'error': 'No query provided'}), 400
try:
result = parse_message(query, model, tokenizer)
return jsonify({
'success': True,
'query': query,
'parsed': result
})
except Exception as e:
return jsonify({'error': str(e)}), 500
if __name__ == '__main__':
app.run(debug=True)
Model Details
| Property | Value |
|---|---|
| Base Model | distilbert-base-uncased |
| Task | Token Classification (NER-style) |
| Number of Labels | 3 (O, content, person) |
| Training Framework | Transformers (Hugging Face) |
| Parameters | ~67M (DistilBERT) |
| Max Sequence Length | 128 tokens |
Training Details
Dataset
- Source: Custom Presto-based dataset
- Task: Send_message queries
- Labels:
person,content,O - Split: 70% train, 15% validation, 15% test
Training Configuration
- Epochs: 15
- Batch Size: 16
- Learning Rate: 2e-5
- Optimizer: AdamW
- Weight Decay: 0.01
- Warmup Steps: 100
Label Alignment
The model uses special label alignment to handle subword tokenization:
- Only the first subtoken of each word receives a label
- Subsequent subtokens are marked with
-100(ignored in loss computation) - Special tokens ([CLS], [SEP], [PAD]) are also ignored
Performance
| Metric | Value |
|---|---|
| Accuracy | >0.90 |
| Precision | >0.88 |
| Recall | >0.88 |
| F1-Score | >0.88 |
Note: Actual metrics may vary depending on your specific use case and dataset.
Limitations
- Language: Optimized for English queries only
- Domain: Best performance on message-sending commands
- Structure: May struggle with highly unusual or complex sentence structures
- Context: Limited to single-turn queries (no conversation context)
Error Handling
def safe_parse_message(query, model, tokenizer):
"""Parse with error handling"""
try:
if not query or not query.strip():
return {'error': 'Empty query', 'receiver': None, 'content': None}
result = parse_message(query, model, tokenizer)
# Validate results
if not result['receiver'] and not result['content']:
return {'warning': 'No entities found', **result}
return result
except Exception as e:
return {'error': str(e), 'receiver': None, 'content': None}
# Example
result = safe_parse_message("", model, tokenizer)
print(result) # {'error': 'Empty query', 'receiver': None, 'content': None}
Citation
If you use this model in your research, please cite:
@misc{distilbert-message-parser,
author = {Your Name},
title = {DistilBERT Message Parser: Token Classification for Message Intent Extraction},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/AbdellatifZ/distilbert-message-parser}}
}
License
This model is released under the Apache 2.0 License.
Contact & Feedback
For questions, issues, or feedback:
- Open an issue on the model repository
- Contact: [Your contact information]
Acknowledgments
- Base model: DistilBERT by Hugging Face
- Framework: Transformers by Hugging Face
- Dataset inspiration: Presto benchmark
Built with Transformers π€
- Downloads last month
- 6