DistilBERT Message Parser πŸ€–πŸ’¬

A fine-tuned DistilBERT model for parsing natural language queries to extract receiver (person) and content (message) information from user requests.

Model Description

This model performs token-level classification to identify:

  • person: The recipient/receiver of the message
  • content: The message content to be sent
  • O: Other tokens (Outside)

Use Cases

Perfect for virtual assistants, chatbots, and messaging applications that need to understand commands like:

  • "Send a message to Mom telling her I'll be home late"
  • "Ask the python teacher when is the next class"
  • "Text John about tomorrow's meeting"

Quick Start

Installation

pip install transformers torch

Basic Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

# Load model and tokenizer
model_name = "AbdellatifZ/distilbert-message-parser"  # Replace with your model name
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Helper function for word-level predictions
def predict_at_word_level(words, model, tokenizer):
    """Predict labels at word level (not subword tokens)"""
    inputs = tokenizer(words, return_tensors="pt", is_split_into_words=True)

    with torch.no_grad():
        logits = model(**inputs).logits
    predictions = torch.argmax(logits, dim=2)

    word_labels = []
    word_ids = inputs.word_ids()
    previous_word_idx = None

    for idx, word_idx in enumerate(word_ids):
        if word_idx is None:  # Special tokens
            continue
        if word_idx != previous_word_idx:  # First subtoken of each word
            word_labels.append(predictions[0][idx].item())
            previous_word_idx = word_idx

    return word_labels

# Main parsing function
def parse_message(query, model, tokenizer):
    """
    Parse a query to extract receiver and content.

    Args:
        query (str): User query in natural language
        model: Token classification model
        tokenizer: Tokenizer

    Returns:
        dict: {"receiver": str, "content": str}
    """
    words = query.split()
    label_ids = predict_at_word_level(words, model, tokenizer)

    id2label = model.config.id2label
    labels = [id2label[label_id] for label_id in label_ids]

    person_tokens = [word for word, label in zip(words, labels) if label == 'person']
    content_tokens = [word for word, label in zip(words, labels) if label == 'content']

    return {
        'receiver': ' '.join(person_tokens) if person_tokens else None,
        'content': ' '.join(content_tokens) if content_tokens else None
    }

# Example usage
query = "Ask the python teacher when is the next class"
result = parse_message(query, model, tokenizer)
print(result)
# Output: {'receiver': 'the python teacher', 'content': 'when is the next class'}

More Examples

# Example 1: Simple message
query = "Send a message to Mom telling her I'll be home late"
result = parse_message(query, model, tokenizer)
print(result)
# {'receiver': 'Mom', 'content': "telling her I'll be home late"}

# Example 2: Professional context
query = "Write to the professor asking about the exam format"
result = parse_message(query, model, tokenizer)
print(result)
# {'receiver': 'the professor', 'content': 'asking about the exam format'}

# Example 3: Casual context
query = "Text John asking if he's available for a meeting tomorrow"
result = parse_message(query, model, tokenizer)
print(result)
# {'receiver': 'John', 'content': "asking if he's available for a meeting tomorrow"}

Advanced Usage: Batch Processing

def parse_messages_batch(queries, model, tokenizer):
    """Parse multiple queries efficiently"""
    results = []
    for query in queries:
        result = parse_message(query, model, tokenizer)
        results.append(result)
    return results

# Batch example
queries = [
    "Ask the python teacher when is the next class",
    "Message the customer support about my order status",
    "Text my friend to see if they're coming tonight"
]

results = parse_messages_batch(queries, model, tokenizer)
for query, result in zip(queries, results):
    print(f"Query: {query}")
    print(f"Result: {result}\n")

Detailed Token-Level Analysis

def visualize_parsing(query, model, tokenizer):
    """Show word-by-word label predictions"""
    words = query.split()
    label_ids = predict_at_word_level(words, model, tokenizer)

    id2label = model.config.id2label
    labels = [id2label[label_id] for label_id in label_ids]

    print(f"\nQuery: {query}\n")
    print(f"{'Word':<25} {'Label':<10}")
    print("-" * 35)

    for word, label in zip(words, labels):
        print(f"{word:<25} {label:<10}")

    result = parse_message(query, model, tokenizer)
    print(f"\n{'='*35}")
    print(f"Receiver: {result['receiver']}")
    print(f"Content:  {result['content']}")
    print(f"{'='*35}")

# Example
visualize_parsing("Ask the python teacher when is the next class", model, tokenizer)

Output:

Query: Ask the python teacher when is the next class

Word                      Label
-----------------------------------
Ask                       O
the                       person
python                    person
teacher                   person
when                      content
is                        content
the                       content
next                      content
class                     content

===================================
Receiver: the python teacher
Content:  when is the next class
===================================

API Integration Example

from flask import Flask, request, jsonify

app = Flask(__name__)

# Load model once at startup
model = AutoModelForTokenClassification.from_pretrained("AbdellatifZ/distilbert-message-parser")
tokenizer = AutoTokenizer.from_pretrained("AbdellatifZ/distilbert-message-parser")

@app.route('/parse', methods=['POST'])
def parse():
    data = request.json
    query = data.get('query', '')

    if not query:
        return jsonify({'error': 'No query provided'}), 400

    try:
        result = parse_message(query, model, tokenizer)
        return jsonify({
            'success': True,
            'query': query,
            'parsed': result
        })
    except Exception as e:
        return jsonify({'error': str(e)}), 500

if __name__ == '__main__':
    app.run(debug=True)

Model Details

Property Value
Base Model distilbert-base-uncased
Task Token Classification (NER-style)
Number of Labels 3 (O, content, person)
Training Framework Transformers (Hugging Face)
Parameters ~67M (DistilBERT)
Max Sequence Length 128 tokens

Training Details

Dataset

  • Source: Custom Presto-based dataset
  • Task: Send_message queries
  • Labels: person, content, O
  • Split: 70% train, 15% validation, 15% test

Training Configuration

  • Epochs: 15
  • Batch Size: 16
  • Learning Rate: 2e-5
  • Optimizer: AdamW
  • Weight Decay: 0.01
  • Warmup Steps: 100

Label Alignment

The model uses special label alignment to handle subword tokenization:

  • Only the first subtoken of each word receives a label
  • Subsequent subtokens are marked with -100 (ignored in loss computation)
  • Special tokens ([CLS], [SEP], [PAD]) are also ignored

Performance

Metric Value
Accuracy >0.90
Precision >0.88
Recall >0.88
F1-Score >0.88

Note: Actual metrics may vary depending on your specific use case and dataset.

Limitations

  • Language: Optimized for English queries only
  • Domain: Best performance on message-sending commands
  • Structure: May struggle with highly unusual or complex sentence structures
  • Context: Limited to single-turn queries (no conversation context)

Error Handling

def safe_parse_message(query, model, tokenizer):
    """Parse with error handling"""
    try:
        if not query or not query.strip():
            return {'error': 'Empty query', 'receiver': None, 'content': None}

        result = parse_message(query, model, tokenizer)

        # Validate results
        if not result['receiver'] and not result['content']:
            return {'warning': 'No entities found', **result}

        return result

    except Exception as e:
        return {'error': str(e), 'receiver': None, 'content': None}

# Example
result = safe_parse_message("", model, tokenizer)
print(result)  # {'error': 'Empty query', 'receiver': None, 'content': None}

Citation

If you use this model in your research, please cite:

@misc{distilbert-message-parser,
  author = {Your Name},
  title = {DistilBERT Message Parser: Token Classification for Message Intent Extraction},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/AbdellatifZ/distilbert-message-parser}}
}

License

This model is released under the Apache 2.0 License.

Contact & Feedback

For questions, issues, or feedback:

  • Open an issue on the model repository
  • Contact: [Your contact information]

Acknowledgments

  • Base model: DistilBERT by Hugging Face
  • Framework: Transformers by Hugging Face
  • Dataset inspiration: Presto benchmark

Built with Transformers πŸ€—

Downloads last month
6
Safetensors
Model size
66.4M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support