Update README.md

f615f1f verified about 2 months ago

9.74 kB

	---
	language: en
	license: apache-2.0
	tags:
	- token-classification
	- distilbert
	- ner
	- message-parsing
	- natural-language-understanding
	datasets:
	- custom
	metrics:
	- accuracy
	- f1
	pipeline_tag: token-classification
	---

	# DistilBERT Message Parser 🤖💬

	A fine-tuned DistilBERT model for parsing natural language queries to extract receiver (person) and content (message) information from user requests.

	## Model Description

	This model performs token-level classification to identify:
	- `person`: The recipient/receiver of the message
	- `content`: The message content to be sent
	- `O`: Other tokens (Outside)

	## Use Cases

	Perfect for virtual assistants, chatbots, and messaging applications that need to understand commands like:
	- "Send a message to Mom telling her I'll be home late"
	- "Ask the python teacher when is the next class"
	- "Text John about tomorrow's meeting"

	## Quick Start

	### Installation

	```bash
	pip install transformers torch
	```

	### Basic Usage

	```python
	from transformers import AutoTokenizer, AutoModelForTokenClassification
	import torch

	# Load model and tokenizer
	model_name = "AbdellatifZ/distilbert-message-parser" # Replace with your model name
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForTokenClassification.from_pretrained(model_name)

	# Helper function for word-level predictions
	def predict_at_word_level(words, model, tokenizer):
	"""Predict labels at word level (not subword tokens)"""
	inputs = tokenizer(words, return_tensors="pt", is_split_into_words=True)

	with torch.no_grad():
	logits = model(**inputs).logits
	predictions = torch.argmax(logits, dim=2)

	word_labels = []
	word_ids = inputs.word_ids()
	previous_word_idx = None

	for idx, word_idx in enumerate(word_ids):
	if word_idx is None: # Special tokens
	continue
	if word_idx != previous_word_idx: # First subtoken of each word
	word_labels.append(predictions[0][idx].item())
	previous_word_idx = word_idx

	return word_labels

	# Main parsing function
	def parse_message(query, model, tokenizer):
	"""
	Parse a query to extract receiver and content.

	Args:
	query (str): User query in natural language
	model: Token classification model
	tokenizer: Tokenizer

	Returns:
	dict: {"receiver": str, "content": str}
	"""
	words = query.split()
	label_ids = predict_at_word_level(words, model, tokenizer)

	id2label = model.config.id2label
	labels = [id2label[label_id] for label_id in label_ids]

	person_tokens = [word for word, label in zip(words, labels) if label == 'person']
	content_tokens = [word for word, label in zip(words, labels) if label == 'content']

	return {
	'receiver': ' '.join(person_tokens) if person_tokens else None,
	'content': ' '.join(content_tokens) if content_tokens else None
	}

	# Example usage
	query = "Ask the python teacher when is the next class"
	result = parse_message(query, model, tokenizer)
	print(result)
	# Output: {'receiver': 'the python teacher', 'content': 'when is the next class'}
	```

	## More Examples

	```python
	# Example 1: Simple message
	query = "Send a message to Mom telling her I'll be home late"
	result = parse_message(query, model, tokenizer)
	print(result)
	# {'receiver': 'Mom', 'content': "telling her I'll be home late"}

	# Example 2: Professional context
	query = "Write to the professor asking about the exam format"
	result = parse_message(query, model, tokenizer)
	print(result)
	# {'receiver': 'the professor', 'content': 'asking about the exam format'}

	# Example 3: Casual context
	query = "Text John asking if he's available for a meeting tomorrow"
	result = parse_message(query, model, tokenizer)
	print(result)
	# {'receiver': 'John', 'content': "asking if he's available for a meeting tomorrow"}
	```

	## Advanced Usage: Batch Processing

	```python
	def parse_messages_batch(queries, model, tokenizer):
	"""Parse multiple queries efficiently"""
	results = []
	for query in queries:
	result = parse_message(query, model, tokenizer)
	results.append(result)
	return results

	# Batch example
	queries = [
	"Ask the python teacher when is the next class",
	"Message the customer support about my order status",
	"Text my friend to see if they're coming tonight"
	]

	results = parse_messages_batch(queries, model, tokenizer)
	for query, result in zip(queries, results):
	print(f"Query: {query}")
	print(f"Result: {result}\n")
	```

	## Detailed Token-Level Analysis

	```python
	def visualize_parsing(query, model, tokenizer):
	"""Show word-by-word label predictions"""
	words = query.split()
	label_ids = predict_at_word_level(words, model, tokenizer)

	id2label = model.config.id2label
	labels = [id2label[label_id] for label_id in label_ids]

	print(f"\nQuery: {query}\n")
	print(f"{'Word':<25} {'Label':<10}")
	print("-" * 35)

	for word, label in zip(words, labels):
	print(f"{word:<25} {label:<10}")

	result = parse_message(query, model, tokenizer)
	print(f"\n{'='*35}")
	print(f"Receiver: {result['receiver']}")
	print(f"Content: {result['content']}")
	print(f"{'='*35}")

	# Example
	visualize_parsing("Ask the python teacher when is the next class", model, tokenizer)
	```

	Output:
	```
	Query: Ask the python teacher when is the next class

	Word Label
	-----------------------------------
	Ask O
	the person
	python person
	teacher person
	when content
	is content
	the content
	next content
	class content

	===================================
	Receiver: the python teacher
	Content: when is the next class
	===================================
	```

	## API Integration Example

	```python
	from flask import Flask, request, jsonify

	app = Flask(__name__)

	# Load model once at startup
	model = AutoModelForTokenClassification.from_pretrained("AbdellatifZ/distilbert-message-parser")
	tokenizer = AutoTokenizer.from_pretrained("AbdellatifZ/distilbert-message-parser")

	@app.route('/parse', methods=['POST'])
	def parse():
	data = request.json
	query = data.get('query', '')

	if not query:
	return jsonify({'error': 'No query provided'}), 400

	try:
	result = parse_message(query, model, tokenizer)
	return jsonify({
	'success': True,
	'query': query,
	'parsed': result
	})
	except Exception as e:
	return jsonify({'error': str(e)}), 500

	if __name__ == '__main__':
	app.run(debug=True)
	```

	## Model Details

	\| Property \| Value \|
	\|----------\|-------\|
	\| Base Model \| `distilbert-base-uncased` \|
	\| Task \| Token Classification (NER-style) \|
	\| Number of Labels \| 3 (O, content, person) \|
	\| Training Framework \| Transformers (Hugging Face) \|
	\| Parameters \| ~67M (DistilBERT) \|
	\| Max Sequence Length \| 128 tokens \|

	## Training Details

	### Dataset
	- Source: Custom Presto-based dataset
	- Task: Send_message queries
	- Labels: `person`, `content`, `O`
	- Split: 70% train, 15% validation, 15% test

	### Training Configuration
	- Epochs: 15
	- Batch Size: 16
	- Learning Rate: 2e-5
	- Optimizer: AdamW
	- Weight Decay: 0.01
	- Warmup Steps: 100

	### Label Alignment
	The model uses special label alignment to handle subword tokenization:
	- Only the first subtoken of each word receives a label
	- Subsequent subtokens are marked with `-100` (ignored in loss computation)
	- Special tokens ([CLS], [SEP], [PAD]) are also ignored

	## Performance

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Accuracy \| >0.90 \|
	\| Precision \| >0.88 \|
	\| Recall \| >0.88 \|
	\| F1-Score \| >0.88 \|

	Note: Actual metrics may vary depending on your specific use case and dataset.

	## Limitations

	- Language: Optimized for English queries only
	- Domain: Best performance on message-sending commands
	- Structure: May struggle with highly unusual or complex sentence structures
	- Context: Limited to single-turn queries (no conversation context)

	## Error Handling

	```python
	def safe_parse_message(query, model, tokenizer):
	"""Parse with error handling"""
	try:
	if not query or not query.strip():
	return {'error': 'Empty query', 'receiver': None, 'content': None}

	result = parse_message(query, model, tokenizer)

	# Validate results
	if not result['receiver'] and not result['content']:
	return {'warning': 'No entities found', **result}

	return result

	except Exception as e:
	return {'error': str(e), 'receiver': None, 'content': None}

	# Example
	result = safe_parse_message("", model, tokenizer)
	print(result) # {'error': 'Empty query', 'receiver': None, 'content': None}
	```

	## Citation

	If you use this model in your research, please cite:

	```bibtex
	@misc{distilbert-message-parser,
	author = {Your Name},
	title = {DistilBERT Message Parser: Token Classification for Message Intent Extraction},
	year = {2025},
	publisher = {Hugging Face},
	howpublished = {\url{https://huggingface.co/AbdellatifZ/distilbert-message-parser}}
	}
	```

	## License

	This model is released under the Apache 2.0 License.

	## Contact & Feedback

	For questions, issues, or feedback:
	- Open an issue on the model repository
	- Contact: [Your contact information]

	## Acknowledgments

	- Base model: [DistilBERT](https://huggingface.co/distilbert-base-uncased) by Hugging Face
	- Framework: [Transformers](https://github.com/huggingface/transformers) by Hugging Face
	- Dataset inspiration: Presto benchmark

	---

	Built with Transformers 🤗