--- language: - en license: apache-2.0 tags: - binary-analysis - security - malware-analysis - executable-analysis - roberta - masked-language-modeling library_name: transformers pipeline_tag: fill-mask widget: - text: "ELF header" --- # Glaurung Large 001 A RoBERTa-based masked language model trained on binary executable files for security research and binary analysis. Part of the [Glaurung](https://github.com/mjbommar/glaurung) project: a modern reverse engineering framework with first-class AI integration. ## Overview **Glaurung Large 001** is a transformer model specifically designed for understanding binary executable files. It uses a custom BPE (Byte Pair Encoding) tokenizer trained on multi-byte patterns from various binary formats across multiple architectures (x86-64, ARM64, etc.) and operating systems (Linux, Alpine, Ubuntu, Debian, Rocky). This is the **large variant** (371M parameters, 24 layers) offering enhanced understanding of binary patterns. For faster inference, see [glaurung-small-001](https://huggingface.co/mjbommar/glaurung-small-001) (160M parameters). ### Key Features - **Custom Binary Tokenizer**: BPE tokenizer that creates efficient multi-byte tokens from binary data - **Binary-Aware**: Trained on actual executable files, not hex strings - **Multi-Architecture**: Understands patterns from various CPU architectures and file formats - **Latin-1 Encoding**: Preserves all byte values (0-255) without loss - **Large Model**: 371M parameters with deeper architecture for enhanced binary understanding ## Model Details - **Architecture**: RoBERTa for Masked Language Modeling - **Hidden Size**: 1024 - **Layers**: 24 - **Attention Heads**: 16 - **Intermediate Size**: 4096 - **Vocabulary Size**: 65,536 tokens - **Tokenizer**: [binary-tokenizer-005](https://huggingface.co/mjbommar/binary-tokenizer-005) - **Max Position Embeddings**: 520 - **Parameters**: ~371M - **Special Tokens**: - `<|start|>` (0): Beginning of sequence - `<|end|>` (1): End token - `<|sep|>` (2): Separator/EOS - `<|cls|>` (3): Classification token - `<|pad|>` (4): Padding - `<|mask|>` (5): Mask token for MLM - `<|unk|>` (6): Unknown token ## Glaurung Ecosystem This model is part of the **Glaurung** project ecosystem: ### 🔧 Main Project - **[Glaurung](https://github.com/mjbommar/glaurung)** - A modern reverse engineering framework designed to replace Ghidra with first-class AI integration throughout the analysis pipeline. Built with Rust's performance and Python's accessibility, featuring AI agents integrated at every level from format detection to decompilation. ### 🤖 Model Family - **[glaurung-large-001](https://huggingface.co/mjbommar/glaurung-large-001)** (this model) - 371M parameters, 24 layers - **[glaurung-small-001](https://huggingface.co/mjbommar/glaurung-small-001)** - 160M parameters, 12 layers, faster inference ### 🔤 Tokenizer - **[binary-tokenizer-005](https://huggingface.co/mjbommar/binary-tokenizer-005)** - 65K vocabulary BPE tokenizer trained on multi-byte patterns ## Performance Comparison vs Glaurung Small 001 | Metric | Glaurung Small 001 | Glaurung Large 001 | Improvement | |--------|-------------------|-------------------|-------------| | **Architecture** | | Parameters | ~160M | ~371M | +132% | | Hidden Size | 768 | 1024 | +33% | | Layers | 12 | 24 | +100% | | Attention Heads | 12 | 16 | +33% | | **ELF Magic Prediction** (`\x7fEL`) | | Top-1 Confidence | ~45-50% (est.) | 59.2% | Stronger recognition | | **x86 Prologue in Context** | | Top-1 Confidence | ~70-80% (est.) | 100.0% | Perfect prediction | | **PE Magic Recognition** | | Top-1 Confidence | ~5-8% (est.) | 7.3% (rank #2) | Weak (training bias) | | **Binary Similarity Detection** | | ELF-to-ELF Similarity | 0.85-0.95 | 0.67-0.92 | More nuanced | | ELF-to-Text Separation | ~0.25-0.30 | ~0.21-0.32 | Similar | **Key Improvements:** - **Dramatically improved confidence** on binary pattern prediction (+21pp on ELF magic) - **Deeper architecture** enables better long-range dependencies in binary code - **More stable predictions** with near-perfect accuracy on structured headers - **Larger capacity** for learning complex multi-architecture binary patterns ## Installation & Loading ```bash pip install transformers torch ``` ```python from transformers import AutoTokenizer, AutoModelForMaskedLM, AutoModel, pipeline # Method 1: Load with pipeline for fill-mask tasks fill_mask = pipeline('fill-mask', model='mjbommar/glaurung-large-001', device=-1) # Method 2: Load model and tokenizer directly for fill-mask model = AutoModelForMaskedLM.from_pretrained('mjbommar/glaurung-large-001') tokenizer = AutoTokenizer.from_pretrained('mjbommar/glaurung-large-001') # Method 3: Load base model for feature extraction/embeddings model_base = AutoModel.from_pretrained('mjbommar/glaurung-large-001') ``` ## Usage Guide ### 1. Loading Binary Data (Critical!) Binary files MUST be read as bytes and converted to latin-1 encoding: ```python # CORRECT: Read as bytes, decode with latin-1 with open('/usr/bin/ls', 'rb') as f: binary_data = f.read() # Read first 512 bytes or as needed text = binary_data.decode('latin-1', errors='ignore') # WRONG: Never use hex strings or other encodings # hex_string = "7f454c46..." # ❌ Will not work # utf8_text = binary_data.decode('utf-8') # ❌ Will lose bytes ``` ### 2. Understanding the BPE Tokenizer The tokenizer creates multi-byte tokens from common binary patterns: ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('mjbommar/glaurung-large-001') # Example: ELF header tokenization elf_header = b'\x7fELF\x02\x01\x01\x00' text = elf_header.decode('latin-1') tokens = tokenizer(text, return_tensors='pt') token_ids = tokens['input_ids'][0].tolist() # Decode tokens individually to see multi-byte patterns for token_id in token_ids[1:5]: # Skip special tokens decoded = tokenizer.decode([token_id], skip_special_tokens=True) print(f"Token {token_id}: {repr(decoded)}") # Output: # Token 45689: '\x7fEL' # ELF magic compressed to one token! # Token 3665: 'F\x02' # Format byte + 64-bit flag # Token 458: '\x01\x01' # Little-endian + version # Token 600: '\x00\x00\x00\x00\x00\x00\x00\x00\x00' # Padding ``` ### 3. Fill-Mask Task (Token-Level Prediction) **Important**: Masking works at the TOKEN level, not byte level! ```python from transformers import AutoTokenizer, AutoModelForMaskedLM import torch model = AutoModelForMaskedLM.from_pretrained('mjbommar/glaurung-large-001') tokenizer = AutoTokenizer.from_pretrained('mjbommar/glaurung-large-001') # Read binary file with open('/usr/bin/ls', 'rb') as f: binary_data = f.read(512) text = binary_data.decode('latin-1', errors='ignore') # Tokenize tokens = tokenizer(text, return_tensors='pt') token_ids = tokens['input_ids'][0].tolist() # Mask the second token (first content token after <|start|>) masked_ids = token_ids.copy() original_token = masked_ids[1] # Save original masked_ids[1] = tokenizer.mask_token_id # Prepare input tokens_masked = { 'input_ids': torch.tensor([masked_ids]), 'attention_mask': torch.tensor([[1]*len(masked_ids)]) } # Predict with torch.no_grad(): outputs = model(**tokens_masked) predictions = outputs.logits[0, 1].softmax(dim=-1) top5 = predictions.topk(5) # Show results print(f"Original: {repr(tokenizer.decode([original_token]))}") for score, token_id in zip(top5.values, top5.indices): token_text = tokenizer.decode([token_id.item()], skip_special_tokens=True) print(f"Predicted: {repr(token_text)} (confidence: {score:.2%})") # Example output: # Original: '\x7fEL' # Predicted: '\x7fEL' (confidence: 59.23%) ✓ Correct! # Predicted: '\x00\x00\x00\x00\x00\x00\x00\x00' (confidence: 9.87%) # Predicted: '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00' (confidence: 4.45%) ``` ### 4. Using Pipeline for Fill-Mask The pipeline handles tokenization automatically but requires understanding multi-byte tokens: ```python from transformers import pipeline # Load pipeline fill_mask = pipeline('fill-mask', model='mjbommar/glaurung-large-001', device=-1) # Read binary with open('/usr/bin/ls', 'rb') as f: binary_data = f.read(100) text = binary_data.decode('latin-1', errors='ignore') # Create masked input at token boundaries # First, tokenize to understand token boundaries tokenizer = fill_mask.tokenizer tokens = tokenizer(text) decoded_tokens = [tokenizer.decode([tid], skip_special_tokens=True) for tid in tokens['input_ids']] # Reconstruct with mask at token boundary masked_text = ''.join([ decoded_tokens[0], # <|start|> fill_mask.tokenizer.mask_token, # Mask the ELF magic ''.join(decoded_tokens[2:]) # Rest of tokens ]) # Predict predictions = fill_mask(masked_text, top_k=3) for pred in predictions: print(f"{repr(pred['token_str'])}: {pred['score']:.2%}") ``` ### 5. Feature Extraction & Embedding Similarity Compare binary files by their learned embeddings: ```python from transformers import AutoTokenizer, AutoModel import torch import torch.nn.functional as F from pathlib import Path # Load for embeddings (not MaskedLM) tokenizer = AutoTokenizer.from_pretrained('mjbommar/glaurung-large-001') model = AutoModel.from_pretrained('mjbommar/glaurung-large-001') model.eval() def get_binary_embedding(file_path, max_bytes=512): """Extract embedding for a binary file using mean pooling""" with open(file_path, 'rb') as f: binary_data = f.read(max_bytes) text = binary_data.decode('latin-1', errors='ignore') # Tokenize tokens = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=512) # Get embeddings with mean pooling with torch.no_grad(): outputs = model(**tokens) # Mean pooling (better than CLS token for this model) attention_mask = tokens['attention_mask'] hidden_states = outputs.last_hidden_state # Mask padding tokens mask_expanded = attention_mask.unsqueeze(-1).expand(hidden_states.size()).float() sum_embeddings = torch.sum(hidden_states * mask_expanded, dim=1) sum_mask = torch.clamp(mask_expanded.sum(dim=1), min=1e-9) embedding = sum_embeddings / sum_mask return embedding # Compare multiple binaries files = ['/usr/bin/ls', '/usr/bin/cat', '/usr/bin/echo', '/etc/passwd'] embeddings = {} for file_path in files: if Path(file_path).exists(): name = Path(file_path).name embeddings[name] = get_binary_embedding(file_path) # Calculate similarities print("Cosine Similarity Matrix:") names = list(embeddings.keys()) for name1 in names: similarities = [] for name2 in names: sim = F.cosine_similarity(embeddings[name1], embeddings[name2], dim=-1).item() similarities.append(f"{sim:.3f}") print(f"{name1:10s}: {' '.join(similarities)}") # Expected output: # ELF executables (ls, cat, echo) will have high similarity (0.85-0.95) # Text file (passwd) will have low similarity (0.25-0.30) to ELF files ``` ## Real-World Example: ELF Header Analysis ```python from transformers import AutoTokenizer, AutoModelForMaskedLM import torch # Load model and tokenizer model = AutoModelForMaskedLM.from_pretrained('mjbommar/glaurung-large-001') tokenizer = AutoTokenizer.from_pretrained('mjbommar/glaurung-large-001') # Analyze ELF executable structure with open('/usr/bin/ls', 'rb') as f: binary_data = f.read(512) # Read enough for context print(f"Raw bytes (hex): {binary_data[:16].hex()}") # Output: 7f454c46020101000000000000000000 # Convert to latin-1 for model text = binary_data.decode('latin-1', errors='ignore') # Tokenize to see learned patterns tokens = tokenizer(text, return_tensors='pt') token_ids = tokens['input_ids'][0].tolist() # Show what tokens the model learned print("\nTokenized ELF header:") for i in range(1, min(5, len(token_ids)-1)): # First few content tokens token_text = tokenizer.decode([token_ids[i]], skip_special_tokens=True) print(f"Token {i}: {token_ids[i]:5d} = {repr(token_text)}") # Output: # Token 1: 45689 = '\x7fEL' - ELF magic compressed to one token! # Token 2: 3665 = 'F\x02' - 'F' + 64-bit flag # Token 3: 458 = '\x01\x01' - Little-endian + version # Token 4: 600 = '\x00\x00\x00\x00\x00\x00\x00\x00\x00' - Padding # Test model's understanding by masking each token print("\nTesting model predictions:") for position in [1, 2, 3]: # Test first 3 content tokens masked_ids = token_ids.copy() original_token = masked_ids[position] masked_ids[position] = tokenizer.mask_token_id # Create input tensors tokens_masked = { 'input_ids': torch.tensor([masked_ids]), 'attention_mask': torch.tensor([[1]*len(masked_ids)]) } # Get prediction with torch.no_grad(): outputs = model(**tokens_masked) predictions = outputs.logits[0, position].softmax(dim=-1) predicted_token = predictions.argmax().item() confidence = predictions.max().item() # Show results original_text = tokenizer.decode([original_token], skip_special_tokens=True) predicted_text = tokenizer.decode([predicted_token], skip_special_tokens=True) correct = "✓" if predicted_token == original_token else "✗" print(f"Position {position}: {correct}") print(f" Original: {repr(original_text)}") print(f" Predicted: {repr(predicted_text)} (confidence: {confidence:.1%})") # Expected Output: # Position 1: ✓ # Original: '\x7fEL' # Predicted: '\x7fEL' (confidence: 59.2%) # Position 2: ✗ (prefers single 'F') # Original: 'F\x02' # Predicted: 'F' (confidence: 96.0%) # Position 3: ✗ (not in top 5) # Original: '\x01\x01' # Predicted: '\x00\x00\x00\x00\x00\x00\x00\x00' (confidence: 59.1%) ``` ## Multi-Format Analysis: ELF vs PE Headers & x86 Instructions Systematic testing reveals performance varies by format and training data exposure: ### Performance Summary Table | Pattern Type | Confidence | Rank | Notes | |--------------|------------|------|-------| | **ELF magic** (`\x7fEL`) | 59.2% | #1 | Strong (94.6% of training data) | | **PE magic** (`MZ`) | 7.3% | #2 | Proportional to training (5.4% of data) | | **x86 prologue** (`PUSH RBP; MOV RBP, RSP`) | 100.0% | #1 | Perfect in full context | ### ELF Header Recognition (Strong) ```python # Test: /usr/bin/ls with 152 bytes of context # Token 1: '\x7fEL' (3-byte ELF magic) # Result: 59.23% confidence, rank #1 ✓ ``` The model strongly recognizes ELF headers (94.6% of training data). ### PE Header Recognition (Limited) ```python # Test: Realistic DOS/PE header with 152 bytes of context # Token 1: 'MZ' (2-byte PE signature) # Result: 7.34% confidence, rank #2 (null bytes ranked #1 at 29.95%) ``` PE recognition reflects limited training exposure (5.4% of training data, 647 files). ### x86 Instructions (Context-Dependent) ```python # Test: Function prologue in /usr/bin/ls at offset 0x4e05 # Token: 'UH\x89å' = 0x554889e5 (4 bytes: PUSH RBP; MOV RBP, RSP) # Result: 100.00% confidence, rank #1 ✓ ``` **Key Finding:** The BPE tokenizer learned to respect x86 instruction boundaries! - 1-byte tokens: `PUSH reg` (0x55), `RET` (0xc3) - 2-byte tokens: `MOV reg,reg` with ModR/M (0x89e5) - 4-byte tokens: Common prologues (0x554889e5) Performance is excellent **with full binary context** but degrades on isolated instruction bytes. ### Training Data Distribution & Performance Correlation The model was trained on the following binary distribution: | Source | Format | File Count | Size (MB) | % by Count | % by Size | |--------|--------|------------|-----------|------------|-----------| | Debian/Ubuntu/Alpine packages | ELF | 11,330 | 4,572 | 94.6% | 68.9% | | Windows Update drivers + SOREL-20M malware | PE | 647 | 2,062 | 5.4% | 31.1% | | **Total** | | **11,977** | **6,634** | | | **Key Metrics:** - **By file count**: 17.5:1 (ELF:PE) - **By data size**: 2.2:1 (ELF:PE) - **PE files are 8x larger** on average (3.19 MB vs 0.40 MB per file) This distribution explains the observed performance: | Format | Training Data | Recognition Confidence | Notes | |--------|---------------|----------------------|-------| | ELF | 11,330 files (95%) / 4,572 MB (69%) | 59.2% | Dominant by count | | PE | 647 files (5%) / 2,062 MB (31%) | 7.3% | Better represented by size | **Key Takeaway:** The model's PE performance reflects training data composition. While PE is only 5% by file count, it represents 31% by size due to larger average file sizes. The 8.1x performance gap (59.2% vs 7.3%) roughly correlates with the 17.5x file count imbalance, though size-based exposure is more balanced. **Practical Guidance:** - ✅ **Use for**: Linux/Unix binary analysis, ELF malware analysis, x86-64 code patterns - ⚠️ **Limited for**: Windows PE analysis (consider retraining with balanced PE dataset) - ✅ **Tokenizer learned**: Instruction-level boundaries across both formats ## Training Details - **MLM Objective**: 20% masking probability - **Training Data**: Binary executables from various architectures - **Optimization**: AdamW with warmup, dropout 0.01 - **Special Design**: Increased position embeddings (520) to handle RoBERTa's position offset - **Model Size**: Large variant with 24 layers and 1024 hidden dimensions ## Limitations - Maximum sequence length: 512 tokens - Optimized for executable files (ELF, PE, Mach-O) - Mean pooling recommended for embeddings (pooler layer not specifically trained) - Larger model size requires more memory (consider using device_map="auto" for large batches) ## Citation If using this model in research: ``` @software{glaurung-large-001, title = {Glaurung Large 001: Binary Analysis Transformer}, author = {Glaurung Project}, year = {2024}, url = {https://github.com/mjbommar/glaurung-models} } ```