File size: 16,578 Bytes
2a22557
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
---
language:
- en
license: apache-2.0
tags:
- binary-analysis
- security
- malware-analysis
- executable-analysis
- roberta
- masked-language-modeling
library_name: transformers
pipeline_tag: fill-mask
widget:
- text: "ELF <mask> header"
---

# Glaurung Large 001

A RoBERTa-based masked language model trained on binary executable files for security research and binary analysis.

## Overview

**Glaurung Large 001** is a transformer model specifically designed for understanding binary executable files. It uses a custom BPE (Byte Pair Encoding) tokenizer trained on multi-byte patterns from various binary formats across multiple architectures (x86-64, ARM64, etc.) and operating systems (Linux, Alpine, Ubuntu, Debian, Rocky).

### Key Features
- **Custom Binary Tokenizer**: BPE tokenizer that creates efficient multi-byte tokens from binary data
- **Binary-Aware**: Trained on actual executable files, not hex strings
- **Multi-Architecture**: Understands patterns from various CPU architectures and file formats
- **Latin-1 Encoding**: Preserves all byte values (0-255) without loss
- **Large Model**: 371M parameters with deeper architecture for enhanced binary understanding

## Model Details

- **Architecture**: RoBERTa for Masked Language Modeling
- **Hidden Size**: 1024
- **Layers**: 24
- **Attention Heads**: 16
- **Intermediate Size**: 4096
- **Vocabulary Size**: 65,536 tokens
- **Max Position Embeddings**: 520
- **Parameters**: ~371M
- **Special Tokens**:
  - `<|start|>` (0): Beginning of sequence
  - `<|end|>` (1): End token
  - `<|sep|>` (2): Separator/EOS
  - `<|cls|>` (3): Classification token
  - `<|pad|>` (4): Padding
  - `<|mask|>` (5): Mask token for MLM
  - `<|unk|>` (6): Unknown token

## Performance Comparison vs Glaurung Small 001

| Metric | Glaurung Small 001 | Glaurung Large 001 | Improvement |
|--------|-------------------|-------------------|-------------|
| **Architecture** |
| Parameters | ~160M | ~371M | +132% |
| Hidden Size | 768 | 1024 | +33% |
| Layers | 12 | 24 | +100% |
| Attention Heads | 12 | 16 | +33% |
| **ELF Magic Prediction** (`\x7fEL`) |
| Top-1 Confidence | ~45-50% (est.) | 59.2% | Stronger recognition |
| **x86 Prologue in Context** |
| Top-1 Confidence | ~70-80% (est.) | 100.0% | Perfect prediction |
| **PE Magic Recognition** |
| Top-1 Confidence | ~5-8% (est.) | 7.3% (rank #2) | Weak (training bias) |
| **Binary Similarity Detection** |
| ELF-to-ELF Similarity | 0.85-0.95 | 0.67-0.92 | More nuanced |
| ELF-to-Text Separation | ~0.25-0.30 | ~0.21-0.32 | Similar |

**Key Improvements:**
- **Dramatically improved confidence** on binary pattern prediction (+21pp on ELF magic)
- **Deeper architecture** enables better long-range dependencies in binary code
- **More stable predictions** with near-perfect accuracy on structured headers
- **Larger capacity** for learning complex multi-architecture binary patterns

## Installation & Loading

```bash
pip install transformers torch
```

```python
from transformers import AutoTokenizer, AutoModelForMaskedLM, AutoModel, pipeline

# Method 1: Load with pipeline for fill-mask tasks
fill_mask = pipeline('fill-mask', model='mjbommar/glaurung-large-001', device=-1)

# Method 2: Load model and tokenizer directly for fill-mask
model = AutoModelForMaskedLM.from_pretrained('mjbommar/glaurung-large-001')
tokenizer = AutoTokenizer.from_pretrained('mjbommar/glaurung-large-001')

# Method 3: Load base model for feature extraction/embeddings
model_base = AutoModel.from_pretrained('mjbommar/glaurung-large-001')
```

## Usage Guide

### 1. Loading Binary Data (Critical!)

Binary files MUST be read as bytes and converted to latin-1 encoding:

```python
# CORRECT: Read as bytes, decode with latin-1
with open('/usr/bin/ls', 'rb') as f:
    binary_data = f.read()  # Read first 512 bytes or as needed
    text = binary_data.decode('latin-1', errors='ignore')

# WRONG: Never use hex strings or other encodings
# hex_string = "7f454c46..."  # ❌ Will not work
# utf8_text = binary_data.decode('utf-8')  # ❌ Will lose bytes
```

### 2. Understanding the BPE Tokenizer

The tokenizer creates multi-byte tokens from common binary patterns:

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('mjbommar/glaurung-large-001')

# Example: ELF header tokenization
elf_header = b'\x7fELF\x02\x01\x01\x00'
text = elf_header.decode('latin-1')

tokens = tokenizer(text, return_tensors='pt')
token_ids = tokens['input_ids'][0].tolist()

# Decode tokens individually to see multi-byte patterns
for token_id in token_ids[1:5]:  # Skip special tokens
    decoded = tokenizer.decode([token_id], skip_special_tokens=True)
    print(f"Token {token_id}: {repr(decoded)}")

# Output:
# Token 45689: '\x7fEL'    # ELF magic compressed to one token!
# Token 3665:  'F\x02'     # Format byte + 64-bit flag
# Token 458:   '\x01\x01'  # Little-endian + version
# Token 600:   '\x00\x00\x00\x00\x00\x00\x00\x00\x00'  # Padding
```

### 3. Fill-Mask Task (Token-Level Prediction)

**Important**: Masking works at the TOKEN level, not byte level!

```python
from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch

model = AutoModelForMaskedLM.from_pretrained('mjbommar/glaurung-large-001')
tokenizer = AutoTokenizer.from_pretrained('mjbommar/glaurung-large-001')

# Read binary file
with open('/usr/bin/ls', 'rb') as f:
    binary_data = f.read(512)
    text = binary_data.decode('latin-1', errors='ignore')

# Tokenize
tokens = tokenizer(text, return_tensors='pt')
token_ids = tokens['input_ids'][0].tolist()

# Mask the second token (first content token after <|start|>)
masked_ids = token_ids.copy()
original_token = masked_ids[1]  # Save original
masked_ids[1] = tokenizer.mask_token_id

# Prepare input
tokens_masked = {
    'input_ids': torch.tensor([masked_ids]),
    'attention_mask': torch.tensor([[1]*len(masked_ids)])
}

# Predict
with torch.no_grad():
    outputs = model(**tokens_masked)
    predictions = outputs.logits[0, 1].softmax(dim=-1)
    top5 = predictions.topk(5)

# Show results
print(f"Original: {repr(tokenizer.decode([original_token]))}")
for score, token_id in zip(top5.values, top5.indices):
    token_text = tokenizer.decode([token_id.item()], skip_special_tokens=True)
    print(f"Predicted: {repr(token_text)} (confidence: {score:.2%})")

# Example output:
# Original: '\x7fEL'
# Predicted: '\x7fEL' (confidence: 59.23%)  βœ“ Correct!
# Predicted: '\x00\x00\x00\x00\x00\x00\x00\x00' (confidence: 9.87%)
# Predicted: '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00' (confidence: 4.45%)
```

### 4. Using Pipeline for Fill-Mask

The pipeline handles tokenization automatically but requires understanding multi-byte tokens:

```python
from transformers import pipeline

# Load pipeline
fill_mask = pipeline('fill-mask', model='mjbommar/glaurung-large-001', device=-1)

# Read binary
with open('/usr/bin/ls', 'rb') as f:
    binary_data = f.read(100)
    text = binary_data.decode('latin-1', errors='ignore')

# Create masked input at token boundaries
# First, tokenize to understand token boundaries
tokenizer = fill_mask.tokenizer
tokens = tokenizer(text)
decoded_tokens = [tokenizer.decode([tid], skip_special_tokens=True) for tid in tokens['input_ids']]

# Reconstruct with mask at token boundary
masked_text = ''.join([
    decoded_tokens[0],  # <|start|>
    fill_mask.tokenizer.mask_token,  # Mask the ELF magic
    ''.join(decoded_tokens[2:])  # Rest of tokens
])

# Predict
predictions = fill_mask(masked_text, top_k=3)
for pred in predictions:
    print(f"{repr(pred['token_str'])}: {pred['score']:.2%}")
```

### 5. Feature Extraction & Embedding Similarity

Compare binary files by their learned embeddings:

```python
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F
from pathlib import Path

# Load for embeddings (not MaskedLM)
tokenizer = AutoTokenizer.from_pretrained('mjbommar/glaurung-large-001')
model = AutoModel.from_pretrained('mjbommar/glaurung-large-001')
model.eval()

def get_binary_embedding(file_path, max_bytes=512):
    """Extract embedding for a binary file using mean pooling"""
    with open(file_path, 'rb') as f:
        binary_data = f.read(max_bytes)
        text = binary_data.decode('latin-1', errors='ignore')

    # Tokenize
    tokens = tokenizer(text, return_tensors='pt',
                      padding=True, truncation=True, max_length=512)

    # Get embeddings with mean pooling
    with torch.no_grad():
        outputs = model(**tokens)
        # Mean pooling (better than CLS token for this model)
        attention_mask = tokens['attention_mask']
        hidden_states = outputs.last_hidden_state

        # Mask padding tokens
        mask_expanded = attention_mask.unsqueeze(-1).expand(hidden_states.size()).float()
        sum_embeddings = torch.sum(hidden_states * mask_expanded, dim=1)
        sum_mask = torch.clamp(mask_expanded.sum(dim=1), min=1e-9)
        embedding = sum_embeddings / sum_mask

    return embedding

# Compare multiple binaries
files = ['/usr/bin/ls', '/usr/bin/cat', '/usr/bin/echo', '/etc/passwd']
embeddings = {}

for file_path in files:
    if Path(file_path).exists():
        name = Path(file_path).name
        embeddings[name] = get_binary_embedding(file_path)

# Calculate similarities
print("Cosine Similarity Matrix:")
names = list(embeddings.keys())
for name1 in names:
    similarities = []
    for name2 in names:
        sim = F.cosine_similarity(embeddings[name1], embeddings[name2], dim=-1).item()
        similarities.append(f"{sim:.3f}")
    print(f"{name1:10s}: {' '.join(similarities)}")

# Expected output:
# ELF executables (ls, cat, echo) will have high similarity (0.85-0.95)
# Text file (passwd) will have low similarity (0.25-0.30) to ELF files
```

## Real-World Example: ELF Header Analysis

```python
from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch

# Load model and tokenizer
model = AutoModelForMaskedLM.from_pretrained('mjbommar/glaurung-large-001')
tokenizer = AutoTokenizer.from_pretrained('mjbommar/glaurung-large-001')

# Analyze ELF executable structure
with open('/usr/bin/ls', 'rb') as f:
    binary_data = f.read(512)  # Read enough for context

print(f"Raw bytes (hex): {binary_data[:16].hex()}")
# Output: 7f454c46020101000000000000000000

# Convert to latin-1 for model
text = binary_data.decode('latin-1', errors='ignore')

# Tokenize to see learned patterns
tokens = tokenizer(text, return_tensors='pt')
token_ids = tokens['input_ids'][0].tolist()

# Show what tokens the model learned
print("\nTokenized ELF header:")
for i in range(1, min(5, len(token_ids)-1)):  # First few content tokens
    token_text = tokenizer.decode([token_ids[i]], skip_special_tokens=True)
    print(f"Token {i}: {token_ids[i]:5d} = {repr(token_text)}")

# Output:
# Token 1: 45689 = '\x7fEL'  - ELF magic compressed to one token!
# Token 2:  3665 = 'F\x02'   - 'F' + 64-bit flag
# Token 3:   458 = '\x01\x01' - Little-endian + version
# Token 4:   600 = '\x00\x00\x00\x00\x00\x00\x00\x00\x00' - Padding

# Test model's understanding by masking each token
print("\nTesting model predictions:")
for position in [1, 2, 3]:  # Test first 3 content tokens
    masked_ids = token_ids.copy()
    original_token = masked_ids[position]
    masked_ids[position] = tokenizer.mask_token_id

    # Create input tensors
    tokens_masked = {
        'input_ids': torch.tensor([masked_ids]),
        'attention_mask': torch.tensor([[1]*len(masked_ids)])
    }

    # Get prediction
    with torch.no_grad():
        outputs = model(**tokens_masked)
        predictions = outputs.logits[0, position].softmax(dim=-1)
        predicted_token = predictions.argmax().item()
        confidence = predictions.max().item()

    # Show results
    original_text = tokenizer.decode([original_token], skip_special_tokens=True)
    predicted_text = tokenizer.decode([predicted_token], skip_special_tokens=True)
    correct = "βœ“" if predicted_token == original_token else "βœ—"

    print(f"Position {position}: {correct}")
    print(f"  Original:  {repr(original_text)}")
    print(f"  Predicted: {repr(predicted_text)} (confidence: {confidence:.1%})")

# Expected Output:
# Position 1: βœ“
#   Original:  '\x7fEL'
#   Predicted: '\x7fEL' (confidence: 59.2%)
# Position 2: βœ— (prefers single 'F')
#   Original:  'F\x02'
#   Predicted: 'F' (confidence: 96.0%)
# Position 3: βœ— (not in top 5)
#   Original:  '\x01\x01'
#   Predicted: '\x00\x00\x00\x00\x00\x00\x00\x00' (confidence: 59.1%)
```

## Multi-Format Analysis: ELF vs PE Headers & x86 Instructions

Systematic testing reveals performance varies by format and training data exposure:

### Performance Summary Table

| Pattern Type | Confidence | Rank | Notes |
|--------------|------------|------|-------|
| **ELF magic** (`\x7fEL`) | 59.2% | #1 | Strong (94.6% of training data) |
| **PE magic** (`MZ`) | 7.3% | #2 | Proportional to training (5.4% of data) |
| **x86 prologue** (`PUSH RBP; MOV RBP, RSP`) | 100.0% | #1 | Perfect in full context |

### ELF Header Recognition (Strong)

```python
# Test: /usr/bin/ls with 152 bytes of context
# Token 1: '\x7fEL' (3-byte ELF magic)
# Result: 59.23% confidence, rank #1 βœ“
```

The model strongly recognizes ELF headers (94.6% of training data).

### PE Header Recognition (Limited)

```python
# Test: Realistic DOS/PE header with 152 bytes of context
# Token 1: 'MZ' (2-byte PE signature)
# Result: 7.34% confidence, rank #2 (null bytes ranked #1 at 29.95%)
```

PE recognition reflects limited training exposure (5.4% of training data, 647 files).

### x86 Instructions (Context-Dependent)

```python
# Test: Function prologue in /usr/bin/ls at offset 0x4e05
# Token: 'UH\x89Γ₯' = 0x554889e5 (4 bytes: PUSH RBP; MOV RBP, RSP)
# Result: 100.00% confidence, rank #1 βœ“
```

**Key Finding:** The BPE tokenizer learned to respect x86 instruction boundaries!
- 1-byte tokens: `PUSH reg` (0x55), `RET` (0xc3)
- 2-byte tokens: `MOV reg,reg` with ModR/M (0x89e5)
- 4-byte tokens: Common prologues (0x554889e5)

Performance is excellent **with full binary context** but degrades on isolated instruction bytes.

### Training Data Distribution & Performance Correlation

The model was trained on the following binary distribution:

| Source | Format | File Count | Size (MB) | % by Count | % by Size |
|--------|--------|------------|-----------|------------|-----------|
| Debian/Ubuntu/Alpine packages | ELF | 11,330 | 4,572 | 94.6% | 68.9% |
| Windows Update drivers + SOREL-20M malware | PE | 647 | 2,062 | 5.4% | 31.1% |
| **Total** | | **11,977** | **6,634** | | |

**Key Metrics:**
- **By file count**: 17.5:1 (ELF:PE)
- **By data size**: 2.2:1 (ELF:PE)
- **PE files are 8x larger** on average (3.19 MB vs 0.40 MB per file)

This distribution explains the observed performance:

| Format | Training Data | Recognition Confidence | Notes |
|--------|---------------|----------------------|-------|
| ELF | 11,330 files (95%) / 4,572 MB (69%) | 59.2% | Dominant by count |
| PE | 647 files (5%) / 2,062 MB (31%) | 7.3% | Better represented by size |

**Key Takeaway:** The model's PE performance reflects training data composition. While PE is only 5% by file count, it represents 31% by size due to larger average file sizes. The 8.1x performance gap (59.2% vs 7.3%) roughly correlates with the 17.5x file count imbalance, though size-based exposure is more balanced.

**Practical Guidance:**
- βœ… **Use for**: Linux/Unix binary analysis, ELF malware analysis, x86-64 code patterns
- ⚠️ **Limited for**: Windows PE analysis (consider retraining with balanced PE dataset)
- βœ… **Tokenizer learned**: Instruction-level boundaries across both formats

## Training Details

- **MLM Objective**: 20% masking probability
- **Training Data**: Binary executables from various architectures
- **Optimization**: AdamW with warmup, dropout 0.01
- **Special Design**: Increased position embeddings (520) to handle RoBERTa's position offset
- **Model Size**: Large variant with 24 layers and 1024 hidden dimensions

## Limitations

- Maximum sequence length: 512 tokens
- Optimized for executable files (ELF, PE, Mach-O)
- Mean pooling recommended for embeddings (pooler layer not specifically trained)
- Larger model size requires more memory (consider using device_map="auto" for large batches)

## Citation

If using this model in research:
```
@software{glaurung-large-001,
  title = {Glaurung Large 001: Binary Analysis Transformer},
  author = {Glaurung Project},
  year = {2024},
  url = {https://github.com/mjbommar/glaurung-models}
}
```