ChessLM Qwen3 - Neuron Traced (AWS Format Structure)

This is a Neuron-traced version of karanps/ChessLM_Qwen3 optimized for AWS Trainium (trn1) and Inferentia (inf2) instances using vLLM with continuous batching enabled.

This model follows the AWS Neuron repository structure with separate directories for compiled artifacts.

Model Details

Base Model: Qwen3-8B fine-tuned for chess
Compilation: optimum-neuron[vllm]==0.3.0
Compiler Version: neuronxcc 2.21.33363.0
Target Hardware: AWS Trainium (trn1) / Inferentia (inf2)
Precision: BF16
Tensor Parallelism: 2 cores
Batch Size: 4 (continuous batching enabled)
Max Sequence Length: 2048

Repository Structure

This repository follows the AWS Neuron format with organized directories:

├── context_encoding_model/
│   └── _tp0_bk0/
│       ├── graph.neff
│       ├── model.MODULE_*.neff
│       ├── model.MODULE_*.hlo_module.pb
│       ├── compile_flags.*.json
│       ├── neuron_config.json
│       └── log-neuron-cc.txt
├── token_generation_model/
│   └── _tp0_bk0/
│       ├── graph.neff
│       ├── model.MODULE_*.neff
│       ├── model.MODULE_*.hlo_module.pb
│       ├── wrapped_neff.hlo
│       ├── compile_flags.*.json
│       ├── neuron_config.json
│       └── log-neuron-cc.txt
├── layout_opt/
│   ├── graph.neff
│   ├── log-neuron-cc.txt
│   └── model/
│       └── graph.hlo
├── model.pt (17GB - contains compiled graphs + weights)
├── config.json
├── neuron_config.json
└── tokenizer files

Key Files

context_encoding_model/: Compiled NEFF files for processing initial prompt sequences (up to 2048 tokens)
token_generation_model/: Compiled NEFF files for autoregressive token generation
layout_opt/: Layout optimization artifacts from compilation
model.pt: Main model file containing compiled graphs and embedded weights (17GB)
neuron_config.json: Neuron compilation configuration

Difference from AWS Reference Format

The AWS Neuron reference models (e.g., aws-neuron/Qwen3-1.7B-TP2-BS8-SEQ4096) typically have:

A weights/ directory with separate safetensors files (e.g., tp0_sharded_checkpoint.safetensors)
A smaller model.pt (e.g., ~100MB) containing just the model structure

This model has:

Weights embedded within model.pt (17GB)
An empty weights/ directory (preserved for format compatibility)

This is because Neuron-compiled models with optimum-neuron[vllm]==0.3.0 bundle weights within the compiled format. The weights are optimized and embedded in the NEFF (Neuron Executable File Format) during compilation. This is a valid alternative implementation that provides the same functionality.

Requirements

pip install optimum-neuron[vllm]==0.3.0
pip install neuronx-distributed --extra-index-url=https://pip.repos.neuron.amazonaws.com

Usage

Loading the Model

from optimum.neuron import NeuronModelForCausalLM
from transformers import AutoTokenizer

# Load the model
model = NeuronModelForCausalLM.from_pretrained("kunhunjon/ChessLM_Qwen3_Trainium_AWS_Format")
tokenizer = AutoTokenizer.from_pretrained("kunhunjon/ChessLM_Qwen3_Trainium_AWS_Format")

# Run inference
prompt = "e2e4"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=20)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)

Hardware Requirements

AWS Trainium (trn1.32xlarge, trn1.2xlarge) or Inferentia (inf2) instances
At least 2 Neuron cores (as configured during tracing)
Minimum 32GB RAM recommended

Compilation Details

This model was traced with the following parameters:

batch_size=4
sequence_length=2048
num_cores=2
auto_cast_type="bf16"
continuous_batching=True

Compilation Artifacts

The separate directories contain all compilation artifacts:

NEFF files: Neuron Executable File Format - the compiled compute graphs
HLO files: High-Level Operations - intermediate representation
Compilation logs: Detailed logs from neuronx-cc compiler
Metadata: Configuration and metrics from compilation

Continuous Batching

This model is compiled with continuous batching enabled, which allows vLLM to:

Process multiple requests simultaneously with dynamic batch sizes up to 4
Optimize throughput by batching requests with different sequence lengths
Reduce latency for concurrent inference workloads

Note: On-device sampling is disabled due to a known Neuron runtime limitation when using tensor parallelism with 2 cores. Sampling is handled on the host instead.

Compilation Metrics

Total compilation time: ~8.1 minutes
Token generation model: 219 seconds
Context encoding model: 165 seconds
Compiler: neuronxcc 2.21.33363.0
Model size: 17GB (with embedded weights)

Model Files

File	Purpose
model.pt	Main model with embedded weights (17GB)
config.json	Base model configuration
neuron_config.json	Neuron compilation settings
tokenizer*	Tokenizer files for text processing
context_encoding_model/	Compiled graphs for prompt processing
token_generation_model/	Compiled graphs for token generation
layout_opt/	Weight layout optimization artifacts

License

This model inherits the license from the base model karanps/ChessLM_Qwen3.

Citation

If you use this model, please cite the original ChessLM model and AWS Neuron tools.

Model tree for kunhunjon/ChessLM_Qwen3_Trainium_AWS_Format

Base model

karanps/ChessLM_Qwen3

Finetuned

(5)

this model

kunhunjon
/

ChessLM_Qwen3_Trainium_AWS_Format