ChessLM Qwen3 - Neuron Traced (AWS Format Structure)
This is a Neuron-traced version of karanps/ChessLM_Qwen3 optimized for AWS Trainium (trn1) and Inferentia (inf2) instances using vLLM with continuous batching enabled.
This model follows the AWS Neuron repository structure with separate directories for compiled artifacts.
Model Details
- Base Model: Qwen3-8B fine-tuned for chess
- Compilation: optimum-neuron[vllm]==0.3.0
- Compiler Version: neuronxcc 2.21.33363.0
- Target Hardware: AWS Trainium (trn1) / Inferentia (inf2)
- Precision: BF16
- Tensor Parallelism: 2 cores
- Batch Size: 4 (continuous batching enabled)
- Max Sequence Length: 2048
Repository Structure
This repository follows the AWS Neuron format with organized directories:
βββ context_encoding_model/
β βββ _tp0_bk0/
β βββ graph.neff
β βββ model.MODULE_*.neff
β βββ model.MODULE_*.hlo_module.pb
β βββ compile_flags.*.json
β βββ neuron_config.json
β βββ log-neuron-cc.txt
βββ token_generation_model/
β βββ _tp0_bk0/
β βββ graph.neff
β βββ model.MODULE_*.neff
β βββ model.MODULE_*.hlo_module.pb
β βββ wrapped_neff.hlo
β βββ compile_flags.*.json
β βββ neuron_config.json
β βββ log-neuron-cc.txt
βββ layout_opt/
β βββ graph.neff
β βββ log-neuron-cc.txt
β βββ model/
β βββ graph.hlo
βββ model.pt (17GB - contains compiled graphs + weights)
βββ config.json
βββ neuron_config.json
βββ tokenizer files
Key Files
- context_encoding_model/: Compiled NEFF files for processing initial prompt sequences (up to 2048 tokens)
- token_generation_model/: Compiled NEFF files for autoregressive token generation
- layout_opt/: Layout optimization artifacts from compilation
- model.pt: Main model file containing compiled graphs and embedded weights (17GB)
- neuron_config.json: Neuron compilation configuration
Difference from AWS Reference Format
The AWS Neuron reference models (e.g., aws-neuron/Qwen3-1.7B-TP2-BS8-SEQ4096) typically have:
- A
weights/directory with separate safetensors files (e.g.,tp0_sharded_checkpoint.safetensors) - A smaller model.pt (e.g., ~100MB) containing just the model structure
This model has:
- Weights embedded within model.pt (17GB)
- An empty
weights/directory (preserved for format compatibility)
This is because Neuron-compiled models with optimum-neuron[vllm]==0.3.0 bundle weights within the compiled format. The weights are optimized and embedded in the NEFF (Neuron Executable File Format) during compilation. This is a valid alternative implementation that provides the same functionality.
Requirements
pip install optimum-neuron[vllm]==0.3.0
pip install neuronx-distributed --extra-index-url=https://pip.repos.neuron.amazonaws.com
Usage
Loading the Model
from optimum.neuron import NeuronModelForCausalLM
from transformers import AutoTokenizer
# Load the model
model = NeuronModelForCausalLM.from_pretrained("kunhunjon/ChessLM_Qwen3_Trainium_AWS_Format")
tokenizer = AutoTokenizer.from_pretrained("kunhunjon/ChessLM_Qwen3_Trainium_AWS_Format")
# Run inference
prompt = "e2e4"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=20)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)
Hardware Requirements
- AWS Trainium (trn1.32xlarge, trn1.2xlarge) or Inferentia (inf2) instances
- At least 2 Neuron cores (as configured during tracing)
- Minimum 32GB RAM recommended
Compilation Details
This model was traced with the following parameters:
batch_size=4sequence_length=2048num_cores=2auto_cast_type="bf16"continuous_batching=True
Compilation Artifacts
The separate directories contain all compilation artifacts:
- NEFF files: Neuron Executable File Format - the compiled compute graphs
- HLO files: High-Level Operations - intermediate representation
- Compilation logs: Detailed logs from neuronx-cc compiler
- Metadata: Configuration and metrics from compilation
Continuous Batching
This model is compiled with continuous batching enabled, which allows vLLM to:
- Process multiple requests simultaneously with dynamic batch sizes up to 4
- Optimize throughput by batching requests with different sequence lengths
- Reduce latency for concurrent inference workloads
Note: On-device sampling is disabled due to a known Neuron runtime limitation when using tensor parallelism with 2 cores. Sampling is handled on the host instead.
Compilation Metrics
- Total compilation time: ~8.1 minutes
- Token generation model: 219 seconds
- Context encoding model: 165 seconds
- Compiler: neuronxcc 2.21.33363.0
- Model size: 17GB (with embedded weights)
Model Files
| File | Purpose |
|---|---|
| model.pt | Main model with embedded weights (17GB) |
| config.json | Base model configuration |
| neuron_config.json | Neuron compilation settings |
| tokenizer* | Tokenizer files for text processing |
| context_encoding_model/ | Compiled graphs for prompt processing |
| token_generation_model/ | Compiled graphs for token generation |
| layout_opt/ | Weight layout optimization artifacts |
License
This model inherits the license from the base model karanps/ChessLM_Qwen3.
Citation
If you use this model, please cite the original ChessLM model and AWS Neuron tools.
See Also
- Sharded version: kunhunjon/ChessLM_Qwen3_Trainium_Sharded - Model split into 9x2GB shards for easier downloading
- Standard version: kunhunjon/ChessLM_Qwen3_Trainium - Single model.pt file
- Downloads last month
- 819
Model tree for kunhunjon/ChessLM_Qwen3_Trainium_AWS_Format
Base model
karanps/ChessLM_Qwen3