ChessLM Qwen3 - Neuron Traced (AWS Format Structure)

This is a Neuron-traced version of karanps/ChessLM_Qwen3 optimized for AWS Trainium (trn1) and Inferentia (inf2) instances using vLLM with continuous batching enabled.

This model follows the AWS Neuron repository structure with separate directories for compiled artifacts.

Model Details

  • Base Model: Qwen3-8B fine-tuned for chess
  • Compilation: optimum-neuron[vllm]==0.3.0
  • Compiler Version: neuronxcc 2.21.33363.0
  • Target Hardware: AWS Trainium (trn1) / Inferentia (inf2)
  • Precision: BF16
  • Tensor Parallelism: 2 cores
  • Batch Size: 4 (continuous batching enabled)
  • Max Sequence Length: 2048

Repository Structure

This repository follows the AWS Neuron format with organized directories:

β”œβ”€β”€ context_encoding_model/
β”‚   └── _tp0_bk0/
β”‚       β”œβ”€β”€ graph.neff
β”‚       β”œβ”€β”€ model.MODULE_*.neff
β”‚       β”œβ”€β”€ model.MODULE_*.hlo_module.pb
β”‚       β”œβ”€β”€ compile_flags.*.json
β”‚       β”œβ”€β”€ neuron_config.json
β”‚       └── log-neuron-cc.txt
β”œβ”€β”€ token_generation_model/
β”‚   └── _tp0_bk0/
β”‚       β”œβ”€β”€ graph.neff
β”‚       β”œβ”€β”€ model.MODULE_*.neff
β”‚       β”œβ”€β”€ model.MODULE_*.hlo_module.pb
β”‚       β”œβ”€β”€ wrapped_neff.hlo
β”‚       β”œβ”€β”€ compile_flags.*.json
β”‚       β”œβ”€β”€ neuron_config.json
β”‚       └── log-neuron-cc.txt
β”œβ”€β”€ layout_opt/
β”‚   β”œβ”€β”€ graph.neff
β”‚   β”œβ”€β”€ log-neuron-cc.txt
β”‚   └── model/
β”‚       └── graph.hlo
β”œβ”€β”€ model.pt (17GB - contains compiled graphs + weights)
β”œβ”€β”€ config.json
β”œβ”€β”€ neuron_config.json
└── tokenizer files

Key Files

  • context_encoding_model/: Compiled NEFF files for processing initial prompt sequences (up to 2048 tokens)
  • token_generation_model/: Compiled NEFF files for autoregressive token generation
  • layout_opt/: Layout optimization artifacts from compilation
  • model.pt: Main model file containing compiled graphs and embedded weights (17GB)
  • neuron_config.json: Neuron compilation configuration

Difference from AWS Reference Format

The AWS Neuron reference models (e.g., aws-neuron/Qwen3-1.7B-TP2-BS8-SEQ4096) typically have:

  • A weights/ directory with separate safetensors files (e.g., tp0_sharded_checkpoint.safetensors)
  • A smaller model.pt (e.g., ~100MB) containing just the model structure

This model has:

  • Weights embedded within model.pt (17GB)
  • An empty weights/ directory (preserved for format compatibility)

This is because Neuron-compiled models with optimum-neuron[vllm]==0.3.0 bundle weights within the compiled format. The weights are optimized and embedded in the NEFF (Neuron Executable File Format) during compilation. This is a valid alternative implementation that provides the same functionality.

Requirements

pip install optimum-neuron[vllm]==0.3.0
pip install neuronx-distributed --extra-index-url=https://pip.repos.neuron.amazonaws.com

Usage

Loading the Model

from optimum.neuron import NeuronModelForCausalLM
from transformers import AutoTokenizer

# Load the model
model = NeuronModelForCausalLM.from_pretrained("kunhunjon/ChessLM_Qwen3_Trainium_AWS_Format")
tokenizer = AutoTokenizer.from_pretrained("kunhunjon/ChessLM_Qwen3_Trainium_AWS_Format")

# Run inference
prompt = "e2e4"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=20)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)

Hardware Requirements

  • AWS Trainium (trn1.32xlarge, trn1.2xlarge) or Inferentia (inf2) instances
  • At least 2 Neuron cores (as configured during tracing)
  • Minimum 32GB RAM recommended

Compilation Details

This model was traced with the following parameters:

  • batch_size=4
  • sequence_length=2048
  • num_cores=2
  • auto_cast_type="bf16"
  • continuous_batching=True

Compilation Artifacts

The separate directories contain all compilation artifacts:

  • NEFF files: Neuron Executable File Format - the compiled compute graphs
  • HLO files: High-Level Operations - intermediate representation
  • Compilation logs: Detailed logs from neuronx-cc compiler
  • Metadata: Configuration and metrics from compilation

Continuous Batching

This model is compiled with continuous batching enabled, which allows vLLM to:

  • Process multiple requests simultaneously with dynamic batch sizes up to 4
  • Optimize throughput by batching requests with different sequence lengths
  • Reduce latency for concurrent inference workloads

Note: On-device sampling is disabled due to a known Neuron runtime limitation when using tensor parallelism with 2 cores. Sampling is handled on the host instead.

Compilation Metrics

  • Total compilation time: ~8.1 minutes
  • Token generation model: 219 seconds
  • Context encoding model: 165 seconds
  • Compiler: neuronxcc 2.21.33363.0
  • Model size: 17GB (with embedded weights)

Model Files

File Purpose
model.pt Main model with embedded weights (17GB)
config.json Base model configuration
neuron_config.json Neuron compilation settings
tokenizer* Tokenizer files for text processing
context_encoding_model/ Compiled graphs for prompt processing
token_generation_model/ Compiled graphs for token generation
layout_opt/ Weight layout optimization artifacts

License

This model inherits the license from the base model karanps/ChessLM_Qwen3.

Citation

If you use this model, please cite the original ChessLM model and AWS Neuron tools.

See Also

Downloads last month
819
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for kunhunjon/ChessLM_Qwen3_Trainium_AWS_Format

Finetuned
(5)
this model