ONNX Format
Hi, I was wondering if anyone has already converted the safetensors into a format like ONNX (preferably a quantized version). I’d like to run it on edge devices. Thanks! ✌️
Doesnt seem to be supported by optimum cli yet
https://huggingface.co/docs/transformers/en/serialization
/home/Alphag0/Documents/TestProjects/onnx/.venv/lib/python3.11/site-packages/torch/onnx/_internal/registration.py:159: OnnxExporterWarning: Symbolic function 'aten::scaled_dot_product_attention' already registered for opset 14. Replacing the existing function with new function. This is unexpected. Please report it on https://github.com/pytorch/pytorch/issues.
  warnings.warn(
MXFP4 quantization requires triton >= 3.4.0 and triton_kernels installed, we will default to dequantizing the model to bf16
Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]
Loading checkpoint shards:  33%|███▎      | 1/3 [00:13<00:26, 13.19s/it]
Loading checkpoint shards:  67%|██████▋   | 2/3 [00:25<00:12, 12.92s/it]
Loading checkpoint shards: 100%|██████████| 3/3 [00:31<00:00,  9.36s/it]
Loading checkpoint shards: 100%|██████████| 3/3 [00:31<00:00, 10.35s/it]
Traceback (most recent call last):
  File "/home/Alphag0/Documents/TestProjects/onnx/.venv/bin/optimum-cli", line 7, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/Alphag0/Documents/TestProjects/onnx/.venv/lib/python3.11/site-packages/optimum/commands/optimum_cli.py", line 208, in main
    service.run()
  File "/home/Alphag0/Documents/TestProjects/onnx/.venv/lib/python3.11/site-packages/optimum/commands/export/onnx.py", line 276, in run
    main_export(
  File "/home/Alphag0/Documents/TestProjects/onnx/.venv/lib/python3.11/site-packages/optimum/exporters/onnx/__main__.py", line 418, in main_export
    onnx_export_from_model(
  File "/home/Alphag0/Documents/TestProjects/onnx/.venv/lib/python3.11/site-packages/optimum/exporters/onnx/convert.py", line 1044, in onnx_export_from_model
    raise ValueError(
ValueError: Trying to export a gpt_oss model, that is a custom or unsupported architecture, but no custom onnx configuration was passed as `custom_onnx_configs`. Please refer to https://huggingface.co/docs/optimum/main/en/exporters/onnx/usage_guides/export_a_model#custom-export-of-transformers-models for an example on how to export custom models. Please open an issue at https://github.com/huggingface/optimum/issues if you would like the model type gpt_oss to be supported natively in the ONNX export.
There is an optimized and quantized ONNX model, and it is available through Foundry Local and AI Toolkit for VS Code. Please see the official OpenAI announcement for more details. I have also uploaded the model to Hugging Face here.
There is an optimized and quantized ONNX model, and it is available through Foundry Local and AI Toolkit for VS Code. Please see the official OpenAI announcement for more details. I have also uploaded the model to Hugging Face here.
any chance of getting the 120b model in onnx, too?
any chance of getting the 120b model in onnx, too?
You can create your own ONNX variants for gpt-oss-20b and gpt-oss-120b with ONNX Runtime GenAI's model builder. Here is the PR to enable that.
I run this, but getting an error: python builder.py -i d:\models\gpt-oss-120b -o d:\models\gpt-oss-120b-onnx -p int4 -e dml --extra_options int4_op_types_to_quantize=MatMul/Gather
Reading final norm
Reading LM head
Saving ONNX model in d:\models\gpt-oss-120b-onnx
Traceback (most recent call last):
  File "C:\projects\onnxruntime-genai\venv\Lib\site-packages\onnx_ir\serde.py", line 99, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "C:\projects\onnxruntime-genai\venv\Lib\site-packages\onnx_ir\serde.py", line 1694, in serialize_tensor_into
    tensor_proto.raw_data = from_.tobytes()
                            ^^^^^^^^^^^^^^^
  File "C:\projects\onnxruntime-genai\venv\Lib\site-packages\onnx_ir_core.py", line 970, in tobytes
    return self._evaluate().tobytes()
           ^^^^^^^^^^^^^^^^
  File "C:\projects\onnxruntime-genai\venv\Lib\site-packages\onnx_ir_core.py", line 931, in _evaluate
    return self._func()
           ^^^^^^^^^^^^
  File "C:\projects\onnxruntime-genai\src\python\py\models\builder.py", line 547, in tensor_func
    tensor = tensor.to(to_torch_dtype(to))
             ^^^^^^^^^
AttributeError: 'Tensor' object has no attribute 'to'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "C:\projects\onnxruntime-genai\venv\Lib\site-packages\onnx_ir\serde.py", line 99, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "C:\projects\onnxruntime-genai\venv\Lib\site-packages\onnx_ir\serde.py", line 1489, in serialize_graph_into
    serialize_tensor_into(graph_proto.initializer.add(), from_=value.const_value)
  File "C:\projects\onnxruntime-genai\venv\Lib\site-packages\onnx_ir\serde.py", line 101, in wrapper
    raise SerdeError(
onnx_ir.serde.SerdeError: Error calling serialize_tensor_into with: LazyTensor<FLOAT16,[128,2880,5760]>(func=<function Model.make_initializer..tensor_func at 0x0000016A47F0CE00>, name='model.layers.0.moe.experts.gate_up_proj.weight')
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "C:\projects\onnxruntime-genai\venv\Lib\site-packages\onnx_ir\serde.py", line 99, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "C:\projects\onnxruntime-genai\venv\Lib\site-packages\onnx_ir\serde.py", line 1294, in serialize_model_into
    serialize_graph_into(model_proto.graph, from_.graph)
  File "C:\projects\onnxruntime-genai\venv\Lib\site-packages\onnx_ir\serde.py", line 101, in wrapper
    raise SerdeError(
onnx_ir.serde.SerdeError: Error calling serialize_graph_into with: name=main_graph, doc_string=None, len(inputs)=75, len(initializers)=581, len(nodes)=1824, len(outputs)=73, metadata_props={}
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "C:\projects\onnxruntime-genai\src\python\py\models\builder.py", line 4514, in 
    create_model(args.model_name, args.input, args.output, args.precision, args.execution_provider, args.cache_dir, **extra_options)
  File "C:\projects\onnxruntime-genai\venv\Lib\site-packages\torch\utils_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "C:\projects\onnxruntime-genai\src\python\py\models\builder.py", line 4377, in create_model
    onnx_model.save_model(output_dir)
  File "C:\projects\onnxruntime-genai\src\python\py\models\builder.py", line 501, in save_model
    model = self.to_int4()
            ^^^^^^^^^^^^^^
  File "C:\projects\onnxruntime-genai\src\python\py\models\builder.py", line 484, in to_int4
    model=ir.to_proto(self.model),
          ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\projects\onnxruntime-genai\venv\Lib\site-packages\onnx_ir\serde.py", line 276, in to_proto
    return serialize_model(ir_object)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\projects\onnxruntime-genai\venv\Lib\site-packages\onnx_ir\serde.py", line 1266, in serialize_model
    return serialize_model_into(onnx.ModelProto(), from_=model)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\projects\onnxruntime-genai\venv\Lib\site-packages\onnx_ir\serde.py", line 101, in wrapper
    raise SerdeError(
onnx_ir.serde.SerdeError: Error calling serialize_model_into with: ir_version=10, producer_name=onnxruntime-genai, producer_version=None, domain=None,
@thaddeusk Could you share the PyTorch and onnx-ir version you are using? You can use this script https://github.com/pytorch/pytorch/blob/main/torch/utils/collect_env.py to collect information about your environment.
I guess I missed this notification, I apologize for the delayed response. I've tried it on two machines, one with a 5090 another with a Ryzen AI Max+ 395. It loads much faster on the 395 due to the 110GB of available VRAM, but they both get the same error. I just went and updated everything and built onnxruntime 1.23.0 from source because the pre-release version seemed to not be working for me, but it didn't help.
I should note that I get the same error on the 20b model.
Collecting environment information...
PyTorch version: 2.8.0.dev20250402+cu128
Is debug build: False
CUDA used to build PyTorch: 12.8
ROCM used to build PyTorch: N/A
OS: Microsoft Windows 11 Pro (10.0.26100 64-bit)
GCC version: Could not collect
Clang version: Could not collect
CMake version: version 4.1.0
Libc version: N/A
Python version: 3.12.8 (tags/v3.12.8:2dc476b, Dec  3 2024, 19:30:04) [MSC v.1942 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-11-10.0.26100-SP0
Is CUDA available: True
CUDA runtime version: 12.8.93
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 5090
Nvidia driver version: 581.29
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Name: AMD Ryzen 7 5700X3D 8-Core Processor
Manufacturer: AuthenticAMD
Family: 107
Architecture: 9
ProcessorType: 3
DeviceID: CPU0
CurrentClockSpeed: 3001
MaxClockSpeed: 3001
L2CacheSize: 4096
L2CacheSpeed: None
Revision: 8450
Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] onnx==1.17.0
[pip3] onnx-ir==0.1.9
[pip3] onnxmltools==1.14.0
[pip3] onnxruntime==1.23.0
[pip3] onnxruntime-directml==1.22.0
[pip3] onnxruntime-genai==0.9.2
[pip3] onnxruntime-genai-cuda==0.7.0
[pip3] onnxruntime-genai-directml==0.9.2
[pip3] onnxruntime-gpu==1.21.0
[pip3] torch==2.8.0.dev20250402+cu128
[pip3] torchaudio==2.6.0.dev20250403+cu128
[pip3] torchvision==0.22.0.dev20250403+cu128
[pip3] triton-windows==3.4.0.post20
[conda] numpy                     2.1.2                    pypi_0    pypi
[conda] torch                     2.7.1+cu128              pypi_0    pypi
[conda] torchaudio                2.7.1+cu128              pypi_0    pypi
[conda] torchvision               0.22.1+cu128             pypi_0    pypi
Collecting environment information...
PyTorch version: 2.10.0a0+rocm7.0.0rc20250919
Is debug build: False
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: 7.1.25375-6fc27805ef
OS: Microsoft Windows 11 Pro (10.0.26100 64-bit)
GCC version: (MinGW-W64 x86_64-ucrt-posix-seh, built by Brecht Sanders, r8) 13.2.0
Clang version: 19.1.5
CMake version: version 4.1.0
Libc version: N/A
Python version: 3.12.10 (tags/v3.12.10:0cc8128, Apr  8 2025, 12:21:36) [MSC v.1943 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-11-10.0.26100-SP0
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to:
GPU models and configuration: Radeon 8060S Graphics (gfx1151)
Nvidia driver version: Could not collect
cuDNN version: Could not collect
Is XPU available: False
HIP runtime version: 7.1.25375
MIOpen runtime version: 3.5.0
Is XNNPACK available: True
CPU:
Name: AMD RYZEN AI MAX+ 395 w/ Radeon 8060S
Manufacturer: AuthenticAMD
Family: 107
Architecture: 9
ProcessorType: 3
DeviceID: CPU0
CurrentClockSpeed: 3000
MaxClockSpeed: 3000
L2CacheSize: 16384
L2CacheSpeed: None
Revision: 28672
Versions of relevant libraries:
[pip3] numpy==2.3.3
[pip3] onnx==1.19.0
[pip3] onnx-ir==0.1.9
[pip3] onnxruntime==1.23.0
[pip3] onnxruntime-directml==1.22.0
[pip3] onnxruntime-genai-directml==0.9.2
[pip3] torch==2.10.0a0+rocm7.0.0rc20250919
[pip3] torchaudio==2.8.0a0+rocm7.0.0rc20250919
[pip3] torchvision==0.25.0a0+rocm7.0.0rc20250919
[pip3] triton-windows==3.4.0.post20
[conda] numpy                         2.2.6                      pypi_0              pypi
[conda] rotary-embedding-torch        0.8.9                      pypi_0              pypi
[conda] torchmetrics                  1.8.2                      pypi_0              pypi
[conda] torchvision                   0.23.0                     pypi_0              pypi

 
						