TaehyunKim

TaehyunKimMotif commited on Sep 8

Commit

e5e2eeb

unverified ·

1 Parent(s): e677f62

Add fusion (#3)

* vectorize and optimized block reduce

* add benchmark test (w/o readme update)

* implemented fused_mul_poly_norm

Signed-off-by: taehyun <[email protected]>

* add_rms_norm added

* deleted backward pass on fused add rms norm, split test and benchmarks

Signed-off-by: taehyun <[email protected]>

* refactored benchmarks

* add readme

* fix readme

* add build

* fix readme

* fix readme2

* add mi250 results

* highlight used our kernel for baseline in fused performance

* applied yapf

---------

Signed-off-by: taehyun <[email protected]>
Co-authored-by: taehyun <[email protected]>

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

README.md +178 -7
activation/block_reduce.h +0 -21
activation/fused_add_rms_norm.cu +157 -0
activation/fused_mul_poly_norm.cu +642 -0
activation/poly_norm.cu +88 -61
activation/rms_norm.cu +243 -51
benchmarks/README.md +35 -0
benchmarks/cases/__init__.py +1 -0
benchmarks/cases/add_rms.py +55 -0
benchmarks/cases/mul_poly.py +53 -0
benchmarks/cases/poly.py +58 -0
benchmarks/cases/rms.py +35 -0
benchmarks/common/__init__.py +1 -0
benchmarks/common/bench_framework.py +220 -0
benchmarks/common/diff_engine.py +85 -0
benchmarks/plots/h100/add_rms/plot_add_rms-bwd-perf.png +0 -0
benchmarks/plots/h100/add_rms/plot_add_rms-fwd-perf.png +0 -0
benchmarks/plots/h100/mul_poly/plot_mul_poly-bwd-perf.png +0 -0
benchmarks/plots/h100/mul_poly/plot_mul_poly-fwd-perf.png +0 -0
benchmarks/plots/h100/poly/plot_poly-bwd-perf.png +0 -0
benchmarks/plots/h100/poly/plot_poly-fwd-perf.png +0 -0
benchmarks/plots/h100/rms/plot_rms-bwd-perf.png +0 -0
benchmarks/plots/h100/rms/plot_rms-fwd-perf.png +0 -0
benchmarks/plots/mi250/add_rms/plot_add_rms-bwd-perf.png +0 -0
benchmarks/plots/mi250/add_rms/plot_add_rms-fwd-perf.png +0 -0
benchmarks/plots/mi250/mul_poly/plot_mul_poly-bwd-perf.png +0 -0
benchmarks/plots/mi250/mul_poly/plot_mul_poly-fwd-perf.png +0 -0
benchmarks/plots/mi250/poly/plot_poly-bwd-perf.png +0 -0
benchmarks/plots/mi250/poly/plot_poly-fwd-perf.png +0 -0
benchmarks/plots/mi250/rms/plot_rms-bwd-perf.png +0 -0
benchmarks/plots/mi250/rms/plot_rms-fwd-perf.png +0 -0
benchmarks/run_cases.py +143 -0
build.toml +4 -2
build/torch27-cxx11-cu118-x86_64-linux/activation/__init__.py +24 -2
tests/perf.png → build/torch27-cxx11-cu118-x86_64-linux/activation/_activation_20250907180255.abi3.so +2 -2
build/torch27-cxx11-cu118-x86_64-linux/activation/_ops.py +3 -3
build/torch27-cxx11-cu118-x86_64-linux/activation/layers.py +48 -2
build/torch27-cxx11-cu118-x86_64-linux/activation/poly_norm.py +37 -0
build/torch27-cxx11-cu118-x86_64-linux/activation/rms_norm.py +47 -0
build/torch27-cxx11-cu126-x86_64-linux/activation/__init__.py +24 -2
build/torch27-cxx11-cu126-x86_64-linux/activation/_activation_20250907180255.abi3.so +3 -0
build/torch27-cxx11-cu126-x86_64-linux/activation/_ops.py +3 -3
build/torch27-cxx11-cu126-x86_64-linux/activation/layers.py +48 -2
build/torch27-cxx11-cu126-x86_64-linux/activation/poly_norm.py +37 -0
build/torch27-cxx11-cu126-x86_64-linux/activation/rms_norm.py +47 -0
build/torch27-cxx11-cu128-x86_64-linux/activation/__init__.py +24 -2
build/torch27-cxx11-cu128-x86_64-linux/activation/_activation_20250907180255.abi3.so +3 -0
build/torch27-cxx11-cu128-x86_64-linux/activation/_ops.py +3 -3
build/torch27-cxx11-cu128-x86_64-linux/activation/layers.py +48 -2
build/torch27-cxx11-cu128-x86_64-linux/activation/poly_norm.py +37 -0

README.md CHANGED Viewed

@@ -11,6 +11,37 @@ Activation is a python package that contains custom CUDA-based activation kernel
 - Currently implemented
   - [PolyNorm](https://arxiv.org/html/2411.03884v1)
   - [RMSNorm](https://docs.pytorch.org/docs/stable/generated/torch.nn.RMSNorm.html)
 ## Usage
@@ -28,18 +59,158 @@ print(poly_norm(x))
 ```
 ## Performance
 ### PolyNorm
-- Test cases are from the Motif LLM
-- You can reproduce the results with:
-```bash
-cd tests
-pytest --run-perf --do-plot
-```
-![PolyNorm Performance](./tests/perf.png)
 ## Pre-commit Hooks

 - Currently implemented
   - [PolyNorm](https://arxiv.org/html/2411.03884v1)
   - [RMSNorm](https://docs.pytorch.org/docs/stable/generated/torch.nn.RMSNorm.html)
+  - **FusedAddRMSNorm**
+    A fused operator that combines **residual addition** (`x + residual`) with **RMSNorm** in a single kernel.
+    - Instead of:
+      ```python
+      y = x + residual
+      out = rms_norm(y, weight, eps)
+      ```
+    - Fused as:
+      ```python
+      out = fused_add_rms_norm(x, residual, weight, eps)
+      ```
+  - **FusedMulPolyNorm**
+    A fused operator that combines **PolyNorm** with an **element-wise multiplication** by a Tensor.
+    - Instead of:
+      ```python
+      y = poly_norm(x, weight, bias, eps)
+      out = y * a
+      ```
+    - Fused as:
+      ```python
+      out = fused_mul_poly_norm(x, a, weight, bias, eps)
+      ```
 ## Usage
 ```
 ## Performance
+- Test cases are from the Motif LLM
+- The results can be reproduced using the provided benchmarking tools.
+- For details on how to use the benchmarking tools, please refer to the [benchmarks README](./benchmarks/README.md).
+- The benchmark results may show fluctuations, especially in the backward pass and when the dimension size is small.
+### RMSNorm
+#### H100 Results
+<details>
+<summary>Forward Performance</summary>
+![RMSNorm Forward Performance](./benchmarks/plots/h100/rms/plot_rms-fwd-perf.png)
+</details>
+<details>
+<summary>Backward Performance</summary>
+![RMSNorm Backward Performance](./benchmarks/plots/h100/rms/plot_rms-bwd-perf.png)
+</details>
+#### MI250 Results
+<details>
+<summary>Forward Performance</summary>
+![RMSNorm Forward Performance](./benchmarks/plots/mi250/rms/plot_rms-fwd-perf.png)
+</details>
+<details>
+<summary>Backward Performance</summary>
+![RMSNorm Backward Performance](./benchmarks/plots/mi250/rms/plot_rms-bwd-perf.png)
+</details>
+---
+### FusedAddRMSNorm
+> [!NOTE]
+> For fusion case performance, the **non-fused baseline** was implemented with our **custom kernels**.
+#### H100 Results
+<details>
+<summary>Forward Performance</summary>
+![FusedAddRMSNorm Forward Performance](./benchmarks/plots/h100/add_rms/plot_add_rms-fwd-perf.png)
+</details>
+<details>
+<summary>Backward Performance</summary>
+![FusedAddRMSNorm Backward Performance](./benchmarks/plots/h100/add_rms/plot_add_rms-bwd-perf.png)
+</details>
+#### MI250 Results
+<details>
+<summary>Forward Performance</summary>
+![FusedAddRMSNorm Forward Performance](./benchmarks/plots/mi250/add_rms/plot_add_rms-fwd-perf.png)
+</details>
+<details>
+<summary>Backward Performance</summary>
+![FusedAddRMSNorm Backward Performance](./benchmarks/plots/mi250/add_rms/plot_add_rms-bwd-perf.png)
+</details>
+---
 ### PolyNorm
+#### H100 Results
+<details>
+<summary>Forward Performance</summary>
+![PolyNorm Forward Performance](./benchmarks/plots/h100/poly/plot_poly-fwd-perf.png)
+</details>
+<details>
+<summary>Backward Performance</summary>
+![PolyNorm Backward Performance](./benchmarks/plots/h100/poly/plot_poly-bwd-perf.png)
+</details>
+#### MI250 Results
+<details>
+<summary>Forward Performance</summary>
+![PolyNorm Forward Performance](./benchmarks/plots/mi250/poly/plot_poly-fwd-perf.png)
+</details>
+<details>
+<summary>Backward Performance</summary>
+![PolyNorm Backward Performance](./benchmarks/plots/mi250/poly/plot_poly-bwd-perf.png)
+</details>
+---
+### FusedMulPolyNorm
+> [!NOTE]
+> For fusion case performance, the **non-fused baseline** was implemented with our **custom kernels**.
+#### H100 Results
+<details>
+<summary>Forward Performance</summary>
+![FusedMulPolyNorm Forward Performance](./benchmarks/plots/h100/mul_poly/plot_mul_poly-fwd-perf.png)
+</details>
+<details>
+<summary>Backward Performance</summary>
+![FusedMulPolyNorm Backward Performance](./benchmarks/plots/h100/mul_poly/plot_mul_poly-bwd-perf.png)
+</details>
+#### MI250 Results
+<details>
+<summary>Forward Performance</summary>
+![FusedMulPolyNorm Forward Performance](./benchmarks/plots/mi250/mul_poly/plot_mul_poly-fwd-perf.png)
+</details>
+<details>
+<summary>Backward Performance</summary>
+![FusedMulPolyNorm Backward Performance](./benchmarks/plots/mi250/mul_poly/plot_mul_poly-bwd-perf.png)
+</details>
 ## Pre-commit Hooks

activation/block_reduce.h DELETED Viewed

@@ -1,21 +0,0 @@
-namespace motif {
-template <typename acc_t, int BLOCK_SIZE>
-__device__ acc_t _block_reduce_sum(acc_t *shared, const float val,
-                                   const int d) {
-  // TODO: Optimize with warp-level primitives
-  __syncthreads();
-  shared[threadIdx.x] = threadIdx.x < d ? val : 0.0f;
-  __syncthreads();
-  for (int stride = BLOCK_SIZE / 2; stride > 0; stride /= 2) {
-    if (threadIdx.x < stride) {
-      shared[threadIdx.x] += shared[threadIdx.x + stride];
-    }
-    __syncthreads();
-  }
-  return shared[0];
-}
-} // namespace motif

activation/fused_add_rms_norm.cu ADDED Viewed

	@@ -0,0 +1,157 @@

+#include <ATen/Functions.h>
+#include <ATen/cuda/CUDAContext.h>
+#include <c10/cuda/CUDAGuard.h>
+#include <torch/all.h>
+#include <cmath>
+#include "assert_utils.h"
+#include "atomic_utils.h"
+#include "cuda_compat.h"
+#include "dispatch_utils.h"
+namespace motif {
+template <typename type, int N> struct alignas(sizeof(type) * N) type_vec_t {
+  type data[N];
+};
+template <typename scalar_t, typename acc_t, int width>
+__global__ std::enable_if_t<(width > 0)>
+fused_add_rms_norm_kernel(scalar_t *__restrict__ out,            // [..., d]
+                          scalar_t *__restrict__ add_out,        // [..., d]
+                          const scalar_t *__restrict__ input,    // [..., d]
+                          const scalar_t *__restrict__ residual, // [..., d]
+                          const scalar_t *__restrict__ weight,   // [d]
+                          const float eps, const int d) {
+  using vec_t = type_vec_t<scalar_t, width>;
+  const int vec_d = d / width;
+  const int64_t vec_offset = blockIdx.x * vec_d;
+  const vec_t *__restrict__ input_vec = reinterpret_cast<const vec_t *>(input);
+  const vec_t *__restrict__ residual_vec =
+      reinterpret_cast<const vec_t *>(residual);
+  vec_t *__restrict__ add_out_vec = reinterpret_cast<vec_t *>(add_out);
+  acc_t sum_square = 0.0f;
+  for (int64_t idx = threadIdx.x; idx < vec_d; idx += blockDim.x) {
+    vec_t x_vec = input_vec[vec_offset + idx];
+    vec_t res_vec = residual_vec[vec_offset + idx];
+    vec_t add_vec;
+#pragma unroll
+    for (int i = 0; i < width; ++i) {
+      acc_t x = x_vec.data[i] + res_vec.data[i];
+      sum_square += x * x;
+      add_vec.data[i] = x;
+    }
+    add_out_vec[vec_offset + idx] = add_vec;
+  }
+  using BlockReduce = cub::BlockReduce<float, 1024>;
+  __shared__ typename BlockReduce::TempStorage reduceStore;
+  sum_square = BlockReduce(reduceStore).Sum(sum_square, blockDim.x);
+  __shared__ acc_t s_scale;
+  if (threadIdx.x == 0) {
+    s_scale = rsqrtf(sum_square / d + eps);
+  }
+  __syncthreads();
+  const vec_t *__restrict__ weight_vec =
+      reinterpret_cast<const vec_t *>(weight);
+  vec_t *__restrict__ output_vec = reinterpret_cast<vec_t *>(out);
+  for (int64_t idx = threadIdx.x; idx < vec_d; idx += blockDim.x) {
+    vec_t x_vec = add_out_vec[vec_offset + idx];
+    vec_t w_vec = weight_vec[idx];
+    vec_t y_vec;
+#pragma unroll
+    for (int i = 0; i < width; ++i) {
+      acc_t x = x_vec.data[i];
+      acc_t w = w_vec.data[i];
+      y_vec.data[i] = w * x * s_scale;
+    }
+    output_vec[vec_offset + idx] = y_vec;
+  }
+}
+template <typename scalar_t, typename acc_t, int width>
+__global__ std::enable_if_t<(width == 0)>
+fused_add_rms_norm_kernel(scalar_t *__restrict__ out,            // [..., d]
+                          scalar_t *__restrict__ add_out,        // [..., d]
+                          const scalar_t *__restrict__ input,    // [..., d]
+                          const scalar_t *__restrict__ residual, // [..., d]
+                          const scalar_t *__restrict__ weight,   // [d]
+                          const float eps, const int d) {
+  const int64_t token_idx = blockIdx.x;
+  const int64_t vec_idx = threadIdx.x;
+  acc_t sum_square = 0.0f;
+  for (int64_t idx = vec_idx; idx < d; idx += blockDim.x) {
+    acc_t x = input[token_idx * d + idx] + residual[token_idx * d + idx];
+    sum_square += x * x;
+    add_out[token_idx * d + idx] = x;
+  }
+  using BlockReduce = cub::BlockReduce<float, 1024>;
+  __shared__ typename BlockReduce::TempStorage reduceStore;
+  sum_square = BlockReduce(reduceStore).Sum(sum_square, blockDim.x);
+  __shared__ acc_t s_scale;
+  if (vec_idx == 0) {
+    s_scale = rsqrtf(sum_square / d + eps);
+  }
+  __syncthreads();
+  for (int64_t idx = vec_idx; idx < d; idx += blockDim.x) {
+    acc_t x = add_out[token_idx * d + idx];
+    acc_t w = weight[idx];
+    out[token_idx * d + idx] = w * x * s_scale;
+  }
+}
+} // namespace motif
+#define LAUNCH_RMS_NORM(width)                                                 \
+  MOTIF_DISPATCH_FLOATING_TYPES(                                               \
+      input.scalar_type(), "fused_add_rms_norm_kernel", [&] {                  \
+        motif::fused_add_rms_norm_kernel<scalar_t, float, width>               \
+            <<<grid, block, 0, stream>>>(                                      \
+                out.data_ptr<scalar_t>(), add_out.data_ptr<scalar_t>(),        \
+                input.data_ptr<scalar_t>(), residual.data_ptr<scalar_t>(),     \
+                weight.data_ptr<scalar_t>(), eps, d);                          \
+      });
+void fused_add_rms_norm(torch::Tensor &out,            // [..., d]
+                        torch::Tensor &add_out,        // [..., d]
+                        const torch::Tensor &input,    // [..., d]
+                        const torch::Tensor &residual, // [..., d]
+                        const torch::Tensor &weight,   // [d]
+                        double eps) {
+  AssertTensorShapeEqual(input, residual, "input", "residual");
+  AssertTensorShapeEqual(input, out, "input", "out");
+  AssertTensorShapeEqual(input, add_out, "input", "result");
+  AssertTensorNotNull(weight, "weight");
+  // TODO shape check
+  int d = input.size(-1);
+  int64_t num_tokens = input.numel() / input.size(-1);
+  dim3 grid(num_tokens);
+  const int max_block_size = (num_tokens < 256) ? 1024 : 256;
+  dim3 block(std::min(d, max_block_size));
+  const at::cuda::OptionalCUDAGuard device_guard(device_of(input));
+  const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
+  if (d % 8 == 0) {
+    LAUNCH_RMS_NORM(8);
+  } else {
+    LAUNCH_RMS_NORM(0);
+  }
+}

activation/fused_mul_poly_norm.cu ADDED Viewed

	@@ -0,0 +1,642 @@

+#include <ATen/Functions.h>
+#include <ATen/cuda/CUDAContext.h>
+#include <c10/cuda/CUDAGuard.h>
+#include <torch/all.h>
+#include <cmath>
+#include "assert_utils.h"
+#include "atomic_utils.h"
+#include "cuda_compat.h"
+#include "dispatch_utils.h"
+namespace motif {
+template <typename type, int N> struct alignas(sizeof(type) * N) type_vec_t {
+  type data[N];
+};
+struct SumOp {
+  __device__ float3 operator()(const float3 &a, const float3 &b) const {
+    return make_float3(a.x + b.x, a.y + b.y, a.z + b.z);
+  }
+};
+struct SumOp4 {
+  __device__ float4 operator()(const float4 &a, const float4 &b) const {
+    return make_float4(a.x + b.x, a.y + b.y, a.z + b.z, a.w + b.w);
+  }
+};
+template <typename scalar_t, typename acc_t, int width>
+__global__ std::enable_if_t<(width > 0)>
+fused_mul_poly_norm_kernel(scalar_t *__restrict__ out,          // [..., d]
+                           const scalar_t *__restrict__ input,  // [..., d]
+                           const scalar_t *__restrict__ mul,    // [..., d]
+                           const scalar_t *__restrict__ weight, // [3]
+                           const scalar_t *__restrict__ bias,   // [1]
+                           const float eps, const int d) {
+  using vec_t = type_vec_t<scalar_t, width>;
+  const int vec_d = d / width;
+  const int64_t vec_offset = blockIdx.x * vec_d;
+  const vec_t *__restrict__ input_vec = reinterpret_cast<const vec_t *>(input);
+  acc_t sum2 = 0.0f;
+  acc_t sum4 = 0.0f;
+  acc_t sum6 = 0.0f;
+  for (int64_t idx = threadIdx.x; idx < vec_d; idx += blockDim.x) {
+    vec_t x_vec = input_vec[vec_offset + idx];
+#pragma unroll
+    for (int i = 0; i < width; ++i) {
+      acc_t x1 = x_vec.data[i];
+      acc_t x2 = x1 * x1;
+      acc_t x4 = x2 * x2;
+      acc_t x6 = x4 * x2;
+      sum2 += x2;
+      sum4 += x4;
+      sum6 += x6;
+    }
+  }
+  using BlockReduce = cub::BlockReduce<float3, 1024>;
+  __shared__ typename BlockReduce::TempStorage reduceStore;
+  float3 thread_sums = make_float3(sum2, sum4, sum6);
+  float3 block_sums =
+      BlockReduce(reduceStore).Reduce(thread_sums, SumOp{}, blockDim.x);
+  sum2 = block_sums.x;
+  sum4 = block_sums.y;
+  sum6 = block_sums.z;
+  __shared__ acc_t s_bias;
+  __shared__ acc_t s_w2_inv_std1;
+  __shared__ acc_t s_w1_inv_std2;
+  __shared__ acc_t s_w0_inv_std3;
+  if (threadIdx.x == 0) {
+    acc_t w0 = weight[0];
+    acc_t w1 = weight[1];
+    acc_t w2 = weight[2];
+    s_bias = bias[0];
+    s_w2_inv_std1 = rsqrtf(sum2 / d + eps) * w2;
+    s_w1_inv_std2 = rsqrtf(sum4 / d + eps) * w1;
+    s_w0_inv_std3 = rsqrtf(sum6 / d + eps) * w0;
+  }
+  __syncthreads();
+  acc_t w2_inv_std1 = s_w2_inv_std1;
+  acc_t w1_inv_std2 = s_w1_inv_std2;
+  acc_t w0_inv_std3 = s_w0_inv_std3;
+  acc_t bias_reg = s_bias;
+  vec_t *__restrict__ output_vec = reinterpret_cast<vec_t *>(out);
+  const vec_t *__restrict__ mul_vec = reinterpret_cast<const vec_t *>(mul);
+  for (int64_t idx = threadIdx.x; idx < vec_d; idx += blockDim.x) {
+    vec_t x_vec = input_vec[vec_offset + idx];
+    vec_t m_vec = mul_vec[vec_offset + idx];
+    vec_t y_vec;
+#pragma unroll
+    for (int i = 0; i < width; ++i) {
+      acc_t x1 = x_vec.data[i];
+      scalar_t m = m_vec.data[i];
+      acc_t x2 = x1 * x1;
+      acc_t x3 = x2 * x1;
+      scalar_t poly_norm_result =
+          x1 * w2_inv_std1 + x2 * w1_inv_std2 + x3 * w0_inv_std3 + bias_reg;
+      y_vec.data[i] = poly_norm_result * m;
+    }
+    output_vec[vec_offset + idx] = y_vec;
+  }
+}
+template <typename scalar_t, typename acc_t, int width>
+__global__ std::enable_if_t<(width == 0)>
+fused_mul_poly_norm_kernel(scalar_t *__restrict__ out,          // [..., d]
+                           const scalar_t *__restrict__ input,  // [..., d]
+                           const scalar_t *__restrict__ mul,    // [..., d]
+                           const scalar_t *__restrict__ weight, // [3]
+                           const scalar_t *__restrict__ bias,   // [1]
+                           const float eps, const int d) {
+  const int64_t token_idx = blockIdx.x;
+  acc_t sum2 = 0.0f;
+  acc_t sum4 = 0.0f;
+  acc_t sum6 = 0.0f;
+  for (int64_t idx = threadIdx.x; idx < d; idx += blockDim.x) {
+    acc_t x1 = input[token_idx * d + idx];
+    acc_t x2 = x1 * x1;
+    acc_t x4 = x2 * x2;
+    acc_t x6 = x4 * x2;
+    sum2 += x2;
+    sum4 += x4;
+    sum6 += x6;
+  }
+  using BlockReduce = cub::BlockReduce<float3, 1024>;
+  __shared__ typename BlockReduce::TempStorage reduceStore;
+  float3 thread_sums = make_float3(sum2, sum4, sum6);
+  float3 block_sums =
+      BlockReduce(reduceStore).Reduce(thread_sums, SumOp{}, blockDim.x);
+  sum2 = block_sums.x;
+  sum4 = block_sums.y;
+  sum6 = block_sums.z;
+  __shared__ acc_t s_bias;
+  __shared__ acc_t s_w2_inv_std1;
+  __shared__ acc_t s_w1_inv_std2;
+  __shared__ acc_t s_w0_inv_std3;
+  if (threadIdx.x == 0) {
+    acc_t w0 = weight[0];
+    acc_t w1 = weight[1];
+    acc_t w2 = weight[2];
+    s_bias = bias[0];
+    s_w2_inv_std1 = rsqrtf(sum2 / d + eps) * w2;
+    s_w1_inv_std2 = rsqrtf(sum4 / d + eps) * w1;
+    s_w0_inv_std3 = rsqrtf(sum6 / d + eps) * w0;
+  }
+  __syncthreads();
+  acc_t w2_inv_std1 = s_w2_inv_std1;
+  acc_t w1_inv_std2 = s_w1_inv_std2;
+  acc_t w0_inv_std3 = s_w0_inv_std3;
+  acc_t bias_reg = s_bias;
+  for (int64_t idx = threadIdx.x; idx < d; idx += blockDim.x) {
+    acc_t x1 = input[token_idx * d + idx];
+    scalar_t m = mul[token_idx * d + idx];
+    acc_t x2 = x1 * x1;
+    acc_t x3 = x2 * x1;
+    scalar_t poly_norm_result =
+        x1 * w2_inv_std1 + x2 * w1_inv_std2 + x3 * w0_inv_std3 + bias_reg;
+    out[token_idx * d + idx] = poly_norm_result * m;
+  }
+}
+template <typename scalar_t, typename acc_t, int width>
+__global__ std::enable_if_t<(width > 0)> fused_mul_poly_norm_backward_kernel(
+    scalar_t *__restrict__ input_grad,        // [..., d]
+    scalar_t *__restrict__ mul_grad,          // [..., d]
+    acc_t *__restrict__ temp_weight_grad,     // [..., 3]
+    acc_t *__restrict__ temp_bias_grad,       // [..., 1]
+    const scalar_t *__restrict__ output_grad, // [..., d]
+    const scalar_t *__restrict__ input,       // [..., d]
+    const scalar_t *__restrict__ mul,         // [..., d]
+    const scalar_t *__restrict__ weight,      // [3]
+    const scalar_t *__restrict__ bias,        // [1]
+    const float eps, const int d) {
+  using vec_t = type_vec_t<scalar_t, width>;
+  const int vec_d = d / width;
+  const int64_t vec_offset = blockIdx.x * vec_d;
+  const vec_t *__restrict__ input_vec = reinterpret_cast<const vec_t *>(input);
+  const vec_t *__restrict__ mul_vec = reinterpret_cast<const vec_t *>(mul);
+  const vec_t *__restrict__ output_grad_vec =
+      reinterpret_cast<const vec_t *>(output_grad);
+  acc_t sum2 = 0.0f;
+  acc_t sum4 = 0.0f;
+  acc_t sum6 = 0.0f;
+  acc_t sum_dx1 = 0.0f;
+  acc_t sum_dx2 = 0.0f;
+  acc_t sum_dx3 = 0.0f;
+  for (int64_t idx = threadIdx.x; idx < vec_d; idx += blockDim.x) {
+    vec_t x_vec = input_vec[vec_offset + idx];
+    vec_t dy_fused_vec = output_grad_vec[vec_offset + idx];
+    vec_t m_vec = mul_vec[vec_offset + idx];
+#pragma unroll
+    for (int i = 0; i < width; ++i) {
+      acc_t x1 = x_vec.data[i];
+      acc_t x2 = x1 * x1;
+      acc_t x3 = x2 * x1;
+      acc_t x4 = x2 * x2;
+      acc_t x6 = x3 * x3;
+      sum2 += x2;
+      sum4 += x4;
+      sum6 += x6;
+      acc_t dy = dy_fused_vec.data[i] * m_vec.data[i];
+      sum_dx1 += dy * x1;
+      sum_dx2 += dy * x2;
+      sum_dx3 += dy * x3;
+    }
+  }
+  using BlockReduce = cub::BlockReduce<float3, 1024>;
+  __shared__ typename BlockReduce::TempStorage reduceStore;
+  float3 thread_sums = make_float3(sum2, sum4, sum6);
+  float3 block_sums =
+      BlockReduce(reduceStore).Reduce(thread_sums, SumOp{}, blockDim.x);
+  sum2 = block_sums.x;
+  sum4 = block_sums.y;
+  sum6 = block_sums.z;
+  float3 thread_dxs = make_float3(sum_dx1, sum_dx2, sum_dx3);
+  __syncthreads();
+  float3 block_sum_dxs =
+      BlockReduce(reduceStore).Reduce(thread_dxs, SumOp{}, blockDim.x);
+  sum_dx1 = block_sum_dxs.x;
+  sum_dx2 = block_sum_dxs.y;
+  sum_dx3 = block_sum_dxs.z;
+  __shared__ acc_t s_mean2;
+  __shared__ acc_t s_mean4;
+  __shared__ acc_t s_mean6;
+  __shared__ acc_t s_sdx1;
+  __shared__ acc_t s_sdx2;
+  __shared__ acc_t s_sdx3;
+  const acc_t inv_d = acc_t(1) / d;
+  if (threadIdx.x == 0) {
+    s_mean2 = sum2 * inv_d + eps;
+    s_mean4 = sum4 * inv_d + eps;
+    s_mean6 = sum6 * inv_d + eps;
+    s_sdx1 = sum_dx1 * inv_d;
+    s_sdx2 = sum_dx2 * inv_d;
+    s_sdx3 = sum_dx3 * inv_d;
+  }
+  __syncthreads();
+  acc_t w0 = weight[0];
+  acc_t w1 = weight[1];
+  acc_t w2 = weight[2];
+  acc_t bias_reg = bias[0];
+  acc_t mean2 = s_mean2;
+  acc_t mean4 = s_mean4;
+  acc_t mean6 = s_mean6;
+  acc_t sdx1 = s_sdx1;
+  acc_t sdx2 = s_sdx2;
+  acc_t sdx3 = s_sdx3;
+  acc_t inv_std1 = rsqrtf(mean2);
+  acc_t inv_std2 = rsqrtf(mean4);
+  acc_t inv_std3 = rsqrtf(mean6);
+  acc_t w2_inv_std1 = inv_std1 * w2;
+  acc_t w1_inv_std2 = inv_std2 * w1;
+  acc_t w0_inv_std3 = inv_std3 * w0;
+  // inv_std / mean == powf(mean, -1.5)
+  acc_t c1 = w2_inv_std1 / mean2;
+  acc_t c2 = acc_t(2) * w1_inv_std2 / mean4;
+  acc_t c3 = acc_t(3) * w0_inv_std3 / mean6;
+  acc_t sum_dy = 0;
+  acc_t sum_dw0 = 0;
+  acc_t sum_dw1 = 0;
+  acc_t sum_dw2 = 0;
+  vec_t *__restrict__ input_grad_vec = reinterpret_cast<vec_t *>(input_grad);
+  vec_t *__restrict__ mul_grad_vec = reinterpret_cast<vec_t *>(mul_grad);
+  for (int64_t idx = threadIdx.x; idx < vec_d; idx += blockDim.x) {
+    vec_t x_vec = input_vec[vec_offset + idx];
+    vec_t dy_fused_vec = output_grad_vec[vec_offset + idx];
+    vec_t m_vec = mul_vec[vec_offset + idx];
+    vec_t dx_vec;
+    vec_t dm_vec;
+#pragma unroll
+    for (int i = 0; i < width; ++i) {
+      acc_t x1 = x_vec.data[i];
+      acc_t x2 = x1 * x1;
+      acc_t x3 = x2 * x1;
+      acc_t dy = dy_fused_vec.data[i] * m_vec.data[i];
+      // For register optimization, the order of the following logic matters.
+      // The input_grad related logic must be placed at the very end.
+      sum_dy += dy;
+      sum_dw0 += dy * (x3 * inv_std3);
+      sum_dw1 += dy * (x2 * inv_std2);
+      sum_dw2 += dy * (x1 * inv_std1);
+      if (mul_grad) {
+        scalar_t poly_norm_result =
+            x1 * w2_inv_std1 + x2 * w1_inv_std2 + x3 * w0_inv_std3 + bias_reg;
+        dm_vec.data[i] = poly_norm_result * dy_fused_vec.data[i];
+      }
+      if (input_grad) {
+        acc_t dx3 = c3 * x2 * (dy * mean6 - x3 * sdx3);
+        acc_t dx2 = c2 * x1 * (dy * mean4 - x2 * sdx2);
+        acc_t dx1 = c1 * (dy * mean2 - x1 * sdx1);
+        dx_vec.data[i] = dx1 + dx2 + dx3;
+      }
+    }
+    if (input_grad) {
+      input_grad_vec[vec_offset + idx] = dx_vec;
+    }
+    if (mul_grad) {
+      mul_grad_vec[vec_offset + idx] = dm_vec;
+    }
+  }
+  using BlockReduce4 = cub::BlockReduce<float4, 1024>;
+  __shared__ typename BlockReduce4::TempStorage reduceStore4;
+  float4 thread_sum_ds = make_float4(sum_dy, sum_dw0, sum_dw1, sum_dw2);
+  float4 block_sum_ds =
+      BlockReduce4(reduceStore4).Reduce(thread_sum_ds, SumOp4{}, blockDim.x);
+  sum_dy = block_sum_ds.x;
+  sum_dw0 = block_sum_ds.y;
+  sum_dw1 = block_sum_ds.z;
+  sum_dw2 = block_sum_ds.w;
+  if (threadIdx.x == 0) {
+    temp_bias_grad[blockIdx.x] = sum_dy;
+    temp_weight_grad[blockIdx.x * 3 + 0] = sum_dw0;
+    temp_weight_grad[blockIdx.x * 3 + 1] = sum_dw1;
+    temp_weight_grad[blockIdx.x * 3 + 2] = sum_dw2;
+  }
+}
+template <typename scalar_t, typename acc_t, int width>
+__global__ std::enable_if_t<(width == 0)> fused_mul_poly_norm_backward_kernel(
+    scalar_t *__restrict__ input_grad,        // [..., d]
+    scalar_t *__restrict__ mul_grad,          // [..., d]
+    acc_t *__restrict__ temp_weight_grad,     // [..., 3]
+    acc_t *__restrict__ temp_bias_grad,       // [..., 1]
+    const scalar_t *__restrict__ output_grad, // [..., d]
+    const scalar_t *__restrict__ input,       // [..., d]
+    const scalar_t *__restrict__ mul,         // [..., d]
+    const scalar_t *__restrict__ weight,      // [3]
+    const scalar_t *__restrict__ bias,        // [1]
+    const float eps, const int d) {
+  const int64_t token_idx = blockIdx.x;
+  acc_t sum2 = 0.0f;
+  acc_t sum4 = 0.0f;
+  acc_t sum6 = 0.0f;
+  acc_t sum_dx1 = 0.0f;
+  acc_t sum_dx2 = 0.0f;
+  acc_t sum_dx3 = 0.0f;
+  for (int64_t idx = threadIdx.x; idx < d; idx += blockDim.x) {
+    acc_t dy = output_grad[token_idx * d + idx] * mul[token_idx * d + idx];
+    acc_t x1 = input[token_idx * d + idx];
+    acc_t x2 = x1 * x1;
+    acc_t x3 = x2 * x1;
+    acc_t x4 = x2 * x2;
+    acc_t x6 = x3 * x3;
+    sum2 += x2;
+    sum4 += x4;
+    sum6 += x6;
+    sum_dx1 += dy * x1;
+    sum_dx2 += dy * x2;
+    sum_dx3 += dy * x3;
+  }
+  using BlockReduce = cub::BlockReduce<float3, 1024>;
+  __shared__ typename BlockReduce::TempStorage reduceStore;
+  float3 thread_sums = make_float3(sum2, sum4, sum6);
+  float3 block_sums =
+      BlockReduce(reduceStore).Reduce(thread_sums, SumOp{}, blockDim.x);
+  sum2 = block_sums.x;
+  sum4 = block_sums.y;
+  sum6 = block_sums.z;
+  float3 thread_dxs = make_float3(sum_dx1, sum_dx2, sum_dx3);
+  __syncthreads();
+  float3 block_sum_dxs =
+      BlockReduce(reduceStore).Reduce(thread_dxs, SumOp{}, blockDim.x);
+  sum_dx1 = block_sum_dxs.x;
+  sum_dx2 = block_sum_dxs.y;
+  sum_dx3 = block_sum_dxs.z;
+  __shared__ acc_t s_mean2;
+  __shared__ acc_t s_mean4;
+  __shared__ acc_t s_mean6;
+  __shared__ acc_t s_sdx1;
+  __shared__ acc_t s_sdx2;
+  __shared__ acc_t s_sdx3;
+  const acc_t inv_d = acc_t(1) / d;
+  if (threadIdx.x == 0) {
+    s_mean2 = sum2 * inv_d + eps;
+    s_mean4 = sum4 * inv_d + eps;
+    s_mean6 = sum6 * inv_d + eps;
+    s_sdx1 = sum_dx1 * inv_d;
+    s_sdx2 = sum_dx2 * inv_d;
+    s_sdx3 = sum_dx3 * inv_d;
+  }
+  __syncthreads();
+  acc_t w0 = weight[0];
+  acc_t w1 = weight[1];
+  acc_t w2 = weight[2];
+  acc_t bias_reg = bias[0];
+  acc_t mean2 = s_mean2;
+  acc_t mean4 = s_mean4;
+  acc_t mean6 = s_mean6;
+  acc_t sdx1 = s_sdx1;
+  acc_t sdx2 = s_sdx2;
+  acc_t sdx3 = s_sdx3;
+  acc_t inv_std1 = rsqrtf(mean2);
+  acc_t inv_std2 = rsqrtf(mean4);
+  acc_t inv_std3 = rsqrtf(mean6);
+  acc_t w2_inv_std1 = inv_std1 * w2;
+  acc_t w1_inv_std2 = inv_std2 * w1;
+  acc_t w0_inv_std3 = inv_std3 * w0;
+  // inv_std / mean == powf(mean, -1.5)
+  acc_t c1 = w2_inv_std1 / mean2;
+  acc_t c2 = acc_t(2) * w1_inv_std2 / mean4;
+  acc_t c3 = acc_t(3) * w0_inv_std3 / mean6;
+  acc_t sum_dy = 0;
+  acc_t sum_dw0 = 0;
+  acc_t sum_dw1 = 0;
+  acc_t sum_dw2 = 0;
+  for (int64_t idx = threadIdx.x; idx < d; idx += blockDim.x) {
+    scalar_t dy_fused = output_grad[token_idx * d + idx];
+    acc_t dy = dy_fused * mul[token_idx * d + idx];
+    acc_t x1 = input[token_idx * d + idx];
+    acc_t x2 = x1 * x1;
+    acc_t x3 = x2 * x1;
+    if (input_grad) {
+      acc_t dx3 = c3 * x2 * (dy * mean6 - x3 * sdx3);
+      acc_t dx2 = c2 * x1 * (dy * mean4 - x2 * sdx2);
+      acc_t dx1 = c1 * (dy * mean2 - x1 * sdx1);
+      input_grad[token_idx * d + idx] = dx1 + dx2 + dx3;
+    }
+    if (mul_grad) {
+      scalar_t poly_norm_result =
+          x1 * w2_inv_std1 + x2 * w1_inv_std2 + x3 * w0_inv_std3 + bias_reg;
+      mul_grad[token_idx * d + idx] = poly_norm_result * dy_fused;
+    }
+    sum_dy += dy;
+    sum_dw0 += dy * (x3 * inv_std3);
+    sum_dw1 += dy * (x2 * inv_std2);
+    sum_dw2 += dy * (x1 * inv_std1);
+  }
+  using BlockReduce4 = cub::BlockReduce<float4, 1024>;
+  __shared__ typename BlockReduce4::TempStorage reduceStore4;
+  float4 thread_sum_ds = make_float4(sum_dy, sum_dw0, sum_dw1, sum_dw2);
+  float4 block_sum_ds =
+      BlockReduce4(reduceStore4).Reduce(thread_sum_ds, SumOp4{}, blockDim.x);
+  sum_dy = block_sum_ds.x;
+  sum_dw0 = block_sum_ds.y;
+  sum_dw1 = block_sum_ds.z;
+  sum_dw2 = block_sum_ds.w;
+  if (threadIdx.x == 0) {
+    temp_bias_grad[token_idx] = sum_dy;
+    temp_weight_grad[token_idx * 3 + 0] = sum_dw0;
+    temp_weight_grad[token_idx * 3 + 1] = sum_dw1;
+    temp_weight_grad[token_idx * 3 + 2] = sum_dw2;
+  }
+}
+} // namespace motif
+#define LAUNCH_FUSED_MUL_POLY_NORM(width)                                      \
+  MOTIF_DISPATCH_FLOATING_TYPES(                                               \
+      input.scalar_type(), "fused_mul_poly_norm_kernel", [&] {                 \
+        motif::fused_mul_poly_norm_kernel<scalar_t, float, width>              \
+            <<<grid, block, 0, stream>>>(                                      \
+                out.data_ptr<scalar_t>(), input.data_ptr<scalar_t>(),          \
+                mul.data_ptr<scalar_t>(), weight.data_ptr<scalar_t>(),         \
+                bias.data_ptr<scalar_t>(), eps, d);                            \
+      });
+void fused_mul_poly_norm(torch::Tensor &out,          // [..., d]
+                         const torch::Tensor &input,  // [..., d]
+                         const torch::Tensor &mul,    // [..., d]
+                         const torch::Tensor &weight, // [3]
+                         const torch::Tensor &bias,   // [1]
+                         double eps) {
+  AssertTensorShapeEqual(input, out, "input", "out");
+  AssertTensorShapeEqual(input, mul, "input", "mul");
+  AssertTensorNotNull(weight, "weight");
+  AssertTensorNotNull(bias, "bias");
+  // TODO shape check
+  int d = input.size(-1);
+  int64_t num_tokens = input.numel() / d;
+  dim3 grid(num_tokens);
+  const int max_block_size = (num_tokens < 256) ? 1024 : 256;
+  dim3 block(std::min(d, max_block_size));
+  const at::cuda::OptionalCUDAGuard device_guard(device_of(input));
+  const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
+  if (d % 8 == 0) {
+    LAUNCH_FUSED_MUL_POLY_NORM(8);
+  } else {
+    LAUNCH_FUSED_MUL_POLY_NORM(0);
+  }
+}
+#define LAUNCH_POLY_NORM_BACKWARD(width)                                       \
+  MOTIF_DISPATCH_FLOATING_TYPES(                                               \
+      input.scalar_type(), "fused_mul_poly_norm_backward_kernel", [&] {        \
+        motif::fused_mul_poly_norm_backward_kernel<scalar_t, float, width>     \
+            <<<grid, block, 0, stream>>>(                                      \
+                input_grad.data_ptr<scalar_t>(),                               \
+                mul_grad.data_ptr<scalar_t>(),                                 \
+                temp_weight_grad.data_ptr<float>(),                            \
+                temp_bias_grad.data_ptr<float>(),                              \
+                output_grad.data_ptr<scalar_t>(), input.data_ptr<scalar_t>(),  \
+                mul.data_ptr<scalar_t>(), weight.data_ptr<scalar_t>(),         \
+                bias.data_ptr<scalar_t>(), eps, d);                            \
+      });
+void fused_mul_poly_norm_backward(torch::Tensor &input_grad,        // [..., d]
+                                  torch::Tensor &mul_grad,          // [..., d]
+                                  torch::Tensor &weight_grad,       // [3]
+                                  torch::Tensor &bias_grad,         // [1]
+                                  const torch::Tensor &output_grad, // [..., d]
+                                  const torch::Tensor &input,       // [..., d]
+                                  const torch::Tensor &mul,         // [..., d]
+                                  const torch::Tensor &weight,      // [3]
+                                  const torch::Tensor &bias,        // [1]
+                                  double eps) {
+  AssertTensorShapeEqual(input, input_grad, "input", "input_grad");
+  AssertTensorShapeEqual(input, output_grad, "input", "output_grad");
+  AssertTensorShapeEqual(input, mul_grad, "input", "mul_grad");
+  AssertTensorShapeEqual(input, mul, "input", "mul");
+  AssertTensorNotNull(weight, "weight");
+  // TODO shape check
+  // weight_grad, bias_grad, mul_grad and input_grad can be nullable
+  int d = input.size(-1);
+  int64_t num_tokens = input.numel() / d;
+  dim3 grid(num_tokens);
+  const int max_block_size = (num_tokens < 256) ? 1024 : 256;
+  dim3 block(std::min(d, max_block_size));
+  torch::Tensor temp_weight_grad =
+      torch::empty({num_tokens, 3}, input.options().dtype(torch::kFloat));
+  torch::Tensor temp_bias_grad =
+      torch::empty({num_tokens, 1}, output_grad.options().dtype(torch::kFloat));
+  const at::cuda::OptionalCUDAGuard device_guard(device_of(input));
+  const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
+  if (d % 8 == 0 && input.element_size() == 2) {
+    LAUNCH_POLY_NORM_BACKWARD(8);
+  } else if (d % 4 == 0 && input.element_size() == 4) {
+    LAUNCH_POLY_NORM_BACKWARD(4);
+  } else {
+    LAUNCH_POLY_NORM_BACKWARD(0);
+  }
+  if (bias_grad.defined()) {
+    torch::Tensor acc = torch::empty_like(bias_grad, temp_bias_grad.options());
+    at::sum_out(acc, temp_bias_grad, {0});
+    bias_grad.copy_(acc);
+  }
+  if (weight_grad.defined()) {
+    torch::Tensor acc =
+        torch::empty_like(weight_grad, temp_weight_grad.options());
+    at::sum_out(acc, temp_weight_grad, {0});
+    weight_grad.copy_(acc);
+  }
+}

activation/poly_norm.cu CHANGED Viewed

@@ -7,7 +7,6 @@
 #include "assert_utils.h"
 #include "atomic_utils.h"
-#include "block_reduce.h"
 #include "cuda_compat.h"
 #include "dispatch_utils.h"
@@ -17,6 +16,18 @@ template <typename type, int N> struct alignas(sizeof(type) * N) type_vec_t {
   type data[N];
 };
 template <typename scalar_t, typename acc_t, int width>
 __global__ std::enable_if_t<(width > 0)>
 poly_norm_kernel(scalar_t *__restrict__ out,          // [..., d]
@@ -39,7 +50,7 @@ poly_norm_kernel(scalar_t *__restrict__ out,          // [..., d]
 #pragma unroll
     for (int i = 0; i < width; ++i) {
-      acc_t x1 = static_cast<acc_t>(x_vec.data[i]);
       acc_t x2 = x1 * x1;
       acc_t x4 = x2 * x2;
       acc_t x6 = x4 * x2;
@@ -50,14 +61,16 @@ poly_norm_kernel(scalar_t *__restrict__ out,          // [..., d]
     }
   }
-  using BlockReduce = cub::BlockReduce<float, 1024>;
   __shared__ typename BlockReduce::TempStorage reduceStore;
-  sum2 = BlockReduce(reduceStore).Sum(sum2, blockDim.x);
-  __syncthreads();
-  sum4 = BlockReduce(reduceStore).Sum(sum4, blockDim.x);
-  __syncthreads();
-  sum6 = BlockReduce(reduceStore).Sum(sum6, blockDim.x);
   __shared__ acc_t s_bias;
@@ -90,14 +103,12 @@ poly_norm_kernel(scalar_t *__restrict__ out,          // [..., d]
 #pragma unroll
     for (int i = 0; i < width; ++i) {
-      acc_t x1 = static_cast<acc_t>(x_vec.data[i]);
       acc_t x2 = x1 * x1;
       acc_t x3 = x2 * x1;
-      acc_t y =
           x1 * w2_inv_std1 + x2 * w1_inv_std2 + x3 * w0_inv_std3 + bias_reg;
-      y_vec.data[i] = static_cast<scalar_t>(y);
     }
     output_vec[vec_offset + idx] = y_vec;
   }
@@ -127,14 +138,16 @@ poly_norm_kernel(scalar_t *__restrict__ out,          // [..., d]
     sum6 += x6;
   }
-  using BlockReduce = cub::BlockReduce<float, 1024>;
   __shared__ typename BlockReduce::TempStorage reduceStore;
-  sum2 = BlockReduce(reduceStore).Sum(sum2, blockDim.x);
-  __syncthreads();
-  sum4 = BlockReduce(reduceStore).Sum(sum4, blockDim.x);
-  __syncthreads();
-  sum6 = BlockReduce(reduceStore).Sum(sum6, blockDim.x);
   __shared__ acc_t s_bias;
@@ -199,7 +212,7 @@ poly_norm_backward_kernel(scalar_t *__restrict__ input_grad,        // [..., d]
 #pragma unroll
     for (int i = 0; i < width; ++i) {
-      acc_t x1 = static_cast<acc_t>(x_vec.data[i]);
       acc_t x2 = x1 * x1;
       acc_t x3 = x2 * x1;
       acc_t x4 = x2 * x2;
@@ -209,7 +222,7 @@ poly_norm_backward_kernel(scalar_t *__restrict__ input_grad,        // [..., d]
       sum4 += x4;
       sum6 += x6;
-      acc_t dy = static_cast<acc_t>(dy_vec.data[i]);
       sum_dx1 += dy * x1;
       sum_dx2 += dy * x2;
@@ -217,22 +230,25 @@ poly_norm_backward_kernel(scalar_t *__restrict__ input_grad,        // [..., d]
     }
   }
-  using BlockReduce = cub::BlockReduce<float, 1024>;
   __shared__ typename BlockReduce::TempStorage reduceStore;
-  __syncthreads();
-  sum2 = BlockReduce(reduceStore).Sum(sum2, blockDim.x);
-  __syncthreads();
-  sum4 = BlockReduce(reduceStore).Sum(sum4, blockDim.x);
-  __syncthreads();
-  sum6 = BlockReduce(reduceStore).Sum(sum6, blockDim.x);
   __syncthreads();
-  sum_dx1 = BlockReduce(reduceStore).Sum(sum_dx1, blockDim.x);
-  __syncthreads();
-  sum_dx2 = BlockReduce(reduceStore).Sum(sum_dx2, blockDim.x);
-  __syncthreads();
-  sum_dx3 = BlockReduce(reduceStore).Sum(sum_dx3, blockDim.x);
   __shared__ acc_t s_mean2;
   __shared__ acc_t s_mean4;
@@ -288,16 +304,16 @@ poly_norm_backward_kernel(scalar_t *__restrict__ input_grad,        // [..., d]
 #pragma unroll
     for (int i = 0; i < width; ++i) {
-      acc_t x1 = static_cast<acc_t>(x_vec.data[i]);
       acc_t x2 = x1 * x1;
       acc_t x3 = x2 * x1;
-      acc_t dy = static_cast<acc_t>(dy_vec.data[i]);
       if (input_grad) {
         acc_t dx3 = c3 * x2 * (dy * mean6 - x3 * sdx3);
         acc_t dx2 = c2 * x1 * (dy * mean4 - x2 * sdx2);
         acc_t dx1 = c1 * (dy * mean2 - x1 * sdx1);
-        dx_vec.data[i] = static_cast<scalar_t>(dx1 + dx2 + dx3);
       }
       sum_dy += dy;
@@ -311,13 +327,17 @@ poly_norm_backward_kernel(scalar_t *__restrict__ input_grad,        // [..., d]
     }
   }
-  sum_dy = BlockReduce(reduceStore).Sum(sum_dy, blockDim.x);
-  __syncthreads();
-  sum_dw0 = BlockReduce(reduceStore).Sum(sum_dw0, blockDim.x);
-  __syncthreads();
-  sum_dw1 = BlockReduce(reduceStore).Sum(sum_dw1, blockDim.x);
-  __syncthreads();
-  sum_dw2 = BlockReduce(reduceStore).Sum(sum_dw2, blockDim.x);
   if (threadIdx.x == 0) {
     temp_bias_grad[blockIdx.x] = sum_dy;
@@ -364,22 +384,25 @@ poly_norm_backward_kernel(scalar_t *__restrict__ input_grad,        // [..., d]
     sum_dx3 += dy * x3;
   }
-  using BlockReduce = cub::BlockReduce<float, 1024>;
   __shared__ typename BlockReduce::TempStorage reduceStore;
-  __syncthreads();
-  sum2 = BlockReduce(reduceStore).Sum(sum2, blockDim.x);
-  __syncthreads();
-  sum4 = BlockReduce(reduceStore).Sum(sum4, blockDim.x);
-  __syncthreads();
-  sum6 = BlockReduce(reduceStore).Sum(sum6, blockDim.x);
   __syncthreads();
-  sum_dx1 = BlockReduce(reduceStore).Sum(sum_dx1, blockDim.x);
-  __syncthreads();
-  sum_dx2 = BlockReduce(reduceStore).Sum(sum_dx2, blockDim.x);
-  __syncthreads();
-  sum_dx3 = BlockReduce(reduceStore).Sum(sum_dx3, blockDim.x);
   __shared__ acc_t s_mean2;
   __shared__ acc_t s_mean4;
@@ -445,13 +468,17 @@ poly_norm_backward_kernel(scalar_t *__restrict__ input_grad,        // [..., d]
     sum_dw2 += dy * (x1 * inv_std1);
   }
-  sum_dy = BlockReduce(reduceStore).Sum(sum_dy, blockDim.x);
-  __syncthreads();
-  sum_dw0 = BlockReduce(reduceStore).Sum(sum_dw0, blockDim.x);
-  __syncthreads();
-  sum_dw1 = BlockReduce(reduceStore).Sum(sum_dw1, blockDim.x);
-  __syncthreads();
-  sum_dw2 = BlockReduce(reduceStore).Sum(sum_dw2, blockDim.x);
   if (threadIdx.x == 0) {
     temp_bias_grad[token_idx] = sum_dy;

 #include "assert_utils.h"
 #include "atomic_utils.h"
 #include "cuda_compat.h"
 #include "dispatch_utils.h"
   type data[N];
 };
+struct SumOp {
+  __device__ float3 operator()(const float3 &a, const float3 &b) const {
+    return make_float3(a.x + b.x, a.y + b.y, a.z + b.z);
+  }
+};
+struct SumOp4 {
+  __device__ float4 operator()(const float4 &a, const float4 &b) const {
+    return make_float4(a.x + b.x, a.y + b.y, a.z + b.z, a.w + b.w);
+  }
+};
 template <typename scalar_t, typename acc_t, int width>
 __global__ std::enable_if_t<(width > 0)>
 poly_norm_kernel(scalar_t *__restrict__ out,          // [..., d]
 #pragma unroll
     for (int i = 0; i < width; ++i) {
+      acc_t x1 = x_vec.data[i];
       acc_t x2 = x1 * x1;
       acc_t x4 = x2 * x2;
       acc_t x6 = x4 * x2;
     }
   }
+  using BlockReduce = cub::BlockReduce<float3, 1024>;
   __shared__ typename BlockReduce::TempStorage reduceStore;
+  float3 thread_sums = make_float3(sum2, sum4, sum6);
+  float3 block_sums =
+      BlockReduce(reduceStore).Reduce(thread_sums, SumOp{}, blockDim.x);
+  sum2 = block_sums.x;
+  sum4 = block_sums.y;
+  sum6 = block_sums.z;
   __shared__ acc_t s_bias;
 #pragma unroll
     for (int i = 0; i < width; ++i) {
+      acc_t x1 = x_vec.data[i];
       acc_t x2 = x1 * x1;
       acc_t x3 = x2 * x1;
+      y_vec.data[i] =
           x1 * w2_inv_std1 + x2 * w1_inv_std2 + x3 * w0_inv_std3 + bias_reg;
     }
     output_vec[vec_offset + idx] = y_vec;
   }
     sum6 += x6;
   }
+  using BlockReduce = cub::BlockReduce<float3, 1024>;
   __shared__ typename BlockReduce::TempStorage reduceStore;
+  float3 thread_sums = make_float3(sum2, sum4, sum6);
+  float3 block_sums =
+      BlockReduce(reduceStore).Reduce(thread_sums, SumOp{}, blockDim.x);
+  sum2 = block_sums.x;
+  sum4 = block_sums.y;
+  sum6 = block_sums.z;
   __shared__ acc_t s_bias;
 #pragma unroll
     for (int i = 0; i < width; ++i) {
+      acc_t x1 = x_vec.data[i];
       acc_t x2 = x1 * x1;
       acc_t x3 = x2 * x1;
       acc_t x4 = x2 * x2;
       sum4 += x4;
       sum6 += x6;
+      acc_t dy = dy_vec.data[i];
       sum_dx1 += dy * x1;
       sum_dx2 += dy * x2;
     }
   }
+  using BlockReduce = cub::BlockReduce<float3, 1024>;
   __shared__ typename BlockReduce::TempStorage reduceStore;
+  float3 thread_sums = make_float3(sum2, sum4, sum6);
+  float3 block_sums =
+      BlockReduce(reduceStore).Reduce(thread_sums, SumOp{}, blockDim.x);
+  sum2 = block_sums.x;
+  sum4 = block_sums.y;
+  sum6 = block_sums.z;
+  float3 thread_dxs = make_float3(sum_dx1, sum_dx2, sum_dx3);
   __syncthreads();
+  float3 block_sum_dxs =
+      BlockReduce(reduceStore).Reduce(thread_dxs, SumOp{}, blockDim.x);
+  sum_dx1 = block_sum_dxs.x;
+  sum_dx2 = block_sum_dxs.y;
+  sum_dx3 = block_sum_dxs.z;
   __shared__ acc_t s_mean2;
   __shared__ acc_t s_mean4;
 #pragma unroll
     for (int i = 0; i < width; ++i) {
+      acc_t x1 = x_vec.data[i];
       acc_t x2 = x1 * x1;
       acc_t x3 = x2 * x1;
+      acc_t dy = dy_vec.data[i];
       if (input_grad) {
         acc_t dx3 = c3 * x2 * (dy * mean6 - x3 * sdx3);
         acc_t dx2 = c2 * x1 * (dy * mean4 - x2 * sdx2);
         acc_t dx1 = c1 * (dy * mean2 - x1 * sdx1);
+        dx_vec.data[i] = dx1 + dx2 + dx3;
       }
       sum_dy += dy;
     }
   }
+  using BlockReduce4 = cub::BlockReduce<float4, 1024>;
+  __shared__ typename BlockReduce4::TempStorage reduceStore4;
+  float4 thread_sum_ds = make_float4(sum_dy, sum_dw0, sum_dw1, sum_dw2);
+  float4 block_sum_ds =
+      BlockReduce4(reduceStore4).Reduce(thread_sum_ds, SumOp4{}, blockDim.x);
+  sum_dy = block_sum_ds.x;
+  sum_dw0 = block_sum_ds.y;
+  sum_dw1 = block_sum_ds.z;
+  sum_dw2 = block_sum_ds.w;
   if (threadIdx.x == 0) {
     temp_bias_grad[blockIdx.x] = sum_dy;
     sum_dx3 += dy * x3;
   }
+  using BlockReduce = cub::BlockReduce<float3, 1024>;
   __shared__ typename BlockReduce::TempStorage reduceStore;
+  float3 thread_sums = make_float3(sum2, sum4, sum6);
+  float3 block_sums =
+      BlockReduce(reduceStore).Reduce(thread_sums, SumOp{}, blockDim.x);
+  sum2 = block_sums.x;
+  sum4 = block_sums.y;
+  sum6 = block_sums.z;
+  float3 thread_dxs = make_float3(sum_dx1, sum_dx2, sum_dx3);
   __syncthreads();
+  float3 block_sum_dxs =
+      BlockReduce(reduceStore).Reduce(thread_dxs, SumOp{}, blockDim.x);
+  sum_dx1 = block_sum_dxs.x;
+  sum_dx2 = block_sum_dxs.y;
+  sum_dx3 = block_sum_dxs.z;
   __shared__ acc_t s_mean2;
   __shared__ acc_t s_mean4;
     sum_dw2 += dy * (x1 * inv_std1);
   }
+  using BlockReduce4 = cub::BlockReduce<float4, 1024>;
+  __shared__ typename BlockReduce4::TempStorage reduceStore4;
+  float4 thread_sum_ds = make_float4(sum_dy, sum_dw0, sum_dw1, sum_dw2);
+  float4 block_sum_ds =
+      BlockReduce4(reduceStore4).Reduce(thread_sum_ds, SumOp4{}, blockDim.x);
+  sum_dy = block_sum_ds.x;
+  sum_dw0 = block_sum_ds.y;
+  sum_dw1 = block_sum_ds.z;
+  sum_dw2 = block_sum_ds.w;
   if (threadIdx.x == 0) {
     temp_bias_grad[token_idx] = sum_dy;

activation/rms_norm.cu CHANGED Viewed

@@ -7,18 +7,76 @@
 #include "assert_utils.h"
 #include "atomic_utils.h"
-#include "block_reduce.h"
 #include "cuda_compat.h"
 #include "dispatch_utils.h"
 namespace motif {
-template <typename scalar_t, typename acc_t, int BLOCK_SIZE>
-__global__ void rms_norm_kernel(scalar_t *__restrict__ out,          // [..., d]
-                                const scalar_t *__restrict__ input,  // [..., d]
-                                const scalar_t *__restrict__ weight, // [d]
-                                const float eps, const int d) {
   const int64_t token_idx = blockIdx.x;
   const int64_t vec_idx = threadIdx.x;
   acc_t sum_square = 0.0f;
@@ -28,20 +86,123 @@ __global__ void rms_norm_kernel(scalar_t *__restrict__ out,          // [..., d]
     sum_square += x * x;
   }
-  __shared__ acc_t shared[BLOCK_SIZE];
-  acc_t variance =
-      _block_reduce_sum<acc_t, BLOCK_SIZE>(shared, sum_square, d) / d;
-  acc_t scale = rsqrt(variance + eps);
   for (int64_t idx = vec_idx; idx < d; idx += blockDim.x) {
     acc_t x = input[token_idx * d + idx];
     acc_t w = weight[idx];
-    out[token_idx * d + idx] = w * x * scale;
   }
 }
-template <typename scalar_t, typename acc_t, int BLOCK_SIZE>
-__global__ void
 rms_norm_backward_kernel(scalar_t *__restrict__ input_grad,        // [..., d]
                          acc_t *__restrict__ temp_weight_grad,     // [..., d]
                          const scalar_t *__restrict__ output_grad, // [..., d]
@@ -61,30 +222,55 @@ rms_norm_backward_kernel(scalar_t *__restrict__ input_grad,        // [..., d]
     sum_square += x * x;
   }
-  __shared__ acc_t shared[BLOCK_SIZE];
-  d_sum = _block_reduce_sum<acc_t, BLOCK_SIZE>(shared, d_sum, d);
-  acc_t variance =
-      _block_reduce_sum<acc_t, BLOCK_SIZE>(shared, sum_square, d) / d;
-  acc_t scale = rsqrt(variance + eps);
-  acc_t scale_cubed = scale * scale * scale;
-  acc_t dxx = d_sum * scale_cubed / d;
   for (int64_t idx = vec_idx; idx < d; idx += blockDim.x) {
     acc_t x = input[token_idx * d + idx];
     acc_t dy = output_grad[token_idx * d + idx];
     acc_t w = weight[idx];
-    input_grad[token_idx * d + idx] = scale * dy * w - dxx * x;
-    if (temp_weight_grad) {
-      temp_weight_grad[token_idx * d + idx] = dy * x * scale;
     }
   }
 }
 } // namespace motif
 void rms_norm(torch::Tensor &out,          // [..., d]
               const torch::Tensor &input,  // [..., d]
               const torch::Tensor &weight, // [d]
@@ -93,27 +279,36 @@ void rms_norm(torch::Tensor &out,          // [..., d]
   AssertTensorNotNull(weight, "weight");
   // TODO shape check
-  constexpr int BLOCK_SIZE = 256;
   int d = input.size(-1);
   int64_t num_tokens = input.numel() / input.size(-1);
   dim3 grid(num_tokens);
-  dim3 block(BLOCK_SIZE);
   const at::cuda::OptionalCUDAGuard device_guard(device_of(input));
   const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
-  MOTIF_DISPATCH_FLOATING_TYPES(input.scalar_type(), "rms_norm_kernel", [&] {
-    motif::rms_norm_kernel<scalar_t, float, BLOCK_SIZE>
-        <<<grid, block, 0, stream>>>(out.data_ptr<scalar_t>(),
-                                     input.data_ptr<scalar_t>(),
-                                     weight.data_ptr<scalar_t>(), eps, d);
-  });
 }
 void rms_norm_backward(torch::Tensor &input_grad,        // [..., d]
-                       torch::Tensor &weight_grad,       // [..., d]
-                       const torch::Tensor &output_grad, // [d]
-                       const torch::Tensor &input,       // [d]
                        const torch::Tensor &weight,      // [d]
                        double eps) {
   AssertTensorShapeEqual(input, input_grad, "input", "input_grad");
@@ -122,30 +317,27 @@ void rms_norm_backward(torch::Tensor &input_grad,        // [..., d]
   // TODO shape check
   // weight_grad, input_grad can be nullable
-  constexpr int BLOCK_SIZE = 256;
   int d = input.size(-1);
   int64_t num_tokens = input.numel() / input.size(-1);
   dim3 grid(num_tokens);
-  dim3 block(BLOCK_SIZE);
   torch::Tensor temp_weight_grad =
       torch::empty({num_tokens, d}, input.options().dtype(torch::kFloat));
-  const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
   const at::cuda::OptionalCUDAGuard device_guard(device_of(input));
-  MOTIF_DISPATCH_FLOATING_TYPES(
-      input.scalar_type(), "rms_norm_backward_kernel", [&] {
-        motif::rms_norm_backward_kernel<scalar_t, float, BLOCK_SIZE>
-            <<<grid, block, 0, stream>>>(input_grad.data_ptr<scalar_t>(),
-                                         temp_weight_grad.data_ptr<float>(),
-                                         output_grad.data_ptr<scalar_t>(),
-                                         input.data_ptr<scalar_t>(),
-                                         weight.data_ptr<scalar_t>(), eps, d);
-      });
   if (weight_grad.defined()) {
-    at::sum_out(weight_grad, temp_weight_grad, {0});
   }
 }

 #include "assert_utils.h"
 #include "atomic_utils.h"
 #include "cuda_compat.h"
 #include "dispatch_utils.h"
 namespace motif {
+template <typename type, int N> struct alignas(sizeof(type) * N) type_vec_t {
+  type data[N];
+};
+template <typename scalar_t, typename acc_t, int width>
+__global__ std::enable_if_t<(width > 0)>
+rms_norm_kernel(scalar_t *__restrict__ out,          // [..., d]
+                const scalar_t *__restrict__ input,  // [..., d]
+                const scalar_t *__restrict__ weight, // [d]
+                const float eps, const int d) {
+  using vec_t = type_vec_t<scalar_t, width>;
+  const int vec_d = d / width;
+  const int64_t vec_offset = blockIdx.x * vec_d;
+  const vec_t *__restrict__ input_vec = reinterpret_cast<const vec_t *>(input);
+  acc_t sum_square = 0.0f;
+  for (int64_t idx = threadIdx.x; idx < vec_d; idx += blockDim.x) {
+    vec_t x_vec = input_vec[vec_offset + idx];
+#pragma unroll
+    for (int i = 0; i < width; ++i) {
+      acc_t x = x_vec.data[i];
+      sum_square += x * x;
+    }
+  }
+  using BlockReduce = cub::BlockReduce<float, 1024>;
+  __shared__ typename BlockReduce::TempStorage reduceStore;
+  sum_square = BlockReduce(reduceStore).Sum(sum_square, blockDim.x);
+  __shared__ acc_t s_scale;
+  if (threadIdx.x == 0) {
+    s_scale = rsqrtf(sum_square / d + eps);
+  }
+  __syncthreads();
+  const vec_t *__restrict__ weight_vec =
+      reinterpret_cast<const vec_t *>(weight);
+  vec_t *__restrict__ output_vec = reinterpret_cast<vec_t *>(out);
+  for (int64_t idx = threadIdx.x; idx < vec_d; idx += blockDim.x) {
+    vec_t x_vec = input_vec[vec_offset + idx];
+    vec_t w_vec = weight_vec[idx];
+    vec_t y_vec;
+#pragma unroll
+    for (int i = 0; i < width; ++i) {
+      acc_t x = x_vec.data[i];
+      acc_t w = w_vec.data[i];
+      y_vec.data[i] = w * x * s_scale;
+    }
+    output_vec[vec_offset + idx] = y_vec;
+  }
+}
+template <typename scalar_t, typename acc_t, int width>
+__global__ std::enable_if_t<(width == 0)>
+rms_norm_kernel(scalar_t *__restrict__ out,          // [..., d]
+                const scalar_t *__restrict__ input,  // [..., d]
+                const scalar_t *__restrict__ weight, // [d]
+                const float eps, const int d) {
   const int64_t token_idx = blockIdx.x;
   const int64_t vec_idx = threadIdx.x;
   acc_t sum_square = 0.0f;
     sum_square += x * x;
   }
+  using BlockReduce = cub::BlockReduce<float, 1024>;
+  __shared__ typename BlockReduce::TempStorage reduceStore;
+  sum_square = BlockReduce(reduceStore).Sum(sum_square, blockDim.x);
+  __shared__ acc_t s_scale;
+  if (vec_idx == 0) {
+    s_scale = rsqrtf(sum_square / d + eps);
+  }
+  __syncthreads();
   for (int64_t idx = vec_idx; idx < d; idx += blockDim.x) {
     acc_t x = input[token_idx * d + idx];
     acc_t w = weight[idx];
+    out[token_idx * d + idx] = w * x * s_scale;
   }
 }
+template <typename scalar_t, typename acc_t, int width>
+__global__ std::enable_if_t<(width > 0)>
+rms_norm_backward_kernel(scalar_t *__restrict__ input_grad,        // [..., d]
+                         acc_t *__restrict__ temp_weight_grad,     // [..., d]
+                         const scalar_t *__restrict__ output_grad, // [..., d]
+                         const scalar_t *__restrict__ input,       // [..., d]
+                         const scalar_t *__restrict__ weight,      // [d]
+                         const float eps, const int d) {
+  using vec_t = type_vec_t<scalar_t, width>;
+  using dw_vec_t = type_vec_t<acc_t, width>;
+  const int64_t token_idx = blockIdx.x;
+  const int64_t vec_idx = threadIdx.x;
+  const int vec_d = d / width;
+  const int64_t vec_offset = token_idx * vec_d;
+  const vec_t *__restrict__ input_vec = reinterpret_cast<const vec_t *>(input);
+  const vec_t *__restrict__ output_grad_vec =
+      reinterpret_cast<const vec_t *>(output_grad);
+  const vec_t *__restrict__ weight_vec =
+      reinterpret_cast<const vec_t *>(weight);
+  acc_t d_sum = 0.0f;
+  acc_t sum_square = 0.0f;
+  for (int64_t vidx = vec_idx; vidx < vec_d; vidx += blockDim.x) {
+    vec_t x_vec = input_vec[vec_offset + vidx];
+    vec_t dy_vec = output_grad_vec[vec_offset + vidx];
+    vec_t w_vec = weight_vec[vidx];
+#pragma unroll
+    for (int i = 0; i < width; ++i) {
+      acc_t x = x_vec.data[i];
+      acc_t dy = dy_vec.data[i];
+      acc_t w = w_vec.data[i];
+      d_sum += dy * x * w;
+      sum_square += x * x;
+    }
+  }
+  using BlockReduce = cub::BlockReduce<float2, 1024>;
+  __shared__ typename BlockReduce::TempStorage reduceStore;
+  struct SumOp {
+    __device__ float2 operator()(const float2 &a, const float2 &b) const {
+      return make_float2(a.x + b.x, a.y + b.y);
+    }
+  };
+  float2 thread_sums = make_float2(d_sum, sum_square);
+  float2 block_sums =
+      BlockReduce(reduceStore).Reduce(thread_sums, SumOp{}, blockDim.x);
+  d_sum = block_sums.x;
+  sum_square = block_sums.y;
+  __shared__ acc_t s_scale;
+  __shared__ acc_t s_dxx;
+  if (threadIdx.x == 0) {
+    acc_t scale = rsqrtf(sum_square / d + eps);
+    s_dxx = d_sum * scale * scale * scale / d;
+    s_scale = scale;
+  }
+  __syncthreads();
+  acc_t scale = s_scale;
+  acc_t dxx = s_dxx;
+  vec_t *__restrict__ input_grad_vec = reinterpret_cast<vec_t *>(input_grad);
+  dw_vec_t *__restrict__ temp_weight_grad_vec =
+      reinterpret_cast<dw_vec_t *>(temp_weight_grad);
+  for (int64_t vidx = vec_idx; vidx < vec_d; vidx += blockDim.x) {
+    vec_t x_vec = input_vec[vec_offset + vidx];
+    vec_t dy_vec = output_grad_vec[vec_offset + vidx];
+    vec_t w_vec = weight_vec[vidx];
+    vec_t in_grad_vec;
+    dw_vec_t tw_grad_vec;
+#pragma unroll
+    for (int i = 0; i < width; ++i) {
+      acc_t x = x_vec.data[i];
+      acc_t dy = dy_vec.data[i];
+      acc_t w = w_vec.data[i];
+      if (input_grad) {
+        in_grad_vec.data[i] = scale * dy * w - dxx * x;
+      }
+      tw_grad_vec.data[i] = dy * x * scale;
+    }
+    if (input_grad) {
+      input_grad_vec[vec_offset + vidx] = in_grad_vec;
+    }
+    temp_weight_grad_vec[vec_offset + vidx] = tw_grad_vec;
+  }
+}
+template <typename scalar_t, typename acc_t, int width>
+__global__ std::enable_if_t<(width == 0)>
 rms_norm_backward_kernel(scalar_t *__restrict__ input_grad,        // [..., d]
                          acc_t *__restrict__ temp_weight_grad,     // [..., d]
                          const scalar_t *__restrict__ output_grad, // [..., d]
     sum_square += x * x;
   }
+  using BlockReduce = cub::BlockReduce<float2, 1024>;
+  __shared__ typename BlockReduce::TempStorage reduceStore;
+  struct SumOp {
+    __device__ float2 operator()(const float2 &a, const float2 &b) const {
+      return make_float2(a.x + b.x, a.y + b.y);
+    }
+  };
+  float2 thread_sums = make_float2(d_sum, sum_square);
+  float2 block_sums =
+      BlockReduce(reduceStore).Reduce(thread_sums, SumOp{}, blockDim.x);
+  d_sum = block_sums.x;
+  sum_square = block_sums.y;
+  __shared__ acc_t s_scale;
+  __shared__ acc_t s_dxx;
+  if (threadIdx.x == 0) {
+    acc_t scale = rsqrtf(sum_square / d + eps);
+    s_dxx = d_sum * scale * scale * scale / d;
+    s_scale = scale;
+  }
+  __syncthreads();
+  acc_t scale = s_scale;
+  acc_t dxx = s_dxx;
   for (int64_t idx = vec_idx; idx < d; idx += blockDim.x) {
     acc_t x = input[token_idx * d + idx];
     acc_t dy = output_grad[token_idx * d + idx];
     acc_t w = weight[idx];
+    if (input_grad) {
+      input_grad[token_idx * d + idx] = scale * dy * w - dxx * x;
     }
+    temp_weight_grad[token_idx * d + idx] = dy * x * scale;
   }
 }
 } // namespace motif
+#define LAUNCH_RMS_NORM(width)                                                 \
+  MOTIF_DISPATCH_FLOATING_TYPES(input.scalar_type(), "rms_norm_kernel", [&] {  \
+    motif::rms_norm_kernel<scalar_t, float, width>                             \
+        <<<grid, block, 0, stream>>>(out.data_ptr<scalar_t>(),                 \
+                                     input.data_ptr<scalar_t>(),               \
+                                     weight.data_ptr<scalar_t>(), eps, d);     \
+  });
 void rms_norm(torch::Tensor &out,          // [..., d]
               const torch::Tensor &input,  // [..., d]
               const torch::Tensor &weight, // [d]
   AssertTensorNotNull(weight, "weight");
   // TODO shape check
   int d = input.size(-1);
   int64_t num_tokens = input.numel() / input.size(-1);
   dim3 grid(num_tokens);
+  const int max_block_size = (num_tokens < 256) ? 1024 : 256;
+  dim3 block(std::min(d, max_block_size));
   const at::cuda::OptionalCUDAGuard device_guard(device_of(input));
   const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
+  if (d % 8 == 0) {
+    LAUNCH_RMS_NORM(8);
+  } else {
+    LAUNCH_RMS_NORM(0);
+  }
 }
+#define LAUNCH_RMS_NORM_BWD(width)                                             \
+  MOTIF_DISPATCH_FLOATING_TYPES(                                               \
+      input.scalar_type(), "rms_norm_backward_kernel", [&] {                   \
+        motif::rms_norm_backward_kernel<scalar_t, float, width>                \
+            <<<grid, block, 0, stream>>>(input_grad.data_ptr<scalar_t>(),      \
+                                         temp_weight_grad.data_ptr<float>(),   \
+                                         output_grad.data_ptr<scalar_t>(),     \
+                                         input.data_ptr<scalar_t>(),           \
+                                         weight.data_ptr<scalar_t>(), eps, d); \
+      });
 void rms_norm_backward(torch::Tensor &input_grad,        // [..., d]
+                       torch::Tensor &weight_grad,       // [d]
+                       const torch::Tensor &output_grad, // [..., d]
+                       const torch::Tensor &input,       // [..., d]
                        const torch::Tensor &weight,      // [d]
                        double eps) {
   AssertTensorShapeEqual(input, input_grad, "input", "input_grad");
   // TODO shape check
   // weight_grad, input_grad can be nullable
   int d = input.size(-1);
   int64_t num_tokens = input.numel() / input.size(-1);
   dim3 grid(num_tokens);
+  const int max_block_size = (num_tokens < 256) ? 1024 : 256;
+  dim3 block(std::min(d, max_block_size));
   torch::Tensor temp_weight_grad =
       torch::empty({num_tokens, d}, input.options().dtype(torch::kFloat));
   const at::cuda::OptionalCUDAGuard device_guard(device_of(input));
+  const cudaStream_t stream = at::cuda::getCurrentCUDAStream();
+  if (d % 8 == 0) {
+    LAUNCH_RMS_NORM_BWD(8);
+  } else {
+    LAUNCH_RMS_NORM_BWD(0);
+  }
   if (weight_grad.defined()) {
+    torch::Tensor acc =
+        torch::empty_like(weight_grad, temp_weight_grad.options());
+    at::sum_out(acc, temp_weight_grad, {0});
+    weight_grad.copy_(acc);
   }
 }

benchmarks/README.md ADDED Viewed

	@@ -0,0 +1,35 @@

+# Benchmark Runner
+This script benchmarks **forward/backward performance** of several operations (`rms`, `add_rms`, `poly`, `mul_poly`).
+Results can be saved as **CSV files** or **plots**.
+> **Note**<br>
+> To run the benchmarks, you must select the appropriate Torch version along with the corresponding CUDA/ROCm build from within the `build` directory.
+>
+> **Example:**
+>
+> ```bash
+> export PYTHONPATH=$PYTHONPATH:<YOUR_PATH>/activation/build/torch27-cxx11-cu128-x86_64-linux
+> ```
+## Usage
+```bash
+python main.py --case <CASE> [--plot] [--save-path <DIR>]
+```
+- `--case` (required): one of `rms`, `add_rms`, `poly`, `mul_poly`
+- `--plot`: save plots instead of CSVs
+- `--save-path`: output directory (default: `./configs/`)
+## Examples
+```bash
+python main.py --case add_rms --save-path ./results/
+python main.py --case poly --plot --save-path ./plots/
+```
+## Output
+- CSV: `<case>-fwd-perf.csv`, `<case>-bwd-perf.csv`
+- Plots: `plot_<case>-fwd-perf.png`, `plot_<case>-bwd-perf.png`

benchmarks/cases/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+

benchmarks/cases/add_rms.py ADDED Viewed

	@@ -0,0 +1,55 @@

+import torch
+from common.diff_engine import DiffCase
+import activation
+class FusedAddRMSNorm(torch.nn.Module):
+    def __init__(self, d, eps=1e-6, dtype: torch.dtype = torch.float32):
+        super().__init__()
+        self.weight = torch.nn.Parameter(torch.ones(d, dtype=dtype))
+        self.eps = eps
+    def forward(self, x, residual):
+        return activation.rms_norm((x + residual), self.weight, self.eps)
+class AddRMS(DiffCase):
+    def build_inputs(self, bs, sl, hidden, dtype, eps):
+        return {
+            "x":
+            torch.randn(bs, sl, hidden, dtype=dtype, requires_grad=True),
+            "residual":
+            torch.randn(bs, sl, hidden, dtype=dtype, requires_grad=True),
+            "weight":
+            torch.ones(hidden, dtype=dtype),
+            "dim":
+            hidden,
+            "eps":
+            eps,
+            "dtype":
+            dtype,
+        }
+    def make_naive(self, I):
+        m = FusedAddRMSNorm(I["dim"], I["eps"], dtype=I["dtype"])
+        m.weight = torch.nn.Parameter(I["weight"].detach().clone())
+        return m
+    def make_cuda(self, I):
+        m = activation.layers.FusedAddRMSNorm(I["dim"],
+                                              I["eps"],
+                                              dtype=I["dtype"])
+        m.weight = torch.nn.Parameter(I["weight"].detach().clone())
+        return m
+    def forward(self, obj, I):
+        return obj(I["x"], I["residual"])
+    def grad_inputs(self, I):
+        return [I["x"], I["residual"]]
+CASE = AddRMS()

benchmarks/cases/mul_poly.py ADDED Viewed

	@@ -0,0 +1,53 @@

+import torch
+from common.diff_engine import DiffCase
+import activation
+class FusedMulPolyNorm(torch.nn.Module):
+    def __init__(self, eps=1e-6, dtype: torch.dtype = torch.float32):
+        super().__init__()
+        self.weight = torch.nn.Parameter(torch.ones(3, dtype=dtype) / 3)
+        self.bias = torch.nn.Parameter(torch.zeros(1, dtype=dtype))
+        self.eps = eps
+    def forward(self, x, mul):
+        output = activation.poly_norm(x, self.weight, self.bias, self.eps)
+        return output * mul
+class MulPoly(DiffCase):
+    def build_inputs(self, bs, sl, hidden, dtype, eps):
+        return {
+            "x": torch.randn(bs, sl, hidden, dtype=dtype, requires_grad=True),
+            "mul": torch.randn(bs, sl, hidden, dtype=dtype,
+                               requires_grad=True),
+            "weight": torch.ones(3, dtype=dtype),
+            "bias": torch.ones(1, dtype=dtype),
+            "dim": hidden,
+            "eps": eps,
+            "dtype": dtype,
+        }
+    def make_naive(self, I):
+        m = FusedMulPolyNorm(I["eps"], dtype=I["dtype"])
+        m.weight = torch.nn.Parameter(I["weight"].detach().clone())
+        m.bias = torch.nn.Parameter(I["bias"].detach().clone())
+        return m
+    def make_cuda(self, I):
+        m = activation.layers.FusedMulPolyNorm(I["eps"], dtype=I["dtype"])
+        m.weight = torch.nn.Parameter(I["weight"].detach().clone())
+        m.bias = torch.nn.Parameter(I["bias"].detach().clone())
+        return m
+    def forward(self, obj, I):
+        return obj(I["x"], I["mul"])
+    def grad_inputs(self, I):
+        return [I["x"], I["mul"]]
+CASE = MulPoly()

benchmarks/cases/poly.py ADDED Viewed

	@@ -0,0 +1,58 @@

+import torch
+from common.diff_engine import DiffCase
+import activation
+class PolyNorm(torch.nn.Module):
+    def __init__(self, eps=1e-6, dtype: torch.dtype = torch.float32):
+        super().__init__()
+        self.weight = torch.nn.Parameter(torch.ones(3, dtype=dtype) / 3)
+        self.bias = torch.nn.Parameter(torch.zeros(1, dtype=dtype))
+        self.eps = eps
+    def _norm(self, x):
+        return x / torch.sqrt(x.pow(2).mean(-1, keepdim=True) + self.eps)
+    def forward(self, x):
+        orig_dtype = x.dtype
+        x_float = x.to(torch.float32)
+        output = (self.weight[0] * self._norm(x_float**3) +
+                  self.weight[1] * self._norm(x_float**2) +
+                  self.weight[2] * self._norm(x_float) + self.bias)
+        return output.to(orig_dtype)
+class Poly(DiffCase):
+    def build_inputs(self, bs, sl, hidden, dtype, eps):
+        return {
+            "x": torch.randn(bs, sl, hidden, dtype=dtype, requires_grad=True),
+            "weight": torch.ones(3, dtype=dtype),
+            "bias": torch.ones(1, dtype=dtype),
+            "dim": hidden,
+            "eps": eps,
+            "dtype": dtype,
+        }
+    def make_naive(self, I):
+        m = PolyNorm(I["eps"], dtype=I["dtype"])
+        m.weight = torch.nn.Parameter(I["weight"].detach().clone())
+        m.bias = torch.nn.Parameter(I["bias"].detach().clone())
+        return m
+    def make_cuda(self, I):
+        m = activation.layers.PolyNorm(I["eps"], dtype=I["dtype"])
+        m.weight = torch.nn.Parameter(I["weight"].detach().clone())
+        m.bias = torch.nn.Parameter(I["bias"].detach().clone())
+        return m
+    def forward(self, obj, I):
+        return obj(I["x"])
+    def grad_inputs(self, I):
+        return [I["x"]]
+CASE = Poly()

benchmarks/cases/rms.py ADDED Viewed

	@@ -0,0 +1,35 @@

+import torch
+from common.diff_engine import DiffCase
+import activation
+class RMS(DiffCase):
+    def build_inputs(self, bs, sl, hidden, dtype, eps):
+        return {
+            "x": torch.randn(bs, sl, hidden, dtype=dtype, requires_grad=True),
+            "weight": torch.ones(hidden, dtype=dtype),
+            "dim": hidden,
+            "eps": eps,
+            "dtype": dtype,
+        }
+    def make_naive(self, I):
+        m = torch.nn.RMSNorm(I["dim"], I["eps"], dtype=I["dtype"])
+        m.weight = torch.nn.Parameter(I["weight"].detach().clone())
+        return m
+    def make_cuda(self, I):
+        m = activation.layers.RMSNorm(I["dim"], I["eps"], dtype=I["dtype"])
+        m.weight = torch.nn.Parameter(I["weight"].detach().clone())
+        return m
+    def forward(self, obj, I):
+        return obj(I["x"])
+    def grad_inputs(self, I):
+        return [I["x"]]
+CASE = RMS()

benchmarks/common/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+

benchmarks/common/bench_framework.py ADDED Viewed

	@@ -0,0 +1,220 @@

+import collections
+import math
+import re
+from typing import Any, Dict, Sequence
+import torch
+import triton
+from .diff_engine import DiffCase
+def make_fwd_key(batch_size, seq_len, dim):
+    return f"forward : ({batch_size}, {seq_len}, {dim})"
+def make_bwd_key(batch_size, seq_len, dim):
+    return f"backward : ({batch_size}, {seq_len}, {dim})"
+def parse_config_string(config_str):
+    match = re.match(r"(\w+)\s*:\s*\(\s*(\d+)\s*,\s*(\d+)\s*,\s*(\d+)\s*\)",
+                     config_str)
+    if not match:
+        raise ValueError(f"Invalid config string: {config_str}")
+    _, bs, sl, d = match.groups()
+    return int(bs), int(sl), int(d)
+def make_fwd_benchmark_for_case(
+    *,
+    case: DiffCase,
+    configs: Sequence[tuple[int, int, int]],
+    plot_name: str,
+    ylabel: str = "us",
+    line_vals=("naive", "cuda", "speedup"),
+    line_names: Dict[str, str] | None = None,
+    dtype=torch.bfloat16,
+    eps: float = 1e-6,
+    time_unit_scale: float = 1000,
+):
+    timings_ms = collections.defaultdict(dict)
+    line_vals = list(line_vals)
+    line_names = line_names or {v: v.title() for v in line_vals}
+    x_vals = [list(_) for _ in configs]
+    @triton.testing.perf_report(
+        triton.testing.Benchmark(x_names=["dim", "batch_size", "seq_len"],
+                                 x_vals=x_vals,
+                                 line_arg="provider",
+                                 line_vals=line_vals,
+                                 line_names=[line_names[v] for v in line_vals],
+                                 ylabel=ylabel,
+                                 plot_name=plot_name,
+                                 args={}))
+    def bench(dim, batch_size, seq_len, provider):
+        key = make_fwd_key(dim, batch_size, seq_len)
+        I = case.build_inputs(batch_size, seq_len, dim, dtype, eps)
+        if provider == "speedup":
+            return timings_ms["naive"][key] / timings_ms["cuda"][key]
+        obj = case.make_naive(I) if provider == "naive" else case.make_cuda(I)
+        run = lambda: case.forward(obj, I)
+        ms = triton.testing.do_bench(run)
+        timings_ms[provider][key] = ms
+        return time_unit_scale * ms
+    return bench
+def make_fwd_benchmark_plot_for_case(
+    *,
+    case: DiffCase,
+    configs: Sequence[tuple[int, int, int]],
+    plot_name: str,
+    ylabel: str = "Relative Speedup",
+    line_vals=("naive", "cuda"),
+    line_names: Dict[str, str] | None = None,
+    dtype=torch.bfloat16,
+    eps: float = 1e-6,
+):
+    timings_ms = collections.defaultdict(dict)
+    spdup_ratio = list()
+    line_vals = list(line_vals)
+    line_names = line_names or {v: v.title() for v in line_vals}
+    x_vals = [make_fwd_key(*_) for _ in configs]
+    x_vals.append("Geometric Mean")
+    @triton.testing.perf_report(
+        triton.testing.Benchmark(x_names=["config"],
+                                 x_vals=x_vals,
+                                 line_arg="provider",
+                                 line_vals=line_vals,
+                                 line_names=[line_names[v] for v in line_vals],
+                                 ylabel=ylabel,
+                                 plot_name=plot_name,
+                                 args={}))
+    def bench(config, provider):
+        if config == "Geometric Mean":
+            if provider == "cuda":
+                return round(math.prod(spdup_ratio)**(1 / len(spdup_ratio)), 2)
+            else:
+                return 1.00
+        batch_size, seq_len, dim = parse_config_string(config)
+        I = case.build_inputs(batch_size, seq_len, dim, dtype, eps)
+        obj = case.make_naive(I) if provider == "naive" else case.make_cuda(I)
+        run = lambda: case.forward(obj, I)
+        ms = triton.testing.do_bench(run)
+        timings_ms[provider][config] = ms
+        if provider == "cuda":
+            ratio = timings_ms["naive"][config] / timings_ms["cuda"][config]
+            spdup_ratio.append(ratio)
+            return round(ratio, 2)
+        else:
+            return 1.00
+    return bench
+def make_bwd_benchmark_for_case(
+    *,
+    case: DiffCase,
+    configs: Sequence[tuple[int, int, int]],
+    plot_name: str,
+    ylabel: str = "us",
+    line_vals=("naive", "cuda", "speedup"),
+    line_names: Dict[str, str] | None = None,
+    dtype=torch.bfloat16,
+    eps: float = 1e-6,
+    time_unit_scale: float = 1000,
+):
+    timings_ms = collections.defaultdict(dict)
+    line_vals = list(line_vals)
+    line_names = line_names or {v: v.title() for v in line_vals}
+    x_vals = [list(_) for _ in configs]
+    @triton.testing.perf_report(
+        triton.testing.Benchmark(x_names=["dim", "batch_size", "seq_len"],
+                                 x_vals=x_vals,
+                                 line_arg="provider",
+                                 line_vals=line_vals,
+                                 line_names=[line_names[v] for v in line_vals],
+                                 ylabel=ylabel,
+                                 plot_name=plot_name,
+                                 args={}))
+    def bench(dim, batch_size, seq_len, provider):
+        key = make_bwd_key(dim, batch_size, seq_len)
+        I = case.build_inputs(batch_size, seq_len, dim, dtype, eps)
+        if provider == "speedup":
+            return timings_ms["naive"][key] / timings_ms["cuda"][key]
+        obj = case.make_naive(I) if provider == "naive" else case.make_cuda(I)
+        y = case.forward(obj, I)
+        gin = list(case.grad_inputs(I)) + list(obj.parameters())
+        g = torch.randn_like(y)
+        run = lambda: torch.autograd.grad(y,
+                                          gin,
+                                          g,
+                                          retain_graph=True,
+                                          create_graph=False,
+                                          allow_unused=False)
+        ms = triton.testing.do_bench(run)
+        timings_ms[provider][key] = ms
+        return time_unit_scale * ms
+    return bench
+def make_bwd_benchmark_plot_for_case(
+    *,
+    case: DiffCase,
+    configs: Sequence[tuple[int, int, int]],
+    plot_name: str,
+    ylabel: str = "Relative Speedup",
+    line_vals=("naive", "cuda"),
+    line_names: Dict[str, str] | None = None,
+    dtype=torch.bfloat16,
+    eps: float = 1e-6,
+):
+    timings_ms = collections.defaultdict(dict)
+    spdup_ratio = list()
+    line_vals = list(line_vals)
+    line_names = line_names or {v: v.title() for v in line_vals}
+    x_vals = [make_bwd_key(*_) for _ in configs]
+    x_vals.append("Geometric Mean")
+    @triton.testing.perf_report(
+        triton.testing.Benchmark(x_names=["config"],
+                                 x_vals=x_vals,
+                                 line_arg="provider",
+                                 line_vals=line_vals,
+                                 line_names=[line_names[v] for v in line_vals],
+                                 ylabel=ylabel,
+                                 plot_name=plot_name,
+                                 args={}))
+    def bench(config, provider):
+        if config == "Geometric Mean":
+            if provider == "cuda":
+                return round(math.prod(spdup_ratio)**(1 / len(spdup_ratio)), 2)
+            else:
+                return 1.00
+        batch_size, seq_len, dim = parse_config_string(config)
+        I = case.build_inputs(batch_size, seq_len, dim, dtype, eps)
+        obj = case.make_naive(I) if provider == "naive" else case.make_cuda(I)
+        y = case.forward(obj, I)
+        gin = list(case.grad_inputs(I)) + list(obj.parameters())
+        g = torch.randn_like(y)
+        run = lambda: torch.autograd.grad(y,
+                                          gin,
+                                          g,
+                                          retain_graph=True,
+                                          create_graph=False,
+                                          allow_unused=False)
+        ms = triton.testing.do_bench(run)
+        timings_ms[provider][config] = ms
+        if provider == "cuda":
+            ratio = timings_ms["naive"][config] / timings_ms["cuda"][config]
+            spdup_ratio.append(ratio)
+            return round(ratio, 2)
+        else:
+            return 1.00
+    return bench

benchmarks/common/diff_engine.py ADDED Viewed

	@@ -0,0 +1,85 @@

+from abc import ABC, abstractmethod
+from typing import Any, Dict, Sequence
+import torch
+class DiffCase(ABC):
+    @abstractmethod
+    def build_inputs(self, hidden: int, bs: int, sl: int, dtype: torch.dtype,
+                     eps: float) -> Dict[str, Any]:
+        ...
+    @abstractmethod
+    def make_naive(self, I: Dict[str, Any]) -> Any:
+        ...
+    @abstractmethod
+    def make_cuda(self, I: Dict[str, Any]) -> Any:
+        ...
+    @abstractmethod
+    def forward(self, obj: Any, I: Dict[str, Any]) -> torch.Tensor:
+        ...
+    @abstractmethod
+    def grad_inputs(self, I: Dict[str, Any]) -> Sequence[torch.Tensor]:
+        ...
+def _clone_payload(d, device):
+    out = {}
+    for k, v in d.items():
+        if isinstance(v, torch.Tensor):
+            t = v.detach().clone().to(device)
+            t.requires_grad_(v.requires_grad)
+            out[k] = t
+        else:
+            out[k] = v
+    return out
+def _unit_grad_like(y):
+    g = torch.randn_like(y)
+    n = g.norm()
+    return g if n == 0 else g / n
+def calculate_diff(
+    case: DiffCase,
+    *,
+    batch_size: int,
+    seq_len: int,
+    hidden_size: int,
+    dtype=torch.bfloat16,
+    eps: float = 1e-6,
+    atol: float = 1e-2,
+    rtol: float = 1e-2,
+    device="cuda",
+) -> None:
+    base = case.build_inputs(hidden_size, batch_size, seq_len, dtype, eps)
+    I_n = _clone_payload(base, device)
+    I_c = _clone_payload(base, device)
+    obj_n = case.make_naive(I_n)
+    obj_c = case.make_cuda(I_c)
+    y_n = case.forward(obj_n, I_n)
+    y_c = case.forward(obj_c, I_c)
+    torch.testing.assert_close(y_n, y_c, atol=atol, rtol=rtol)
+    gin_n = list(case.grad_inputs(I_n)) + list(obj_n.parameters())
+    gin_c = list(case.grad_inputs(I_c)) + list(obj_c.parameters())
+    g = _unit_grad_like(y_n).to(device)
+    ng = torch.autograd.grad(y_n,
+                             gin_n,
+                             g,
+                             retain_graph=False,
+                             create_graph=False,
+                             allow_unused=False)
+    cg = torch.autograd.grad(y_c,
+                             gin_c,
+                             g,
+                             retain_graph=False,
+                             create_graph=False,
+                             allow_unused=False)
+    torch.testing.assert_close(ng, cg, atol=atol, rtol=rtol)
+    print("✅ forward + backward match")

benchmarks/plots/h100/add_rms/plot_add_rms-bwd-perf.png ADDED Viewed

benchmarks/plots/h100/add_rms/plot_add_rms-fwd-perf.png ADDED Viewed

benchmarks/plots/h100/mul_poly/plot_mul_poly-bwd-perf.png ADDED Viewed

benchmarks/plots/h100/mul_poly/plot_mul_poly-fwd-perf.png ADDED Viewed

benchmarks/plots/h100/poly/plot_poly-bwd-perf.png ADDED Viewed

benchmarks/plots/h100/poly/plot_poly-fwd-perf.png ADDED Viewed

benchmarks/plots/h100/rms/plot_rms-bwd-perf.png ADDED Viewed

benchmarks/plots/h100/rms/plot_rms-fwd-perf.png ADDED Viewed

benchmarks/plots/mi250/add_rms/plot_add_rms-bwd-perf.png ADDED Viewed

benchmarks/plots/mi250/add_rms/plot_add_rms-fwd-perf.png ADDED Viewed

benchmarks/plots/mi250/mul_poly/plot_mul_poly-bwd-perf.png ADDED Viewed

benchmarks/plots/mi250/mul_poly/plot_mul_poly-fwd-perf.png ADDED Viewed

benchmarks/plots/mi250/poly/plot_poly-bwd-perf.png ADDED Viewed

benchmarks/plots/mi250/poly/plot_poly-fwd-perf.png ADDED Viewed

benchmarks/plots/mi250/rms/plot_rms-bwd-perf.png ADDED Viewed

benchmarks/plots/mi250/rms/plot_rms-fwd-perf.png ADDED Viewed

benchmarks/run_cases.py ADDED Viewed

	@@ -0,0 +1,143 @@

+import argparse
+import glob
+import importlib
+import itertools
+import os
+import torch
+from common.bench_framework import (make_bwd_benchmark_for_case,
+                                    make_bwd_benchmark_plot_for_case,
+                                    make_fwd_benchmark_for_case,
+                                    make_fwd_benchmark_plot_for_case)
+from common.diff_engine import DiffCase, calculate_diff
+def make_title_tag():
+    if torch.cuda.is_available():
+        dev_name = torch.cuda.get_device_name(0)
+    else:
+        dev_name = "CPU"
+    torch_ver = torch.__version__
+    return f"[{dev_name} | torch {torch_ver}]"
+def plot_result(r_path):
+    import matplotlib.pyplot as plt
+    import pandas as pd
+    df = pd.read_csv(r_path + ".csv")
+    plt.figure(figsize=(12, 6))
+    ax = df.plot(x="config", y=["Naive", "Cuda"], kind="bar", ax=plt.gca())
+    ax.set_title("Speedup over torch (higher is better)\n" + make_title_tag(),
+                 fontsize=14,
+                 fontweight="bold")
+    ax.set_ylabel("Relative Speedup", fontsize=14)
+    ax.set_xlabel("")
+    plt.xticks(rotation=45, fontsize=12, ha="right", rotation_mode="anchor")
+    for container in ax.containers:
+        labels = [f"x{v.get_height():.2f}" for v in container]
+        ax.bar_label(container, labels=labels, label_type="edge", fontsize=10)
+    plt.tight_layout()
+    plt.savefig(r_path + ".png", bbox_inches="tight")
+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--case",
+                    choices=["rms", "add_rms", "poly", "mul_poly"],
+                    required=True)
+    ap.add_argument("--plot", action="store_true")
+    ap.add_argument(
+        "--save-path",
+        type=str,
+        default="./configs/",
+        help="Path to save benchmark results",
+    )
+    args = ap.parse_args()
+    torch.set_default_device("cuda")
+    mod = importlib.import_module(f"cases.{args.case}")
+    case: DiffCase = mod.CASE
+    calculate_diff(
+        case,
+        batch_size=2,
+        seq_len=128,
+        hidden_size=4096,
+    )
+    save_dir = os.path.join(args.save_path, args.case)
+    if args.plot:
+        batch_size_range = [1]
+        seq_length_range = [4096, 8192, 16384]
+        dim = [8192, 16384] if "poly" in args.case else [2048, 4096]
+        configs = list(
+            itertools.product(batch_size_range, seq_length_range, dim))
+        plot_name = f"plot_{args.case}-fwd-perf"
+        bench = make_fwd_benchmark_plot_for_case(
+            case=case,
+            configs=configs,
+            plot_name=plot_name,
+            line_names={
+                "naive": "Naive",
+                "cuda": "Cuda",
+            },
+        )
+        bench.run(print_data=True, save_path=save_dir)
+        plot_result(os.path.join(save_dir, plot_name))
+        plot_name = f"plot_{args.case}-bwd-perf"
+        bench = make_bwd_benchmark_plot_for_case(
+            case=case,
+            configs=configs,
+            plot_name=plot_name,
+            line_names={
+                "naive": "Naive",
+                "cuda": "Cuda",
+            },
+        )
+        bench.run(print_data=True, save_path=save_dir)
+        plot_result(os.path.join(save_dir, plot_name))
+        for f in glob.glob(os.path.join(save_dir, "*.html")) + glob.glob(
+                os.path.join(save_dir, "*.csv")):
+            os.remove(f)
+    else:
+        batch_size_range = [2**i for i in range(0, 4, 1)]
+        seq_length_range = [2**i for i in range(10, 14, 1)]
+        dim = [8192, 16384] if "poly" in args.case else [2048, 4096]
+        configs = list(
+            itertools.product(dim, batch_size_range, seq_length_range))
+        bench = make_fwd_benchmark_for_case(
+            case=case,
+            configs=configs,
+            plot_name=f"{args.case}-fwd-perf",
+            line_names={
+                "naive": "Naive",
+                "cuda": "Cuda",
+                "speedup": "SpeedUp"
+            },
+        )
+        bench.run(print_data=True, save_path=save_dir)
+        bench = make_bwd_benchmark_for_case(
+            case=case,
+            configs=configs,
+            plot_name=f"{args.case}-bwd-perf",
+            line_names={
+                "naive": "Naive",
+                "cuda": "Cuda",
+                "speedup": "SpeedUp"
+            },
+        )
+        bench.run(print_data=True, save_path=save_dir)
+        for f in glob.glob(os.path.join(save_dir, "*.html")) + glob.glob(
+                os.path.join(save_dir, "*.png")):
+            os.remove(f)
+if __name__ == "__main__":
+    main()

build.toml CHANGED Viewed

@@ -13,9 +13,10 @@ backend = "rocm"
 rocm-archs = [ "gfx90a", "gfx942" ]
 src = [
   "activation/poly_norm.cu",
   "activation/rms_norm.cu",
   "activation/cuda_compat.h",
-  "activation/block_reduce.h",
   "activation/dispatch_utils.h",
   "activation/assert_utils.h",
   "activation/atomic_utils.h",
@@ -26,9 +27,10 @@ depends = [ "torch" ]
 backend = "cuda"
 src = [
   "activation/poly_norm.cu",
   "activation/rms_norm.cu",
   "activation/cuda_compat.h",
-  "activation/block_reduce.h",
   "activation/dispatch_utils.h",
   "activation/assert_utils.h",
   "activation/atomic_utils.h",

 rocm-archs = [ "gfx90a", "gfx942" ]
 src = [
   "activation/poly_norm.cu",
+  "activation/fused_mul_poly_norm.cu",
   "activation/rms_norm.cu",
+  "activation/fused_add_rms_norm.cu",
   "activation/cuda_compat.h",
   "activation/dispatch_utils.h",
   "activation/assert_utils.h",
   "activation/atomic_utils.h",
 backend = "cuda"
 src = [
   "activation/poly_norm.cu",
+  "activation/fused_mul_poly_norm.cu",
   "activation/rms_norm.cu",
+  "activation/fused_add_rms_norm.cu",
   "activation/cuda_compat.h",
   "activation/dispatch_utils.h",
   "activation/assert_utils.h",
   "activation/atomic_utils.h",

build/torch27-cxx11-cu118-x86_64-linux/activation/__init__.py CHANGED Viewed

@@ -2,8 +2,8 @@ import torch
 from . import layers
 from ._ops import ops
-from .poly_norm import PolyNormFunction
-from .rms_norm import RMSNormFunction
 def poly_norm(
@@ -15,6 +15,16 @@ def poly_norm(
     return PolyNormFunction.apply(x, weight, bias, eps)
 def rms_norm(
     x: torch.Tensor,
     weight: torch.Tensor,
@@ -23,8 +33,20 @@ def rms_norm(
     return RMSNormFunction.apply(x, weight, eps)
 __all__ = [
     "poly_norm",
     "layers",
     "ops",
 ]

 from . import layers
 from ._ops import ops
+from .poly_norm import FusedMulPolyNormFunction, PolyNormFunction
+from .rms_norm import FusedAddRMSNormFunction, RMSNormFunction
 def poly_norm(
     return PolyNormFunction.apply(x, weight, bias, eps)
+def fused_mul_poly_norm(
+    x: torch.Tensor,
+    mul: torch.Tensor,
+    weight: torch.Tensor,
+    bias: torch.Tensor,
+    eps: float = 1e-6,
+) -> None:
+    return FusedMulPolyNormFunction.apply(x, mul, weight, bias, eps)
 def rms_norm(
     x: torch.Tensor,
     weight: torch.Tensor,
     return RMSNormFunction.apply(x, weight, eps)
+def fused_add_rms_norm(
+    x: torch.Tensor,
+    residual: torch.Tensor,
+    weight: torch.Tensor,
+    eps: float = 1e-6,
+) -> None:
+    return FusedAddRMSNormFunction.apply(x, residual, weight, eps)[0]
 __all__ = [
     "poly_norm",
+    "fused_mul_poly_norm",
+    "rms_norm",
+    "fused_add_rms_norm",
     "layers",
     "ops",
 ]

tests/perf.png → build/torch27-cxx11-cu118-x86_64-linux/activation/_activation_20250907180255.abi3.so RENAMED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:12f88f9ac4511cb37f38a34e3572e4347bd0c857144a4aaf64bd5981d6b50877
-size 165982

 version https://git-lfs.github.com/spec/v1
+oid sha256:d21a85bf21aa74f1281541e658acfd4f4326d902efe3578b059eccf054443284
+size 8089696

build/torch27-cxx11-cu118-x86_64-linux/activation/_ops.py CHANGED Viewed

@@ -1,9 +1,9 @@
 import torch
-from . import _activation_f517c97_dirty
-ops = torch.ops._activation_f517c97_dirty
 def add_op_namespace_prefix(op_name: str):
     """
     Prefix op by namespace.
     """
-    return f"_activation_f517c97_dirty::{op_name}"

 import torch
+from . import _activation_20250907180255
+ops = torch.ops._activation_20250907180255
 def add_op_namespace_prefix(op_name: str):
     """
     Prefix op by namespace.
     """
+    return f"_activation_20250907180255::{op_name}"

build/torch27-cxx11-cu118-x86_64-linux/activation/layers.py CHANGED Viewed

@@ -2,8 +2,8 @@ import torch
 import torch.nn as nn
 from torch.nn import init
-from .poly_norm import PolyNormFunction
-from .rms_norm import RMSNormFunction
 class PolyNorm(nn.Module):
@@ -28,6 +28,30 @@ class PolyNorm(nn.Module):
         init.zeros_(self.bias)
 class RMSNorm(nn.Module):
     def __init__(self, dim: int, eps=1e-6, dtype: torch.dtype = torch.float32):
@@ -46,3 +70,25 @@ class RMSNorm(nn.Module):
         Resets parameters based on their initialization used in __init__.
         """
         init.ones_(self.weight)

 import torch.nn as nn
 from torch.nn import init
+from .poly_norm import FusedMulPolyNormFunction, PolyNormFunction
+from .rms_norm import FusedAddRMSNormFunction, RMSNormFunction
 class PolyNorm(nn.Module):
         init.zeros_(self.bias)
+class FusedMulPolyNorm(nn.Module):
+    def __init__(self, eps=1e-6, dtype: torch.dtype = torch.float32):
+        super().__init__()
+        self.weight = torch.nn.Parameter(torch.ones(3, dtype=dtype) / 3)
+        self.bias = torch.nn.Parameter(torch.zeros(1, dtype=dtype))
+        self.eps = eps
+    def forward(
+        self,
+        x: torch.Tensor,
+        mul: torch.Tensor,
+    ):
+        return FusedMulPolyNormFunction.apply(x, mul, self.weight, self.bias,
+                                              self.eps)
+    def reset_parameters(self) -> None:
+        """
+        Resets parameters based on their initialization used in __init__.
+        """
+        init.ones_(self.weight)
+        init.zeros_(self.bias)
 class RMSNorm(nn.Module):
     def __init__(self, dim: int, eps=1e-6, dtype: torch.dtype = torch.float32):
         Resets parameters based on their initialization used in __init__.
         """
         init.ones_(self.weight)
+class FusedAddRMSNorm(nn.Module):
+    def __init__(self, dim: int, eps=1e-6, dtype: torch.dtype = torch.float32):
+        super().__init__()
+        self.weight = torch.nn.Parameter(torch.ones(dim, dtype=dtype))
+        self.eps = eps
+    def forward(
+        self,
+        x: torch.Tensor,
+        residual: torch.Tensor,
+    ):
+        return FusedAddRMSNormFunction.apply(x, residual, self.weight,
+                                             self.eps)[0]
+    def reset_parameters(self) -> None:
+        """
+        Resets parameters based on their initialization used in __init__.
+        """
+        init.ones_(self.weight)

build/torch27-cxx11-cu118-x86_64-linux/activation/poly_norm.py CHANGED Viewed

@@ -37,3 +37,40 @@ class PolyNormFunction(torch.autograd.Function):
                                input, weight, eps)
         return input_grad, weight_grad, bias_grad, None

                                input, weight, eps)
         return input_grad, weight_grad, bias_grad, None
+class FusedMulPolyNormFunction(torch.autograd.Function):
+    # Note that forward, setup_context, and backward are @staticmethods
+    @staticmethod
+    def forward(input, mul, weight, bias, eps):
+        output = torch.empty_like(input)
+        ops.fused_mul_poly_norm(output, input, mul, weight, bias, eps)
+        return output
+    @staticmethod
+    # inputs is a Tuple of all of the inputs passed to forward.
+    # output is the output of the forward().
+    def setup_context(ctx, inputs, output):
+        input, mul, weight, bias, eps = inputs
+        ctx.save_for_backward(input, mul, weight, bias)
+        ctx.eps = eps
+    # This function has only a single output, so it gets only one gradient
+    @staticmethod
+    def backward(ctx, output_grad):
+        input, mul, weight, bias = ctx.saved_tensors
+        eps = ctx.eps
+        input_grad = torch.empty_like(
+            input) if ctx.needs_input_grad[0] else None
+        mul_grad = torch.empty_like(mul) if ctx.needs_input_grad[1] else None
+        weight_grad = torch.empty_like(
+            weight) if ctx.needs_input_grad[2] else None
+        bias_grad = (torch.empty(1, dtype=weight.dtype, device=weight.device)
+                     if ctx.needs_input_grad[3] else None)
+        ops.fused_mul_poly_norm_backward(input_grad, mul_grad, weight_grad,
+                                         bias_grad, output_grad, input, mul,
+                                         weight, bias, eps)
+        return input_grad, mul_grad, weight_grad, bias_grad, None

build/torch27-cxx11-cu118-x86_64-linux/activation/rms_norm.py CHANGED Viewed

@@ -35,3 +35,50 @@ class RMSNormFunction(torch.autograd.Function):
                               weight, eps)
         return input_grad, weight_grad, None

                               weight, eps)
         return input_grad, weight_grad, None
+# Inherit from Function
+class FusedAddRMSNormFunction(torch.autograd.Function):
+    # Note that forward, setup_context, and backward are @staticmethods
+    @staticmethod
+    def forward(input, residual, weight, eps):
+        output = torch.empty_like(input)
+        add_output = torch.empty_like(input)
+        ops.fused_add_rms_norm(output, add_output, input, residual, weight,
+                               eps)
+        return output, add_output
+    @staticmethod
+    # inputs is a Tuple of all of the inputs passed to forward.
+    # output is the output of the forward().
+    def setup_context(ctx, inputs, outputs):
+        _, _, weight, eps = inputs
+        _, add_output = outputs
+        ctx.mark_non_differentiable(add_output)
+        ctx.set_materialize_grads(False)
+        ctx.save_for_backward(weight, add_output)
+        ctx.eps = eps
+    # This function only needs one gradient
+    @staticmethod
+    def backward(ctx, output_grad, _):
+        weight, add_output = ctx.saved_tensors
+        eps = ctx.eps
+        if output_grad is None:
+            output_grad = torch.zeros_like(add_output)
+        need_in = ctx.needs_input_grad[0]
+        need_res = ctx.needs_input_grad[1]
+        grad = torch.empty_like(output_grad) if need_in or need_res else None
+        weight_grad = torch.empty_like(
+            weight) if ctx.needs_input_grad[2] else None
+        ops.rms_norm_backward(grad, weight_grad, output_grad, add_output,
+                              weight, eps)
+        input_grad = grad if need_in else None
+        residual_grad = grad if need_res else None
+        return input_grad, residual_grad, weight_grad, None

build/torch27-cxx11-cu126-x86_64-linux/activation/__init__.py CHANGED Viewed

@@ -2,8 +2,8 @@ import torch
 from . import layers
 from ._ops import ops
-from .poly_norm import PolyNormFunction
-from .rms_norm import RMSNormFunction
 def poly_norm(
@@ -15,6 +15,16 @@ def poly_norm(
     return PolyNormFunction.apply(x, weight, bias, eps)
 def rms_norm(
     x: torch.Tensor,
     weight: torch.Tensor,
@@ -23,8 +33,20 @@ def rms_norm(
     return RMSNormFunction.apply(x, weight, eps)
 __all__ = [
     "poly_norm",
     "layers",
     "ops",
 ]

 from . import layers
 from ._ops import ops
+from .poly_norm import FusedMulPolyNormFunction, PolyNormFunction
+from .rms_norm import FusedAddRMSNormFunction, RMSNormFunction
 def poly_norm(
     return PolyNormFunction.apply(x, weight, bias, eps)
+def fused_mul_poly_norm(
+    x: torch.Tensor,
+    mul: torch.Tensor,
+    weight: torch.Tensor,
+    bias: torch.Tensor,
+    eps: float = 1e-6,
+) -> None:
+    return FusedMulPolyNormFunction.apply(x, mul, weight, bias, eps)
 def rms_norm(
     x: torch.Tensor,
     weight: torch.Tensor,
     return RMSNormFunction.apply(x, weight, eps)
+def fused_add_rms_norm(
+    x: torch.Tensor,
+    residual: torch.Tensor,
+    weight: torch.Tensor,
+    eps: float = 1e-6,
+) -> None:
+    return FusedAddRMSNormFunction.apply(x, residual, weight, eps)[0]
 __all__ = [
     "poly_norm",
+    "fused_mul_poly_norm",
+    "rms_norm",
+    "fused_add_rms_norm",
     "layers",
     "ops",
 ]

build/torch27-cxx11-cu126-x86_64-linux/activation/_activation_20250907180255.abi3.so ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:74d4955271509451b946495da75f69a0f978e7258b8303fe3c077e585c0d3e6a
+size 8272456

build/torch27-cxx11-cu126-x86_64-linux/activation/_ops.py CHANGED Viewed

@@ -1,9 +1,9 @@
 import torch
-from . import _activation_f517c97_dirty
-ops = torch.ops._activation_f517c97_dirty
 def add_op_namespace_prefix(op_name: str):
     """
     Prefix op by namespace.
     """
-    return f"_activation_f517c97_dirty::{op_name}"

 import torch
+from . import _activation_20250907180255
+ops = torch.ops._activation_20250907180255
 def add_op_namespace_prefix(op_name: str):
     """
     Prefix op by namespace.
     """
+    return f"_activation_20250907180255::{op_name}"

build/torch27-cxx11-cu126-x86_64-linux/activation/layers.py CHANGED Viewed

@@ -2,8 +2,8 @@ import torch
 import torch.nn as nn
 from torch.nn import init
-from .poly_norm import PolyNormFunction
-from .rms_norm import RMSNormFunction
 class PolyNorm(nn.Module):
@@ -28,6 +28,30 @@ class PolyNorm(nn.Module):
         init.zeros_(self.bias)
 class RMSNorm(nn.Module):
     def __init__(self, dim: int, eps=1e-6, dtype: torch.dtype = torch.float32):
@@ -46,3 +70,25 @@ class RMSNorm(nn.Module):
         Resets parameters based on their initialization used in __init__.
         """
         init.ones_(self.weight)

 import torch.nn as nn
 from torch.nn import init
+from .poly_norm import FusedMulPolyNormFunction, PolyNormFunction
+from .rms_norm import FusedAddRMSNormFunction, RMSNormFunction
 class PolyNorm(nn.Module):
         init.zeros_(self.bias)
+class FusedMulPolyNorm(nn.Module):
+    def __init__(self, eps=1e-6, dtype: torch.dtype = torch.float32):
+        super().__init__()
+        self.weight = torch.nn.Parameter(torch.ones(3, dtype=dtype) / 3)
+        self.bias = torch.nn.Parameter(torch.zeros(1, dtype=dtype))
+        self.eps = eps
+    def forward(
+        self,
+        x: torch.Tensor,
+        mul: torch.Tensor,
+    ):
+        return FusedMulPolyNormFunction.apply(x, mul, self.weight, self.bias,
+                                              self.eps)
+    def reset_parameters(self) -> None:
+        """
+        Resets parameters based on their initialization used in __init__.
+        """
+        init.ones_(self.weight)
+        init.zeros_(self.bias)
 class RMSNorm(nn.Module):
     def __init__(self, dim: int, eps=1e-6, dtype: torch.dtype = torch.float32):
         Resets parameters based on their initialization used in __init__.
         """
         init.ones_(self.weight)
+class FusedAddRMSNorm(nn.Module):
+    def __init__(self, dim: int, eps=1e-6, dtype: torch.dtype = torch.float32):
+        super().__init__()
+        self.weight = torch.nn.Parameter(torch.ones(dim, dtype=dtype))
+        self.eps = eps
+    def forward(
+        self,
+        x: torch.Tensor,
+        residual: torch.Tensor,
+    ):
+        return FusedAddRMSNormFunction.apply(x, residual, self.weight,
+                                             self.eps)[0]
+    def reset_parameters(self) -> None:
+        """
+        Resets parameters based on their initialization used in __init__.
+        """
+        init.ones_(self.weight)

build/torch27-cxx11-cu126-x86_64-linux/activation/poly_norm.py CHANGED Viewed

@@ -37,3 +37,40 @@ class PolyNormFunction(torch.autograd.Function):
                                input, weight, eps)
         return input_grad, weight_grad, bias_grad, None

                                input, weight, eps)
         return input_grad, weight_grad, bias_grad, None
+class FusedMulPolyNormFunction(torch.autograd.Function):
+    # Note that forward, setup_context, and backward are @staticmethods
+    @staticmethod
+    def forward(input, mul, weight, bias, eps):
+        output = torch.empty_like(input)
+        ops.fused_mul_poly_norm(output, input, mul, weight, bias, eps)
+        return output
+    @staticmethod
+    # inputs is a Tuple of all of the inputs passed to forward.
+    # output is the output of the forward().
+    def setup_context(ctx, inputs, output):
+        input, mul, weight, bias, eps = inputs
+        ctx.save_for_backward(input, mul, weight, bias)
+        ctx.eps = eps
+    # This function has only a single output, so it gets only one gradient
+    @staticmethod
+    def backward(ctx, output_grad):
+        input, mul, weight, bias = ctx.saved_tensors
+        eps = ctx.eps
+        input_grad = torch.empty_like(
+            input) if ctx.needs_input_grad[0] else None
+        mul_grad = torch.empty_like(mul) if ctx.needs_input_grad[1] else None
+        weight_grad = torch.empty_like(
+            weight) if ctx.needs_input_grad[2] else None
+        bias_grad = (torch.empty(1, dtype=weight.dtype, device=weight.device)
+                     if ctx.needs_input_grad[3] else None)
+        ops.fused_mul_poly_norm_backward(input_grad, mul_grad, weight_grad,
+                                         bias_grad, output_grad, input, mul,
+                                         weight, bias, eps)
+        return input_grad, mul_grad, weight_grad, bias_grad, None

build/torch27-cxx11-cu126-x86_64-linux/activation/rms_norm.py CHANGED Viewed

@@ -35,3 +35,50 @@ class RMSNormFunction(torch.autograd.Function):
                               weight, eps)
         return input_grad, weight_grad, None

                               weight, eps)
         return input_grad, weight_grad, None
+# Inherit from Function
+class FusedAddRMSNormFunction(torch.autograd.Function):
+    # Note that forward, setup_context, and backward are @staticmethods
+    @staticmethod
+    def forward(input, residual, weight, eps):
+        output = torch.empty_like(input)
+        add_output = torch.empty_like(input)
+        ops.fused_add_rms_norm(output, add_output, input, residual, weight,
+                               eps)
+        return output, add_output
+    @staticmethod
+    # inputs is a Tuple of all of the inputs passed to forward.
+    # output is the output of the forward().
+    def setup_context(ctx, inputs, outputs):
+        _, _, weight, eps = inputs
+        _, add_output = outputs
+        ctx.mark_non_differentiable(add_output)
+        ctx.set_materialize_grads(False)
+        ctx.save_for_backward(weight, add_output)
+        ctx.eps = eps
+    # This function only needs one gradient
+    @staticmethod
+    def backward(ctx, output_grad, _):
+        weight, add_output = ctx.saved_tensors
+        eps = ctx.eps
+        if output_grad is None:
+            output_grad = torch.zeros_like(add_output)
+        need_in = ctx.needs_input_grad[0]
+        need_res = ctx.needs_input_grad[1]
+        grad = torch.empty_like(output_grad) if need_in or need_res else None
+        weight_grad = torch.empty_like(
+            weight) if ctx.needs_input_grad[2] else None
+        ops.rms_norm_backward(grad, weight_grad, output_grad, add_output,
+                              weight, eps)
+        input_grad = grad if need_in else None
+        residual_grad = grad if need_res else None
+        return input_grad, residual_grad, weight_grad, None

build/torch27-cxx11-cu128-x86_64-linux/activation/__init__.py CHANGED Viewed

@@ -2,8 +2,8 @@ import torch
 from . import layers
 from ._ops import ops
-from .poly_norm import PolyNormFunction
-from .rms_norm import RMSNormFunction
 def poly_norm(
@@ -15,6 +15,16 @@ def poly_norm(
     return PolyNormFunction.apply(x, weight, bias, eps)
 def rms_norm(
     x: torch.Tensor,
     weight: torch.Tensor,
@@ -23,8 +33,20 @@ def rms_norm(
     return RMSNormFunction.apply(x, weight, eps)
 __all__ = [
     "poly_norm",
     "layers",
     "ops",
 ]

 from . import layers
 from ._ops import ops
+from .poly_norm import FusedMulPolyNormFunction, PolyNormFunction
+from .rms_norm import FusedAddRMSNormFunction, RMSNormFunction
 def poly_norm(
     return PolyNormFunction.apply(x, weight, bias, eps)
+def fused_mul_poly_norm(
+    x: torch.Tensor,
+    mul: torch.Tensor,
+    weight: torch.Tensor,
+    bias: torch.Tensor,
+    eps: float = 1e-6,
+) -> None:
+    return FusedMulPolyNormFunction.apply(x, mul, weight, bias, eps)
 def rms_norm(
     x: torch.Tensor,
     weight: torch.Tensor,
     return RMSNormFunction.apply(x, weight, eps)
+def fused_add_rms_norm(
+    x: torch.Tensor,
+    residual: torch.Tensor,
+    weight: torch.Tensor,
+    eps: float = 1e-6,
+) -> None:
+    return FusedAddRMSNormFunction.apply(x, residual, weight, eps)[0]
 __all__ = [
     "poly_norm",
+    "fused_mul_poly_norm",
+    "rms_norm",
+    "fused_add_rms_norm",
     "layers",
     "ops",
 ]

build/torch27-cxx11-cu128-x86_64-linux/activation/_activation_20250907180255.abi3.so ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0bf0d2ab5ff5520704e0b0c959b61d0043d360cfd4335950e69677873a87e436
+size 12792112

build/torch27-cxx11-cu128-x86_64-linux/activation/_ops.py CHANGED Viewed

@@ -1,9 +1,9 @@
 import torch
-from . import _activation_f517c97_dirty
-ops = torch.ops._activation_f517c97_dirty
 def add_op_namespace_prefix(op_name: str):
     """
     Prefix op by namespace.
     """
-    return f"_activation_f517c97_dirty::{op_name}"

 import torch
+from . import _activation_20250907180255
+ops = torch.ops._activation_20250907180255
 def add_op_namespace_prefix(op_name: str):
     """
     Prefix op by namespace.
     """
+    return f"_activation_20250907180255::{op_name}"

build/torch27-cxx11-cu128-x86_64-linux/activation/layers.py CHANGED Viewed

@@ -2,8 +2,8 @@ import torch
 import torch.nn as nn
 from torch.nn import init
-from .poly_norm import PolyNormFunction
-from .rms_norm import RMSNormFunction
 class PolyNorm(nn.Module):
@@ -28,6 +28,30 @@ class PolyNorm(nn.Module):
         init.zeros_(self.bias)
 class RMSNorm(nn.Module):
     def __init__(self, dim: int, eps=1e-6, dtype: torch.dtype = torch.float32):
@@ -46,3 +70,25 @@ class RMSNorm(nn.Module):
         Resets parameters based on their initialization used in __init__.
         """
         init.ones_(self.weight)

 import torch.nn as nn
 from torch.nn import init
+from .poly_norm import FusedMulPolyNormFunction, PolyNormFunction
+from .rms_norm import FusedAddRMSNormFunction, RMSNormFunction
 class PolyNorm(nn.Module):
         init.zeros_(self.bias)
+class FusedMulPolyNorm(nn.Module):
+    def __init__(self, eps=1e-6, dtype: torch.dtype = torch.float32):
+        super().__init__()
+        self.weight = torch.nn.Parameter(torch.ones(3, dtype=dtype) / 3)
+        self.bias = torch.nn.Parameter(torch.zeros(1, dtype=dtype))
+        self.eps = eps
+    def forward(
+        self,
+        x: torch.Tensor,
+        mul: torch.Tensor,
+    ):
+        return FusedMulPolyNormFunction.apply(x, mul, self.weight, self.bias,
+                                              self.eps)
+    def reset_parameters(self) -> None:
+        """
+        Resets parameters based on their initialization used in __init__.
+        """
+        init.ones_(self.weight)
+        init.zeros_(self.bias)
 class RMSNorm(nn.Module):
     def __init__(self, dim: int, eps=1e-6, dtype: torch.dtype = torch.float32):
         Resets parameters based on their initialization used in __init__.
         """
         init.ones_(self.weight)
+class FusedAddRMSNorm(nn.Module):
+    def __init__(self, dim: int, eps=1e-6, dtype: torch.dtype = torch.float32):
+        super().__init__()
+        self.weight = torch.nn.Parameter(torch.ones(dim, dtype=dtype))
+        self.eps = eps
+    def forward(
+        self,
+        x: torch.Tensor,
+        residual: torch.Tensor,
+    ):
+        return FusedAddRMSNormFunction.apply(x, residual, self.weight,
+                                             self.eps)[0]
+    def reset_parameters(self) -> None:
+        """
+        Resets parameters based on their initialization used in __init__.
+        """
+        init.ones_(self.weight)

build/torch27-cxx11-cu128-x86_64-linux/activation/poly_norm.py CHANGED Viewed

@@ -37,3 +37,40 @@ class PolyNormFunction(torch.autograd.Function):
                                input, weight, eps)
         return input_grad, weight_grad, bias_grad, None

                                input, weight, eps)
         return input_grad, weight_grad, bias_grad, None
+class FusedMulPolyNormFunction(torch.autograd.Function):
+    # Note that forward, setup_context, and backward are @staticmethods
+    @staticmethod
+    def forward(input, mul, weight, bias, eps):
+        output = torch.empty_like(input)
+        ops.fused_mul_poly_norm(output, input, mul, weight, bias, eps)
+        return output
+    @staticmethod
+    # inputs is a Tuple of all of the inputs passed to forward.
+    # output is the output of the forward().
+    def setup_context(ctx, inputs, output):
+        input, mul, weight, bias, eps = inputs
+        ctx.save_for_backward(input, mul, weight, bias)
+        ctx.eps = eps
+    # This function has only a single output, so it gets only one gradient
+    @staticmethod
+    def backward(ctx, output_grad):
+        input, mul, weight, bias = ctx.saved_tensors
+        eps = ctx.eps
+        input_grad = torch.empty_like(
+            input) if ctx.needs_input_grad[0] else None
+        mul_grad = torch.empty_like(mul) if ctx.needs_input_grad[1] else None
+        weight_grad = torch.empty_like(
+            weight) if ctx.needs_input_grad[2] else None
+        bias_grad = (torch.empty(1, dtype=weight.dtype, device=weight.device)
+                     if ctx.needs_input_grad[3] else None)
+        ops.fused_mul_poly_norm_backward(input_grad, mul_grad, weight_grad,
+                                         bias_grad, output_grad, input, mul,
+                                         weight, bias, eps)
+        return input_grad, mul_grad, weight_grad, bias_grad, None